-
I'm looking at about 22GB VRAM usage by using a 1B model (Llama 3.2 1b instruct Q3_M). By the way, using different model like Llama 2 13B uses only about 17GB VRAM, which only makes this confusing. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Try changing the context size. Llama 3.2 1B has a much larger context window than Llama 2 and context has a very large impact on memory usage. |
Beta Was this translation helpful? Give feedback.
-
I'm working on incremental memory allocation, which will greatly help with this, but it may take some time since it's complicated to get it to work well with good performance. |
Beta Was this translation helpful? Give feedback.
node-llama-cpp
attempts to use the largest context size that can be fitted optimally in the current hardware to allow you to process as much information as the model is capable of handling, but that also means that it can use a lot of VRAM.If you don't need a large context size, you can limit it by using the
contextSize
option when callingcreateContext
, e.g..createContext({contextSize: {max: 4096}})
.You can also enable flash attention to have even smaller memory footprint without compromising on performance, but it may not work well with every model.
I'm working on incremental memory allocation, which will greatly help with this, but it may take some time since it's complicated to get…