Skip to content

High memory consumption on 1B model #373

Closed Answered by giladgd
physimo asked this question in Q&A
Discussion options

You must be logged in to vote

node-llama-cpp attempts to use the largest context size that can be fitted optimally in the current hardware to allow you to process as much information as the model is capable of handling, but that also means that it can use a lot of VRAM.
If you don't need a large context size, you can limit it by using the contextSize option when calling createContext, e.g. .createContext({contextSize: {max: 4096}}).
You can also enable flash attention to have even smaller memory footprint without compromising on performance, but it may not work well with every model.

I'm working on incremental memory allocation, which will greatly help with this, but it may take some time since it's complicated to get…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@physimo
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by giladgd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants