High memory consumption on 1B model #373

physimo · 2024-10-24T20:26:08Z

physimo
Oct 24, 2024

I'm looking at about 22GB VRAM usage by using a 1B model (Llama 3.2 1b instruct Q3_M).
When using CPU and loading the model to RAM, it uses about 14GB RAM.

By the way, using different model like Llama 2 13B uses only about 17GB VRAM, which only makes this confusing.
Is there any setting or parameter that I'm missing here to make the 1B use low memory as what a 1B should be?

Answered by giladgd

Oct 25, 2024

node-llama-cpp attempts to use the largest context size that can be fitted optimally in the current hardware to allow you to process as much information as the model is capable of handling, but that also means that it can use a lot of VRAM.
If you don't need a large context size, you can limit it by using the contextSize option when calling createContext, e.g. .createContext({contextSize: {max: 4096}}).
You can also enable flash attention to have even smaller memory footprint without compromising on performance, but it may not work well with every model.

I'm working on incremental memory allocation, which will greatly help with this, but it may take some time since it's complicated to get…

View full answer

StrangeBytesDev · 2024-10-24T21:16:50Z

StrangeBytesDev
Oct 24, 2024

Try changing the context size. Llama 3.2 1B has a much larger context window than Llama 2 and context has a very large impact on memory usage.

1 reply

physimo Oct 24, 2024
Author

Ah right, the 1B, 3B, and 11B has 128k context length, which explains why the VRAM usage is not that different across the parameter size. Thanks!

giladgd · 2024-10-25T16:24:56Z

giladgd
Oct 25, 2024
Maintainer

node-llama-cpp attempts to use the largest context size that can be fitted optimally in the current hardware to allow you to process as much information as the model is capable of handling, but that also means that it can use a lot of VRAM.
If you don't need a large context size, you can limit it by using the contextSize option when calling createContext, e.g. .createContext({contextSize: {max: 4096}}).
You can also enable flash attention to have even smaller memory footprint without compromising on performance, but it may not work well with every model.

I'm working on incremental memory allocation, which will greatly help with this, but it may take some time since it's complicated to get it to work well with good performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory consumption on 1B model #373

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

High memory consumption on 1B model #373

physimo Oct 24, 2024

Replies: 2 comments · 1 reply

StrangeBytesDev Oct 24, 2024

physimo Oct 24, 2024 Author

giladgd Oct 25, 2024 Maintainer

physimo
Oct 24, 2024

Replies: 2 comments 1 reply

StrangeBytesDev
Oct 24, 2024

physimo Oct 24, 2024
Author

giladgd
Oct 25, 2024
Maintainer