Skip to content

efficient model loading and usage across multiple thread on different devices (desktop, cuda, metal, cpu, etc.) #2326

Answered by LaurentMazare
louis030195 asked this question in Q&A
Discussion options

You must be logged in to vote

If you load the same model multiple times (same thread or multiple thread doesn't make much of a difference), this would duplicate the memory footprint of the weights.
Instead you probably want to load the model once and then clone it to get a separate KV cache (if it's a model with a KV cache), cloning will result in the weights being shared so no duplicate memory and the KV cache being different. Fwiw that's how we handle it to serve moshi.
And yes your errors (segfault 11/killed 9/bus error) certainly look like the OS running out of memory and killing the process.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@louis030195
Comment options

@louis030195
Comment options

Answer selected by louis030195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants