Re-using model weights after initial load? #2158
Replies: 2 comments 1 reply
-
The reason the model needs an However, if you are running multiple inference sessions at once, you should use an inference platform designed for this. If that is not necessary, you can just clear the KV cache after one session is done and repeat the process. |
Beta Was this translation helpful? Give feedback.
-
The recommend solution to handle multiple sessions at once is load the model then for each session clone this "main" model initially (most models are cloneable). This will ensure that each session uses a separate kv cache and the weights will be shared between the different sessions. (you could re-use the sessions by clearning the kv caches but that is not really necessary). |
Beta Was this translation helpful? Give feedback.
-
A bit of a noob question mostly because I don't yet have a full grasp on how transformers work, but I noticed that the
ModelWeights
loaded from file needs to be mutable in order to perform an inference:https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized-phi/main.rs#L189
I'm guessing this is because it needs to keep some internal state to accurately predict the next token.
I'd like to load a model from file and create many unrelated inference sessions with that same model. What's the best approach for this? Save off the initial model weights and clone them every time a new inference session needs to be created? Worried I might be barking up the wrong tree here. TIA
Beta Was this translation helpful? Give feedback.
All reactions