You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.
The text was updated successfully, but these errors were encountered:
dan-homebrew
changed the title
feedback: mmap for keeping Model in VRAM
feedback: mmap for keeping Model in VRAM when Flash Attention is used
Nov 23, 2024
dan-homebrew
changed the title
feedback: mmap for keeping Model in VRAM when Flash Attention is used
roadmap: mmap for keeping Model in VRAM when Flash Attention is used
Dec 16, 2024
Goal
Feedback from WiseFarAI:
I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.
The text was updated successfully, but these errors were encountered: