Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap: mmap for keeping Model in VRAM when Flash Attention is used #1717

Open
dan-homebrew opened this issue Nov 23, 2024 · 1 comment
Open
Assignees

Comments

@dan-homebrew
Copy link
Contributor

dan-homebrew commented Nov 23, 2024

Goal

Feedback from WiseFarAI:

I am also wondering if you guys use Mmap and other ways of keeping the model in Vram memory and if/when Flash Attention is used. Since these parameters cannot easily be observed (their current setting) or changed from within the UI. Sometimes it seems like it tries to reload the model mid-conversation and the generation speed drops from 6 tokens to 2 per second.

@dan-homebrew dan-homebrew converted this from a draft issue Nov 23, 2024
@dan-homebrew dan-homebrew changed the title feedback: mmap for keeping Model in VRAM feedback: mmap for keeping Model in VRAM when Flash Attention is used Nov 23, 2024
@gabrielle-ong
Copy link
Contributor

@vansangpfiev assigning to you to investigate, cc @dan-homebrew - will have a call with Sang next week

@dan-homebrew dan-homebrew changed the title feedback: mmap for keeping Model in VRAM when Flash Attention is used roadmap: mmap for keeping Model in VRAM when Flash Attention is used Dec 16, 2024
@vansangpfiev vansangpfiev moved this from Investigating to In Progress in Jan & Cortex Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

4 participants