Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-cpp multi server support #316

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Oct 21, 2024

llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up.

The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once.

via a --num-servers flag from the cli, a user can spin up 2,3, or even 4 of the mistral 7b instruct models since they only take about 5GB of RAM.

This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch

@bbrowning
Copy link
Contributor

We can already do parallel requests with vLLM, which anyone on Linux should be able to use. So, the goal here is to get higher performance strictly with llama.cpp for Mac users? And the premise behind these changes is that today we are unable to efficiently utilize a Mac with a single llama.cpp instance, so we need to run multiple of them? Is there any upstream llama.cpp discussion or documentation around this to show that a single server cannot saturate typical Mac hardware?

@cdoern cdoern force-pushed the laptop-pipeline branch 2 times, most recently from 6b29f4a to a86542d Compare October 21, 2024 17:11
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 21, 2024
@cdoern cdoern force-pushed the laptop-pipeline branch 8 times, most recently from d36a5b6 to 7e0fa83 Compare October 21, 2024 19:24
llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up.

The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once.

via a `--num-servers` flag from the cli, a user can spin up 2,3, or even 4 of the `mistral 7b instruct` models since they only take about 5GB of RAM.

This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch

Signed-off-by: Charlie Doern <[email protected]>
@cdoern
Copy link
Contributor Author

cdoern commented Oct 22, 2024

We can already do parallel requests with vLLM, which anyone on Linux should be able to use. So, the goal here is to get higher performance strictly with llama.cpp for Mac users? And the premise behind these changes is that today we are unable to efficiently utilize a Mac with a single llama.cpp instance, so we need to run multiple of them? Is there any upstream llama.cpp discussion or documentation around this to show that a single server cannot saturate typical Mac hardware?

So the idea here is pretty close to what you have captured. Though, Its not that a single llama server can't saturate Mac HW, its that it can't take more than 1 singular large completion request at once.

I can run full sdg on a laptop with a 2 page md in ~2 hrs, using debug logging I can see the completions come back pretty quickly since the chunks of data are quite small.

Now, if I turn this up to a 50 pg markdown, gen_spellcheck and gen_knowledge alone take 10 hours.

I tried a couple of things like seeing if the ThreadPoolExecutor could just work on llamacpp, but it seems that llama servers can only take 1 completion request at once (from what I could tell). So splitting into threads and trying to split the data that way didn't work.

Since it seems llama-cpp-python can't natively support parallel completion requests, the only way to get close to the behavior we support w/ vLLM is to spin up a few servers, and kick off threads where the client is different in each one and the dataset is a subsection of the overall set.

This allows us to concurrently run something like 3 pipeline processes at once, each processing a subset of the data. The data is then returned, concatenated, and mixed normally cutting the time into thirds of what it once was!

if I am wrong about llama not taking multiple completions at once in threads I can try that again, but I had no luck there.

I have an ilab branch here: https://github.com/cdoern/instructlab/tree/llama-batch showing how this would feed into sdg @bbrowning

@aakankshaduggal aakankshaduggal requested a review from a team October 28, 2024 14:49
Copy link
Contributor

mergify bot commented Nov 7, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 7, 2024
@khaledsulayman
Copy link
Member

@cdoern I know this is a bit stale but is this redundant with #358?

@bbrowning
Copy link
Contributor

This doesn't exactly overlap with #358 I don't think - this one is more about speeding up SDG by executing parallel requests against multiple llama-cpp servers.

I agree that SDG should be able to send multiple requests in parallel. I disagree with the approach in this PR of having multiple different OpenAI Clients hitting different endpoints, as that's not typically how servers are load-balanced. In the production case, we'd have a single load-balanced endpoint and the user should have some knob to control how many SDG requests we execute in parallel against that backend.

Perhaps we need to separate this out a bit into a couple of phases. One phase is providing a knob so users can control how many SDG requests we execute in parallel, and that may require rethinking some of our concurrency primitives in use during the data generation loop.

The other phase would be giving options in the CLI to spin up multiple llama-cpp-python servers load-balanced behind the uvicorn it's already managing. That would be multiple llama-cpp-python but all behind a single endpoint, so that we're not having to juggle multiple endpoints and multiple separate OpenAI Client instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants