llama-cpp multi server support #316

cdoern · 2024-10-21T01:36:27Z

llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up.

The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once.

via a --num-servers flag from the cli, a user can spin up 2,3, or even 4 of the mistral 7b instruct models since they only take about 5GB of RAM.

This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch

bbrowning · 2024-10-21T15:16:46Z

We can already do parallel requests with vLLM, which anyone on Linux should be able to use. So, the goal here is to get higher performance strictly with llama.cpp for Mac users? And the premise behind these changes is that today we are unable to efficiently utilize a Mac with a single llama.cpp instance, so we need to run multiple of them? Is there any upstream llama.cpp discussion or documentation around this to show that a single server cannot saturate typical Mac hardware?

llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up. The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once. via a `--num-servers` flag from the cli, a user can spin up 2,3, or even 4 of the `mistral 7b instruct` models since they only take about 5GB of RAM. This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch Signed-off-by: Charlie Doern <[email protected]>

cdoern · 2024-10-22T01:46:53Z

We can already do parallel requests with vLLM, which anyone on Linux should be able to use. So, the goal here is to get higher performance strictly with llama.cpp for Mac users? And the premise behind these changes is that today we are unable to efficiently utilize a Mac with a single llama.cpp instance, so we need to run multiple of them? Is there any upstream llama.cpp discussion or documentation around this to show that a single server cannot saturate typical Mac hardware?

So the idea here is pretty close to what you have captured. Though, Its not that a single llama server can't saturate Mac HW, its that it can't take more than 1 singular large completion request at once.

I can run full sdg on a laptop with a 2 page md in ~2 hrs, using debug logging I can see the completions come back pretty quickly since the chunks of data are quite small.

Now, if I turn this up to a 50 pg markdown, gen_spellcheck and gen_knowledge alone take 10 hours.

I tried a couple of things like seeing if the ThreadPoolExecutor could just work on llamacpp, but it seems that llama servers can only take 1 completion request at once (from what I could tell). So splitting into threads and trying to split the data that way didn't work.

Since it seems llama-cpp-python can't natively support parallel completion requests, the only way to get close to the behavior we support w/ vLLM is to spin up a few servers, and kick off threads where the client is different in each one and the dataset is a subsection of the overall set.

This allows us to concurrently run something like 3 pipeline processes at once, each processing a subset of the data. The data is then returned, concatenated, and mixed normally cutting the time into thirds of what it once was!

if I am wrong about llama not taking multiple completions at once in threads I can try that again, but I had no luck there.

I have an ilab branch here: https://github.com/cdoern/instructlab/tree/llama-batch showing how this would feed into sdg @bbrowning

mergify · 2024-11-07T17:00:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

khaledsulayman · 2024-11-11T15:51:10Z

@cdoern I know this is a bit stale but is this redundant with #358?

bbrowning · 2024-11-11T16:29:36Z

This doesn't exactly overlap with #358 I don't think - this one is more about speeding up SDG by executing parallel requests against multiple llama-cpp servers.

I agree that SDG should be able to send multiple requests in parallel. I disagree with the approach in this PR of having multiple different OpenAI Clients hitting different endpoints, as that's not typically how servers are load-balanced. In the production case, we'd have a single load-balanced endpoint and the user should have some knob to control how many SDG requests we execute in parallel against that backend.

Perhaps we need to separate this out a bit into a couple of phases. One phase is providing a knob so users can control how many SDG requests we execute in parallel, and that may require rethinking some of our concurrency primitives in use during the data generation loop.

The other phase would be giving options in the CLI to spin up multiple llama-cpp-python servers load-balanced behind the uvicorn it's already managing. That would be multiple llama-cpp-python but all behind a single endpoint, so that we're not having to juggle multiple endpoints and multiple separate OpenAI Client instances.

mergify bot added the ci-failure label Oct 21, 2024

cdoern force-pushed the laptop-pipeline branch 3 times, most recently from f5f3fdc to 0e12c33 Compare October 21, 2024 02:11

mergify bot added ci-failure and removed ci-failure labels Oct 21, 2024

cdoern force-pushed the laptop-pipeline branch from 0e12c33 to 916c8dd Compare October 21, 2024 14:12

cdoern mentioned this pull request Oct 21, 2024

[Epic] ilab laptop improvements instructlab/instructlab#2504

Open

cdoern force-pushed the laptop-pipeline branch 2 times, most recently from 6b29f4a to a86542d Compare October 21, 2024 17:11

mergify bot added ci-failure and removed ci-failure labels Oct 21, 2024

cdoern force-pushed the laptop-pipeline branch 8 times, most recently from d36a5b6 to 7e0fa83 Compare October 21, 2024 19:24

cdoern force-pushed the laptop-pipeline branch from 7e0fa83 to b8f614a Compare October 22, 2024 01:24

mergify bot added ci-failure and removed ci-failure labels Oct 22, 2024

aakankshaduggal requested a review from a team October 28, 2024 14:49

mergify bot added the needs-rebase label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-cpp multi server support #316

llama-cpp multi server support #316

cdoern commented Oct 21, 2024

bbrowning commented Oct 21, 2024

cdoern commented Oct 22, 2024 •

edited

Loading

mergify bot commented Nov 7, 2024

khaledsulayman commented Nov 11, 2024

bbrowning commented Nov 11, 2024

llama-cpp multi server support #316

Are you sure you want to change the base?

llama-cpp multi server support #316

Conversation

cdoern commented Oct 21, 2024

bbrowning commented Oct 21, 2024

cdoern commented Oct 22, 2024 • edited Loading

mergify bot commented Nov 7, 2024

khaledsulayman commented Nov 11, 2024

bbrowning commented Nov 11, 2024

cdoern commented Oct 22, 2024 •

edited

Loading