-
Notifications
You must be signed in to change notification settings - Fork 293
Idea: Support distributed computing inference serialized over layer-subset groups. #542
Replies: 1 comment · 3 replies
-
I'd like to switch from llama.cpp, but I need this feature! There's significant room for improvement vs their implementation too! |
Beta Was this translation helpful? Give feedback.
All reactions
-
Hey @oldgithubman and @ghchris2021! This is something which I am thinking about, probably starting with @oldgithubman, can you please elaborate on:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Hi! Hey thank you very much for this great project overall! And thank you very much for considering the addition of these kinds of distributed features, I think it's a big missed opportunity for personal scale inference since lots of people have a couple / few decent machines but not so many have really high end multi-state of the art GPU workstations / servers or workstations with hundreds of GBy RAM + dozens of cores etc. We're blessed to be in a time where we're getting several 70-400B+ models with open weights for experimentation / personal use etc. but the common PC architectures and GPU offerings other than high end servers are not keeping up! NCCL sounds like a great start I can understand. I don't have so much of a use case for that other than one particular box with a couple NV GPUs but I'm sure many will have more capacity & nodes with them. IMO llama.cpp's distributed capability right now wrt. my observations is able to be improved by: 1: Better supporting heterogeneous GPUs; CUDA GPUs in that project's status quo seem to have a lot less limitations whether inability to use particular quantizations, or not working as multi-gpus in one single node, or not working at all in a distributed rpc-server use case so far. I've run into these limits wrt. Intel GPUs and have heard from others that it's not so good for AMD GPUs still either depending on the particular use cases. You can see from the feature matrix here that while NV GPUs are almost fully supported, there are performance / quantization support / multi-GPU scalability limits for AMD, and Intel GPUs are still AFAIK not working distributed and single-node multi-gpu intel support is only just implemented in the past day or so. https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix Furthermore you can see in the "Vulkan" and "Kompute" columns that although those frameworks have great unifying potential to support multiple architectures of GPUs on multiple OS platforms via using Vulkan technology to abstract the compute it's really looking like it's so far behind in performance or feature support that it's not a viable / attractive solution in llama.cpp yet to be Beyond the functional limitations their implementation AFAICT has a lot of UX roughness for distributed: 1: You can't easily just tell it how much node RAM or node GPU VRAM to use on each node either as an absolute number of GB or as a percentage of device total capacity, and you can't (IME) rely on it usefully correctly auto-scaling the memory usage even on a single node / GPU, much less so in the rpc server case. So scaling the compute / memory allocations per layer etc. is very awkward and manually interactive and layer allocation etc. settings varies for every model and every set of cache / quantization / context size etc. parameter changes. 2: Of all the (couple) projects (llama.cpp et. al.) I've seen so far that are starting to do distributed they seem to innately just send the model data to the distributed worker nodes over the LAN connection. That's an appropriate baseline, yes, but in typical use cases every worker node may probably / will have local SSD storage which is 10x+ faster than the LAN BW may be so it's highly desirable to be able to load / cache model data on the worker nodes' SSDs so they can load that data much faster (ideally all workers in parallel even) than waiting to DL it over LAN every single time the same model is rerun. So getting to a level where one can support say at least 8bit & 4bit good quality model quantization, works heterogeneously with nvidia / intel / amd GPUs and multi-core CPUs as workers in whatever combination, can do it over IP LAN, supports running under linux, and supports the contemporarily primary 70-400B LLMs including the most prominent MoEs, that's sort of my ideal hope for evolution of personal / local hobby & study scale distributed inference. It seems like over all the inference projects adding support for new models and also supporting more heterogeneous platforms / GPUs are pain points, so I think there's probably some opportunity in these areas to make middleware or interface portable abstractions / encapsulations or something so that becomes easy and the community doesn't have to independently reinvent those wheels over N different projects would be nice but that's a tangent. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
First, I'd like to echo @ghchris2021 in thanking you for everything so far. I haven't started using mistral.rs yet, but even just these few interactions so far have been way better than much of what I've experienced over at llama.cpp (after typing this, I've discovered they've banned me from commenting, further illustrating my point - my final comments, for reference - ggerganov/llama.cpp#8684). I also agree with most of what @ghchris2021 said in terms of distributed inference. As a data point, I currently have three relevant machines connected via LAN:
So, 100% nvidia/cuda, but I absolutely welcome more competition regarding GPU's.
(Turns out most of the answers to these two questions overlap - current state of llama.cpp in parenthesis) In rough order of importance (the first two mirror the points of @ghchris2021) Critical:
Important:
Nice to have:
Possibly useful:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
❤️ 1 -
🚀 1 -
👀 1
-
Idea: Support distributed computing inference serialized over layer-subset groups.
The primary use case would be enabling models larger than any single system's available RAM / VRAM can handle to be inferenced with relatively high speed by using the available compute capacity of any number of networked (e.g. LAN) computers each with some available memory, some available CPU compute, and possibly some GPU/VRAM compute resources available.
Now there are commonly many attractively compelling open-weights models between 30B and well over 200B parameter sizes which, generally speaking, are too large for almost all users to inference effectively on any single commonly available laptop / desktop / modest workstation system because of limited available VRAM capacity in their GPUs and even limited available RAM capacity wrt. the system CPU/RAM characteristics.
It is well known that typical laptop / desktop / workstation computers GPUs have VRAM BW O(10x) higher than the system's RAM BW. It is also well known that most typically personal desktop systems have no more than 1-2 directly attached GPUs with typical total VRAM capacity in the 8-48 GBy size range.
However it is typically the case that a given person or household or small business might commonly have access to two or more laptop / desktop computers each with significant CPU/RAM computing resources, and also in many cases with significantly useful if modest GPU capabilities (e.g. 8-24 GBy VRAM GPUs).
It has been demonstrated by other FOSS attempts at distributed LLM inference computing that this is a practically viable
way for LLMs to be inferenced using similar codebases / logic as are used for single-computer + (none / single / multiple) GPU based inference to occur; e.g. "petals", llama.cpp "RPC-server" mode, et. al.
In the professional / enterprise / data center space distributed computing is widely used by mainstream training and also inference frameworks to enable multiple servers and multiple GPUs per server to be used as a group for either training or inference and to gain capacity scaling thereby.
I believe the extra "business logic" to support this could be small relative to a platform that can already support single-system inference using multi-core CPUs, and single-systems which use multiple locally attached GPUs.
Several low level frameworks e.g. MPI, PoCL, et. al. already support / facilitate distributed computing as a primary or first class optional capability.
And since we're discussing, at minimum, the coarsely sharded distribution of model inferencing among several servers the actual logic to inference a single layer or group of juxtaposed layers is not technically very different than what would be done on a single host or single GPU use case.
Models of key interest for such scaling capability would include all typical 70B range current generation models, models in the 100-140B size range, models in the 200B+ range (q.v. deekseek-v2, command-r+, dbrx, etc. etc.), and MoE models requiring 70-200B+ range memory (e.g. Mixtral 8x22b, et. al.).
Eliminating the use of swap on a single host can frequently improve effective performance by 1+ orders of magnitude.
Eliminating the use of slow CPU/RAM based inference by enabling the congregation of several "modest" GPUs + VRAM to be used for inference of a single model can also typically improve performance by O(10x) vs. even a common modern fast desktop PC with 20-50 GBy/s RAM BW, 6-16 CPU cores, etc.
Beta Was this translation helpful? Give feedback.
All reactions