使用 torch.dist 实现 张量并行,使用 rpc 实现流水并行
- Speed Up
- Merge Linear
- Pipeline Parallel by grpc
- Tensor Parallel by torch.dist
- Sequence KV Cache
- Performance Testing
- Attention
- SDPA
- xformers
- flash_attention
- Decoding Strategy
- Top-K Sampling
- Top-P Sampling
- Temperature Sampling
- Model
- LLM
- LLaMA
- Qwen2
- Multi-Modal
- Qwen2-VL
- LLM
- MLX Framework
- With Torch Inference
- Some bugs with multi requests
- Quantization
- MLX Server
- LoRA Training
- With Torch Inference
- Web UI
- Node Status
- Display Multi Model
- ChatWeb Demo by Gradio
- Parameters
- System
- Button
- Node Status
- Backend
- OpenAI API format
- Streaming Output
- chat completion(stream)
- chat completion(non-stream)
- using anythingLLM
- Client Send Url and Port
- Auto Layer Split
- get free layer idx
- fix split layer pipeline
- calculate layer memory and recommend split
- split model before load
- Async Generation
- Multi-Sequence Batch=1
- Queuing mechanism
- Continuous Batch
- Test Cases
- Client Disconnect and Abort
- await Event
- Communication
- Communication Time Benchmark
- Async GRPC
- Ring Communication
- Auto Find Node
- WebSocket Communication
- Client Retry Connect
- Client auto update url
- Master Exit
- OpenAI API format
- KV Cache
- Request/Sequence Cache
- Custom KV Cache Class
- Conversation KV Cache (in progress)
- Token-Level Cache
- Prefix-tree Cache
- Shard Storage
- Auto Download
Master 和 Client 交互方式 http
- Master 先启动,已知模型名和层数
-
Client 启动 grpc,HTTP 发送可连接到地址信息(TODO 内存/显存大小/算力等信息)到 Master
-
Master 返回模型名,分配的起始和结束层数(同步操作,不需要状态)
-
Client 下载模型,加载模型,向 Master 发送 InitModel 信息完成
-
之后 Master 会向 Client 定时发送心跳包,确保 Client 连接正常
-
- 如果 Master 重启,Master 会丢失所有的 Client 信息
- Client 会有定时心跳检查,带着已有状态重新连接
remove torch dependency