Skip to content

Latest commit

 

History

History
86 lines (80 loc) · 2.61 KB

RoadMap.md

File metadata and controls

86 lines (80 loc) · 2.61 KB

RoadMap

使用 torch.dist 实现 张量并行,使用 rpc 实现流水并行

  • Speed Up
    • Merge Linear
    • Pipeline Parallel by grpc
    • Tensor Parallel by torch.dist
    • Sequence KV Cache
    • Performance Testing
    • Attention
      • SDPA
      • xformers
      • flash_attention
  • Decoding Strategy
    • Top-K Sampling
    • Top-P Sampling
    • Temperature Sampling
  • Model
    • LLM
      • LLaMA
      • Qwen2
    • Multi-Modal
      • Qwen2-VL
  • MLX Framework
    • With Torch Inference
      • Some bugs with multi requests
    • Quantization
    • MLX Server
    • LoRA Training
  • Web UI
    • Node Status
      • Display Multi Model
    • ChatWeb Demo by Gradio
      • Parameters
      • System
      • Button
  • Backend
    • OpenAI API format
      • Streaming Output
      • chat completion(stream)
      • chat completion(non-stream)
      • using anythingLLM
    • Client Send Url and Port
    • Auto Layer Split
      • get free layer idx
      • fix split layer pipeline
      • calculate layer memory and recommend split
      • split model before load
    • Async Generation
      • Multi-Sequence Batch=1
      • Queuing mechanism
      • Continuous Batch
      • Test Cases
      • Client Disconnect and Abort
      • await Event
    • Communication
      • Communication Time Benchmark
      • Async GRPC
      • Ring Communication
    • Auto Find Node
      • WebSocket Communication
      • Client Retry Connect
      • Client auto update url
      • Master Exit
  • KV Cache
    • Request/Sequence Cache
    • Custom KV Cache Class
    • Conversation KV Cache (in progress)
    • Token-Level Cache
      • Prefix-tree Cache
  • Shard Storage
  • Auto Download

Master 和 Client 交互方式 http

  • Master 先启动,已知模型名和层数
    • Client 启动 grpc,HTTP 发送可连接到地址信息(TODO 内存/显存大小/算力等信息)到 Master

    • Master 返回模型名,分配的起始和结束层数(同步操作,不需要状态)

    • Client 下载模型,加载模型,向 Master 发送 InitModel 信息完成

    • 之后 Master 会向 Client 定时发送心跳包,确保 Client 连接正常

  • 如果 Master 重启,Master 会丢失所有的 Client 信息
    • Client 会有定时心跳检查,带着已有状态重新连接

remove torch dependency