Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

wnma3mz / tLLM Public

Notifications You must be signed in to change notification settings
Fork 1
Star 6

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

tLLM

/

RoadMap.md

Copy path

Latest commit

History

86 lines (80 loc) · 2.61 KB

Breadcrumbs

tLLM

/

RoadMap.md

File metadata and controls

86 lines (80 loc) · 2.61 KB

RoadMap

使用 torch.dist 实现张量并行，使用 rpc 实现流水并行

Speed Up
- Merge Linear
- Pipeline Parallel by grpc
- Tensor Parallel by torch.dist
- Sequence KV Cache
- Performance Testing
- Attention
  - SDPA
  - xformers
  - flash_attention
Decoding Strategy
- Top-K Sampling
- Top-P Sampling
- Temperature Sampling
Model
- LLM
  - LLaMA
  - Qwen2
- Multi-Modal
  - Qwen2-VL
MLX Framework
- With Torch Inference
  - Some bugs with multi requests
- Quantization
- MLX Server
- LoRA Training
Web UI
- Node Status
  - Display Multi Model
- ChatWeb Demo by Gradio
  - Parameters
  - System
  - Button
Backend
- OpenAI API format
  - Streaming Output
  - chat completion(stream)
  - chat completion(non-stream)
  - using anythingLLM
- Client Send Url and Port
- Auto Layer Split
  - get free layer idx
  - fix split layer pipeline
  - calculate layer memory and recommend split
  - split model before load
- Async Generation
  - Multi-Sequence Batch=1
  - Queuing mechanism
  - Continuous Batch
  - Test Cases
  - Client Disconnect and Abort
  - await Event
- Communication
  - Communication Time Benchmark
  - Async GRPC
  - Ring Communication
- Auto Find Node
  - WebSocket Communication
  - Client Retry Connect
  - Client auto update url
  - Master Exit
KV Cache
- Request/Sequence Cache
- Custom KV Cache Class
- Conversation KV Cache (in progress)
- Token-Level Cache
  - Prefix-tree Cache
Shard Storage
Auto Download

Master 和 Client 交互方式 http

Master 先启动，已知模型名和层数
- Client 启动 grpc，HTTP 发送可连接到地址信息（TODO 内存/显存大小/算力等信息）到 Master
- Master 返回模型名，分配的起始和结束层数（同步操作，不需要状态）
- Client 下载模型，加载模型，向 Master 发送 InitModel 信息完成
- 之后 Master 会向 Client 定时发送心跳包，确保 Client 连接正常
如果 Master 重启，Master 会丢失所有的 Client 信息
- Client 会有定时心跳检查，带着已有状态重新连接

remove torch dependency

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.