Where is the implementation of DP RadixAttention described in paper? #1602
Closed
Misaka9468
started this conversation in
General
Replies: 2 comments 3 replies
-
By the way, I only find round-robin method and shortest-queue method in SGLang v0.3.2 |
Beta Was this translation helpful? Give feedback.
2 replies
-
We had a experimental implementation but we did not upstream it. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the A.4 of SGLang Paper (https://arxiv.org/pdf/2312.07104):
"To adapt the RadixAttention for distributed settings with multiple replica workers (i.e., data parallelism), we developed a mechanism wherein each worker maintains its own sub-tree, while the router oversees a meta-tree. This meta-tree acts as a trie that tracks all sub-trees and their associated devices. Upon the arrival of a new batch of requests at the router, prefix matching is executed on the meta-tree. We implement various policies based on each request’s affinity—measured by the length of the shared prefix with specific workers and other requests in the same group—to make efficient dispatch decisions that minimize redundant computations. Each time new requests are processed, both the router and workers independently update their respective trees. Should an eviction occur at a worker node, it commits this eviction to a queue, which the router then processes to update the meta-tree during periods of low activity. We benchmarked this distributed configuration using four workers and the MMLU dataset, observing that it achieves linear scaling and an optimal cache hit rate with minimal overhead from this weakly consistent distributed cache design. There exists a trade-off between maximizing data locality and parallel processing efficiency. Exploring advanced scheduling policies to optimize this trade-off is designated as an area for future research. In addition, concurrent work from Preble [45] studies data-parallel scheduling based on an early version of SGLang."
Howerver, I cannot find any code about this implementation in SGLang.
Beta Was this translation helpful? Give feedback.
All reactions