Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aciddelgado/continuous #867

Draft
wants to merge 63 commits into
base: main
Choose a base branch
from
Draft

Aciddelgado/continuous #867

wants to merge 63 commits into from

Conversation

aciddelgado
Copy link
Contributor

No description provided.

BowenBao and others added 21 commits August 2, 2024 16:53
Results are validated with model-generate.py by using a int4 quantized
model as the original model's assistant. The output sequence is the same
and increased tps is observed.

NOTE: Only MHA decoder only models, batch size 1, CPU, greedy select top
is supported in this initial version. GQA needs microsoft/onnxruntime#21523
to support seqlen > 1 in token phase.

* Updated builder.py to produce MHA graph that supports seqlen > 1
  in token phase.
* Introduce speculative decoding currently through a separate Generator
  class. This can be merged with existing Generator potentially on
  either API level or implementation level.
* Extended various components for functionalities to support
  speculative search. Previously most methods are hardcoded assuming
  seqlen == 1 for token phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants