Continuous decoding feature #858

WA225 · 2024-08-21T17:17:50Z

WA225
Aug 21, 2024

Describe the bug
I am wondering if there an API (C or python but preferably python) that allows us to modify the generator input without the need to recreate the generator.

I do not see any API that can do that in the documentation, but it would be really helpful to have it.

yufenglee · 2024-08-22T23:37:05Z

yufenglee
Aug 22, 2024
Collaborator

Could you please share more detains on this requirement? Do you want the generator object to serve for multiple inputs instead of one? if so, we are adding it.

0 replies

WA225 · 2024-08-23T19:04:24Z

WA225
Aug 23, 2024
Author

@yufenglee Yes, if possible, I would like to modify the input of the generator before the next prediction without having the recreate the generator for every input modification. Would that be possible with what is currently being developed?

0 replies

elephantpanda · 2024-08-26T04:57:20Z

elephantpanda
Aug 26, 2024

I'm guessing the idea is that if you having a chat with the LLM you want to add the conversation so far to the end of the input and add a new question.

0 replies

natke · 2024-08-27T21:01:04Z

natke
Aug 27, 2024
Collaborator

@WA225 and @elephantpanda, yes we are working on the continuous decoding feature, which will allow you to do this. The feature will be available in the next release

0 replies

natke · 2024-09-04T17:57:12Z

natke
Sep 4, 2024
Collaborator

Modified the title to continuous decoding

1 reply

elephantpanda Sep 8, 2024

Modified the title to continuous decoding

Can I add another suggestion for the continuous decoding improvement with this use case:

If someone is typing a long question, what I would like to do is encode it a few words at a time, while the person is typing the question and feed that to the model. That way, by the time the person has finished typing the question, nearly the whole prompt tokens have already been fed into the model. This may shave off maybe a second between the questioner asking the question and getting a reply. That is in place of the alternative which is to feed in the whole prompt only after the whole question has been written (or got from voice recognition such as using Whisper). Yes, if they edit the beginning of the question, we need to rewind it and feed in the prompt again but overall this could save a few cycles.

A second of time doesn't seem very much, but it can make all the difference between having a more natural conversation with the AI. And with running models locally every second counts. Currently for a long question there is a short pause before it starts outputting tokens.

Also, since we are running this locally, we don't have to save on GPU cycles so we should make use of them while the person is typing.

A further use case would be that the model could be used for predictive text while the person is typing it runs a few steps forwards and suggests the next word. (Then we rewind the model)

aciddelgado · 2024-10-18T23:13:00Z

aciddelgado
Oct 18, 2024
Collaborator

We are actively working on continuous decoding and a PR is coming soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous decoding feature #858

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Continuous decoding feature #858

WA225 Aug 21, 2024

Replies: 6 comments · 1 reply

yufenglee Aug 22, 2024 Collaborator

WA225 Aug 23, 2024 Author

elephantpanda Aug 26, 2024

natke Aug 27, 2024 Collaborator

natke Sep 4, 2024 Collaborator

elephantpanda Sep 8, 2024

aciddelgado Oct 18, 2024 Collaborator

WA225
Aug 21, 2024

Replies: 6 comments 1 reply

yufenglee
Aug 22, 2024
Collaborator

WA225
Aug 23, 2024
Author

elephantpanda
Aug 26, 2024

natke
Aug 27, 2024
Collaborator

natke
Sep 4, 2024
Collaborator

aciddelgado
Oct 18, 2024
Collaborator