-
Notifications
You must be signed in to change notification settings - Fork 360
Add ability to delete tokens (undo feed) #334
Conversation
Good job, i read through it and it looks good. Could you provide a link to your demo program? I would like to play around a bit with it. |
Sure! Here's the demo code: https://gist.github.com/steventrouble/47cca1306bf3a41c4f62dc698d71f5af And you can run it with I extracted it rather hastily from my project crate, but I think it should compile fine. Hopefully I didn't remove too much functionality when I yanked it out, and hopefully it's somewhat comprehensible. It's a little late over here 😅 |
Hi there! Great work on this; this is a very useful feature to have. The only thing I'm not sure about is the interaction with the
Do the logits match with no further user intervention? My understanding is that the process of feeding prompts/running inference will mutate the memory tensors, so the logits will not match. (Feel free to correct me if I'm wrong - it's very hard to get visibility into exactly what's happening with the tensors, so I have to resort to conjecture) |
You bring up a great point. There's not a lot of visibility into the buffers. I verified with a hex editor that GptNeoX only modifies the memory corresponding to the latest token, but I'd be happy to do more tests. Do you have ggml copies of tiny models for each of the model types? That would save me a lot of time when testing. Also, I'm not sure what the policy is on integration tests, but I'd love to be able to submit some kind of repeatable test so others can verify it works on other hardware. |
@steventrouble Currently we don't have integration tests but we sure would like to have some :D (see #319) The problem is that we don't have miniature versions of our models which makes testing with the original models kinda difficult. |
@LLukas22 Not sure how much progress you've made on this already, but I can give you mock models that compress to about 2MB each. Does that work for your needs? I set most of the model weights and biases to 0, then compress it with gzip. We should be able to do this with any model. Here's an example with GptNeoX: https://github.com/steventrouble/llm/tree/smoke/crates/llm/tests |
Awesome! I saw your integration tests PR 😄 I wrote a basic integration test, but I'll hold off on uploading until your PR is stable. Then I'll upate my PR to add the test in. Ping me when I can pick this up again! |
Uhm i dont know how these zeroed out models will behave but wont they always output the same results as they will zero out the input after the first matmul? |
A few of the neurons will still have values, so it'll output random jibberish. It's mostly useful as a lightweight smoke test, to make sure the tensor loader still works and that the compute graph doesn't crash at inference time. |
Maybe we could use these tiny models, currently we're just using the smallest available models for each architecture 🤷 But honestly, we can just use the original models as the github action runners can download them in about 30 sec 😅 If you want you could pull the newest main into this branch and try to create an integration test for the token deletion 👍 |
Sweet! I'm on it 😃 |
Helps newbies like me set up the repo correctly.
Alright, all tested. Deleting tokens works on all the models! (except llama 😢) |
Hm thats strange, are you recalculating the K/V memory or are you removing the last n entries from it? |
@LLukas22 Neither, I leave the memory unchanged. The old values are still in the memory, but it's fine because the memory isn't read. E.g. in GPT-J, accesses to llm/crates/models/gptj/src/lib.rs Line 185 in 7d6eee3
P.S: I accidentally clicked "close issue", sorry 😅 Could you re-open the PR? |
TBH, I'm not sure why it doesn't work for llama. I've looked over the code, and all access to memory_k and memory_v seem to be using session_len. Is it possible it's storing state in the scratch buffers or statically? Or that there's some subtle bug in one of the memory_k/v accesses? |
Maybe it has something to do with the scratch buffers. If i find some time i'll give it a look and check why the results diverge. |
Aha! Found the issue. The test was wrong, and the model works fine with delete. I used a debugger to verify that it only touches the section of memory corresponding to the current token, and that the scratch buffers don't (seem to) carry state between calls. The issue was that the string 😄 |
This'll help when I add my tests, because otherwise this file will explode in size
Note that llama.json fails the tests, so it's likely it doesn't support it. I may investigate further, though.
(Fixed a temp change I made while debugging. I'm still getting back into the hang of git after 8 years of perforce, thanks for your patience!) |
Is this ready for a review? |
Yup, it looks good for review. Thanks for your patience! Next time I contribute there will be a lot less back and forth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good job 👍
Maybe a few naming/messages need to be changed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great outside of what Lukas mentioned - ready to merge once those are fixed!
Fantastic work - thank you very much 🚀 |
The goal of this PR is to enable token feed undo. This is mostly useful for autocomplete functionality (e.g. keyboards or coding assistants), where the text changes often around the cursor's position.
Closes #332
Tested on
GptNeoX
by writing a demo program that uses thedelete_token
method to minimize model refreshes. I tested these use cases:The performance is pretty solid. Deleting 10 tokens takes less than a millisecond on my box.
A potential issue with this PR is that you can't delete the start of content token because
decoded_tokens.truncate
would fail. It also might fail on other special tokens, but I'm not sure how to encounter these as part of my testing.