Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Add ability to delete tokens (undo feed) #334

Merged
merged 5 commits into from
Jul 9, 2023

Conversation

steventrouble
Copy link
Contributor

@steventrouble steventrouble commented Jun 27, 2023

The goal of this PR is to enable token feed undo. This is mostly useful for autocomplete functionality (e.g. keyboards or coding assistants), where the text changes often around the cursor's position.

Closes #332

Tested on GptNeoX by writing a demo program that uses the delete_token method to minimize model refreshes. I tested these use cases:

  • Add 200 tokens, then delete one of them. Ensure topk predictions look the same as before the latest token.
  • Add 200 tokens, then delete all of them. Ensure topk predictions return to start of sentence.
  • Delete then add, add then delete, etc.
  • Used println!() extensively to verify it was deleting the expected tokens.

The performance is pretty solid. Deleting 10 tokens takes less than a millisecond on my box.

A potential issue with this PR is that you can't delete the start of content token because decoded_tokens.truncate would fail. It also might fail on other special tokens, but I'm not sure how to encounter these as part of my testing.

@LLukas22
Copy link
Contributor

Good job, i read through it and it looks good. Could you provide a link to your demo program? I would like to play around a bit with it.

@steventrouble
Copy link
Contributor Author

Sure! Here's the demo code:

https://gist.github.com/steventrouble/47cca1306bf3a41c4f62dc698d71f5af

And you can run it with cargo run -- --model=my_model_path

I extracted it rather hastily from my project crate, but I think it should compile fine. Hopefully I didn't remove too much functionality when I yanked it out, and hopefully it's somewhat comprehensible. It's a little late over here 😅

@philpax
Copy link
Collaborator

philpax commented Jun 28, 2023

Hi there! Great work on this; this is a very useful feature to have.

The only thing I'm not sure about is the interaction with the memory_k/memory_v tensors. If you do the following sequence of operations:

logits_0 = feed_prompt("blah")
logits_200 = infer_200(deterministic_sampler)

logits_0_new = rewind_200()
logits_200_new = infer_200(deterministic_sampler)

assert(logits_0 == logits_0_new)
assert(logits_200 == logits_new)

Do the logits match with no further user intervention? My understanding is that the process of feeding prompts/running inference will mutate the memory tensors, so the logits will not match. (Feel free to correct me if I'm wrong - it's very hard to get visibility into exactly what's happening with the tensors, so I have to resort to conjecture)

@steventrouble
Copy link
Contributor Author

steventrouble commented Jun 28, 2023

You bring up a great point. There's not a lot of visibility into the buffers. I verified with a hex editor that GptNeoX only modifies the memory corresponding to the latest token, but I'd be happy to do more tests.

Do you have ggml copies of tiny models for each of the model types? That would save me a lot of time when testing.

Also, I'm not sure what the policy is on integration tests, but I'd love to be able to submit some kind of repeatable test so others can verify it works on other hardware.

@LLukas22
Copy link
Contributor

@steventrouble Currently we don't have integration tests but we sure would like to have some :D (see #319)

The problem is that we don't have miniature versions of our models which makes testing with the original models kinda difficult.
Maybe we could create some sort of python script to only covert a single layer of a pytorch model but that's incredibly hacky.

@steventrouble
Copy link
Contributor Author

steventrouble commented Jun 30, 2023

@LLukas22 Not sure how much progress you've made on this already, but I can give you mock models that compress to about 2MB each. Does that work for your needs?

I set most of the model weights and biases to 0, then compress it with gzip. We should be able to do this with any model. Here's an example with GptNeoX:

https://github.com/steventrouble/llm/tree/smoke/crates/llm/tests

@steventrouble
Copy link
Contributor Author

Awesome! I saw your integration tests PR 😄

I wrote a basic integration test, but I'll hold off on uploading until your PR is stable. Then I'll upate my PR to add the test in. Ping me when I can pick this up again!

@LLukas22
Copy link
Contributor

LLukas22 commented Jul 1, 2023

Uhm i dont know how these zeroed out models will behave but wont they always output the same results as they will zero out the input after the first matmul?

@steventrouble
Copy link
Contributor Author

A few of the neurons will still have values, so it'll output random jibberish. It's mostly useful as a lightweight smoke test, to make sure the tensor loader still works and that the compute graph doesn't crash at inference time.

@LLukas22
Copy link
Contributor

LLukas22 commented Jul 5, 2023

Maybe we could use these tiny models, currently we're just using the smallest available models for each architecture 🤷
I also thought about creating some "tiny_shakespeare" models where is use 1-2 layers and maybe 8 heads to train a small variant of each model on this dataset.

But honestly, we can just use the original models as the github action runners can download them in about 30 sec 😅

If you want you could pull the newest main into this branch and try to create an integration test for the token deletion 👍

@steventrouble
Copy link
Contributor Author

Sweet! I'm on it 😃

Helps newbies like me set up the repo correctly.
@steventrouble
Copy link
Contributor Author

Alright, all tested. Deleting tokens works on all the models! (except llama 😢)

@LLukas22
Copy link
Contributor

LLukas22 commented Jul 6, 2023

Hm thats strange, are you recalculating the K/V memory or are you removing the last n entries from it?

@steventrouble
Copy link
Contributor Author

steventrouble commented Jul 7, 2023

@LLukas22 Neither, I leave the memory unchanged. The old values are still in the memory, but it's fine because the memory isn't read. E.g. in GPT-J, accesses to memory_k and memory_v use session_len (n_past) to only select up to the last token.

(memory_k_size * n_embd) * (il * ctx_size + session_len),

P.S: I accidentally clicked "close issue", sorry 😅 Could you re-open the PR?

@hhamud hhamud reopened this Jul 7, 2023
@steventrouble
Copy link
Contributor Author

TBH, I'm not sure why it doesn't work for llama. I've looked over the code, and all access to memory_k and memory_v seem to be using session_len.

Is it possible it's storing state in the scratch buffers or statically? Or that there's some subtle bug in one of the memory_k/v accesses?

@LLukas22
Copy link
Contributor

LLukas22 commented Jul 7, 2023

Maybe it has something to do with the scratch buffers. If i find some time i'll give it a look and check why the results diverge.

@steventrouble
Copy link
Contributor Author

steventrouble commented Jul 7, 2023

Aha! Found the issue. The test was wrong, and the model works fine with delete. I used a debugger to verify that it only touches the section of memory corresponding to the current token, and that the scratch buffers don't (seem to) carry state between calls.

The issue was that the string " crab" is actually tokenized as two separate tokens by llama, but I only remove one token in the test. I updated the test to use " ", which is a single token and it passes for llama.

😄

This'll help when I add my tests, because otherwise this file will explode in size
Note that llama.json fails the tests, so it's likely it doesn't support it.
I may investigate further, though.
@steventrouble
Copy link
Contributor Author

(Fixed a temp change I made while debugging. I'm still getting back into the hang of git after 8 years of perforce, thanks for your patience!)

@LLukas22
Copy link
Contributor

LLukas22 commented Jul 8, 2023

Is this ready for a review?

@steventrouble
Copy link
Contributor Author

Yup, it looks good for review.

Thanks for your patience! Next time I contribute there will be a lot less back and forth.

Copy link
Contributor

@LLukas22 LLukas22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job 👍

Maybe a few naming/messages need to be changed 🤔

crates/llm-base/src/model/mod.rs Outdated Show resolved Hide resolved
binaries/llm-test/src/delete.rs Outdated Show resolved Hide resolved
binaries/llm-test/src/delete.rs Outdated Show resolved Hide resolved
binaries/llm-test/src/tokens.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great outside of what Lukas mentioned - ready to merge once those are fixed!

binaries/llm-test/src/delete.rs Outdated Show resolved Hide resolved
binaries/llm-test/src/delete.rs Outdated Show resolved Hide resolved
@philpax philpax merged commit 7f13bb9 into rustformers:main Jul 9, 2023
13 checks passed
@philpax
Copy link
Collaborator

philpax commented Jul 9, 2023

Fantastic work - thank you very much 🚀

@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FR: Add methods for removing tokens
4 participants