Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

Open
ScottTodd opened this issue Jul 11, 2024 · 3 comments
Assignees

Comments

@ScottTodd
Copy link
Member

ScottTodd commented Jul 11, 2024

Some parts of the llama model divide by input values, so providing empty inputs can crash the program or produce NaNs. We should generate real input values from tokenized prompts for testing and benchmarking.

Signature for quantized model we're focusing on:

  • @prefill_bs2(%arg0: !torch.vtensor<[2,?],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2,?],si64>, %arg3: !torch.tensor<[?,65536],f16>) -> !torch.vtensor<[2,?,32000],f16>
  • @decode_bs2(%arg0: !torch.vtensor<[2,1],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2],si64>, %arg3: !torch.vtensor<[2,?],si64>, %arg4: !torch.tensor<[?,65536],f16>)

Python code to start from is in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py . I've added logging there and in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py for debugging, but I'd like a way to export to .bin (or .npy) files for use with iree-run-module and iree-benchmark-module. Maybe a python script for prompt -> input files? Could also have a script for validating a prompt -> reference model and prompt -> compiled model.

@ScottTodd ScottTodd self-assigned this Jul 11, 2024
@ScottTodd ScottTodd changed the title Producing real values to run the compiled vmfb on (for testing) [llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) Jul 11, 2024
@ScottTodd
Copy link
Member Author

@ScottTodd
Copy link
Member Author

Trying to write a standalone script but it might also make sense to teach service_v1_cli to just dump its inputs to disk. Might do that too / instead, once I go through enough of this code to understand what's happening.

BTW, I found https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/config.py has a nice comment at the top - I'd find that useful in docs outside of the source or closer to the model definition in sharktank/

ScottTodd added a commit that referenced this issue Jul 15, 2024
Progress on #103. Sending
early for design feedback.

I want something lighter weight than
https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py
and
https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py
to use that can generate real inputs using a tokenizer and the provided
hyperparameters (e.g. batch sizes, max sequence length), for use in
offline tests and benchmarks. What I'm not sure about is the state
tracking in the cache. That's probably easiest to just dump from the
service, but I think for prefill at least we should be able to generate
something sensible here.

Anyways, this is part learning exercise for me and part useful tool for
others. Looking for feedback!
@ScottTodd
Copy link
Member Author

Here's a WIP patch adding debug file dumping to service_v1.py: main...ScottTodd:sharktank:llama-generate-data-2 . That gets the real values used in the service. Note that the service code itself is still under construction, but it at least models the stateful nature of this program.

Still need to read the cache from device memory to host memory and dump it, then try the .bin files with iree-run-module to see if it all works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

1 participant