-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103
Comments
Trying to write a standalone script but it might also make sense to teach BTW, I found https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/config.py has a nice comment at the top - I'd find that useful in docs outside of the source or closer to the model definition in sharktank/ |
Progress on #103. Sending early for design feedback. I want something lighter weight than https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py and https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py to use that can generate real inputs using a tokenizer and the provided hyperparameters (e.g. batch sizes, max sequence length), for use in offline tests and benchmarks. What I'm not sure about is the state tracking in the cache. That's probably easiest to just dump from the service, but I think for prefill at least we should be able to generate something sensible here. Anyways, this is part learning exercise for me and part useful tool for others. Looking for feedback!
Here's a WIP patch adding debug file dumping to Still need to read the cache from device memory to host memory and dump it, then try the .bin files with |
Some parts of the llama model divide by input values, so providing empty inputs can crash the program or produce NaNs. We should generate real input values from tokenized prompts for testing and benchmarking.
Signature for quantized model we're focusing on:
@prefill_bs2(%arg0: !torch.vtensor<[2,?],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2,?],si64>, %arg3: !torch.tensor<[?,65536],f16>) -> !torch.vtensor<[2,?,32000],f16>
@decode_bs2(%arg0: !torch.vtensor<[2,1],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2],si64>, %arg3: !torch.vtensor<[2,?],si64>, %arg4: !torch.tensor<[?,65536],f16>)
Python code to start from is in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py . I've added logging there and in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py for debugging, but I'd like a way to export to .bin (or .npy) files for use with
iree-run-module
andiree-benchmark-module
. Maybe a python script for prompt -> input files? Could also have a script for validating a prompt -> reference model and prompt -> compiled model.The text was updated successfully, but these errors were encountered: