[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

ScottTodd · 2024-07-11T16:47:09Z

Some parts of the llama model divide by input values, so providing empty inputs can crash the program or produce NaNs. We should generate real input values from tokenized prompts for testing and benchmarking.

Signature for quantized model we're focusing on:

@prefill_bs2(%arg0: !torch.vtensor<[2,?],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2,?],si64>, %arg3: !torch.tensor<[?,65536],f16>) -> !torch.vtensor<[2,?,32000],f16>
@decode_bs2(%arg0: !torch.vtensor<[2,1],si64>, %arg1: !torch.vtensor<[2],si64>, %arg2: !torch.vtensor<[2],si64>, %arg3: !torch.vtensor<[2,?],si64>, %arg4: !torch.tensor<[?,65536],f16>)

Python code to start from is in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py . I've added logging there and in https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py for debugging, but I'd like a way to export to .bin (or .npy) files for use with iree-run-module and iree-benchmark-module. Maybe a python script for prompt -> input files? Could also have a script for validating a prompt -> reference model and prompt -> compiled model.

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-07-11T17:30:50Z

This could be useful: https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/models/punet/tools/sample_data.py

ScottTodd · 2024-07-11T21:43:52Z

Trying to write a standalone script but it might also make sense to teach service_v1_cli to just dump its inputs to disk. Might do that too / instead, once I go through enough of this code to understand what's happening.

BTW, I found https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/config.py has a nice comment at the top - I'd find that useful in docs outside of the source or closer to the model definition in sharktank/

Progress on #103. Sending early for design feedback. I want something lighter weight than https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py and https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py to use that can generate real inputs using a tokenizer and the provided hyperparameters (e.g. batch sizes, max sequence length), for use in offline tests and benchmarks. What I'm not sure about is the state tracking in the cache. That's probably easiest to just dump from the service, but I think for prefill at least we should be able to generate something sensible here. Anyways, this is part learning exercise for me and part useful tool for others. Looking for feedback!

ScottTodd · 2024-07-15T23:01:23Z

Here's a WIP patch adding debug file dumping to service_v1.py: main...ScottTodd:sharktank:llama-generate-data-2 . That gets the real values used in the service. Note that the service code itself is still under construction, but it at least models the stateful nature of this program.

Still need to read the cache from device memory to host memory and dump it, then try the .bin files with iree-run-module to see if it all works as expected.

ScottTodd self-assigned this Jul 11, 2024

ScottTodd changed the title ~~Producing real values to run the compiled vmfb on (for testing)~~ [llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) Jul 11, 2024

ScottTodd mentioned this issue Jul 11, 2024

Start on llama/tools/generate_data.py. #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

ScottTodd commented Jul 11, 2024 •

edited

Loading

ScottTodd commented Jul 11, 2024

ScottTodd commented Jul 11, 2024

ScottTodd commented Jul 15, 2024

[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

[llama] Produce real values to run the compiled vmfb on (for testing/benchmarking) #103

Comments

ScottTodd commented Jul 11, 2024 • edited Loading

ScottTodd commented Jul 11, 2024

ScottTodd commented Jul 11, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 11, 2024 •

edited

Loading