Start on llama/tools/generate_data.py. #105

ScottTodd · 2024-07-11T23:13:13Z

Progress on #103. Sending early for design feedback.

I want something lighter weight than https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1_cli.py and https://github.com/nod-ai/sharktank/blob/main/shortfin/shortfin/llm/impl/service_v1.py to use that can generate real inputs using a tokenizer and the provided hyperparameters (e.g. batch sizes, max sequence length), for use in offline tests and benchmarks. What I'm not sure about is the state tracking in the cache. That's probably easiest to just dump from the service, but I think for prefill at least we should be able to generate something sensible here.

Anyways, this is part learning exercise for me and part useful tool for others. Looking for feedback!

ScottTodd

I'm making slow progress understanding each of these parameters. The stateful parameters are especially tricky. Are there 1:1 mappings with inputs used with other models (i.e. not our port in sharktank)? I'm wondering if we could get low level input/output datasets from something other than our own code. Otherwise we might just need to keep building out service_v1.py and teach it to dump binary files with a flag 🤔

ScottTodd · 2024-07-12T22:52:54Z

sharktank/sharktank/models/llama/tools/generate_data.py

+ arg0_prefill_tokens = np.ndarray(
+ [prefill_batch_size, config["max_seq_len"]], dtype=np.int64


Ah, this is a hyperparameter defining an upper limit for the entire model, but individual function calls will typically use smaller values. That's computed based on the tokens: https://github.com/nod-ai/sharktank/blob/5005107768120df1a3e69ab1ac7abf40e701c34d/shortfin/shortfin/llm/impl/service_v1.py#L299 https://github.com/nod-ai/sharktank/blob/5005107768120df1a3e69ab1ac7abf40e701c34d/shortfin/shortfin/llm/impl/service_v1.py#L250-L261

Added a comment for now, as well as debug logging showing the full tensors being saved:

INFO 07-15 11:33:50 [generate_data.py:78] Loaded config with hyperparameters: INFO 07-15 11:33:50 [generate_data.py:79] { "module_name": "module", "module_abi_version": 1, "max_seq_len": 2048, "attn_head_count": 32, "attn_head_dim": 100, "prefill_batch_sizes": [ 4 ], "decode_batch_sizes": [ 4 ], "transformer_block_count": 26, "block_seq_stride": 16 } INFO 07-15 11:33:50 [generate_data.py:103] prompt -> encoded tokens: [1, 1200, 325, 268, 4546, 296, 1161, 29584] DEBUG 07-15 11:33:50 [generate_data.py:108] arg0_prefill_tokens: DEBUG 07-15 11:33:50 [generate_data.py:109] [[ 1 1200 325 268 4546 296 1161 29584 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... (that should be a much smaller tensor for this input, hooray dynamic shapes)

ScottTodd · 2024-07-12T22:54:00Z

sharktank/sharktank/models/llama/tools/generate_data.py

+ [prefill_batch_size, config["max_seq_len"]], dtype=np.int64
+ )
+ arg1_prefill_seq_lens = np.ndarray(prefill_batch_size, dtype=np.int64)
+ # TODO(scotttodd): arg2 - attention block indices


This is also tricky to populate, see the logic in set_sequences

ScottTodd · 2024-07-12T22:54:26Z

sharktank/sharktank/models/llama/tools/generate_data.py

+ )
+ arg1_prefill_seq_lens = np.ndarray(prefill_batch_size, dtype=np.int64)
+ # TODO(scotttodd): arg2 - attention block indices
+ # TODO(scotttodd): arg3 - attention block buffer


This is a stateful device buffer. Would need to read it back from device to host after a few real calls.

ScottTodd · 2024-07-15T15:47:39Z

Ping? Would like feedback on the approach taken here (standalone script vs teaching the service to output its args/outputs).

stellaraccident

Seems ok as a start. I also don't know a great answer for the states. Ideally you wouldn't include those in pre baked data but have something that loops them properly

dan-garvey

I also think this works as a first step. Thanks for all the notes, answered my question from the pre-sync

Start on llama/tools/generate_data.py.

adc962b

ScottTodd requested a review from dan-garvey July 11, 2024 23:32

ScottTodd self-assigned this Jul 12, 2024

ScottTodd requested a review from rsuderman July 12, 2024 16:34

ScottTodd commented Jul 12, 2024

View reviewed changes

stellaraccident approved these changes Jul 15, 2024

View reviewed changes

dan-garvey approved these changes Jul 15, 2024

View reviewed changes

ScottTodd added 2 commits July 15, 2024 11:31

Adjust logging instructions.

228e67e

Zero initialize args and log values before writing to files (debug).

7d3ff07

ScottTodd marked this pull request as ready for review July 15, 2024 18:35

Merge branch 'main' into llama-generate-data

182f098

ScottTodd enabled auto-merge (squash) July 15, 2024 18:37

ScottTodd merged commit acad056 into nod-ai:main Jul 15, 2024
2 of 3 checks passed

ScottTodd deleted the llama-generate-data branch July 15, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start on llama/tools/generate_data.py. #105

Start on llama/tools/generate_data.py. #105

ScottTodd commented Jul 11, 2024

ScottTodd left a comment •

edited

Loading

ScottTodd Jul 12, 2024

ScottTodd Jul 15, 2024

ScottTodd Jul 12, 2024

ScottTodd Jul 12, 2024

ScottTodd commented Jul 15, 2024

stellaraccident left a comment

dan-garvey left a comment

		arg0_prefill_tokens = np.ndarray(
		[prefill_batch_size, config["max_seq_len"]], dtype=np.int64

Start on llama/tools/generate_data.py. #105

Start on llama/tools/generate_data.py. #105

Conversation

ScottTodd commented Jul 11, 2024

ScottTodd left a comment • edited Loading

Choose a reason for hiding this comment

ScottTodd Jul 12, 2024

Choose a reason for hiding this comment

ScottTodd Jul 15, 2024

Choose a reason for hiding this comment

ScottTodd Jul 12, 2024

Choose a reason for hiding this comment

ScottTodd Jul 12, 2024

Choose a reason for hiding this comment

ScottTodd commented Jul 15, 2024

stellaraccident left a comment

Choose a reason for hiding this comment

dan-garvey left a comment

Choose a reason for hiding this comment

ScottTodd left a comment •

edited

Loading