Possibility of using Generate API after exporting for inference on device for a custom LLM model - Android? #819

martinkorelic · 2024-08-20T12:59:27Z

martinkorelic
Aug 20, 2024

Still fairly new to this framework, but would like to thank the contributors for your effort for providing us with this tool for generative AI.

My question is whether it is possible to include our own custom models or models with Gemma, TinyLLama arhictecture which are exported for inference and then using this genai onnxruntime framework to run faster inference for text generation (for on device inference - Android)?

The repository states that the models supported are Gemma and LLama architectures, however there aren't many provided examples on this situation except for phi-3 but in this case we are downloading an already pretrained model.

In my case, I would like to export an already existing transformer causal LM model on device and then use it with this framework for on device inference.
Would that be possible if we have a model which produces the logits output from the graph? For example, if we already had a inference_model.onnx file already available that accepts token_ids as input and outputs the logits. Almost same structure that phi3-mini-4k-instruct-cpu-int4-rtn-block-32-acc-level-4.onnx uses with input having input_ids, attention_mask and output with logits + hidden states.

Or are we only bounded by the models supported in the given example?

Thank you for your answer in advance:)

kunal-vaishnavi · 2024-08-20T15:11:57Z

kunal-vaishnavi
Aug 20, 2024
Collaborator

My question is whether it is possible to include our own custom models or models with Gemma, TinyLLama arhictecture which are exported for inference and then using this genai onnxruntime framework to run faster inference for text generation (for on device inference - Android)?

Yes, you can bring your own custom models or use models that have the same architecture as already-supported model architectures (e.g. Gemma, LLaMA, Phi, etc).

In my case, I would like to export an already existing transformer causal LM model on device and then use it with this framework for on device inference.
Would that be possible if we have a model which produces the logits output from the graph? For example, if we already had a inference_model.onnx file already available that accepts token_ids as input and outputs the logits. Almost same structure that phi3-mini-4k-instruct-cpu-int4-rtn-block-32-acc-level-4.onnx uses with input having input_ids, attention_mask and output with logits + hidden states.
Or are we only bounded by the models supported in the given example?

Yes, you can use your ONNX model and you are not bounded by the example models shown. To create the supporting files needed to run with ONNX Runtime GenAI (e.g. genai_config.json, tokenizer.json, tokenizer_config.json, etc), you can run the model builder like this.

The model builder will try to create the genai_config.json file automatically but you should verify its information and fix any issues. For example, if your model accepts input_ids but the input name in the ONNX model for input ids is named as token_ids, then you will need to change "input_ids": "input_ids" to "input_ids": "token_ids".

7 replies

martinkorelic Aug 24, 2024
Author

Yes, I have noticed this while experimenting with implementing it in the optimized graph. Another way if it is possible, I would avoid using the optimizations for the graph and instead using my own graph which has the input and the output of loss and logits for onnx genai.

However this raises issues with using the past keys and values, since I do not have set those as inputs and as outputs.

Without using these optimizations, would I be able to use the hidden states of keys and values of the default transformer implementation which would then be also used as an input, effectively allowing me to use the onnx genai framework?

kunal-vaishnavi Aug 25, 2024
Collaborator

Yes, you should be able to use them as inputs and outputs. Here's some pseudocode for how you can do this using the ONNX helper APIs.

import onnx
from onnx import helper, TensorProto

# 1) Load ONNX model into memory as `ModelProto`
model = onnx.load_model("/path/to/model.onnx")

# 2) Create new inputs and outputs
num_layers = number_of_layers
num_kv_heads = number_of_key_value_heads
head_size = hidden_size / number_of_attention_heads

dtype = TensorProto.FLOAT for float32 precision or TensorProto.FLOAT16 for float16 precision
input_shape = ['batch_size', num_kv_heads, 'past_sequence_length', head_size]
output_shape = ['batch_size', num_kv_heads, 'total_sequence_length', head_size]
inputs, outputs = [], []

for layer_id in range(num_layers):
    # Add KV cache to inputs
    key_name = f"past_key_values.{i}.key"
    inputs.append(helper.make_tensor_value_info(key_name, dtype, input_shape))
    value_name = f"past_key_values.{i}.value"
    inputs.append(helper.make_tensor_value_info(value_name, dtype, input_shape))

    # Add KV cache to outputs
    key_name = f"present.{i}.key"
    outputs.append(helper.make_tensor_value_info(key_name, dtype, output_shape))
    value_name = f"present.{i}.value"
    outputs.append(helper.make_tensor_value_info(value_name, dtype, output_shape))

# 3) Add new inputs and outputs to existing inputs and outputs
model.graph.input.extend(inputs)
model.graph.output.extend(outputs)

# 4) Save new ONNX model
onnx.save_model(
    model,
    "/path/to/new_model.onnx",
    save_as_external_data=True,
    all_tensors_to_one_file=True,
    location="new_model.onnx.data",
    size_threshold=0,
    convert_attribute=False,
)

Here are a few things to look out for.

ONNX Runtime GenAI currently expects the KV cache inputs and outputs to be named in the "past_key_values.{i}.key"/"past_key_values.{i}.value" and "present.{i}.key"/"present.{i}.value" formats based on the genai_config.json file. You can change the naming formats as long as you keep a defined naming structure and use %d to define where the layer number goes.

For example, your ONNX model may have the key output defined as "/models/layers.{layer_id}/self_attn/Concat_output_0" per layer. You can then change "present_key_names": "present.%d.key" to "present_key_names": "/models/layers.%d/self_attn/Concat_output_0" in genai_config.json where %d is replaced by the layer number.

The input and output KV cache names have to all be used as inputs or outputs to at least 1 node in the graph.

For example, you can have a Concat node defined as follows.

Input 0: past key or past value
Input 1: current key or current value calculated during attention
Output 0: present key or present value

Output 0 is then passed as an input to the next node. If needed, you can create a new node to use the KV cache names and then insert the new node into the model as follows.

new_node = helper.make_node(...)
model.graph.node.append(new_node)

martinkorelic Aug 25, 2024
Author

Thank you again for the detailed answer, this will surely help me in trying to make this implementation.
Will try this next time to add it to the existing graph and test it with onnx genai framework.

martinkorelic Aug 30, 2024
Author

@kunal-vaishnavi Seems like I made it work with KV caching inputs, however whenever I set to past_present_share_buffer : true, my model gets an error as presumably the input of KV caches are incorrect.

onnxruntime_genai.onnxruntime_genai.OrtException: Non-zero status code returned while running Concat node. Name:'/backbone/model/model/layers.0/self_attn/Concat_5' Status Message: /onnxruntime_src/onnxruntime/core/framework/execution_frame.cc:171 onnxruntime::common::Status onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue(int, int, const onnxruntime::TensorShape*, OrtValue*&, const onnxruntime::Node&) 
shape && tensor.Shape() == *shape was false. OrtValue shape verification failed.
Current shape:{1,4,256,64} Requested shape:{1,4,263,64}

Input ids are in this example: [1, 12] (batchsize, sequence length)

This is from the concat node, my inputs are as follows:

input ids, position ids -> [batchsize, sequence length]
attention mask -> [batch_size, past_sequence_length + 1]
present_key_values -> [batch_size,4,past_sequence_length + 1,64]
past_key_values -> [batch_size,4,past_sequence_length,64]

Looks like it is requesting something like [1, 4, total_sequence_length + input_id_seq_length, 64].
It should be noted that I am not using the optimizations like GroupAttentionQuery provided by onnx genai, but rather my own model. This are all operations from the default transformer with the use of cache.
I have tried setting the same outputs and input dimension names as in the given example of phi-3-mini4k, however it was the same error.

Here is the specific node in question:

Now, is this perhaps an error on my part due and different operations of concat node, now I am unsure if KV caching is even working or just the share buffer?
If so, do you have any advice on how this should work with onnx genai?

kunal-vaishnavi Aug 30, 2024
Collaborator

Glad to hear you were able to add the past and present KV caches inputs!

ONNX Runtime implements the GroupQueryAttention op and the op is designed to accept KV caches that use or do not use past-present buffer sharing. ONNX Runtime GenAI offers the user the option to enable or disable past-present buffer sharing so that it creates the KV caches according to the user's preferences and then passes them to the model. Thus, a model that contains the GroupQueryAttention op can support past-present buffer sharing because the op supports it.

Since your model does not have the GroupQueryAttention op, you should set past_present_share_buffer = false. Then, the past KV caches will be of size [batch_size, 4, past_sequence_length, 64] and the present KV caches will be of size [batch_size, 4, past_sequence_length + 1, 64] in ONNX Runtime GenAI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of using Generate API after exporting for inference on device for a custom LLM model - Android? #819

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Possibility of using Generate API after exporting for inference on device for a custom LLM model - Android? #819

martinkorelic Aug 20, 2024

Replies: 1 comment · 7 replies

kunal-vaishnavi Aug 20, 2024 Collaborator

martinkorelic Aug 24, 2024 Author

kunal-vaishnavi Aug 25, 2024 Collaborator

martinkorelic Aug 25, 2024 Author

martinkorelic Aug 30, 2024 Author

kunal-vaishnavi Aug 30, 2024 Collaborator

martinkorelic
Aug 20, 2024

Replies: 1 comment 7 replies

kunal-vaishnavi
Aug 20, 2024
Collaborator

martinkorelic Aug 24, 2024
Author

kunal-vaishnavi Aug 25, 2024
Collaborator

martinkorelic Aug 25, 2024
Author

martinkorelic Aug 30, 2024
Author

kunal-vaishnavi Aug 30, 2024
Collaborator