Skip to content

Inference time issue #3608

@VEERA-NAGENDRA-KUMAR-EETHA

Description

I am using int8 weights only quantization to quantize my Mamba model using the script.

from torchao.quantization import quantize_, Int8WeightOnlyConfig
quantize_(mamba, Int8WeightOnlyConfig())
exported_ep = torch.export.export(mamba, dummy_inputs)
et_program = to_edge_transform_and_lower( # (6)
                    exported_ep,
                    partitioner=[XnnpackPartitioner()],
                ).to_executorch()
with open(f"{args.export}_quant.pte", "wb") as file:
                    file.write(et_program.buffer)

Using this script, I saw that the model size is reduced. But when I perform inference, it takes ~10 minutes to generate a single token. But the generated tokens are meaningful.
Are there any suggestions to improve my inference speed?

The inference results are below,
For FP16:

Image

For Int8:

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions