-
Notifications
You must be signed in to change notification settings - Fork 400
Open
Description
I am using int8 weights only quantization to quantize my Mamba model using the script.
from torchao.quantization import quantize_, Int8WeightOnlyConfig
quantize_(mamba, Int8WeightOnlyConfig())
exported_ep = torch.export.export(mamba, dummy_inputs)
et_program = to_edge_transform_and_lower( # (6)
exported_ep,
partitioner=[XnnpackPartitioner()],
).to_executorch()
with open(f"{args.export}_quant.pte", "wb") as file:
file.write(et_program.buffer)Using this script, I saw that the model size is reduced. But when I perform inference, it takes ~10 minutes to generate a single token. But the generated tokens are meaningful.
Are there any suggestions to improve my inference speed?
The inference results are below,
For FP16:
For Int8:

Metadata
Metadata
Assignees
Labels
No labels