alma

A Python library for benchmarking PyTorch model speed for different conversion options 🚀

With just one function call, you can get a full report on how fast your PyTorch model runs for inference across over 90 conversion options, such as JIT tracing, torch.compile, torch.export, torchao, ONNX, OpenVINO, TensorRT, and many more!

This allows one to find the best option for one's model, data, and hardware. See here for all supported options.

Getting Started

Installation

alma is available as a Python package.

One can install the package from python package index by running

pip install alma-torch

Alternatively, it can be installed from the root of this repository by running:

pip install -e .

Docker

We recommend that you build the provided Dockerfile to ensure an easy installation of all of the system dependencies and the alma pip packages.

Working with the docker image

Build the Docker Image
```
bash scripts/build_docker.sh
```
Run the Docker Container
Create and start a container named alma:
```
bash scripts/run_docker.sh
```
Access the Running Container
Enter the container's shell:
```
docker exec -it alma bash
```
Mount Your Repository
By default, the run_docker.sh script mounts your /home directory to /home inside the container.
If your alma repository is in a different location, update the bind mount, for example:
```
-v /Users/myuser/alma:/home/alma
```

Basic usage

The core API is benchmark_model, which is used to benchmark the speed of a model for different conversion options. The usage is as follows:

from alma import benchmark_model
from alma.benchmark import BenchmarkConfig
from alma.benchmark.log import display_all_results

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Load the model
model = ...

# Load the dataloader used in benchmarking
data_loader = ...

# Set the configuration (this can also be passed in as a dict)
config = BenchmarkConfig(
    n_samples=2048,
    batch_size=64,
    device=device,  # The device to run the model on
)

# Choose with conversions to benchmark
conversions = ["EAGER", "TORCH_SCRIPT", "COMPILE_INDUCTOR_MAX_AUTOTUNE", "COMPILE_OPENXLA"]

# Benchmark the model
results = benchmark_model(model, config, conversions, data_loader=data_loader)

# Print all results
display_all_results(results)

The results will look like this, depending on one's model, dataloader, and hardware.

EAGER results:
Device: cuda
Total elapsed time: 0.0206 seconds
Total inference time (model only): 0.0074 seconds
Total samples: 2048 - Batch size: 64
Throughput: 275643.45 samples/second

TORCH_SCRIPT results:
Device: cuda
Total elapsed time: 0.0203 seconds
Total inference time (model only): 0.0043 seconds
Total samples: 2048 - Batch size: 64
Throughput: 477575.34 samples/second

COMPILE_INDUCTOR_MAX_AUTOTUNE results:
Device: cuda
Total elapsed time: 0.0159 seconds
Total inference time (model only): 0.0035 seconds
Total samples: 2048 - Batch size: 64
Throughput: 592801.70 samples/second

COMPILE_OPENXLA results:
Device: xla:0
Total elapsed time: 0.0146 seconds
Total inference time (model only): 0.0033 seconds
Total samples: 2048 - Batch size: 64
Throughput: 611865.07 samples/second

Documentation

See the examples for discussion of design choices and for examples of more advanced usage, e.g. controlling the multiprocessing setup, controlling graceful failures, setting default device fallbacks if a conversion option is incompatible with your specified device, memory efficient usage of alma, etc.

Video

See below for a YouTube video going over alma as well as a discussion of the documentation for advanced usages.

YouTube link.

Conversion Options

Naming conventions

The naming convention for conversion options is as follows:

Short but descriptive names for each technique, e.g. EAGER, EXPORT, etc.
Underscores _ are used within each technique name to seperate the words for readability, e.g. AOT_INDUCTOR, COMPILE_CUDAGRAPHS, etc.
If multiple "techniques" are used in a conversion option, then the names are separated by a + sign in chronological order of operation. For example, EXPORT+EAGER, EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE. In both cases, EXPORT is the first operation, followed by EAGER or COMPILE_INDUCTOR_MAX_AUTOTUNE.

Conversion Options Summary

Below is a table summarizing the currently supported conversion options and their identifiers:

ID	Conversion Option	Device Support	Project
0	EAGER	CPU, MPS, GPU	PyTorch
1	EXPORT+EAGER	CPU, MPS, GPU	torch.export
2	ONNX_CPU	CPU	ONNXRT
3	ONNX_GPU	GPU	ONNXRT
4	ONNX+DYNAMO_EXPORT	CPU	ONNXRT
5	COMPILE_CUDAGRAPHS	GPU (CUDA)	torch.compile
6	COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	torch.compile
7	COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	torch.compile
8	COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	torch.compile
9	COMPILE_INDUCTOR_EAGER_FALLBACK	CPU, MPS, GPU	torch.compile
10	COMPILE_ONNXRT	CPU, MPS, GPU	torch.compile + ONNXRT
11	COMPILE_OPENXLA	XLA_GPU	torch.compile + OpenXLA
12	COMPILE_TVM	CPU, MPS, GPU	torch.compile + Apache TVM
13	EXPORT+AI8WI8_FLOAT_QUANTIZED	CPU, MPS, GPU	torch.export
14	EXPORT+AI8WI8_FLOAT_QUANTIZED+RUN_DECOMPOSITION	CPU, MPS, GPU	torch.export
15	EXPORT+AI8WI8_STATIC_QUANTIZED	CPU, MPS, GPU	torch.export
16	EXPORT+AI8WI8_STATIC_QUANTIZED+RUN_DECOMPOSITION	CPU, MPS, GPU	torch.export
17	EXPORT+AOT_INDUCTOR	CPU, MPS, GPU	torch.export + aot_inductor
18	EXPORT+COMPILE_CUDAGRAPHS	GPU (CUDA)	torch.export + torch.compile
19	EXPORT+COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	torch.export + torch.compile
20	EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	torch.export + torch.compile
21	EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	torch.export + torch.compile
22	EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK	CPU, MPS, GPU	torch.export + torch.compile
23	EXPORT+COMPILE_ONNXRT	CPU, MPS, GPU	torch.export + torch.compile + ONNXRT
24	EXPORT+COMPILE_OPENXLA	XLA_GPU	torch.export + torch.compile + OpenXLA
25	EXPORT+COMPILE_TVM	CPU, MPS, GPU	torch.export + torch.compile + Apache TVM
26	NATIVE_CONVERT_AI8WI8_STATIC_QUANTIZED	CPU	CPU (PyTorch)
27	NATIVE_FAKE_QUANTIZED_AI8WI8_STATIC	CPU, GPU	CPU (PyTorch)
28	COMPILE_TENSORRT	GPU (CUDA)	torch.compile + NVIDIA TensorRT
29	EXPORT+COMPILE_TENSORRT	GPU (CUDA)	torch.export + torch.compile + NVIDIA TensorRT
30	COMPILE_OPENVINO	CPU (Intel)	torch.compile + OpenVINO
31	JIT_TRACE	CPU, MPS, GPU	PyTorch
32	TORCH_SCRIPT	CPU, MPS, GPU	PyTorch
33	OPTIMUM_QUANTO_AI8WI8	CPU, MPS, GPU	optimum quanto
34	OPTIMUM_QUANTO_AI8WI4	CPU, MPS, GPU (not all GPUs supported)	optimum quanto
35	OPTIMUM_QUANTO_AI8WI2	CPU, MPS, GPU (not all GPUs supported)	optimum quanto
36	OPTIMUM_QUANTO_WI8	CPU, MPS, GPU	optimum quanto
37	OPTIMUM_QUANTO_WI4	CPU, MPS, GPU (not all GPUs supported)	optimum quanto
38	OPTIMUM_QUANTO_WI2	CPU, MPS, GPU (not all GPUs supported)	optimum quanto
39	OPTIMUM_QUANTO_Wf8E4M3N	CPU, MPS, GPU	optimum quanto
40	OPTIMUM_QUANTO_Wf8E4M3NUZ	CPU, MPS, GPU	optimum quanto
41	OPTIMUM_QUANTO_Wf8E5M2	CPU, MPS, GPU	optimum quanto
42	OPTIMUM_QUANTO_Wf8E5M2+COMPILE_CUDAGRAPHS	GPU (CUDA)	optimum quanto + torch.compile
43	FP16+EAGER	CPU, MPS, GPU	PyTorch
44	BF16+EAGER	CPU, MPS, GPU (not all GPUs natively supported)	PyTorch
45	COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_AUTOQUANT_DEFAULT	GPU	torch.compile + torchao
46	COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_AUTOQUANT_NONDEFAULT	GPU	torch.compile + torchao
47	COMPILE_CUDAGRAPHS+ TORCHAO_AUTOQUANT_DEFAULT	GPU (CUDA)	torch.compile + torchao
48	COMPILE_INDUCTOR_MAX_AUTOTUNE+ TORCHAO_QUANT_I4_WEIGHT_ONLY	GPU (requires bf16 support)	torch.compile + torchao
49	TORCHAO_QUANT_I4_WEIGHT_ONLY	GPU (requires bf16 support)	torchao
50	FP16+COMPILE_CUDAGRAPHS	GPU (CUDA)	PyTorch + torch.compile
51	FP16+COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	PyTorch + torch.compile
52	FP16+COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	PyTorch + torch.compile
53	FP16+COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	PyTorch + torch.compile
54	FP16+COMPILE_INDUCTOR_EAGER_FALLBACK	CPU, MPS, GPU	PyTorch + torch.compile
55	FP16+COMPILE_ONNXRT	CPU, MPS, GPU	PyTorch + torch.compile + ONNXRT
56	FP16+COMPILE_OPENXLA	XLA_GPU	PyTorch + torch.compile + OpenXLA
57	FP16+COMPILE_TVM	CPU, MPS, GPU	PyTorch + torch.compile + Apache TVM
58	FP16+COMPILE_TENSORRT	GPU (CUDA)	PyTorch + torch.compile + NVIDIA TensorRT
59	FP16+COMPILE_OPENVINO	CPU (Intel)	PyTorch + torch.compile + OpenVINO
60	FP16+EXPORT+COMPILE_CUDAGRAPHS	GPU (CUDA)	torch.export + torch.compile
61	FP16+EXPORT+COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	torch.export + torch.compile
62	FP16+EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	torch.export + torch.compile
63	FP16+EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	torch.export + torch.compile
64	FP16+EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK	CPU, MPS, GPU	torch.export + torch.compile
65	FP16+EXPORT+COMPILE_ONNXRT	CPU, MPS, GPU	torch.export + torch.compile + ONNXRT
66	FP16+EXPORT+COMPILE_OPENXLA	XLA_GPU	torch.export + torch.compile + OpenXLA
67	FP16+EXPORT+COMPILE_TVM	CPU, MPS, GPU	torch.export + torch.compile + Apache TVM
68	FP16+EXPORT+COMPILE_TENSORRT	GPU (CUDA)	torch.export + torch.compile + NVIDIA TensorRT
69	FP16+EXPORT+COMPILE_OPENVINO	CPU (Intel)	torch.export + torch.compile + OpenVINO
70	FP16+JIT_TRACE	CPU, MPS, GPU	PyTorch
71	FP16+TORCH_SCRIPT	CPU, MPS, GPU	PyTorch
72	BF16+COMPILE_CUDAGRAPHS	GPU (CUDA)	PyTorch + torch.compile
73	BF16+COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	PyTorch + torch.compile
74	BF16+COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	PyTorch + torch.compile
75	BF16+COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	PyTorch + torch.compile
76	BF16+COMPILE_INDUCTOR_EAGER_FALLBACK	CPU, MPS, GPU	PyTorch + torch.compile
77	BF16+COMPILE_ONNXRT	CPU, MPS, GPU	PyTorch + torch.compile + ONNXRT
78	BF16+COMPILE_OPENXLA	XLA_GPU	PyTorch + torch.compile + OpenXLA
79	BF16+COMPILE_TVM	CPU, MPS, GPU	PyTorch + torch.compile + Apache TVM
80	BF16+COMPILE_TENSORRT	GPU (CUDA)	PyTorch + torch.compile + NVIDIA TensorRT
81	BF16+COMPILE_OPENVINO	CPU (Intel)	PyTorch + torch.compile + OpenVINO
82	BF16+EXPORT+COMPILE_CUDAGRAPHS	GPU (CUDA)	torch.export + torch.compile
83	BF16+EXPORT+COMPILE_INDUCTOR_DEFAULT	CPU, MPS, GPU	torch.export + torch.compile
84	BF16+EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD	CPU, MPS, GPU	torch.export + torch.compile
85	BF16+EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE	CPU, MPS, GPU	torch.export + torch.compile
86	BF16+EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK	CPU, MPS, GPU	torch.export + torch.compile
87	BF16+EXPORT+COMPILE_ONNXRT	CPU, MPS, GPU	torch.export + torch.compile + ONNXRT
88	BF16+EXPORT+COMPILE_OPENXLA	XLA_GPU	torch.export + torch.compile + OpenXLA
89	BF16+EXPORT+COMPILE_TVM	CPU, MPS, GPU	torch.export + torch.compile + Apache TVM
90	BF16+EXPORT+COMPILE_TENSORRT	GPU (CUDA)	torch.export + torch.compile + NVIDIA TensorRT
91	BF16+EXPORT+COMPILE_OPENVINO	CPU (Intel)	torch.export + torch.compile + OpenVINO
92	BF16+JIT_TRACE	CPU, MPS, GPU	PyTorch
93	BF16+TORCH_SCRIPT	CPU, MPS, GPU	PyTorch

These conversion options are also all hard-coded in the conversion options file, which is the source of truth.

Testing:

We use pytest for testing. Simply run:

pytest

We currently don't have comprehensive tests, but we are working on adding more tests to ensure that the conversion options are working as expected in known environments (e.g. the Docker container).

Future work:

Add more conversion options. This is a work in progress, and we are always looking for more conversion options.
Multi-device benchmarking. Currently alma only supports single-device benchmarking, but ideally a model could be split across multiple devices.
Integrating conversion options beyond PyTorch, e.g. HuggingFace, JAX, llama.cpp, etc.

How to contribute:

Contributions are welcome! If you have a new conversion option, feature, or other you would like to add, so that the whole community can benefit, please open a pull request! We are always looking for new conversion options, and we are happy to help you get started with adding a new conversion option/feature!

See the CONTRIBUTING.md file for more detailed information on how to contribute.

Citation

@Misc{alma,
  title =        {Alma: PyTorch model speed benchmarking across all conversion types},
  author =       {Oscar Savolainen and Saif Haq},
  howpublished = {\url{https://github.com/saifhaq/alma}},
  year =         {2024}
}

Name	Name	Last commit message	Last commit date
Latest commit saifhaq Merge pull request #115 from saifhaq/benchmark-loop-improvement Mar 10, 2025 adeb06f · Mar 10, 2025 History 123 Commits
.github/workflows	.github/workflows	Unit tests \| Allow user to provide dict for config and list[str] for …	Dec 18, 2024
assets	assets	Add banner to README (#105 )	Jan 5, 2025
docker	docker	Adds github actions runner docker + script (#11 )	Dec 1, 2024
examples	examples	Add details about non-blocked data transfer to example	Mar 8, 2025
scripts	scripts	New docker. XLA working. Pydantic inputs to benchmark_model. Device o…	Dec 18, 2024
src/alma	src/alma	formatting	Mar 8, 2025
tests	tests	Unit tests \| Allow user to provide dict for config and list[str] for …	Dec 18, 2024
.gitignore	.gitignore	WIP - More compile options, and suppress internel logs for onnx and t…	Dec 9, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	Blacked + isorted	Jul 26, 2024
.pre-commit-hooks.yaml	.pre-commit-hooks.yaml	Blacked + isorted	Jul 26, 2024
.python-version	.python-version	Moved scripts to root of the repo	Jul 26, 2024
CONTRIBUTING.md	CONTRIBUTING.md	documentation: README update, we make it smaller (#97 )	Jan 4, 2025
Dockerfile	Dockerfile	New docker. XLA working. Pydantic inputs to benchmark_model. Device o…	Dec 18, 2024
LICENSE	LICENSE	Update LICENSE	Dec 6, 2024
README.md	README.md	Add video to docs (#109 )	Jan 29, 2025
pyproject.toml	pyproject.toml	Fixes pypi release (#83 )	Dec 18, 2024
requirements.txt	requirements.txt	Bump tqdm from 4.66.1 to 4.66.3 (#102 )	Jan 4, 2025
runner.Dockerfile	runner.Dockerfile	New docker. XLA working. Pydantic inputs to benchmark_model. Device o…	Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alma

Table of Contents

Getting Started

Installation

Docker

Basic usage

Documentation

Video

Conversion Options

Naming conventions

Conversion Options Summary

Testing:

Future work:

How to contribute:

Citation

About

Releases 55

Packages

Contributors 3

Languages

License

saifhaq/alma

Folders and files

Latest commit

History

Repository files navigation

alma

Table of Contents

Getting Started

Installation

Docker

Basic usage

Documentation

Video

Conversion Options

Naming conventions

Conversion Options Summary

Testing:

Future work:

How to contribute:

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 55

Packages 0

Contributors 3

Languages

Packages