Skip to content

Benchmarking

Laurent Mazare edited this page Jul 6, 2023 · 4 revisions

There are two main metrics of interest, the time to process a prompt (for large prompts) and the time to generate each subsequent token once the initial prompt has been processed.

Prompt Processing Time

Subsequent Per Token Time

CPU benchmarking The following command can be used to benchmark the per token generation time (note that this uses f16 and a single thread).

OMP_NUM_THREADS=1 RAYON_NUM_THREADS=1 cargo run --release --example llama -- \
    --cpu --npy llama.npz --prompt "the answer to life in the universe and everything is"

On a Ryzen 5 2600X, this results in a time of ~2s per token, flamegraph.

Bert Sentence Encoding

Still on a Ryzen 5 2600X, the time to process a sentence (cpu/f32) as per the benchmark below is roughly (commit cd230d26fecb2ba69352a125d8ba1a4e75f3e6d1):

  • ~24ms using the default setting (1 thread per core).
  • ~17ms using a single thread via RAYON_NUM_THREADS=1.

With the mkl branch, commit e5b68fa490790e0f903008fda828354bbbadae65:

  • ~12ms using the default setting.
  • ~13ms forcing to use a single thread via OMP_NUM_THREADS=1.
cargo run --example bert --release -- --cpu --prompt "this is an example sentence that is not too long" --n 100
Clone this wiki locally