rope_benchmark #3550

jjsjann123 · 2024-12-10T00:12:30Z

Rope benchmark extracted from lightning trace.

TODO:

add iobytes measurement for benchmarks.

jjsjann123 · 2024-12-10T00:14:10Z

benchmarks/python/test_rope.py

+}
+
+
+@pytest.mark.parametrize(


This is the only part that's worth reviewing.

code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)

jjsjann123 · 2024-12-10T00:15:26Z

I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR.

naoyam · 2024-12-10T02:55:02Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

Priya2698 · 2024-12-10T03:39:16Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

naoyam · 2024-12-10T03:46:27Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

Yes, so, we don't need to have the backward implementations explicitly, right?

jjsjann123 · 2024-12-16T05:36:23Z

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example.
This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

jjsjann123 · 2024-12-23T21:43:32Z

!test

Priya2698

Can you post your final numbers for the benchmark in the description, and if they match what you see from Kevin's script?

jjsjann123 · 2024-12-30T21:51:53Z

pulled the number on this branch

----------------------------------------------------------------------------------------------------------------------------- benchmark: 10 tests -----------------------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         618.3680 (3.01)       628.8100 (2.97)       624.3230 (3.02)     3.2188 (3.73)       624.2075 (3.04)     4.3820 (2.75)          3;0  1,601.7350 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,314.6250 (6.41)     1,319.4610 (6.24)     1,317.8842 (6.36)     1.5359 (1.78)     1,318.4260 (6.41)     1.9490 (1.22)          3;0    758.7920 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,376.2430 (6.71)     1,383.2420 (6.54)     1,378.7306 (6.66)     2.1992 (2.55)     1,378.4055 (6.70)     2.3330 (1.46)          3;1    725.3049 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,926.8670 (9.39)     1,940.1780 (9.17)     1,932.1506 (9.33)     4.2306 (4.90)     1,930.0695 (9.39)     6.3500 (3.98)          2;0    517.5580 (0.11)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       322.8510 (1.57)       329.6040 (1.56)       325.2294 (1.57)     2.9409 (3.41)       323.6790 (1.57)     6.0960 (3.82)          3;0  3,074.7528 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       311.9970 (1.52)       317.2140 (1.50)       314.4097 (1.52)     1.6342 (1.89)       314.2355 (1.53)     2.3720 (1.49)          3;0  3,180.5634 (0.66)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               695.5210 (3.39)       701.7980 (3.32)       699.4310 (3.38)     1.9682 (2.28)       699.9025 (3.40)     2.8590 (1.79)          2;0  1,429.7336 (0.30)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               851.0710 (4.15)       856.7360 (4.05)       854.0387 (4.12)     1.6840 (1.95)       853.9010 (4.15)     2.0720 (1.30)          3;0  1,170.9071 (0.24)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              205.1190 (1.0)        211.5500 (1.0)        207.0699 (1.0)      2.4760 (2.87)       205.6310 (1.0)      4.0360 (2.53)          2;0  4,829.2871 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.7350 (1.74)       359.1000 (1.70)       357.8798 (1.73)     0.8628 (1.0)        357.8220 (1.74)     1.5960 (1.0)           5;0  2,794.2343 (0.58)         10           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Model  | Batch-Size |  Sequence-Length | Executor |  Forward-Time(ms) | Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.628              0.959
Llama-3-8B           2             8192  Thunder-nvFuser             1.387              1.574
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.583              0.323
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.587              0.738
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.391              0.498

kernel time somewhat checked out. The difference between benchmark vs the manual measurement is coming from the earlier comment: #3550 (comment)

e.g. for llama_2_7b_hf_rope backward. The benchmark measurement included a pointwise apply kernel, which takes about 363 us and that explained why benchmark has a longer kernel time comparing to the manual benchmark.

naoyam · 2024-12-30T22:23:39Z

pulled the number on this branch

----------------------------------------------------------------------------------------------------------------------------- benchmark: 10 tests -----------------------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         618.3680 (3.01)       628.8100 (2.97)       624.3230 (3.02)     3.2188 (3.73)       624.2075 (3.04)     4.3820 (2.75)          3;0  1,601.7350 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,314.6250 (6.41)     1,319.4610 (6.24)     1,317.8842 (6.36)     1.5359 (1.78)     1,318.4260 (6.41)     1.9490 (1.22)          3;0    758.7920 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,376.2430 (6.71)     1,383.2420 (6.54)     1,378.7306 (6.66)     2.1992 (2.55)     1,378.4055 (6.70)     2.3330 (1.46)          3;1    725.3049 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,926.8670 (9.39)     1,940.1780 (9.17)     1,932.1506 (9.33)     4.2306 (4.90)     1,930.0695 (9.39)     6.3500 (3.98)          2;0    517.5580 (0.11)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       322.8510 (1.57)       329.6040 (1.56)       325.2294 (1.57)     2.9409 (3.41)       323.6790 (1.57)     6.0960 (3.82)          3;0  3,074.7528 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       311.9970 (1.52)       317.2140 (1.50)       314.4097 (1.52)     1.6342 (1.89)       314.2355 (1.53)     2.3720 (1.49)          3;0  3,180.5634 (0.66)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               695.5210 (3.39)       701.7980 (3.32)       699.4310 (3.38)     1.9682 (2.28)       699.9025 (3.40)     2.8590 (1.79)          2;0  1,429.7336 (0.30)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               851.0710 (4.15)       856.7360 (4.05)       854.0387 (4.12)     1.6840 (1.95)       853.9010 (4.15)     2.0720 (1.30)          3;0  1,170.9071 (0.24)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              205.1190 (1.0)        211.5500 (1.0)        207.0699 (1.0)      2.4760 (2.87)       205.6310 (1.0)      4.0360 (2.53)          2;0  4,829.2871 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.7350 (1.74)       359.1000 (1.70)       357.8798 (1.73)     0.8628 (1.0)        357.8220 (1.74)     1.5960 (1.0)           5;0  2,794.2343 (0.58)         10           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Model  | Batch-Size |  Sequence-Length | Executor |  Forward-Time(ms) | Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.628              0.959
Llama-3-8B           2             8192  Thunder-nvFuser             1.387              1.574
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.583              0.323
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.587              0.738
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.391              0.498

kernel time somewhat checked out. The difference between benchmark vs the manual measurement is coming from the earlier comment: #3550 (comment)

e.g. for llama_2_7b_hf_rope backward. The benchmark measurement included a pointwise apply kernel, which takes about 363 us and that explained why benchmark has a longer kernel time comparing to the manual benchmark.

Can you run then on a PJNL H100 machine? Also what do the results with torch.compile look like?

jjsjann123 · 2024-12-31T00:51:58Z

These are the perf number on A100 for torch.compile

Name (time in us)                                                                                            Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='llama_2_7b_hf_rope']         160.5770 (1.39)       162.0170 (1.37)       161.3019 (1.38)     0.4042 (2.06)       161.3440 (1.38)     0.1930 (1.0)           3;3  6,199.5550 (0.73)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='llama_2_7b_hf_rope']         909.7630 (7.87)       913.4440 (7.70)       911.5028 (7.79)     1.0986 (5.59)       911.5555 (7.79)     0.8960 (4.64)          3;2  1,097.0893 (0.13)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='llama_3_8B_rope']            288.2230 (2.49)       294.4970 (2.48)       291.6220 (2.49)     1.6176 (8.23)       291.7120 (2.49)     0.7390 (3.83)          2;3  3,429.0966 (0.40)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='llama_3_8B_rope']            997.5660 (8.63)     1,002.8170 (8.45)     1,000.8923 (8.55)     1.9805 (10.07)    1,001.8440 (8.56)     3.1030 (16.08)         2;0    999.1085 (0.12)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_mistral_nemo_rope']       235.5220 (2.04)       236.3820 (1.99)       235.8931 (2.01)     0.2827 (1.44)       235.8240 (2.01)     0.4470 (2.32)          5;0  4,239.2084 (0.50)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_mistral_nemo_rope']       333.6660 (2.89)       334.9830 (2.82)       334.3401 (2.86)     0.4517 (2.30)       334.3825 (2.86)     0.7390 (3.83)          4;0  2,990.9664 (0.35)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_phi3_rope']               194.0040 (1.68)       194.4960 (1.64)       194.2168 (1.66)     0.1966 (1.0)        194.1290 (1.66)     0.3890 (2.02)          5;0  5,148.8852 (0.60)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_phi3_rope']             2,402.0750 (20.77)    2,407.7960 (20.30)    2,404.3932 (20.54)    1.8007 (9.16)     2,403.8470 (20.54)    2.8140 (14.58)         2;0    415.9054 (0.05)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_qwen2_rope']              115.6500 (1.0)        118.6250 (1.0)        117.0693 (1.0)      0.9196 (4.68)       117.0470 (1.0)      1.4730 (7.63)          3;0  8,541.9491 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_qwen2_rope']              320.7930 (2.77)       322.5930 (2.72)       321.6137 (2.75)     0.4587 (2.33)       321.6130 (2.75)     0.2860 (1.48)          2;2  3,109.3203 (0.36)         10           1

Model  Batch-Size  Sequence-Length       Executor  Forward-Time(ms)  Backward-Time(ms)
Llama-2-7b-hf           2             4096  torch.compile             0.164              0.547
Llama-3-8B           2             8192  torch.compile             0.738              5.949
mistralai/Mistral-Nemo-Base-2407           1             4096  torch.compile             0.298              0.238
microsoft/Phi-3.5-mini-instruct           1             8192  torch.compile             0.197              2.120
Qwen/Qwen2.5-7B-Instruct           1             4096  torch.compile             0.128              0.234

Looks like Llama-3-8b backward is somewhat odd. But that seems to be coming from when we run both llama config in a single setting. If I run Llama-3-8B by itself, the timing looks about right.

root@867d93f125ca:/volume/rope/rope_examples# python lit_gpt_models.py --execs torch.compile
        Model  Batch-Size  Sequence-Length       Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  torch.compile             0.292              0.649

We could be getting some cached kernel?!

naoyam · 2024-12-31T01:04:59Z

Kevin's script has two variations of torch.compile-based executions, torch.compile and Thunder-torch.compile. Could you please show both? He mentioned Thunder-torch.compile is what we should look at, and I'm seeing much faster performances than just torch.compile on H100.

jjsjann123 · 2024-12-31T01:08:54Z

Meanwhile, H100 numbers for nvfuser

Name (time in us)                                                                                     Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       305.5070 (2.81)     309.1870 (2.73)     306.8002 (2.78)     1.0669 (2.88)     306.5605 (2.80)     0.6750 (1.52)          3;2        3.2595 (0.36)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       611.7420 (5.63)     614.2720 (5.43)     613.4171 (5.55)     0.7574 (2.04)     613.7120 (5.61)     1.0210 (2.30)          2;0        1.6302 (0.18)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          640.4120 (5.89)     643.6720 (5.69)     642.0621 (5.81)     0.8957 (2.42)     642.0985 (5.87)     1.1180 (2.52)          2;0        1.5575 (0.17)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          892.1050 (8.21)     896.1000 (7.92)     893.8670 (8.09)     1.1381 (3.07)     894.0450 (8.17)     1.4170 (3.19)          3;0        1.1187 (0.12)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']     192.5720 (1.77)     197.0260 (1.74)     194.4665 (1.76)     1.8428 (4.97)     194.0145 (1.77)     3.5230 (7.93)          4;0        5.1423 (0.57)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']     156.7050 (1.44)     158.7880 (1.40)     157.7894 (1.43)     0.6554 (1.77)     157.7595 (1.44)     0.6710 (1.51)          4;0        6.3376 (0.70)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             348.2880 (3.20)     355.4230 (3.14)     352.0209 (3.19)     2.2721 (6.13)     352.3665 (3.22)     3.8130 (8.59)          2;0        2.8407 (0.31)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             408.4810 (3.76)     409.7590 (3.62)     409.1938 (3.71)     0.3709 (1.0)      409.1870 (3.74)     0.4440 (1.0)           3;0        2.4438 (0.27)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']            108.7010 (1.0)      113.1840 (1.0)      110.4311 (1.0)      1.7987 (4.85)     109.4530 (1.0)      3.3280 (7.50)          2;0        9.0554 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']            340.5750 (3.13)     344.0330 (3.04)     342.3673 (3.10)     1.0602 (2.86)     342.4130 (3.13)     1.1170 (2.52)          4;0        2.9208 (0.32)         10           1

Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.308              0.453
Llama-3-8B           2             8192  Thunder-nvFuser             0.655              0.725
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.536              0.384
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.499              0.276
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.453              0.571

jjsjann123 · 2024-12-31T01:10:28Z

Kevin's script has two variations of torch.compile-based executions, torch.compile and Thunder-torch.compile. Could you please show both? He mentioned Thunder-torch.compile is what we should look at, and I'm seeing much faster performances than just torch.compile on H100.

Interesting.

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

naoyam · 2024-12-31T01:22:18Z

Here's what I got on a PJNL H100 machine with the mistral benchmark:

 Model  Batch-Size  Sequence-Length               Executor  Forward-Kernels  Forward-Time(ms)  Backward Kernels  Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1             4096            Torch-Eager               20             0.498                39              0.552
1  mistralai/Mistral-Nemo-Base-2407           1             4096          torch.compile                7             0.158                 4              0.146
2  mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-torch.compile                9             0.077                 4              0.105
3  mistralai/Mistral-Nemo-Base-2407           1             4096          Thunder-Torch               42             1.196                55              1.466
4  mistralai/Mistral-Nemo-Base-2407           1             4096        Thunder-nvFuser                8             0.202                 6              0.199

As you can see, the Thunder-torch.compile version is twice faster than just torch.compile. It seems the nvfuser result was also much faster than your results.

naoyam · 2024-12-31T01:25:13Z

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

That's what I heard from @kevinstephano.

naoyam · 2024-12-31T03:04:18Z

Another question, maybe to @Priya2698: Can we enable the result verification by default? I remember there's still a tolerance issue, but for these RoPE benchmarks since there's almost no reduction (there's some in the backward cases), maybe verification would work fine?

jjsjann123 · 2024-12-31T04:29:37Z

Here's what I got on a PJNL H100 machine with the mistral benchmark:

 Model  Batch-Size  Sequence-Length               Executor  Forward-Kernels  Forward-Time(ms)  Backward Kernels  Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1             4096            Torch-Eager               20             0.498                39              0.552
1  mistralai/Mistral-Nemo-Base-2407           1             4096          torch.compile                7             0.158                 4              0.146
2  mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-torch.compile                9             0.077                 4              0.105
3  mistralai/Mistral-Nemo-Base-2407           1             4096          Thunder-Torch               42             1.196                55              1.466
4  mistralai/Mistral-Nemo-Base-2407           1             4096        Thunder-nvFuser                8             0.202                 6              0.199

As you can see, the Thunder-torch.compile version is twice faster than just torch.compile. It seems the nvfuser result was also much faster than your results.

I can double check the profile tomorrow. I was using the H100x2 node today.
Could be possible that forward trace has gaps in kernel, so the manual benchmark could be bottlenecked by cpu time, instead of kernel time.

jjsjann123 · 2024-12-31T04:30:20Z

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

That's what I heard from @kevinstephano.

Noted. I'll add another executor.

jjsjann123 · 2024-12-31T18:13:02Z

I realized that Kevin's benchmark script has been updated to measure profiler time as well and I was two commits behind that. The previous discrepancy was coming from the different measurement.

jjsjann123 · 2024-12-31T19:18:54Z

With updated manual benchmark, we are making apple to apple comparison now.

On h100

manual benchmark

Model Executor Forward-Kernels Forward-Time(ms) Backward Kernels Backward-Time(ms)
Llama-2-7b-hf Thunder-nvFuser 4 0.346 5 0.504
Llama-2-7b-hf Thunder-torch.compile 1 0.093 2 0.277
Llama-3-8B Thunder-nvFuser 5 0.733 5 0.814
Llama-3-8B Thunder-torch.compile 2 0.162 3 0.616
microsoft/Phi-3.5-mini-instruct Thunder-nvFuser 7 0.298 6 0.383
microsoft/Phi-3.5-mini-instruct Thunder-torch.compile 6 0.084 2 0.228
mistralai/Mistral-Nemo-Base-2407 Thunder-nvFuser 8 0.163 6 0.136
mistralai/Mistral-Nemo-Base-2407 Thunder-torch.compile 9 0.074 4 0.103
Qwen/Qwen2.5-7B-Instruct Thunder-nvFuser 5 0.109 8 0.318
Qwen/Qwen2.5-7B-Instruct Thunder-torch.compile 4 0.049 5 0.157

pytest benchmark

Name (time in us)                                                                                                Median
----------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']                     347.5185 (6.90)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_2_7b_hf_rope']         93.2800 (1.85)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']                     707.8535 (14.06)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_2_7b_hf_rope']        471.2945 (9.36)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']                        735.9030 (14.62)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_3_8B_rope']           161.2320 (3.20)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']                      1,017.9635 (20.22)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_3_8B_rope']           809.3740 (16.08)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']                           363.3590 (7.22)    
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_phi3_rope']               83.9865 (1.67)    
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']                           471.2105 (9.36)    
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_phi3_rope']              374.9610 (7.45)    
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']                   175.7105 (3.49)    
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_mistral_nemo_rope']       74.0490 (1.47)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']                   178.3185 (3.54)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_mistral_nemo_rope']      162.8475 (3.24)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']                          120.9745 (2.40)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_qwen2_rope']              50.3365 (1.0)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']                          381.1360 (7.57)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_qwen2_rope']             211.7920 (4.21)
----------------------------------------------------------------------------------------------------------------------------------

fwd time mostly matches, (except for hf_phi3, that's because we are enabling matmul in pytest benchmark, which is not enabled by manual benchmark).
bwd time looks about right, pytest benchmark has higher time coming from the grad accumulation kernel.

jjsjann123 · 2024-12-31T19:19:43Z

benchmarks/python/test_rope.py

+):
+    kwargs = {}
+    if executor == "thunder":
+        kwargs["nv_enable_matmul"] = True


BTW, this is where I'm enabling matmul in nvfuser.

This gives us a single fusion region, I believe is something we would like. cc'ing @naoyam

jjsjann123 · 2024-12-31T19:22:17Z

benchmarks/python/core.py

+        return thunder.jit(
+            fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs
+        )
+    if executor == "thunder-torchcompile":


thunder-torchcompile is the config we wanted for the rope comparison. Not sure if this is something we would also like to enable for other benchmarks. cc'ing @Priya2698

naoyam · 2025-01-06T21:47:25Z

benchmarks/python/core.py

-def with_executor(executor: str, fwd_fn: Callable) -> Callable:
-    assert executor in ["eager", "torchcompile", "thunder"]
+def with_executor(executor: str, fwd_fn: Callable, **kwargs) -> Callable:
+    assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"]


Suggested change

assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"]

assert executor in ["eager", "torchcompile", "thunder-nvfuser", "thunder-torchcompile"]

naoyam · 2025-01-06T21:48:17Z

benchmarks/python/core.py

@@ -221,9 +226,9 @@ def set_metrics(
            % Peak Bandwidth (SOL): 100 * Bandwidth /PEAK_BANDWIDTH
        """
        if not iobytes:
-            if isinstance(inputs, torch.Tensor):
+            if not isinstance(inputs, Iterable):


Why is this changed?

naoyam · 2025-01-06T22:05:00Z

benchmarks/python/test_rope.py

+@pytest.mark.parametrize(
+    "executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
+)
+def test_rope_variations_fwd_benchmark(


IIUC, we also have test_rope_benchmark, which is separate from test_rope_variations_fwd_benchmark and test_rope_variations_bwd_benchmark. What does test_rope_benchmark evaluate?

jjsjann123 added 6 commits December 9, 2024 14:14

benchmark added

f516886

adding other benchmarks from Kevin's example

1d6920b

hf_mistral_nemo added

800cbd8

fixing strided inputs

0c81c54

typo

40c554f

oops, missed an input

5a07055

jjsjann123 requested review from naoyam, kevinstephano, xwang233 and Priya2698 December 10, 2024 00:12

jjsjann123 commented Dec 10, 2024

View reviewed changes

jjsjann123 marked this pull request as draft December 10, 2024 21:28

jjsjann123 added 13 commits December 15, 2024 19:03

WIP

2074ae5

Merge remote-tracking branch 'origin/main' into HEAD

d090f3c

WIP

d9f06f3

WIP

dc2211b

WIP

8882d06

WIP

44d8b55

WAR

cb4db6b

adding qwen2

83bbc7f

fixing qwen2

602c516

hf_phi3 added

3f28aeb

wip

d7cbf20

add hf_mistral_nemo

156a7a5

keep forgetting json

b8752bf

jjsjann123 and others added 2 commits December 23, 2024 12:27

Merge branch 'main' into jjsjann123/rope_benchmark

e0cb2a2

rearrange files per review comment

1bb31c4

jjsjann123 requested a review from Priya2698 December 23, 2024 21:28

using thunder iobytes for benchmark measurement

8ec3903

Priya2698 reviewed Dec 27, 2024

View reviewed changes

Merge branch 'main' into jjsjann123/rope_benchmark

6009d88

jjsjann123 requested a review from Priya2698 December 30, 2024 22:00

jjsjann123 added 4 commits December 31, 2024 10:15

Merge branch 'main' into jjsjann123/rope_benchmark

6aecf26

adding thunder-torchcompile for rope

a40770e

adding knob for thunder-torchcompile

8fe2abf

wip

e9650e1

jjsjann123 commented Dec 31, 2024

View reviewed changes

black

3cff34b

jjsjann123 commented Dec 31, 2024

View reviewed changes

naoyam reviewed Jan 6, 2025

View reviewed changes

naoyam mentioned this pull request Jan 8, 2025

Add RoPE benchmark sample from Qwen2 network. #3683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rope_benchmark #3550

rope_benchmark #3550

jjsjann123 commented Dec 10, 2024 •

edited

Loading

jjsjann123 Dec 10, 2024

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 23, 2024

Priya2698 left a comment

jjsjann123 commented Dec 30, 2024 •

edited

Loading

naoyam commented Dec 30, 2024

jjsjann123 commented Dec 31, 2024

naoyam commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

naoyam commented Dec 31, 2024

naoyam commented Dec 31, 2024

naoyam commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 Dec 31, 2024

jjsjann123 Dec 31, 2024

naoyam Jan 6, 2025

naoyam Jan 6, 2025

naoyam Jan 6, 2025

	assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
	assert executor in ["eager", "torchcompile", "thunder-nvfuser", "thunder-torchcompile"]

		}


		@pytest.mark.parametrize(

rope_benchmark #3550

Are you sure you want to change the base?

rope_benchmark #3550

Conversation

jjsjann123 commented Dec 10, 2024 • edited Loading

jjsjann123 Dec 10, 2024

Choose a reason for hiding this comment

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 23, 2024

Priya2698 left a comment

Choose a reason for hiding this comment

jjsjann123 commented Dec 30, 2024 • edited Loading

naoyam commented Dec 30, 2024

jjsjann123 commented Dec 31, 2024

naoyam commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

naoyam commented Dec 31, 2024

naoyam commented Dec 31, 2024

naoyam commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 Dec 31, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 31, 2024

Choose a reason for hiding this comment

naoyam Jan 6, 2025

Choose a reason for hiding this comment

naoyam Jan 6, 2025

Choose a reason for hiding this comment

naoyam Jan 6, 2025

Choose a reason for hiding this comment

jjsjann123 commented Dec 10, 2024 •

edited

Loading

jjsjann123 commented Dec 30, 2024 •

edited

Loading