Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rope_benchmark #3550

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open

rope_benchmark #3550

wants to merge 44 commits into from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Dec 10, 2024

Rope benchmark extracted from lightning trace.

TODO:

  • add iobytes measurement for benchmarks.

}


@pytest.mark.parametrize(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that's worth reviewing.

code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)

@jjsjann123
Copy link
Collaborator Author

I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

@Priya2698
Copy link
Collaborator

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

Yes, so, we don't need to have the backward implementations explicitly, right?

@jjsjann123 jjsjann123 marked this pull request as draft December 10, 2024 21:28
@jjsjann123
Copy link
Collaborator Author

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example.
This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

@jjsjann123 jjsjann123 requested a review from Priya2698 December 23, 2024 21:28
@jjsjann123
Copy link
Collaborator Author

!test

Copy link
Collaborator

@Priya2698 Priya2698 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post your final numbers for the benchmark in the description, and if they match what you see from Kevin's script?

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Dec 30, 2024

pulled the number on this branch

----------------------------------------------------------------------------------------------------------------------------- benchmark: 10 tests -----------------------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         618.3680 (3.01)       628.8100 (2.97)       624.3230 (3.02)     3.2188 (3.73)       624.2075 (3.04)     4.3820 (2.75)          3;0  1,601.7350 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,314.6250 (6.41)     1,319.4610 (6.24)     1,317.8842 (6.36)     1.5359 (1.78)     1,318.4260 (6.41)     1.9490 (1.22)          3;0    758.7920 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,376.2430 (6.71)     1,383.2420 (6.54)     1,378.7306 (6.66)     2.1992 (2.55)     1,378.4055 (6.70)     2.3330 (1.46)          3;1    725.3049 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,926.8670 (9.39)     1,940.1780 (9.17)     1,932.1506 (9.33)     4.2306 (4.90)     1,930.0695 (9.39)     6.3500 (3.98)          2;0    517.5580 (0.11)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       322.8510 (1.57)       329.6040 (1.56)       325.2294 (1.57)     2.9409 (3.41)       323.6790 (1.57)     6.0960 (3.82)          3;0  3,074.7528 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       311.9970 (1.52)       317.2140 (1.50)       314.4097 (1.52)     1.6342 (1.89)       314.2355 (1.53)     2.3720 (1.49)          3;0  3,180.5634 (0.66)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               695.5210 (3.39)       701.7980 (3.32)       699.4310 (3.38)     1.9682 (2.28)       699.9025 (3.40)     2.8590 (1.79)          2;0  1,429.7336 (0.30)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               851.0710 (4.15)       856.7360 (4.05)       854.0387 (4.12)     1.6840 (1.95)       853.9010 (4.15)     2.0720 (1.30)          3;0  1,170.9071 (0.24)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              205.1190 (1.0)        211.5500 (1.0)        207.0699 (1.0)      2.4760 (2.87)       205.6310 (1.0)      4.0360 (2.53)          2;0  4,829.2871 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.7350 (1.74)       359.1000 (1.70)       357.8798 (1.73)     0.8628 (1.0)        357.8220 (1.74)     1.5960 (1.0)           5;0  2,794.2343 (0.58)         10           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Model  | Batch-Size |  Sequence-Length | Executor |  Forward-Time(ms) | Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.628              0.959
Llama-3-8B           2             8192  Thunder-nvFuser             1.387              1.574
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.583              0.323
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.587              0.738
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.391              0.498

kernel time somewhat checked out. The difference between benchmark vs the manual measurement is coming from the earlier comment: #3550 (comment)

e.g. for llama_2_7b_hf_rope backward. The benchmark measurement included a pointwise apply kernel, which takes about 363 us and that explained why benchmark has a longer kernel time comparing to the manual benchmark.

@jjsjann123 jjsjann123 requested a review from Priya2698 December 30, 2024 22:00
@naoyam
Copy link
Collaborator

naoyam commented Dec 30, 2024

pulled the number on this branch

----------------------------------------------------------------------------------------------------------------------------- benchmark: 10 tests -----------------------------------------------------------------------------------------------------------------------------
Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         618.3680 (3.01)       628.8100 (2.97)       624.3230 (3.02)     3.2188 (3.73)       624.2075 (3.04)     4.3820 (2.75)          3;0  1,601.7350 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,314.6250 (6.41)     1,319.4610 (6.24)     1,317.8842 (6.36)     1.5359 (1.78)     1,318.4260 (6.41)     1.9490 (1.22)          3;0    758.7920 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,376.2430 (6.71)     1,383.2420 (6.54)     1,378.7306 (6.66)     2.1992 (2.55)     1,378.4055 (6.70)     2.3330 (1.46)          3;1    725.3049 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,926.8670 (9.39)     1,940.1780 (9.17)     1,932.1506 (9.33)     4.2306 (4.90)     1,930.0695 (9.39)     6.3500 (3.98)          2;0    517.5580 (0.11)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       322.8510 (1.57)       329.6040 (1.56)       325.2294 (1.57)     2.9409 (3.41)       323.6790 (1.57)     6.0960 (3.82)          3;0  3,074.7528 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       311.9970 (1.52)       317.2140 (1.50)       314.4097 (1.52)     1.6342 (1.89)       314.2355 (1.53)     2.3720 (1.49)          3;0  3,180.5634 (0.66)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               695.5210 (3.39)       701.7980 (3.32)       699.4310 (3.38)     1.9682 (2.28)       699.9025 (3.40)     2.8590 (1.79)          2;0  1,429.7336 (0.30)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               851.0710 (4.15)       856.7360 (4.05)       854.0387 (4.12)     1.6840 (1.95)       853.9010 (4.15)     2.0720 (1.30)          3;0  1,170.9071 (0.24)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              205.1190 (1.0)        211.5500 (1.0)        207.0699 (1.0)      2.4760 (2.87)       205.6310 (1.0)      4.0360 (2.53)          2;0  4,829.2871 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.7350 (1.74)       359.1000 (1.70)       357.8798 (1.73)     0.8628 (1.0)        357.8220 (1.74)     1.5960 (1.0)           5;0  2,794.2343 (0.58)         10           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Model  | Batch-Size |  Sequence-Length | Executor |  Forward-Time(ms) | Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.628              0.959
Llama-3-8B           2             8192  Thunder-nvFuser             1.387              1.574
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.583              0.323
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.587              0.738
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.391              0.498

kernel time somewhat checked out. The difference between benchmark vs the manual measurement is coming from the earlier comment: #3550 (comment)

e.g. for llama_2_7b_hf_rope backward. The benchmark measurement included a pointwise apply kernel, which takes about 363 us and that explained why benchmark has a longer kernel time comparing to the manual benchmark.

Can you run then on a PJNL H100 machine? Also what do the results with torch.compile look like?

@jjsjann123
Copy link
Collaborator Author

These are the perf number on A100 for torch.compile

Name (time in us)                                                                                            Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='llama_2_7b_hf_rope']         160.5770 (1.39)       162.0170 (1.37)       161.3019 (1.38)     0.4042 (2.06)       161.3440 (1.38)     0.1930 (1.0)           3;3  6,199.5550 (0.73)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='llama_2_7b_hf_rope']         909.7630 (7.87)       913.4440 (7.70)       911.5028 (7.79)     1.0986 (5.59)       911.5555 (7.79)     0.8960 (4.64)          3;2  1,097.0893 (0.13)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='llama_3_8B_rope']            288.2230 (2.49)       294.4970 (2.48)       291.6220 (2.49)     1.6176 (8.23)       291.7120 (2.49)     0.7390 (3.83)          2;3  3,429.0966 (0.40)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='llama_3_8B_rope']            997.5660 (8.63)     1,002.8170 (8.45)     1,000.8923 (8.55)     1.9805 (10.07)    1,001.8440 (8.56)     3.1030 (16.08)         2;0    999.1085 (0.12)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_mistral_nemo_rope']       235.5220 (2.04)       236.3820 (1.99)       235.8931 (2.01)     0.2827 (1.44)       235.8240 (2.01)     0.4470 (2.32)          5;0  4,239.2084 (0.50)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_mistral_nemo_rope']       333.6660 (2.89)       334.9830 (2.82)       334.3401 (2.86)     0.4517 (2.30)       334.3825 (2.86)     0.7390 (3.83)          4;0  2,990.9664 (0.35)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_phi3_rope']               194.0040 (1.68)       194.4960 (1.64)       194.2168 (1.66)     0.1966 (1.0)        194.1290 (1.66)     0.3890 (2.02)          5;0  5,148.8852 (0.60)         10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_phi3_rope']             2,402.0750 (20.77)    2,407.7960 (20.30)    2,404.3932 (20.54)    1.8007 (9.16)     2,403.8470 (20.54)    2.8140 (14.58)         2;0    415.9054 (0.05)         10           1
test_rope_variations_fwd_benchmark[executor='torchcompile'-rope_variation='hf_qwen2_rope']              115.6500 (1.0)        118.6250 (1.0)        117.0693 (1.0)      0.9196 (4.68)       117.0470 (1.0)      1.4730 (7.63)          3;0  8,541.9491 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='torchcompile'-rope_variation='hf_qwen2_rope']              320.7930 (2.77)       322.5930 (2.72)       321.6137 (2.75)     0.4587 (2.33)       321.6130 (2.75)     0.2860 (1.48)          2;2  3,109.3203 (0.36)         10           1
Model  Batch-Size  Sequence-Length       Executor  Forward-Time(ms)  Backward-Time(ms)
Llama-2-7b-hf           2             4096  torch.compile             0.164              0.547
Llama-3-8B           2             8192  torch.compile             0.738              5.949
mistralai/Mistral-Nemo-Base-2407           1             4096  torch.compile             0.298              0.238
microsoft/Phi-3.5-mini-instruct           1             8192  torch.compile             0.197              2.120
Qwen/Qwen2.5-7B-Instruct           1             4096  torch.compile             0.128              0.234

Looks like Llama-3-8b backward is somewhat odd. But that seems to be coming from when we run both llama config in a single setting. If I run Llama-3-8B by itself, the timing looks about right.

root@867d93f125ca:/volume/rope/rope_examples# python lit_gpt_models.py --execs torch.compile
        Model  Batch-Size  Sequence-Length       Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  torch.compile             0.292              0.649

We could be getting some cached kernel?!

@naoyam
Copy link
Collaborator

naoyam commented Dec 31, 2024

Kevin's script has two variations of torch.compile-based executions, torch.compile and Thunder-torch.compile. Could you please show both? He mentioned Thunder-torch.compile is what we should look at, and I'm seeing much faster performances than just torch.compile on H100.

@jjsjann123
Copy link
Collaborator Author

Meanwhile, H100 numbers for nvfuser

Name (time in us)                                                                                     Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       305.5070 (2.81)     309.1870 (2.73)     306.8002 (2.78)     1.0669 (2.88)     306.5605 (2.80)     0.6750 (1.52)          3;2        3.2595 (0.36)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       611.7420 (5.63)     614.2720 (5.43)     613.4171 (5.55)     0.7574 (2.04)     613.7120 (5.61)     1.0210 (2.30)          2;0        1.6302 (0.18)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          640.4120 (5.89)     643.6720 (5.69)     642.0621 (5.81)     0.8957 (2.42)     642.0985 (5.87)     1.1180 (2.52)          2;0        1.5575 (0.17)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          892.1050 (8.21)     896.1000 (7.92)     893.8670 (8.09)     1.1381 (3.07)     894.0450 (8.17)     1.4170 (3.19)          3;0        1.1187 (0.12)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']     192.5720 (1.77)     197.0260 (1.74)     194.4665 (1.76)     1.8428 (4.97)     194.0145 (1.77)     3.5230 (7.93)          4;0        5.1423 (0.57)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']     156.7050 (1.44)     158.7880 (1.40)     157.7894 (1.43)     0.6554 (1.77)     157.7595 (1.44)     0.6710 (1.51)          4;0        6.3376 (0.70)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             348.2880 (3.20)     355.4230 (3.14)     352.0209 (3.19)     2.2721 (6.13)     352.3665 (3.22)     3.8130 (8.59)          2;0        2.8407 (0.31)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             408.4810 (3.76)     409.7590 (3.62)     409.1938 (3.71)     0.3709 (1.0)      409.1870 (3.74)     0.4440 (1.0)           3;0        2.4438 (0.27)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']            108.7010 (1.0)      113.1840 (1.0)      110.4311 (1.0)      1.7987 (4.85)     109.4530 (1.0)      3.3280 (7.50)          2;0        9.0554 (1.0)          10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']            340.5750 (3.13)     344.0330 (3.04)     342.3673 (3.10)     1.0602 (2.86)     342.4130 (3.13)     1.1170 (2.52)          4;0        2.9208 (0.32)         10           1
Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.308              0.453
Llama-3-8B           2             8192  Thunder-nvFuser             0.655              0.725
mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-nvFuser             0.536              0.384
microsoft/Phi-3.5-mini-instruct           1             8192  Thunder-nvFuser             0.499              0.276
Qwen/Qwen2.5-7B-Instruct           1             4096  Thunder-nvFuser             0.453              0.571

@jjsjann123
Copy link
Collaborator Author

Kevin's script has two variations of torch.compile-based executions, torch.compile and Thunder-torch.compile. Could you please show both? He mentioned Thunder-torch.compile is what we should look at, and I'm seeing much faster performances than just torch.compile on H100.

Interesting.

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

@naoyam
Copy link
Collaborator

naoyam commented Dec 31, 2024

Here's what I got on a PJNL H100 machine with the mistral benchmark:

 Model  Batch-Size  Sequence-Length               Executor  Forward-Kernels  Forward-Time(ms)  Backward Kernels  Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1             4096            Torch-Eager               20             0.498                39              0.552
1  mistralai/Mistral-Nemo-Base-2407           1             4096          torch.compile                7             0.158                 4              0.146
2  mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-torch.compile                9             0.077                 4              0.105
3  mistralai/Mistral-Nemo-Base-2407           1             4096          Thunder-Torch               42             1.196                55              1.466
4  mistralai/Mistral-Nemo-Base-2407           1             4096        Thunder-nvFuser                8             0.202                 6              0.199

As you can see, the Thunder-torch.compile version is twice faster than just torch.compile. It seems the nvfuser result was also much faster than your results.

@naoyam
Copy link
Collaborator

naoyam commented Dec 31, 2024

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

That's what I heard from @kevinstephano.

@naoyam
Copy link
Collaborator

naoyam commented Dec 31, 2024

Another question, maybe to @Priya2698: Can we enable the result verification by default? I remember there's still a tolerance issue, but for these RoPE benchmarks since there's almost no reduction (there's some in the backward cases), maybe verification would work fine?

@jjsjann123
Copy link
Collaborator Author

Here's what I got on a PJNL H100 machine with the mistral benchmark:

 Model  Batch-Size  Sequence-Length               Executor  Forward-Kernels  Forward-Time(ms)  Backward Kernels  Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1             4096            Torch-Eager               20             0.498                39              0.552
1  mistralai/Mistral-Nemo-Base-2407           1             4096          torch.compile                7             0.158                 4              0.146
2  mistralai/Mistral-Nemo-Base-2407           1             4096  Thunder-torch.compile                9             0.077                 4              0.105
3  mistralai/Mistral-Nemo-Base-2407           1             4096          Thunder-Torch               42             1.196                55              1.466
4  mistralai/Mistral-Nemo-Base-2407           1             4096        Thunder-nvFuser                8             0.202                 6              0.199

As you can see, the Thunder-torch.compile version is twice faster than just torch.compile. It seems the nvfuser result was also much faster than your results.

I can double check the profile tomorrow. I was using the H100x2 node today.
Could be possible that forward trace has gaps in kernel, so the manual benchmark could be bottlenecked by cpu time, instead of kernel time.

@jjsjann123
Copy link
Collaborator Author

Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet.

That's what I heard from @kevinstephano.

Noted. I'll add another executor.

@jjsjann123
Copy link
Collaborator Author

I realized that Kevin's benchmark script has been updated to measure profiler time as well and I was two commits behind that. The previous discrepancy was coming from the different measurement.

@jjsjann123
Copy link
Collaborator Author

With updated manual benchmark, we are making apple to apple comparison now.

On h100

manual benchmark

Model Executor Forward-Kernels Forward-Time(ms) Backward Kernels Backward-Time(ms)
Llama-2-7b-hf Thunder-nvFuser 4 0.346 5 0.504
Llama-2-7b-hf Thunder-torch.compile 1 0.093 2 0.277
Llama-3-8B Thunder-nvFuser 5 0.733 5 0.814
Llama-3-8B Thunder-torch.compile 2 0.162 3 0.616
microsoft/Phi-3.5-mini-instruct Thunder-nvFuser 7 0.298 6 0.383
microsoft/Phi-3.5-mini-instruct Thunder-torch.compile 6 0.084 2 0.228
mistralai/Mistral-Nemo-Base-2407 Thunder-nvFuser 8 0.163 6 0.136
mistralai/Mistral-Nemo-Base-2407 Thunder-torch.compile 9 0.074 4 0.103
Qwen/Qwen2.5-7B-Instruct Thunder-nvFuser 5 0.109 8 0.318
Qwen/Qwen2.5-7B-Instruct Thunder-torch.compile 4 0.049 5 0.157

pytest benchmark

Name (time in us)                                                                                                Median
----------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']                     347.5185 (6.90)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_2_7b_hf_rope']         93.2800 (1.85)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']                     707.8535 (14.06)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_2_7b_hf_rope']        471.2945 (9.36)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']                        735.9030 (14.62)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_3_8B_rope']           161.2320 (3.20)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']                      1,017.9635 (20.22)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='llama_3_8B_rope']           809.3740 (16.08)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']                           363.3590 (7.22)    
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_phi3_rope']               83.9865 (1.67)    
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']                           471.2105 (9.36)    
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_phi3_rope']              374.9610 (7.45)    
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']                   175.7105 (3.49)    
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_mistral_nemo_rope']       74.0490 (1.47)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']                   178.3185 (3.54)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_mistral_nemo_rope']      162.8475 (3.24)
 
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']                          120.9745 (2.40)
test_rope_variations_fwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_qwen2_rope']              50.3365 (1.0)
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']                          381.1360 (7.57)
test_rope_variations_bwd_benchmark[executor='thunder-torchcompile'-rope_variation='hf_qwen2_rope']             211.7920 (4.21)
----------------------------------------------------------------------------------------------------------------------------------

fwd time mostly matches, (except for hf_phi3, that's because we are enabling matmul in pytest benchmark, which is not enabled by manual benchmark).
bwd time looks about right, pytest benchmark has higher time coming from the grad accumulation kernel.

):
kwargs = {}
if executor == "thunder":
kwargs["nv_enable_matmul"] = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is where I'm enabling matmul in nvfuser.

This gives us a single fusion region, I believe is something we would like. cc'ing @naoyam

return thunder.jit(
fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs
)
if executor == "thunder-torchcompile":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thunder-torchcompile is the config we wanted for the rope comparison. Not sure if this is something we would also like to enable for other benchmarks. cc'ing @Priya2698

def with_executor(executor: str, fwd_fn: Callable) -> Callable:
assert executor in ["eager", "torchcompile", "thunder"]
def with_executor(executor: str, fwd_fn: Callable, **kwargs) -> Callable:
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
assert executor in ["eager", "torchcompile", "thunder-nvfuser", "thunder-torchcompile"]

@@ -221,9 +226,9 @@ def set_metrics(
% Peak Bandwidth (SOL): 100 * Bandwidth /PEAK_BANDWIDTH
"""
if not iobytes:
if isinstance(inputs, torch.Tensor):
if not isinstance(inputs, Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

@pytest.mark.parametrize(
"executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
)
def test_rope_variations_fwd_benchmark(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we also have test_rope_benchmark, which is separate from test_rope_variations_fwd_benchmark and test_rope_variations_bwd_benchmark. What does test_rope_benchmark evaluate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants