rope_benchmark #3550

jjsjann123 · 2024-12-10T00:12:30Z

Rope benchmark extracted from lightning trace.

TODO:

add iobytes measurement for benchmarks.

jjsjann123 · 2024-12-10T00:14:10Z

benchmarks/python/test_rope.py

+}
+
+
+@pytest.mark.parametrize(


This is the only part that's worth reviewing.

code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)

jjsjann123 · 2024-12-10T00:15:26Z

I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR.

naoyam · 2024-12-10T02:55:02Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

Priya2698 · 2024-12-10T03:39:16Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

naoyam · 2024-12-10T03:46:27Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

Yes, so, we don't need to have the backward implementations explicitly, right?

jjsjann123 · 2024-12-16T05:36:23Z

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example.
This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

Priya2698 · 2024-12-16T18:33:38Z

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example. This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

I think the backward difference is because of gradient accumulation in between multiple runs that I am fixing in PR #3394. Let me try to run your PR with that fix. Are you seeing any difference in the forward pass also?

I will try to push this PR in so that we can add your RoPE example easily. I was verifying the measurements in that PR against nsys and they match for operators like rmsnorm, softmax, layernorm for all configs and executors.

jjsjann123 · 2024-12-16T20:01:41Z

Are you seeing any difference in the forward pass also?

Yes, if you look at the mistral example, forward time looks quite different.

test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope'] 322.8819 (1.56)
vs

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322

I checked nsys profile, looks like the same trace (kernels and how they line up). Yet the reported number looks different.

I tried to verify the event time recorded in our benchmark, if I run everything vanilla as a for loop below, fwd number does match. with_executor added a bit number on top, which is understandable, but run_benchmark changed everything. The mistral example is host bound, is run_benchmark measuring only the kernel times? Because the reported number does match that roughly.

@pytest.mark.parametrize(
    "rope_variation",
    [   
        "llama_2_7b_hf_rope",
        "llama_3_8B_rope",
        "hf_qwen2_rope",
        "hf_phi3_rope",
        "hf_mistral_nemo_rope",
    ],
)
@pytest.mark.parametrize("executor", ["eager", "torchcompile", "thunder"])
def test_rope_variations_fwd_benchmark(
    benchmark,
    rope_variation: str,
    executor: str,
):
    if executor == "torchcompile":
        clear_dynamo_cache()

    model, inputs, _ = rope_setup[rope_variation]()

    def fwd_call(inp):
        return model(*inp)

    #benchmark_fn = with_executor(executor, fwd_call)
    #run_benchmark(benchmark, benchmark_fn, inputs())

    import thunder
    benchmark_fn = thunder.jit(fwd_call)

    evt_start_fwd = torch.cuda.Event(enable_timing=True)
    evt_end_fwd = torch.cuda.Event(enable_timing=True)
    def bn_fn(inp):
        torch.cuda.nvtx.range_push("Inputs Generation")
        evt_start_fwd.record()
        ret = benchmark_fn(inp)
        evt_end_fwd.record()
        torch.cuda.nvtx.range_pop()
        torch.cuda.synchronize()
        fwd_time = evt_start_fwd.elapsed_time(evt_end_fwd)
        print(f"{fwd_time=}")
        return ret

    for i in range(10):
        bn_fn(inputs())
    #run_benchmark(benchmark, bn_fn, inputs())

jjsjann123 · 2024-12-16T23:28:42Z

A few reason for the performance discrepancy between @kevinstephano 's example and the benchmark here.

In our ci tests, we use run_benchmark, which only measures kernel time. Unlike Kevin's example, where we are measuing end-2-end time in each call. So the benchmark number would hide some host latency, which is why some fwd time reported here is lower than Kevin's python examples. The same thing applies to bwd as well.
currently since we run bwd functions in a loop without wiping grad on inputs/parameters. We ended up calling a pointwise kernel accumulating gradients in each backward pass, which bumps up the total measurement time. e.g. above we see that pytest benchmark reports mistral bwd accumulated kernel time is 430us, which is higher than the end-2-end time reported in the python script 322us. Because the python script doesn't have the accumulation kernels.

…mark

benchmarks/python/core.py

…HEAD

jjsjann123 · 2024-12-20T23:39:15Z

!build

jjsjann123 · 2024-12-20T23:40:32Z

benchmarks/python/test_rope.py

+        benchmark,
+        unary_bwd_torch,
+        [output, grad()],
+        iobytes=iobytes() if executor == "thunder" else None,


@Priya2698 does this looks about right, if I just want a manual iobytes computation for thunder backward.

Note for myself. double check the backward iobytes computation again.

I did it once while looking at the backward thunder trace. But I'm not totally confident in that.

So these will be the IObytes based on the inputs/outputs of the nvfuser definition? They should be used for all 3 executors then.

That sounds fine to me. I'll make the change.

Another question on how we plan to handle these in the long run.

So if we do use the same IOBytes across executors. For this instance, we'll just be using the thunder autograd as a reference point. But other two executors might be running a different autograd strategy, which means their IOBytes for bwd is not calculated faithfully. Is that the right way to interpret this? The reported IOBytes would just be a reference point.

…HEAD

jjsjann123 · 2024-12-21T00:16:24Z

benchmarks/python/core.py

    if executor == "thunder":
-        return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex])
+        return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs)


adding this so I can run nv_enable_matmul=True for rope. cc'ing @naoyam @kevinstephano

Priya2698 · 2024-12-21T00:20:11Z

benchmarks/python/test_rope.py

It might be better to put the RoPE configuration and setup in a separate file. This file can only have the benchmark function itself for easier readability.

jjsjann123 · 2024-12-23T21:43:32Z

!test

jjsjann123 added 6 commits December 9, 2024 14:14

benchmark added

f516886

adding other benchmarks from Kevin's example

1d6920b

hf_mistral_nemo added

800cbd8

fixing strided inputs

0c81c54

typo

40c554f

oops, missed an input

5a07055

jjsjann123 requested review from naoyam, kevinstephano, xwang233 and Priya2698 December 10, 2024 00:12

jjsjann123 commented Dec 10, 2024

View reviewed changes

jjsjann123 marked this pull request as draft December 10, 2024 21:28

jjsjann123 added 13 commits December 15, 2024 19:03

WIP

2074ae5

Merge remote-tracking branch 'origin/main' into HEAD

d090f3c

WIP

d9f06f3

WIP

dc2211b

WIP

8882d06

WIP

44d8b55

WAR

cb4db6b

adding qwen2

83bbc7f

fixing qwen2

602c516

hf_phi3 added

3f28aeb

wip

d7cbf20

add hf_mistral_nemo

156a7a5

keep forgetting json

b8752bf

jjsjann123 added 4 commits December 18, 2024 16:41

black

eb95b8f

Merge remote-tracking branch 'origin/main' into jjsjann123/rope_bench…

2c03139

…mark

removing custom iobytes

2ac6f67

WIP

0852707

jjsjann123 commented Dec 19, 2024

View reviewed changes

benchmarks/python/core.py Outdated Show resolved Hide resolved

jjsjann123 and others added 9 commits December 19, 2024 13:42

Merge branch 'main' into jjsjann123/rope_benchmark

274968b

revert unwanted WAR

ef46957

missing import

c8f83d0

Merge branch 'main' into jjsjann123/rope_benchmark

a9fbc48

manual iobytes

df3a4ce

Merge remote-tracking branch 'origin/jjsjann123/rope_benchmark' into …

a01a03e

…HEAD

fixing bwd io inputs computation

4e7a25f

black

8132c05

Merge branch 'main' into jjsjann123/rope_benchmark

93d28ed

jjsjann123 marked this pull request as ready for review December 20, 2024 23:39

jjsjann123 commented Dec 20, 2024

View reviewed changes

jjsjann123 added 3 commits December 20, 2024 16:06

allow matmul in nvfuser for rope benchmark

1806368

Merge remote-tracking branch 'origin/jjsjann123/rope_benchmark' into …

9f55f22

…HEAD

oops

1a117bc

jjsjann123 commented Dec 21, 2024

View reviewed changes

Priya2698 reviewed Dec 21, 2024

View reviewed changes

jjsjann123 and others added 2 commits December 23, 2024 12:27

Merge branch 'main' into jjsjann123/rope_benchmark

e0cb2a2

rearrange files per review comment

1bb31c4

jjsjann123 requested a review from Priya2698 December 23, 2024 21:28

using thunder iobytes for benchmark measurement

8ec3903

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rope_benchmark #3550

rope_benchmark #3550

jjsjann123 commented Dec 10, 2024 •

edited

Loading

jjsjann123 Dec 10, 2024

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024

Priya2698 commented Dec 16, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 20, 2024

jjsjann123 Dec 20, 2024

jjsjann123 Dec 21, 2024

Priya2698 Dec 21, 2024

jjsjann123 Dec 23, 2024

jjsjann123 Dec 21, 2024

Priya2698 Dec 21, 2024

jjsjann123 commented Dec 23, 2024

rope_benchmark #3550

Are you sure you want to change the base?

rope_benchmark #3550

Conversation

jjsjann123 commented Dec 10, 2024 • edited Loading

jjsjann123 Dec 10, 2024

Choose a reason for hiding this comment

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024

Priya2698 commented Dec 16, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 16, 2024

jjsjann123 commented Dec 20, 2024

jjsjann123 Dec 20, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 21, 2024

Choose a reason for hiding this comment

Priya2698 Dec 21, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 23, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 21, 2024

Choose a reason for hiding this comment

Priya2698 Dec 21, 2024

Choose a reason for hiding this comment

jjsjann123 commented Dec 23, 2024

jjsjann123 commented Dec 10, 2024 •

edited

Loading