Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rope_benchmark #3550

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

rope_benchmark #3550

wants to merge 38 commits into from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Dec 10, 2024

Rope benchmark extracted from lightning trace.

TODO:

  • add iobytes measurement for benchmarks.

}


@pytest.mark.parametrize(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that's worth reviewing.

code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)

@jjsjann123
Copy link
Collaborator Author

I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

@Priya2698
Copy link
Collaborator

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

@naoyam
Copy link
Collaborator

naoyam commented Dec 10, 2024

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

Yes, so, we don't need to have the backward implementations explicitly, right?

@jjsjann123 jjsjann123 marked this pull request as draft December 10, 2024 21:28
@jjsjann123
Copy link
Collaborator Author

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example.
This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

@Priya2698
Copy link
Collaborator

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example. This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

I think the backward difference is because of gradient accumulation in between multiple runs that I am fixing in PR #3394. Let me try to run your PR with that fix. Are you seeing any difference in the forward pass also?

I will try to push this PR in so that we can add your RoPE example easily. I was verifying the measurements in that PR against nsys and they match for operators like rmsnorm, softmax, layernorm for all configs and executors.

@jjsjann123
Copy link
Collaborator Author

Are you seeing any difference in the forward pass also?

Yes, if you look at the mistral example, forward time looks quite different.

test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope'] 322.8819 (1.56)
vs

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322

I checked nsys profile, looks like the same trace (kernels and how they line up). Yet the reported number looks different.

I tried to verify the event time recorded in our benchmark, if I run everything vanilla as a for loop below, fwd number does match. with_executor added a bit number on top, which is understandable, but run_benchmark changed everything. The mistral example is host bound, is run_benchmark measuring only the kernel times? Because the reported number does match that roughly.

@pytest.mark.parametrize(
    "rope_variation",
    [   
        "llama_2_7b_hf_rope",
        "llama_3_8B_rope",
        "hf_qwen2_rope",
        "hf_phi3_rope",
        "hf_mistral_nemo_rope",
    ],
)
@pytest.mark.parametrize("executor", ["eager", "torchcompile", "thunder"])
def test_rope_variations_fwd_benchmark(
    benchmark,
    rope_variation: str,
    executor: str,
):
    if executor == "torchcompile":
        clear_dynamo_cache()

    model, inputs, _ = rope_setup[rope_variation]()

    def fwd_call(inp):
        return model(*inp)

    #benchmark_fn = with_executor(executor, fwd_call)
    #run_benchmark(benchmark, benchmark_fn, inputs())

    import thunder
    benchmark_fn = thunder.jit(fwd_call)

    evt_start_fwd = torch.cuda.Event(enable_timing=True)
    evt_end_fwd = torch.cuda.Event(enable_timing=True)
    def bn_fn(inp):
        torch.cuda.nvtx.range_push("Inputs Generation")
        evt_start_fwd.record()
        ret = benchmark_fn(inp)
        evt_end_fwd.record()
        torch.cuda.nvtx.range_pop()
        torch.cuda.synchronize()
        fwd_time = evt_start_fwd.elapsed_time(evt_end_fwd)
        print(f"{fwd_time=}")
        return ret

    for i in range(10):
        bn_fn(inputs())
    #run_benchmark(benchmark, bn_fn, inputs())

@jjsjann123
Copy link
Collaborator Author

A few reason for the performance discrepancy between @kevinstephano 's example and the benchmark here.

  1. In our ci tests, we use run_benchmark, which only measures kernel time. Unlike Kevin's example, where we are measuing end-2-end time in each call. So the benchmark number would hide some host latency, which is why some fwd time reported here is lower than Kevin's python examples. The same thing applies to bwd as well.
  2. currently since we run bwd functions in a loop without wiping grad on inputs/parameters. We ended up calling a pointwise kernel accumulating gradients in each backward pass, which bumps up the total measurement time. e.g. above we see that pytest benchmark reports mistral bwd accumulated kernel time is 430us, which is higher than the end-2-end time reported in the python script 322us. Because the python script doesn't have the accumulation kernels.

benchmarks/python/core.py Outdated Show resolved Hide resolved
@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123 jjsjann123 marked this pull request as ready for review December 20, 2024 23:39
benchmark,
unary_bwd_torch,
[output, grad()],
iobytes=iobytes() if executor == "thunder" else None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Priya2698 does this looks about right, if I just want a manual iobytes computation for thunder backward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself. double check the backward iobytes computation again.

I did it once while looking at the backward thunder trace. But I'm not totally confident in that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these will be the IObytes based on the inputs/outputs of the nvfuser definition? They should be used for all 3 executors then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me. I'll make the change.

Another question on how we plan to handle these in the long run.

So if we do use the same IOBytes across executors. For this instance, we'll just be using the thunder autograd as a reference point. But other two executors might be running a different autograd strategy, which means their IOBytes for bwd is not calculated faithfully. Is that the right way to interpret this? The reported IOBytes would just be a reference point.

if executor == "thunder":
return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex])
return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this so I can run nv_enable_matmul=True for rope. cc'ing @naoyam @kevinstephano

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to put the RoPE configuration and setup in a separate file. This file can only have the benchmark function itself for easier readability.

@jjsjann123 jjsjann123 requested a review from Priya2698 December 23, 2024 21:28
@jjsjann123
Copy link
Collaborator Author

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants