-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rope_benchmark #3550
base: main
Are you sure you want to change the base?
rope_benchmark #3550
Conversation
} | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part that's worth reviewing.
code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul
in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)
I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR. |
@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions? |
We will also benchmark backward pass with Thunder backend. |
Yes, so, we don't need to have the backward implementations explicitly, right? |
Looking at the thunder-nvfuser timing. Strangely the benchmark number doesn't match with the benchmark from kevin's example.
But if I run the manual rope_example, I'm getting these
I'll double check the measurement script, as well as compile options (i.e. thunder trace options). We need to do the same sanity check for torchcompile later. |
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post your final numbers for the benchmark in the description, and if they match what you see from Kevin's script?
pulled the number on this branch
kernel time somewhat checked out. The difference between benchmark vs the manual measurement is coming from the earlier comment: #3550 (comment) e.g. for llama_2_7b_hf_rope backward. The benchmark measurement included a pointwise apply kernel, which takes about 363 us and that explained why benchmark has a longer kernel time comparing to the manual benchmark. |
Can you run then on a PJNL H100 machine? Also what do the results with torch.compile look like? |
These are the perf number on A100 for torch.compile
Looks like Llama-3-8b backward is somewhat odd. But that seems to be coming from when we run both llama config in a single setting. If I run Llama-3-8B by itself, the timing looks about right.
We could be getting some cached kernel?! |
Kevin's script has two variations of torch.compile-based executions, torch.compile and Thunder-torch.compile. Could you please show both? He mentioned Thunder-torch.compile is what we should look at, and I'm seeing much faster performances than just torch.compile on H100. |
Meanwhile, H100 numbers for nvfuser
|
Interesting. Is the thunder-torch.compile what we should be using in our benchmark as well, I'm asking since we do not have that executor in pytest benchmark yet. |
Here's what I got on a PJNL H100 machine with the mistral benchmark:
As you can see, the |
That's what I heard from @kevinstephano. |
Another question, maybe to @Priya2698: Can we enable the result verification by default? I remember there's still a tolerance issue, but for these RoPE benchmarks since there's almost no reduction (there's some in the backward cases), maybe verification would work fine? |
I can double check the profile tomorrow. I was using the H100x2 node today. |
Noted. I'll add another executor. |
I realized that Kevin's benchmark script has been updated to measure profiler time as well and I was two commits behind that. The previous discrepancy was coming from the different measurement. |
With updated manual benchmark, we are making apple to apple comparison now. On h100 manual benchmark
pytest benchmark
fwd time mostly matches, (except for hf_phi3, that's because we are enabling matmul in pytest benchmark, which is not enabled by manual benchmark). |
): | ||
kwargs = {} | ||
if executor == "thunder": | ||
kwargs["nv_enable_matmul"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is where I'm enabling matmul in nvfuser.
This gives us a single fusion region, I believe is something we would like. cc'ing @naoyam
return thunder.jit( | ||
fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs | ||
) | ||
if executor == "thunder-torchcompile": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thunder-torchcompile
is the config we wanted for the rope comparison. Not sure if this is something we would also like to enable for other benchmarks. cc'ing @Priya2698
def with_executor(executor: str, fwd_fn: Callable) -> Callable: | ||
assert executor in ["eager", "torchcompile", "thunder"] | ||
def with_executor(executor: str, fwd_fn: Callable, **kwargs) -> Callable: | ||
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert executor in ["eager", "torchcompile", "thunder", "thunder-torchcompile"] | |
assert executor in ["eager", "torchcompile", "thunder-nvfuser", "thunder-torchcompile"] |
@@ -221,9 +226,9 @@ def set_metrics( | |||
% Peak Bandwidth (SOL): 100 * Bandwidth /PEAK_BANDWIDTH | |||
""" | |||
if not iobytes: | |||
if isinstance(inputs, torch.Tensor): | |||
if not isinstance(inputs, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
@pytest.mark.parametrize( | ||
"executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"] | ||
) | ||
def test_rope_variations_fwd_benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we also have test_rope_benchmark
, which is separate from test_rope_variations_fwd_benchmark
and test_rope_variations_bwd_benchmark
. What does test_rope_benchmark
evaluate?
Rope benchmark extracted from lightning trace.
TODO: