-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rope_benchmark #3550
base: main
Are you sure you want to change the base?
rope_benchmark #3550
Conversation
} | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part that's worth reviewing.
code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul
in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)
I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR. |
@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions? |
We will also benchmark backward pass with Thunder backend. |
Yes, so, we don't need to have the backward implementations explicitly, right? |
Looking at the thunder-nvfuser timing. Strangely the benchmark number doesn't match with the benchmark from kevin's example.
But if I run the manual rope_example, I'm getting these
I'll double check the measurement script, as well as compile options (i.e. thunder trace options). We need to do the same sanity check for torchcompile later. |
I think the backward difference is because of gradient accumulation in between multiple runs that I am fixing in PR #3394. Let me try to run your PR with that fix. Are you seeing any difference in the forward pass also? I will try to push this PR in so that we can add your RoPE example easily. I was verifying the measurements in that PR against nsys and they match for operators like rmsnorm, softmax, layernorm for all configs and executors. |
Yes, if you look at the mistral example, forward time looks quite different.
I checked nsys profile, looks like the same trace (kernels and how they line up). Yet the reported number looks different. I tried to verify the event time recorded in our benchmark, if I run everything vanilla as a for loop below, fwd number does match.
|
A few reason for the performance discrepancy between @kevinstephano 's example and the benchmark here.
|
!build |
benchmarks/python/test_rope.py
Outdated
benchmark, | ||
unary_bwd_torch, | ||
[output, grad()], | ||
iobytes=iobytes() if executor == "thunder" else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Priya2698 does this looks about right, if I just want a manual iobytes computation for thunder backward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself. double check the backward iobytes computation again.
I did it once while looking at the backward thunder trace. But I'm not totally confident in that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these will be the IObytes based on the inputs/outputs of the nvfuser definition? They should be used for all 3 executors then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me. I'll make the change.
Another question on how we plan to handle these in the long run.
So if we do use the same IOBytes across executors. For this instance, we'll just be using the thunder autograd as a reference point
. But other two executors might be running a different autograd strategy, which means their IOBytes for bwd is not calculated faithfully. Is that the right way to interpret this? The reported IOBytes would just be a reference point.
benchmarks/python/core.py
Outdated
if executor == "thunder": | ||
return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex]) | ||
return thunder.jit(fwd_fn, nv_enable_bookend=False, executors=[nvfuserex], **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this so I can run nv_enable_matmul=True
for rope. cc'ing @naoyam @kevinstephano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to put the RoPE configuration and setup in a separate file. This file can only have the benchmark function itself for easier readability.
!test |
Rope benchmark extracted from lightning trace.
TODO: