-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to use FastDivmod for predicated strided dgrad iterators. #1453
base: main
Are you sure you want to change the base?
Conversation
Not seeing benefits from this one either. ran
|
@ZelboK , You can compile and run only align8 kernels for this shape. Use string "cutlass_tensorop_h16816dgrad_optimized*align8" for cmake and running the cultass_profiler. The results in comparison_hgrad.csv are with both loads and stores with fast_divmod? |
@manishucsd Sorry that file isn't complete, please ignore. I'll paste the complete one here(also with running align8 only) THis one will have load, store, load and store, and normal GFLOPS benchmarks. |
Thanks @ZelboK for the work on this and analysis. The @hwu36, are you working on this profiling further with more problem sizes on A100 and potentially merging this? |
This PR has been labeled |
This PR has been labeled |
i tried on a100 and observed small regression in perf. |
This PR has been labeled |
On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt
(i'll omit the other two lines assembly for brevity for now)
AFTER
this code:
leads to
assembly
Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed
the
FastDivmod
was formed like this:all tests pass. @hwu36
Here are the benchmarks from
cutlass_profiler
from running
GFLOPS
load_store_k1024.conv2d.csv
loadk1024.conv2d.csv
normal_k1024.conv2d.csv
store_k1024.conv2d.csv
the_four.csv