Expand 7.2 example to have double buffer scheduling and scale preshuffle#912
Expand 7.2 example to have double buffer scheduling and scale preshuffle#912willghatch wants to merge 1 commit intomainfrom
Conversation
Previously 7.2 just did scale preshuffle, and didn't build on top of the optimiziations of 7.1. This adds a scheduled preshuffle kernel to 7.2 like in 7.1. However, the exact schedule is not the same, because 7.2 does not use LDS memory for the scales, since optimization to merge loads into vector loads is not yet working for LDS. So this includes a new schedule. It has a `--benchmark` flag that can be used to benchmark and see results of vanilla mxfp4, scheduled, preshuffled (non-scheduled), and preshuffled scheduled. Signed-off-by: William G Hatch <william@hatch.uno>
|
This is mostly just LLM merging things. It imports kernels from 7.1 for the benchmarks to try to make it fair. In running the benchmarks a few times on a shared machine, the scheduled preshuffle kernel wins typically, but there is enough jitter that the runtime of any of the kernels can go high enough to lose. |
|
Just a thought, I think we shall keep 7.2, just scale preshuffle. Benchmarking script could be a separate one. |
|
I'm fine with moving out the benchmark, it's just been convenient when working on it. But do you mean that you want the preshuffle scale + scheduling in a different file, too? Or leave it as it is here except to remove the benchmarking? |
|
I think moving benchmark do a different file should be fine. req @panditsa to comment. |
|
@panditsa since you also made a similar PR to this one that merges the preshuffle and the scheduling, would you drop a link to yours here for reference? Then I'm happy to close this one. |
Previously 7.2 just did scale preshuffle, and didn't build on top of the optimiziations of 7.1. This adds a scheduled preshuffle kernel to 7.2 like in 7.1. However, the exact schedule is not the same, because 7.2 does not use LDS memory for the scales, since optimization to merge loads into vector loads is not yet working for LDS. So this includes a new schedule. It has a
--benchmarkflag that can be used to benchmark and see results of vanilla mxfp4, scheduled, preshuffled (non-scheduled), and preshuffled scheduled.