MXFP4 double buffer barrier fix by adedespirlet · Pull Request #866 · iree-org/wave

adedespirlet · 2026-02-11T03:12:12Z

This PR places barriers differently in the double buffered mfp4 gemm to reduce stalls observed in the trace.

The previous schedule utilized amdgpu.lds_barrier prior to the wavefront staggering branch. This forced an immediate s_waitcnt vmcnt(0), resulting in big stalls in the prologue. This "too early" LDS synchronization (vmcnt(0)) prevented the overlap of indexing logic with the in flight global memory loads.

To resolve this, I replaced the amdgpu.lds_barrier with a pure rocdl.s.barrier. And inside the loop I inserted a amdgpu.memory_counter_wait load(0) immediately before the first vector load from LDS.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

IE rather than extracting an i8 (and bitcasting to f8e8m0fnu) from a 4x vector (IE one VGPR) to use as mfma argument, use the extraction offset as the opsel argument for the mfma. This should reduce register pressure. Signed-off-by: William G Hatch <william@hatch.uno>

Signed-off-by: William G Hatch <william@hatch.uno>

The speculative_decoding.py tests change because the extra canonicalize pass optimizes some things away. Eg. nested conditionals simplified to one conditional with a merged condition expression. The pipelined_attention.py changes are similar, it looks like double canonicalization merged some things which moved bits around and changed some counts. Signed-off-by: William G Hatch <william@hatch.uno>

Signed-off-by: William G Hatch <william@hatch.uno>

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

panditsa · 2026-02-20T15:45:23Z

examples/python/7.1_schedule.py

 ):
    """Double-buffered MXFP4 GEMM, 8 waves, with stagger."""
+
+    mlir = """


Do we still need to pass the MLIR text entirely?

fix with barriers tor reduce vmcnt stall

49c69ef

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet requested a review from panditsa February 11, 2026 03:12

adedespirlet and others added 15 commits February 11, 2026 16:25

run 100 times

bee5282

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

fix barriers

6562db4

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add triple buffer

f54ab83

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

latest most performant kernel for big gemms

08c0a40

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

modify wave mapping , to prevent inter cluster dependency

f213322

new schedule mixed ping pong

87ec7e7

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add benchmark to preshuffle example

45c5a03

Signed-off-by: William G Hatch <william@hatch.uno>

add lit test for preshuffle with vector load and opsel

f88f412

Signed-off-by: William G Hatch <william@hatch.uno>

fix some tests

538661d

Signed-off-by: William G Hatch <william@hatch.uno>

move test into merge_scale_reads.py

4bd36da

Signed-off-by: William G Hatch <william@hatch.uno>

tighten checks for specific ops

b8f1b5f

Signed-off-by: William G Hatch <william@hatch.uno>

mxfp pin pong with shufflinf

4f80642

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

mxfp4 b shuffle

2381b35

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

panditsa reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

MXFP4 double buffer barrier fix #866

MXFP4 double buffer barrier fix #866
adedespirlet wants to merge 16 commits intoiree-org:mainfrom
adedespirlet:mxfp4

adedespirlet commented Feb 11, 2026

Uh oh!

panditsa Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

adedespirlet commented Feb 11, 2026

Uh oh!

panditsa Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants