-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Visit blocks in RPO during LSRA #107927
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
This reverts commit 4a9e0ff.
/azp run runtime-coreclr outerloop, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib, @AndyAyersMS @kunalspathak PTAL. Fuzzlyn failures are known or NaN false positives. Note that LSRA previously had three ordering strategies: the default preds-first one, lexical order, and randomized order (which was never implemented). RPO looks like a viable replacement for the first one, and lexical order doesn't make much sense if we plan to move block layout later, so I removed the functionality for specifying LSRA's block order. Is it ok to remove this for now, and add it back in if/when we decide to implement a randomized order? |
Diffs are large, though a net size improvement. Looking at the instructions retired per collection, the larger MinOpts TP regressions are concentrated in collections with relatively few MinOpts methods, so I think the TP impact isn't as bad as it looks. |
Fuzzlyn linux/arm failures seems to expose some more issues by this change. We should address all of them before merging this PR. |
The assertion |
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
@kunalspathak the Linux arm failure didn't repro in the last Fuzzlyn run, so if it is a bug, it doesn't readily repro. Aside from inner/outerloop tests, are there any other suites you'd like me to run? JitStress doesn't seem to do anything interesting for block ordering, though if it's still worthwhile to run LSRA stress modes, I can do that -- thanks! |
Yes, since Fuzzlyn is randomized, it might not necessary repro on every run, but we should take the failure that we saw in previous run and see why it showed up with this changes and go from there. I would usually run |
/azp run runtime-coreclr jitstressregs, runtime-coreclr jitstressregs-x86, runtime-coreclr jitstress2-jitstressregs |
Azure Pipelines successfully started running 3 pipeline(s). |
Note that the diffs from the latest run look quite different because the collections got a bit messed up on x64: we're missing |
When visiting a block during an RPO traversal, we check if the block is a loop header, and if so, we visit the rest of the loop's body before visiting anything else -- value numbering currently does this, if you want to see what the implementation looks like. This visit ordering has the nice property of keeping loop bodies compact. |
Do we have an issue that we moved in dotnet/runtime repo? There are more regressions than improvements even though the asmdiff codesize/perfscore found otherwise. We should revisit some of the regressions. |
The first one #108201 Still waiting on arm64 results (tomorrow). Also some of this might be mitigated by loop-aware. |
@amanasifkhalid - can we try locally if loop-aware helps mitigate some of the regressions? |
Sure, I'll try that today. I'll have the top regressions collated here soon. |
x64 regressions:
|
x64 improvements:
|
@amanasifkhalid - thanks for sharing the data, but can you please summarize the take away from it and next steps? |
I'm still trying to repro the top regressions locally -- if I revert this change on top of You're correct that we have more regressions than improvements (286 vs 176) at the moment. To get an idea of how the magnitudes of regressions/improvements compare, here are some histograms: Note that some improvements became regressions over time, and vice-versa, hence the odd tails for the recent scores. Looking at the original scores, it looks like the improvements tend to be bigger than the regressions, which seems promising for loop-aware RPO? |
I've looked at some regressions locally, and some look like they can easily be fixed by the loop-aware RPO. In the absence of high-fidelity edge likelihoods, we can end up with flowgraphs like this:
Since the profile-aware RPO only considers edge likelihoods, there isn't an obvious successor of
Whereas the old LSRA block order uses block weights to decide on the next successor, so it gets this one right:
The loop-aware RPO gets such examples right because of the presence of loops, but it's otherwise not aware of successor blocks' weights. For example, consider this flowgraph, which doesn't have any loops:
Both RPO-based orderings interleave the cold block with the hot paths:
Whereas the previous implementation doesn't:
To handle these cases, I think we can emulate what we do during block reordering, and push rarely-run blocks to the end of the order. It's trivial to implement, and we don't have to worry about EH constraints like we do during block reordering. I think we eventually want these mismatches between likelihoods and block weights to disappear by running profile synthesis late in the frontend, though I don't think I'll get to that until later, so this seems like a decent fix for now. For the remaining regressions I looked at, I'm seeing slight differences in code layout due to more critical edges split. This seems to happen in the case where the old ordering breaks ties using @AndyAyersMS @kunalspathak does this all sound reasonable? Thanks! |
Emulating reordering seems plausible, I guess, but then perhaps we should simply run ordering before LSRA (and re-ordering later if there are new blocks), and have LSRA just use the lexical order? For benchmark runs I'm surprised we don't see PGO everywhere... are we measuring non-PGO code in some tests? |
I was thinking about going this route; from what we see above, better LSRA block orderings also tend to look like better block layouts, so it seems reasonable to just use lexical ordering. The only hurdles I see to this are the fact that we cannot move cold EH blocks to the end of the main body, and the fact that switch lowering can change flow in between block layout and LSRA. We already don't put much effort into ordering switch successors optimally (though 3-opt will probably fix this automatically), so maybe the latter point is fine? I'll give this a shot.
As far as I know, all the microbenchmarks use PGO; the non-PGO examples were PerfScore regressions handpicked from non-tiered SPMI collections to illustrate limitations. For the few benchmark regressions I was able to repro locally, the churn was primarily driven by more critical edges being split, and thus more churn in code layout. My understanding of LSRA's edge resolution is limited, but I don't see an obvious fix to these cases. |
Ah, good point... LSRA "layout" need not be EH aware at all. |
While snooping around LSRA, I noticed this TODO where the logic for remembering the first cold location assumes the first cold block is the beginning of a contiguous cold section. As mentioned above, block layout cannot satisfy this property when we have EH regions, so keeping this state accurate is important, then we cannot rely on layout order for LSRA. The current RPO traversal doesn't ensure cold blocks are visited last either, so I think it's worth pursuing enabling this invariant as a next step, alongside getting loop-aware RPO checked in. (Sorry for the recent silence on this front. I've been trying to figure out the source of a TP regression for a massive MinOpts method in #108147, but I haven't been able to get a good trace from pin on multiple machines. That change is a nice-to-have, so I guess we can get these tweaks into LSRA's FullOpts block order first.) |
Part of #107749, and follow-up to #107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
Part of #107749. LSRA's currently does a lexical pass over the blocklist to build a visitation order. Since we intend to run block layout after LSRA with #107634, LSRA ideally shouldn't be sensitive to lexical ordering, and since the current logic tries to visit a block's predecessors before the block itself, it seems easier and faster to just use an RPO traversal.