Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522

trey-ornl · 2024-07-23T16:12:42Z

This pull request attempts to provide Frontier optimizations like those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6. Here is a summary of the changes.

Create multiple new structs in SphereOperators.hpp to use for the new Caar pre-boundary exchange.
Add new file CaarFunctorImpl.cpp that implements the new caar_compute function and template functions with Kokkos loops.
Update CaarFunctorImpl.hpp with new functions, different number of buffers, and #if to turn on/off the new caar_compute. If we adopt the new implementation permanently, significant code can be eliminated from this file.
Add new viewAsReal functions in ViewUtils.hpp.
Remove LaunchBounds<512,1> calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.
Update ExecSpaceDefs.cpp to use newer Kokkos command-line argument for MPI.
Move to Cray Programming Environment cpe/23.12 and Rocm rocm/5.7.1.
Add frontier.cmake for standalone Homme builds.
Delete old Crusher files for CMake.

I confirmed that the modified code passes the caar_ut unit test using frontier-bfb.cmake, and I ran a single-node performance test of stand-alone Homme provided to me by @oksana, building with the new frontier.cmake. Here are the walltotal times recorded in HommeTime_stats for "caar compute".

Existing Code (#if 0): 153.6221 s
New Code (#if 1): 4.777188 s

I feel like this improvement is "too much." I confirmed that the outputs of the performance runs matched within roundoff. And I confirmed the performance difference using rocprof.

I then compiled CaarFunctor with -Rpass-analysis=kernel-resource-usage. While all the kernels of the new code had a grand total of just one vector-register spill, the monolithic existing kernel had 156115 scalar-register spills and 2820 vector-register spills using rocm/5.7.1. Is this enough to explain the performance gulf? I wonder if current E3SM runs are seeing this slowdown.

Here are a few caveats about this pull request.

I don't know how it performs on NVidia GPUs. I am hopeful that it increases performance.
I don't know if it even works on CPUs. It uses standard Kokkos, but it relies on active team parallelism (instead of TeamThreadRange loops).

ambrad · 2024-07-23T17:09:04Z

This PR looks like it will be quite a bit of effort to merge in its current state. Perhaps it's premature to post it. Perhaps it would be better to discuss the branch on a Confluence page before finalizing a PR.

Trey, when you have a chance, would you post full timer output for the two runs in which you see the 32x performance discrepancy?

trey-ornl · 2024-07-23T18:59:18Z

This PR looks like it will be quite a bit of effort to merge in its current state. Perhaps it's premature to post it. Perhaps it would be better to discuss the branch on a Confluence page before finalizing a PR.

Trey, when you have a chance, would you post full timer output for the two runs in which you see the 32x performance discrepancy?

I created the following Confluence page for discussion. It includes the full timer output.
https://acme-climate.atlassian.net/wiki/x/JwDvCgE

bartgol

I have some preliminary comments. As Andrew said, this PR is quite big, and may be worth see if/how we can chunk it. But before then, we should find a way to make things work for Serial/OpenMP/Cuda backends.

bartgol · 2024-07-24T21:36:51Z

components/homme/src/share/cxx/SphereOperators.hpp

+      du += dvvy[j] * t0.sv(x,j);
+      dv += dvvx[j] * t1.sv(j,y);
+    }
+    if (rrdmd == 0) rrdmd = (1.0 / metdet) * scale_factor_inv;


This seems to always be true. Maybe you could just do it in the constructor?

Divisions are expensive on AMD GPUs. The attempt here is to do it only once per kernel, even if the kernel calls div or fort more than once, or both div and vort, and to do it zero times if the kernel never calls div or vort (but calls only grad, for example).

I see. The F90 data structures do stored rmeted=1/metdet, but the C++ ones don't. I don't recall why we chose to do that. Perhaps we should do as in F90, since 1/metdet does not change across the run.

Want me to try converting to a stored rmetdet and see how performance compares?

No need for that. To avoid doing the division at every kernel call, we'd have to precompute the rmetdet view during init, and storing in the ElementsGeometry structure. I don't think that's a change for this PR. I was just thinking out loud :)

bartgol · 2024-07-24T21:38:55Z

components/homme/src/share/cxx/SphereOperators.hpp

+
+  KOKKOS_INLINE_FUNCTION void vortInit(SphereBlockScratch &t0, SphereBlockScratch &t1, const Real v0, const Real v1) const
+  {
+    t0.sv(x,y) = g.d(e,0,0,x,y) * v0 + g.d(e,0,1,x,y) * v1;


I noticed for dinv you store a copy of the point-local 2x2 matrix in the class, while for d you don't. Would it benefit to store d locally too?

Part of the tuning process was trying different combinations of local copies. A local copy of dinv happened to show speedup, while a local copy of d did not. I hypothesize that the tiny L1 cache happens to be big enough to reuse the required d values without making an explicit local copy.

bartgol · 2024-07-24T21:45:57Z

components/homme/src/share/cxx/SphereOperators.hpp

+  const ExecViewManaged<const Real *[NP][NP]> metdet;
+  const Real scale_factor_inv;
+
+  SphereGlobal(const SphereOperators &op):


Since SphereGlobal stores stuff that is already in SphereOperators, what is the point of SphereGlobal objects? One could simply pass around a SphereOperator object, no?

When a lambda captures an object, it makes a full copy of that object, regardless of what members the lambda actually uses. The SphereGlobal struct attempts to minimize the number of SphereOperators members that get copied into the kernel. (When a Kokkos::View gets copied, it copies only the array descriptor, not all the values of the array. But SphereOperators still has more of them than are actually used by each kernel.)

bartgol · 2024-07-24T21:49:53Z

components/homme/src/share/cxx/SphereOperators.hpp

+struct SphereBlockOps;
+
+struct SphereBlockScratch {
+  SphereBlockScratchView v;


I don't see the v part of this type ever used. So maybe we only need (NP,NP) scratch views?

It provides the actual space for the sv subviews.

E3SM/components/homme/src/share/cxx/SphereOperators.hpp

Line 1334 in dd1ffec

sv(Kokkos::subview(v, b.z % SPHERE_BLOCK_LEV, Kokkos::ALL, Kokkos::ALL))

Each vertical level uses a separate (NP,NP) piece of the shared array.

bartgol · 2024-07-24T21:50:38Z

components/homme/src/share/cxx/SphereOperators.hpp

+  const SphereGlobal &g;
+  const Team &t;
+  Real scale_factor_inv;
+  Real metdet;


Why do you think you need to store these local values (like dvvx, dvvy, metdet, ...)? Does it save on indexing time? I didn't expect this to matter much, tbh.

I don't know the particular reason each local copy wins. It was all empirical. If a value was reused, I tried making a local copy. If the result was faster, I kept it.
Possible reasons are that it keeps the compiler from computing indices or loading values more than once. Integer calculation for array indices can be surprisingly expensive on AMD GPUs.

bartgol · 2024-07-24T21:51:09Z

components/homme/src/share/cxx/SphereOperators.hpp

+    metdet = g.metdet(e,x,y);
+    for (int i = 0; i < 2; i++) for (int j = 0; j < 2; j++) dinv[i][j] = g.dinv(e,i,j,x,y);
+    for (int j = 0; j < NP; j++) {
+      dvvx[j] = g.dvv(x,j);


Since you subview the first index, maybe you could just make dvvx/dvvy two kokkos subviews of g.dvv, to avoid a copy?

Because SphereBlockOps objects are on the kernel stack, dvvx and dvvy only exist in vector registers. Each "copy" is a single load that is required regardless. And Kokkos subviews can be register hogs.

bartgol · 2024-07-24T23:17:53Z

components/homme/src/theta-l_kokkos/cxx/CaarFunctorImpl.cpp

+
+      const Real v0 = state_v(b.e,data_n0,0,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z);
+      const Real v1 = state_v(b.e,data_n0,1,b.x,b.y,b.z) * state_dp3d(b.e,data_n0,b.x,b.y,b.z);
+      buffers_vdp(b.e,0,b.x,b.y,b.z) = v0;


The buffers are allocated based on the number of concurrent slots needed, but with SphereBlockOps (as well as with SphereColOps and SphereScanOps) we access them with the element index. We need to polish this out, so that we can use "team_idx" instead of the element index for the buffers.

I use buffers to store values across kernels, so I need separate values for every element. I think you are saying that the buffers are actually allocated based on the number of teams, and the number of teams being the same as the number of local elements is a coincidence that I should not rely upon?

Ah, I did not notice that the epochs are now separate kernels (rather than separate parts of the same kernel). That's a big change. It reminds me of what we did in SCREAM for some physics parametrizations, where an implementation with many smaller kernels was better than a monolithic one. But we should definitely see if this impl is also good for CUDA.

That said, yes, the buffers are allocated based on the maximum number of concurrent teams. If the number of (rank-local) elements is large enough, even on GPU we can get num_buffer_slots<num_elements, which means those buffers may be smaller than num_elems. This is pretty much always the case on CPU, btw, so this impl needs to be adjusted to be portable.

How about the following to-do for me? Convert to using a separate set of buffers, created to be the required sizes.

You mean make the new impl use different buffers from the original impl? That would work. But related to that, I am getting a bit concerned about how implementations for different backends may start do dramatically diverge. So yes, do the fix you suggest, but we need to start verification of other backends.

bartgol · 2024-07-24T23:27:17Z

components/homme/src/share/cxx/SphereOperators.hpp

+using Team = TeamPolicy::member_type;
+
+static constexpr int NPNP = NP * NP;
+static constexpr int WARP_SIZE = warpSize;


I think warpSize is a HIP specific variable?

I think Cuda has it, but it's not a compile-time constant. I need a compile-time workaround. #ifdef? Other suggestions?

It's also not defined for serial/openmp, which means we do need some sort of CPP guard. I guess #ifdef would do.

bartgol · 2024-07-24T23:42:21Z

components/homme/src/share/cxx/SphereOperators.hpp

+    const int ixy = tr / SPHERE_BLOCK_LEV;
+    x = ixy / NP;
+    y = ixy % NP;
+    const int dz = tr % SPHERE_BLOCK_LEV;


This snippet below confuse me. I don't think team_rank() uniquely identifies a thread in a warp, but rather it identifies the warp as a whole. So I'm confused as of why this code can work correctly for all levels.

In Kokkos, team_rank() returns threadIdx.y, but a block (i.e., a team) is 2d.

For SphereBlockOps::policy, the team_rank() does uniquely identify a thread, because the Kokkos "vector" dimension is one (no third argument in the constructor):

E3SM/components/homme/src/share/cxx/SphereOperators.hpp

Line 1326 in dd1ffec

return TeamPolicy(num_elems * SPHERE_BLOCKS_PER_COL, SPHERE_BLOCK).

E3SM/components/homme/src/share/cxx/SphereOperators.hpp

Line 1187 in dd1ffec

static constexpr int SPHERE_BLOCK = NPNP * SPHERE_BLOCK_LEV;

Other team policies do have nonzero vector lengths, so their team_rank() values do not uniquely identify a thread. This is the case for the existing code and for SphereScanOps::policy in the new code (three arguments in constructor):

E3SM/components/homme/src/share/cxx/SphereOperators.hpp

Line 1442 in dd1ffec

return TeamPolicy(num_elems, NPNP, WARP_SIZE);

rljacob · 2024-08-15T17:29:31Z

Is this ready?

trey-ornl · 2024-08-15T17:31:49Z

Is this ready?

I doubt it. It has only been test on Frontier.

rljacob · 2024-09-05T16:48:02Z

How do you want to test this on other machines? @trey-ornl do you have perlmutter access?

trey-ornl · 2024-09-05T16:51:08Z

How do you want to test this on other machines? @trey-ornl do you have perlmutter access?

Sorry, no, I don't have Perlmutter access.

Once I have E3SM-Project/scream#2969 through on scream, I want to retest performance on Frontier.

rljacob · 2024-10-31T17:14:59Z

@trey-ornl did you get a chance to try on perlmutter?

rljacob · 2024-10-31T17:16:21Z

Notes: need to check cpu performance. Look for less-invasive way to do it.

trey-ornl · 2024-10-31T17:17:15Z

@trey-ornl did you get a chance to try on perlmutter?

Yes. It also sped up on pm-gpu, but it slowed down pm-cpu too much. I'm working on improvements for pm-cpu.

rljacob · 2024-11-21T18:34:02Z

@trey-ornl any updates?

trey-ornl · 2024-11-25T15:28:51Z

@trey-ornl any updates?

Still working on speeding up pm-cpu. No recent progress because of SC24 distractions.

Trey White and others added 4 commits July 15, 2024 12:49

Successful run of standalone Homme with CPE 23.12 and Rocm 5.7.1.

743904e

Added CaarFunctorImpl.cpp and new SphereOperators. But not turned on.

b255288

Changed #ifdef to #if 1 in CaarFunctorImpl.hpp.

91165b2

Merge branch 'E3SM-Project:master' into trey/caar/frontier

dd1ffec

oksanaguba requested review from bartgol and ambrad July 23, 2024 16:15

ambrad removed their request for review July 23, 2024 16:20

rljacob added HOMME Frontier labels Jul 23, 2024

rljacob assigned rljacob and jgfouca and unassigned rljacob Jul 23, 2024

bartgol reviewed Jul 24, 2024

View reviewed changes

trey-ornl added 2 commits July 29, 2024 13:21

Added #ifdef for WARP_SIZE in SphereOperators.

308adf7

Ensured CaarFunctorImpl buffers have at least m_num_elem nslots.

e8b2fa8

trey-ornl mentioned this pull request Aug 27, 2024

hommexx: fix templating #6449

Closed

trey-ornl added a commit to trey-ornl/scream that referenced this pull request Sep 24, 2024

Patched caar_compute changes from E3SM-Project/E3SM#6522.

17ae823

trey-ornl added a commit to trey-ornl/scream that referenced this pull request Oct 4, 2024

Patched caar_compute changes from E3SM-Project/E3SM#6522.

5bbadb7

Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522

Are you sure you want to change the base?

Redesigned "caar loop pre-boundary exchange", tuned for Frontier #6522

Conversation

trey-ornl commented Jul 23, 2024

ambrad commented Jul 23, 2024 • edited Loading

trey-ornl commented Jul 23, 2024

bartgol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rljacob commented Aug 15, 2024

trey-ornl commented Aug 15, 2024

rljacob commented Sep 5, 2024

trey-ornl commented Sep 5, 2024 • edited Loading

rljacob commented Oct 31, 2024

rljacob commented Oct 31, 2024

trey-ornl commented Oct 31, 2024

rljacob commented Nov 21, 2024

trey-ornl commented Nov 25, 2024

ambrad commented Jul 23, 2024 •

edited

Loading

trey-ornl commented Sep 5, 2024 •

edited

Loading