Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel cyclic reduction #2330

Merged
merged 45 commits into from
May 25, 2021
Merged

Parallel cyclic reduction #2330

merged 45 commits into from
May 25, 2021

Conversation

JosephThomasParker
Copy link
Contributor

As mentioned in #2296, cyclic reduction wasn't actually performing cyclic reduction. Also, as noted in #2314, its scaling with processor count on Archer2 is much worse than we've seen on other machines.

This PR adds a parallel cyclic reduction solver, incorporating code from this repo, published under MIT. It can be selected using type=pcr.

This solver has a much better scaling performance than "old cyclic" while still being a parameter-free direct solver. The graph below shows run times on Archer2 (with 128 cores per node) for blob2d with (nx,nz)=(8192,1024) and rk4 time advance. Old cyclic (in magenta) doesn't scale nearly so well as parallel cyclic reduction (in red). The multigrid-Thomas solver (blue) is available in BOUT++ as type=ipt (impls/iterative_parallel_tri), while multigrid (green) is not the multigrid in next - it hasn't been pushed yet. However, PCR is a good drop-in replacement for old cyclic, without the need to fiddle with parameters (and possible non-convergence) that comes with multigrid.

run_time_8196

@JosephThomasParker JosephThomasParker added work in progress Not ready for merging feature backport candidiate Does not break backward compatibility, so can be back-ported to v4.4 labels May 19, 2021
@ZedThree
Copy link
Member

/clang-format

@ZedThree
Copy link
Member

Should we name this one "cyclic" and rename the direct partitioned one to something else?

We might also want to put the citation for this solver into CITATION.cff

@bendudson
Copy link
Contributor

I think on balance renaming the old one (to partitioned?) and calling the new one is good. This will change the algorithm used for the same input file, but the result should be the same so shouldn't affect reproducibility.

@JosephThomasParker
Copy link
Contributor Author

JosephThomasParker commented May 19, 2021

There are some limitations:

  • now can use NYPE>1
  • now can run in serial
  • requires nx=2^n** and NXPE=2^m with n >= 2*m - this is from the library README, so it's unlikely to find a simple fix for this.

Other possible enhancements

  • The wrapper around the actual tridiagonal solve is identical to that in cyclic_laplace. This should probably be shared between them somehow.
  • The repo (and code here) can also support two other solvers: (non-parallel) cyclic reduction and hybrid Thomas-PCR, both of which could probably implemented quite easily, given code here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 89. Check the log or trigger a new build to see more.

src/invert/laplace/impls/cyclic/cyclic_laplace.hxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 66. Check the log or trigger a new build to see more.

outbndry = 1;

if (dst) {
BOUT_OMP(parallel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: if with identical then and else branches [bugprone-branch-clone]

if (dst) {
  ^
src/invert/laplace/impls/pcr/pcr.cxx:225:5: note: else branch starts here
  } else {
    ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, I think it's just confused by the openmp loop turning into a function call.

src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 41. Check the log or trigger a new build to see more.

src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Outdated Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.hxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.hxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.hxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.hxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.hxx Show resolved Hide resolved
JosephThomasParker and others added 6 commits May 20, 2021 12:07
xproc = localmesh->getXProcIndex(); // Local rank in x proc space
const int yproc = localmesh->getYProcIndex();
nprocs = localmesh->getNXPE(); // Number of processors in x
myrank = yproc * nprocs + xproc; // Global rank for communication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just be BoutComm::rank()?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
src/invert/laplace/impls/pcr/pcr.cxx Show resolved Hide resolved
@ZedThree
Copy link
Member

What do we want to do about the names? I'm happy for this to go in now, and we can sort out the names in another PR

@JosephThomasParker
Copy link
Contributor Author

PCR is a parameter-free direct solver, so it is a like-for-like replacement for cyclic that scales better. The only concern I have is the restriction that we require powers of two for both processors and points in x. If we change the default name, then there's a good chance that previously-working submission scripts will error (particularly on Marconi, 48 cores per node). We could change the name, and also make the old cyclic the fallback solver when the procs/points conditions aren't met, but then there would be an unexplained performance drop-off.

The way around this to make the solver node-aware: solving within a node using (something like) the Thomas algorithm or the old cyclic (that doesn't care about point or core counts), and between nodes with PCR. This would only require using 2^n nodes, which is much less restrictive. The "within a node" bit will require work beyond this PR though. The Kang library is set up to do Thomas + PCR but it only has MPI so I doubt it would be node-aware out-of-the-box. Doing old cyclic on-node and PCR off-node might be nice though (and wouldn't require OpenMP (or SYCL 😛 ))

@JosephThomasParker
Copy link
Contributor Author

After a bit of code-digging:

  • Thomas+PCR would get rid of the nx=2^n restriction without needing to be node-aware, but we'd still need NXPE=2^m. This could be implemented reasonably easily from the Kang code.
  • old cyclic + PCR would get rid of all restrictions except nnodes=2^m. This would require thinking about the algorithm though - the PCR bit requires coefficients as inputs (in the form that they happen to come out from the Thomas algorithm). On the other hand, solvers tend to give "the solution" as their output. It takes some work to translate between "solution to a node's subsystem" and "what this means in terms of coefficients".

@JosephThomasParker JosephThomasParker removed the work in progress Not ready for merging label May 24, 2021
@ZedThree ZedThree merged commit 8ad1d8d into next May 25, 2021
@ZedThree ZedThree deleted the pcr_cyclic_copy+asan branch May 25, 2021 16:12
Comment on lines +704 to +706
BoutReal zlen = coords->dz * (localmesh->LocalNz - 3);
BoutReal kwave =
kz * 2.0 * PI / (2. * zlen); // wave number is 1/[rad]; DST has extra 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like "DST has extra 2" means that there are two extra cells, but why is then nz - 3 used?

@ZedThree ZedThree mentioned this pull request Aug 4, 2021
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport candidiate Does not break backward compatibility, so can be back-ported to v4.4 feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants