Skip to content

Optimised Laplace solver using Shared-memory MPI? #3164

@johnomotani

Description

@johnomotani

Edit: The original proposal in this top comment seems impractical (see discussion below), but shared-memory MPI might be useful for a Laplace solver optimisation #3164 (comment).

For another project, I've been working a lot with shared-memory MPI. I think it might apply fairly straightforwardly to BOUT++ and possibly provide some useful speed up, with relatively localised code changes (only at low-level in BOUT++, not changing any PhysicsModel code).

Proposal:

  • Pick some groups of processes that would share memory - either a node or a subset of a node (e.g. a NUMA region, although my experience so far is that restricting shared-memory arrays to NUMA regions is surprisingly unimportant, so might be worth defaulting to just full nodes??).
  • Allocate the memory backing Field3D and Field2D using MPI_Win_allocate_shared (https://docs.open-mpi.org/en/main/man-openmpi/man3/MPI_Win_allocate_shared.3.html). MPI will want to assign the local 'view' as non-overlapping slices of the full shared-memory array, but BOUT++ probably wants to modify this to include also the guard cells that are 'owned' by neighbouring processes - this should be possible but might require calling MPI_Win_shared_query (https://docs.open-mpi.org/en/v5.0.4/man-openmpi/man3/MPI_Win_shared_query.3.html) after the MPI_Win_allocate_shared. The pattern I used in my other project was to allocate with all indices 'owned' by the root process of the shared-memory communicator, then use MPI_Win_shared_query to select the actual index ranges that are wanted by each process.
  • In Mesh::communicate() guard cells that are in shared-memory do not need to actually exchange data, but an MPI.Barrier is probably needed to synchronize the MPI ranks on the shared-memory communicators to avoid race conditions.

I think the MPI shared memory would be a fairly straightforward addition. At present processes should never modify guard cells (not counting boundary cells, which would not be shared between different processes anyway), so as long as Mesh::communicate() synchronises the shared-memory communicators, there should be no danger of race conditions.

This kind of shared-memory should (I think) be composable with OpenMP threads, so should be just an extra optimisation and not restrict any other features.

The main initial benefit would be to remove the need for copies to fill guard cells within an HPC node. This may not be the rate-limiting thing - the cost might well be hidden by the time spent in inter-node MPI communications (although for example 2D transport simulations that fit on a single node might see a bigger benefit). It would also give the potential for other optimisations. For example in FFT-based Laplace solvers if the x-direction is local to a single node, you could parallelise the tridiagonal x-direction solves over z instead of x without needing to do any communication (it would also be possible to generalise to allow the x-domain to extend over multiple nodes with the same algorithm as the current distributed-memory MPI implementation, not sure how much more complicated it would be to include that on a first pass).

There is a risk of creating race conditions, which can be very annoying to debug (e.g. intermittently incorrect results, and occasional segfaults), but I think the risk is low (as mentioned above).

I won't have any chance to work on this, but would guess that it's maybe an O(1) month project for an RSE (plus extra if you wanted to add shared-memory optimisation for Laplace solvers, etc.).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions