-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20200804
Geoffrey Paulsen edited this page Jan 19, 2021
·
2 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
Did not capture attendance accurately -- this may not be fully correct. I put a "yes" next to the people I know were there today.
- NOT-YET-UPDATED
Blockers All Open Blockers
Review v4.0.x Milestones v4.0.5
-
Still waiting on blocker (also v4.1): cache line stuff
- Why is this a correctness issue (not just a performance optimization)?
- We align the data in the shared memory stuff to be on cache line sizes
- We start the ring every 128 bytes (i.e., local rank 0)
- Other processes then find out the real cache line size of 64.
- Then other processes attach to shared memory, and use the cache line size/alignment of 64.
- First message will get sent, but then the 2nd message will never be received (and/or it's reading corrupt data because it's reading at offset 64 instead of 128).
- How is this not happening anywhere else?
- Previously, cache line size was setup very, very late (after all the shmem stuff was setup -- even the non-local-process-rank-0). I.e., we got lucky.
- I.e., we brought the hwloc initialization forward at some point and broke this.
- This only happens in
smcuda
BTL (and possibly only in single-node runs, because other BTLs/PMLs may have been selected). - The plain
sm
andvader
BTLs do this differently. - Meaning: this is a very specific corner case.
- Solutions?
- Trivial fix: just have everyone use a fixed value (e.g., 128 or 64).
- Pretty simple: modex-send the size to be used from local rank 0 to the others. The others modex recv the value and use it.
- A little more complicated: also add code to
smcuda
to read the Linux /proc / /sys / whatever to get the cache line size.
- There's a PR for master that does the fix -- but in a way that will kill scalability.
- Once Brian's configury fixes are in, this is easy to fix on master.
- Or it could be done the "A little more complicated" way, above. Neither of which are difficult.
- For 4.0 and 4.1: George will make one-liner patch to make everyone use a fixed value.
- This clears the blocker.
- Why is this a correctness issue (not just a performance optimization)?
-
https://github.com/open-mpi/ompi/issues/7968: added something to README for v4.0: there's a known issue when using UCX with very, very old IB hardware (pre-Connect X) -- it'll segv. According to Mellanox, UCX 1.10 will fix this issue.
Review v4.1.x Milestones v4.1.0
-
Same cache line blocker as v4.0.
-
https://github.com/open-mpi/ompi/issues/7982: OFI BTL and FI_DELIVERY_COMPLETE. This only matters for MPI one-sided.
- EFA and other providers are misbehaving
-
https://github.com/open-mpi/ompi/pull/7973: PR for fix: Disable EFA provider
- ...but then later discovered that other providers also misbehave in the same way.
- AWS proposal: extend #7973 to exclude other providers that misbehave.
- Meaning: if you're using libfabric over verbs, the OFI BTL won't be used.
- In v4.0x, there is no OFI BTL. So this is not an issue.
- In v4.1 this is a minor inconvenience because we still have osc/pt2pt. I.e., OMPI will automatically fall back to osc/pt2pt.
- This is unfortunately a big problem for master/v5.0. Need to figure this out -- i.e., coordinate with libfabric community.
- NOTE: This is a different code path than the MPI-one-sided problem Cisco MTT discovered when we removed osc/rdma (and all MPI_WIN_CREATE operations failed).
- Looks like Cisco MTT is still failing one-sided tests -- need to follow up with Nathan.
- Howard asks: how can I see this problem?
- Anything with MPI_PUT. E.g., IBM one-sided tests.
-
ADAPT / HAN.
- Need to test and produce some documentation for ADAPT and HAN.
Review v5.0.0 Milestones v5.0.0
- No update this week other than master discussion.
-
osc/pt2pt removal on master
- George: There are many machines where osc/pt2pt is the only mechanism, and it was the most performant.
- Brian: osc/pt2pt wasn't removed because it wasn't needed, it was removed because it's very buggy (to include no good path to becoming multi-thread safe) and "unrecoverably broken" (Brian's words! And he wrote it!) and no one will take ownership of fixing it.
- ...so if someone wants to take ownership of fixing it, they can!
-
Ralph points out:
- AWS MTT builds for SLURM, need to fix up the compiles for external hwloc/libevent. Brian+William will talk internally.
- Java: builds failing from Aurelien PR. He'll have a look.
- It's after July, so Jeff will go de-activate people.
- Brian will go do it today.
- Agenda items for next week.
- Talk through MPI-4 features. Howard will make a list of big-ticket MPI-4 features (from MPI-4 changelog).
- Sessions
- Default error handler
- ...etc.
- Walk through PRRTE issues.
- Figure out: which are blockers for v5.0? (etc.)
- With these two, we're good enough for Monday's meeting.
- Please add any other items to the wiki.
- We'll evaluate if we still need Tuesday's meeting.
- Talk through MPI-4 features. Howard will make a list of big-ticket MPI-4 features (from MPI-4 changelog).