Skip to content

WeeklyTelcon_20200317

Geoffrey Paulsen edited this page Mar 17, 2020 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Akshay Venkatesh (NVIDIA)
  • Brian Barrett (AWS)
  • Brendan Cunningham (Intel)
  • David Bernhold (ORNL)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Michael Heinz (Intel)
  • Thomas Naughton (ORNL)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Artem Polyakov (Mellanox)
  • Edgar Gabriel (UH)
  • Nathan Hjelm (Google)
  • Charles Shereda (LLNL)
  • George Bosilca (UTK)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Xin Zhao (Mellanox)
  • mohan (AWS)

Old Business

  • MTT -

    • If you change your MTT to startup PRRTE at begining of session, and just use prun.
    • Can see times cut in half or more.
    • This is good, but also need to test mpirun wrapper.
    • Cisco is converting half of MPI installs to use prrte/prun
  • OMPI master submodule pointers setup to track PMIx and PRRTE master.

    • Jeff discussed an idea to have some integration with PRRTE that putting a string in a PRRTE PR would automatically open an Open-MPI PR to update the PRRTE submodule after that PRRTE PR is merged to PRRTE master.

Release Branches

Review v3.0.x Milestones v3.0.6

Review v3.1.x Milestones v3.1.6

  • Brian merged in a one-sided shared memory fix, to kick MTT last night.
  • Assuming everything looks good, will do the release today.
  • Some questions about MTT running on v3.0.x and v3.1.x

Review v4.0.x Milestones v4.0.4

  • v4.0.4 in the works.
    • No Schedule yet.
    • Jeff is looking at PMIx issue, some issue with dstore working with Ralph.
  • May need a new PMIx v3.1.x release.
    • It's a bug, but may not expose itself unless you direct launch.
  • Issue 7507
  • There's a one-liner code fix needed for SLURM with > ppn64
    • This may drive a v4.0.4
    • Right in MPI_Init (not related to specific component)

v5.0.0

  • Schedule:
    • Feature Freeze: End of April
    • Release: End of June
  • Austen took an initial stab at issues and is starting a google sheets of v5.0 features.
  • PMIx v4.0.0
    • Totalview says it's on track.
  • PRRTE v2.0
    • Steadily making progress. Other than Comm_Spawn, just a few more little things.
  • Remove OSC pt2pt - Not straight forward.
    • SUMMARY: Significant technical investigation needed.
      • Intel will see about path forward.
    • If we remove this Omnipath won't have a OSC component
    • Timeframe is end of April
    • Michael works on OMNIpath team
    • It's not working for Multithreaded.
    • It can crash quite a bit.
    • May have data corruption issue, haven't investigated deeply
      • No Issue opened.
    • Nathan suggested removing this.
    • Not even good reference implementation.
    • TCP can use OSC_RDMA - tcp btl.
    • Need to do testing with OSC UCX
      • Mellanox UCX is as good as UCT, and more supportable.
      • Realise on UCT so need to harden over time.
      • OFI btl works
      • But PSM/PSM2 are problematic.
    • Mixing PSM2 MTL, and OSC_OFI_BTL is a problem.
      • But non building PSM2 MTL helps.
      • Intel Still wants to support PSM2 MTL, as CUDA support in OFI isn't as performance.

master

  • SLURM PMIx plugin has been locked on PMIx v2 for some time.
    • There are some NEW PMIx calls that SHOULD be added to bring it up.
      • Ralph has started a PR, but needs help.
    • So for now, there's some optional info that won't be passed correctly.
      • No OMPI_INFO for now.
      • Ralph gets pinged occasionally.
    • Not sure priority of this.
  • MTT on master is looking pretty good.

Face to face

  • Defered.

Infrastrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • CI testing only tests build and did it run, but doesn't test HOW it ran.
    • Environment setup can be a bit different.
    • For example no-permissions in /tmp. Might pass on one machine, and fail on another without /tmp permissions.

ORTE/PRRTE

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally