-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20191105
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Artem Polyakov (Mellanox)
- David Bernhold (ORNL)
- Akshay Venkatesh (NVIDIA)
- William Zhang (AWS)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Thomas Naughton (ORNL)
- Brian Barrett (AWS)
- Edgar Gabriel (UH)
- George Bosilca (UTK)
- Todd Kordenbrock (Sandia)
- Brendan Cunningham (Intel)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Erik Zeiske
- Joshua Ladd (Mellanox)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Ralph Castain (Intel)
- Tom Naughton
- Xin Zhao (Mellanox)
- mohan (AWS)
-
All of this in context in v5.0
-
Intel is no longer driving PRRTE work, and Ralph won't be available for PRRTE much either.
-
PRRTE will be a good PMIX developement environment, but no longer a focus to be a scale and robust launcher.
-
OMPI community could come into PRRTE, and put in production / scalability testing, features, etc.
-
Given that we have not been good at contributing to PRRTE (other than Ralph), there's another proposal
- There's been a drift from ORTE / PRRTE, so transitioning is risky.
-
Step 1. Make PMIX a first class citizen
- Still good to keep PMIX as a static framework (no more glue, but still under
orte/mca/pmix
, but basicly just passes through, and callPMIX_
calls directly. - Allows us to still have internal backup PMIx if no external PMIX is found.
- Still good to keep PMIX as a static framework (no more glue, but still under
-
Step 2. We can whittle down orte, since PMIX does much of this.
-
Two things PRRTE won't care about, is scale and all binding patterns.
-
Only recent versions of SLURM have PMIx
-
Need to continue to support ssh.
- Not just core PMIx, still need daemons for SSH to work, but they're not part of PMIx.
- Part of ORTE that we wouldn't be deleting.
-
What does Altair PbsPro and open source PbsPro do?
- Torque is different than PbsPro
-
Are there OLD systems that we currently support that we still don't care, and could discontinue support in v5.x
- Who supports PMIx, and who doesn't
-
If PMIx becomes a first class citizen and rest of code base just makes PMIx calls, how do we support these things?
- mpirun would still have to launch orteds via plm.
- srun wouldn't need
- But this is how it works today. Torque doesn't support PMIx at all, but TM just launches ORTEDs
- ALPS - aprun ./a.out - requires a.out to connect up to ALPS daemons.
- Cray still supports PMI - someone would need to write a PMI -> PMIX adapter.
- ORTE does not have the concept of persistant daemons
-
Is there a situation where we might have a launcher launching ortes and we'd need to relay pmix calls to the correct pmix server layer?
- Generally we won't have that situation, since the launcher won't launch ORTEds.
-
George's work currently depends on PRRTE
- If ORTEDs provides PMIx_Events, would that be enough?
- No George needs PRRTE's fault-tollerant overlay network.
- George will scope the effort to port that feature from PRRTE to ORTE.
- If ORTEDs provides PMIx_Events, would that be enough?
-
ACTION - Please gather list of resource managers, and Tools that we care about supporting in Open-MPI v5.0.x
-
Today - Howard
- Summary - make PMIx a first class citizen.
- Then whittle away ORTE as much as possible.
- We think the only one who uses PMI1, and PMI2 might be cray.
- Howard doesn't think Cray's even going to go that direction, might be adopting pmix for future direciton. Good super computing question.
- Most places will be whatever SLURM does.
- What will MPICH do? suspect PMIx
- Howard thinks that by the time Open-MPI v5 gets out
- Is SLURM + PMIx dead? No, it's supported, just not all of the
-
George looked into scoping the amount of work to bring reliable overlay network from
- PRRTE frameworks not in
-
Howard also brought up that Sessions only works with PRRTE right now, so would need to backport this as well.
-
Only thing that depends on PRRTE is Sessions, Reliable connections, and Resource allocation support. Thing Geoffry Valle was working on before. Howard will investigate.
- Sounds like no one needs PMI1 or PMI2 in Open MPI v5+
- DVM - persistant daemon aspect (even outside of MPI).
- People pulling runtime out and not using MPI
- A lot of fixes for this went into PRRTE but didn't make ORTE - Thomas
- Something we should/could bring back to ORTE.
- Some people finding benifits inside of a resource manager.
- Portability. Abstraction to have a runtime layer they could carry around.
- One other thing that standalone PRRTE gives you.
- Thomas feels long term PRRTE has more value long term.
- MPICH used to use OMPI ORTE to launch at some point.
- If there's a standalone project useful for other projects.
- Two major items that favors PRRTE over ORTE
- Reliable overlay network done on PRRTE, not ORTE.
- Nothing pushed upstream of PMIX/PRRTE yet.
- Sessions - prototype done with PRRTE
- PMIx as first class citizen (nice, but not sufficent for sessions).
- Ralph and Howard convinced themselves that what Sessions needs is in PMIx not ORTE layer.
- Still some items only in PRRTE needed, but not huge.
- Reliable overlay network done on PRRTE, not ORTE.
- This has been a bit of a roller coaster.
- Either way, we've been largely relying on Ralph, so we'll need to step up.
- We need to have a way to make the decision go forward.
- Howard could look into
- How bit-rotted IS DVM support.
- Is there feature enhancements in ORTE not in PRRTE?
- PRRTE - we need a diagram that high
- Sometimes if you launch with persistant dameon, IT does the binding, but your later prun might be in conflict.
- Need a feature list.
- Don't lose sight of usefulness of external runtime environment.
- Need unit tests.
- Who can do this feature comparison?
- No reasons to NOT
- Wanted comments from community.
- George - hit this a few weeks ago on AWS.
- William Zhang has not yet committed some graph code for reachability similar to usnic.
- Brian/William will get with Josh Hursey to potentially test some more.
- Please register on Wiki page, since Jeff has to register you.
- Think
- Date looks good. Feb 17th right before MPI Forum
- 2pm monday, and maybe most of Tuesday
- Cisco has a portland facility and is happy to host.
- But willing to step asside if others want to host.
- about 20-30 min drive from MPI Forum, will probably need a car.
- It's official! Portland Oregon, Feb 17, 2020.
- Safe to begin booking travel now.
-
Can we just turn on locbot / probot until we can get AWS bot online? *
-
OMPI has been waiting for some git submodule work in Jenkins on AWS.
- Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
- Anyone with github account for ompi team should have access.
- PR 6821
- Apparently Jenkin's isn't behaving as it should.
- Three pieces: Jenkins, CI, bot.
- AWS has a libfabirc setup like this for testing.
- Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
- William Zhang talked to Brian
- Not something AWS team will work on, but Brian will work on it.
- Jeff will talk to Brian as well.
- Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
-
Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.
- William will probably be admining the Jenkins/AWS or communicating with those who will.
-
Merged
--recurse-submodules
update intoompi-scripts
Jenkins script as first step. Let's see if that works.
-
PR used to see if Poll would wait at all, and if it would
- Howard is working on configure stuff to work.
- Argobots - problem is integrating libevent.
- libevent today is a framework.
- libev support, would it solve the problem? From a high level it's a stripped down version of libevent.
- No mechanism yet for one user-level thread to switch to another user-level thread.
- Problem? libeevent is polling too hard, and breaking things.
- libevent today is a framework.
- UGNI and Vader BTLs were getting better performance, not sure why.
- For modular threading library, might be interesting to decide at compile time or runtime.
- Previously similar things seemed to be related to ICACHE.
- Howard will lok at.
-
Artem - Mellanox developers are doing some changes,
- might require enabling
- Google actions ? On the repo.
- PMIx - already migrated to this
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- Will put out RCs for v3.0.5 and v3.1.5 this week.
- Please test RCs when they become available.
- Start drawing up a list of fixes that won't be backported to v3.0.x
- Datatype bug won't be backported, because it snowballed too big.
- Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.
Review v4.0.x Milestones v4.0.2
- v4.0.2 was released and haven't had any catastrophic issues come in.
- We're begining to merge in new v4.0.3 PRs
- PR 7116 - giles updates some code for flang on master.
- There were 3 commits. One definately broke ABI, but other two shouldn't have.
- Some concern even those 2 might break something on release branch.
- Going to ask giles on PR if this is important for him being on v4.0.x
- Another detail is flang is old and being replaced.
- another REAL problem with v4.0.x - no MPI_NO_OP
- Geoff will just PR the missing MPI_NO_OP change to v4.0.x
- Schedule: April 2020?
- Wiki - go look at items, and we should discuss a bit in weekly calls.
- Some items:
- MPI1 removed stuff.
Review Master Master Pull Requests
- IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
- Austen is looking into
- Absoft 32bit fortran failures.
- No discussion this week.
- See older weekday notes for prior items.
- No discussion this week.