-
Notifications
You must be signed in to change notification settings - Fork 868
Meeting Minutes 2017 01
Open MPI Face to Face - Cisco, San Jose - Jan 24-26, 2017
Ralph Castain (Intel)
George Bosilca (UTK)
Shinji Sumimoto (Fujitsu)
Howard Pritchard (LANL)
Geoff Paulsen (IBM)
Jeff Squyres (Cisco)
Andrew Friedley (Intel)
Annu Dasari (Intel)
Artem Polyakov (Mellanox)
Joshua Ladd (Mellanox)
Nathan Hjelmn (LANL)
Matias Cabral (Intel)
David Bernholdt (ORNL)
Brian Barrett (AWS)
Sylvain Jeaugey (NVIDIA)
Chris Chambreau (LLNL)
Attending Remotely: 1. Josh Hursey - IBM (Available from 6:30am-3pm Pacific) (Added a ☎️ icon next to the items I'd like to call in for, if possible) 1. Murali Emani - Livermore Cisco aps
- Discuss the Difference between --host and --hostfile
- Will close PR, and make master behave like 2.0.x
- Add a --host :auto to request autoselection of number of slots, otherwise just default to 1.
- Reviewed blockers on v2.0.x
- Discussed HFI backgrace in PSM/PSM2 backtrace infinite loop in 1.10.x
- It's in 1.10.x AND 2.0.x.
- Not handling MPI_IN_PLACE on one algorithm in basic correctly. Will have more details tomorrow.
- IBM will have PR tomorrow.
-
Issue 2644 - Performance comparison between v1.10 and v2.0.x
- What to look at next? Remember some change in PMLs from 1.10 -> 2.0
- Diff the CM subdirectory between 1.10 and v2.0.x? They are very different.
-
Issue 2151 - Nonblocking Collective Datatype cleanup
- George - the correct answer would be to refcount all of the MPI API objects.
- Need to refcount Ops and Datatypes
- For v2.0.2 we can document (Howard).
- IBM will look at fixing this for v2.1 and v2.0.3.
- More general case - as we have more components that can create their own types, we need a more generic way to allow those components to cleanup in progress correctly.
- PMIx - 1.2.1 - update from this mornings call.
-
thread locking issue / fix
- Like to put this into 1.2.1. Still seeing a difference between x86 and powerpc.
- One place we could think of is flock vs pthread_lock, futex.
- ppc6464le still seeing bad performance compared to x86_64.
- Mellanox will get done next week.
-
thread locking issue / fix
- Schedule for v2.1?
- Originally October, then January.
- Mellanox wants this for PMIx and for OSHMEM updates.
- Nathan saw some ODS loop possible race condition in orted.
- Change v2.x default so that if async modex is off, then set default RTE barrier to init to off (since we don't need it).
-
Big Endian / Little Endian support
- v2.0.3, v2.1, master - error configure if someone enables --enable-heterogeneous configurey.
- mpirun confidently talks to other Endians in orteds.
- at runtime check if architecture is hetro, and error out if it is.
- v2.0.3, v2.1, master - error configure if someone enables --enable-heterogeneous configurey.
- Went through some of v2.1 Milestones
- David Bernholdt presented OMPI-x: Open MPI for Exascale project at ORNL
- Slides will be made available.
- exaMPI - for MPicH side of things.
- hwloc Memory footprint reduction
- Problem: when you get complex chips with many many PPN. The hwloc topology tree, goes into the MB.
- HWLOC topology tree is a driver - can we devise strategies for not holding this object in memory for the entire execution?
- See PR 2629 for details and possible solutions
- Can we do this for v2.x?
- Each rank instantiates the tree from the orted / resource manager daemon.
- Mostly a PPN scale issue, but somewhat also Node scale issue.
- How complete is this PR? Closed into Master. Not a major change.
- Thinking about Difference between scaling on Master versus v2.x
- v2.x is close. Need fix don't send signature back.
- If v2.x has to go to today's enterprise scale machines.
- Memory consumption an issue for certain class of nodes v2.1.1?
- Took ralph a fair amount of time to do.
- Launch speed v2.1.0? - easy Needs PMIx 1.2.1
- Memory consumption an issue for certain class of nodes v2.1.1?
- Just a backporting issue what / when. Concerned that it's not well tested and may cause slippage.
- Radix is building an entire tree at mpirun time, taking some time. (Nathan). Maybe need to bump Brusian priority back up.
- Release mechanisms:
- Put launch speed pieces in v2.1.0 (mid-Feb)
- Put memory scale piece into a v2.1.1 (late March).
- Exposing PMIx functionality
- Martin Schulz proposes MPI_T wrapper - Similar
- MPI_T - Some parameters must be scoped local (process), (global)
- If you want to change something globally, need to do a
- Discussed, going to have some OMPI inbox, and continue discussing, and get back.
- No portability in MPI_T. Will get compile/link time portability.
- We only install pmix.h if users request install devel headers.
- PMIx talks to resource manager.
- Should OMPI consider installing pmix.h and expose libpmix symbols.
- There is a libpmix.so (doesn't work for --disable_shared).
- globally visibile inside the library, but not exported outside of opal.
- prefix -> lib -> libmpi, libopen-rte, libpmix
- prefix -> include -> pmix -> pmix.h
- Then users could mpicc .... -lpmix.
- with external headers, --expose-devel-headers, doesn't put libpmix.so out there, and users would use the external one.
- Advantage to installing the one that's in Open MPI, is that if you're running against slurm pluggin.
- The one that the internal one is versioned correctly with the internal orted.
- Applications are actually calling PMIx.
- Always have to coirdinate with distro guys (build externally anyway.
- Don't think it's neccissary to guarantee pmix.h backward compatibility.
- There is a libpmix.so (doesn't work for --disable_shared).
- Should we allow variadiac macros? They're part of C99.
- E.g., OBJ_NEW to allow constructors with arguments.
- Issue, is contructors for super class.
- Some usage already, In PR 2773, already in ______ today.
- Yes! Already have documentation
- Some discussion about possibly requiring C11 compiler
- Some pushback, as older RHELs and other distros shipped with too old of compilers.
- Caution against
- Generics will have to wait until GCC 4.9+
- Fix type-checking for atomics to use Generics in C11, or turn off atomics for those older compilers.
- Moving verbosity, enable, priority, (and other?) parameters into the MCA super class to be shared across all MCA components?
- There is a lot of history around this.
- There are some frameworks that don't use priority.
- No harm about having a priority, but don't want to force a priority based selection.
- Have a flags field, and one flag could say a framework doesn't care about priority.
- There is a priority now. Components either specify, or it's the order that we load them... might be better to be explicit.
- Need to look at discussion.
- Standard query function in mca base component that takes 2 params, are frameworks, where that is not adiquate, and they just extend that framework to add that additional component. In these, just leave base as NULL because extention does what they want.
- Need to write a special select function.
- Query Interface would have to change, because that's how we get priority.
- It's nice to have enable mca parameters, because they can be used command line, env, conf file.
- --tuned is a different way to do it.
- Need to understand use cases a bit more.
- list with ^ or not ^ to enable or disable.
- Down in the guts of MCA base. When they get component opened, they return an error. Others do something at query time, to decide if they want to be selected or not.
- For 3.0 should clean up some of the btl stuff that is hardcoded to 1.
- Does anyone care about 32bit on 64bit platforms? (x86_64, ppc64le, and arm/arm64)
- Good way to approach this is like BTLs, who will sign up to support it and test regularly.
- IBM cares about 32bit on ppc64le, and will add regular testing for it.
- Should we state other levels of support? 1. Works and test regularly, 2. have code, but dont test, 3. Don't have code, and are not interested.
- Support list in README (Issue to update)
-
Discussion about mca parameters include,disclude,enable,disable,developer vs user.
- Use case for 'manditory include' - example 'btl/self', 'coll_basic', PR 2773
- Use case for 'exclude unless expressly included' - example buggy or developer-only components.
- Consensus that 'manditory disable' is okay.
- Concern that putting a feature upstream that isn't used upstream (only 1 vendor, IBM).
- Some thoughts that it could be useful for other components.
- Now considering a Recommended state, that prints a component specific warning (with an env to turn off warning for specific warning).
- put enum in base, only inforced in hook,
- 3 new flags: Manditory, Recommended, and Disabled.
- define flag in base, put enforcement in hooks framework.
-
We now have subprojects that have their own mca parameters that are prefixed properly, but mpirun doesn't know to forward those.
- In orte we only pickup mca prefixed. Need registration function to know what prefixes to forward.
- Direct launch case, usually just propagate entire environment, and subprojects would fall under that case.
- In orte we only pickup mca prefixed. Need registration function to know what prefixes to forward.
-
- Showed PMIx memory usage improvements.
- Define some work they're working on in PR 2758
- Brought up some communicator size issues, and Binary backward compatibility.
- Would like to fix the communicator size issue once and for all, by creating a superset comm structure that would never change size, and then have a dereference to lower level piece that could change size, but would cost a dereference each time.
- Would need to do this if/when endpoints become standardized.
- for v3.0. good canidate fields are neighborhood collectives pointers. Good time to move other items out too.
- Consider at that time, possible canidates for shared memory, useful for ranks in same comm across node to save size.
- If putting stuff into sharedmem, Moving peer hostname to accessor function that would request it from PMIx.
-
BTLs vs UCX vs libfabric discussion
- BTLs were invented first, but now libfabric and UCX
- It's been nice/easy in Open MPI to add a new device.
- What does STACK from MPI down look like in 2 years?
- Depends on capabilities of networking layer.
- Natural evolution of networking, networks are moving up to use PMLs now, to better take advantage of networking features.
- If you provide network in libfabric / UCX, then easy for other MPIs to adopt.
- Are we willing to bet the farm on software we don't control?
- HPC is becoming commoditized, and MPI usage is going up commercially, possibly down from labs.
- Some thoughts about revisiting OSHMEM support, and perhaps decouple the interconnects that it supports from MPI's interconnects.
- Some arguments that it's difficult to support many different BTLs.
- Where are vendors? Some in PMLs, some in BTLs.
- What are we discussing? Throwing out OB1 (and BTLs?)
- Open MPI + PMIx + UCX is very compelling.
- We do have a UCX PML, we should probably have a libfabric PML.
- When CM was wrote, it wasn't easy to write a PML.
- Okay to let BTLs atrophy... even if multiple paths to same hardware.
- Motivating discussion - Had a Requirement that high level Languages: MPI / OSHMEM must support all interconnects.
- Easiest way is to just say, that this isn't a requirement.
- If an upper layer supports the BTLs, they should support them all.
- TCP is weird because it doesn't support RDMA symantics.
- Some requirements for MPI + OSHMEM in same job (picking best features of each).
- Does UCX support TCP? There is a Pull Request.
- DECISION - at top framework level (one-sided, pt2pt, collectives, files) if it supports BTLs, it must support ALL BTLs)
- Shared progress is important to possibly neccisary.
- DECISION - We agreed to remove Yoda in v3.0
- Nice if get good message when no transport available.
- Very nice if can develop on laptop with shared memory.
---- Lunch ----
- Embedding - libhwloc, libevent, libpmix, ...
- What do we mean by embedding?
- Is it just a library we ship, and build just as if the user built it? Or do we have more integration like libevent?
- Two places we got in trouble: When we started modifying it, and when we exposed (libevent) outside of the component.
- Distros only want external components.
- Everytime you do a new external component, super painful to write configury.
- Briant wouldn't build a freamework, he would make it another top level library, (like libmpi doesn't know about libopal).
- If you make libUCX like we did libevent, it will be super painful. But if you just keep it at top level, it's easier, than pushing it down to a component, makes you copy the code across different components. Needs to be at the top because both MPI and OSHMEM need it.
- Put it in opal common? Will then get slurped into opal.
- configure time-stamps are still an issue with GIT. So autgen needs to traverse down to component to run autogen down there.
- In a year, we may not want internal components.
- PAINFUL to have to reship OMPI, due to a bug in an embedded component.
- hwloc there can only be one (for m4 reasons).
- RALPH suggests we should strip out libevent in our distro.
- If we can get into Fedora Fedora and Debian.
- If we embed UCX, we should embed it in a way that's easy to take it out.
- Dont think we want the 'auto-find-it' configury.
- Components that use it will -l and -L it, and directly call functions in it.
-
BTLs are currently always being open in case it's needed by an MTL later.
- Warnings from BTL when not using BTLs.
- what does the MTL need to do to let the OSC know that it can handle OSC communications?
- General suggestion is that most MTL owners should write there own OSC component.
- To support Yoda:
- in 1.10 - onlyone that opened BTLs, was BMLs, opened.
- Nathan can now (that we've decided to dump Yoda) move the open.
- osc / files are lazy open frameworks (first call).
- Extending the MTLs is not the right answer.
- Recommendation is that MTL owners also provide a osc
- usnic provided a dummy libverbs provider that doesn't emmit warning.
- Suggestion to have the PSM2 MTL force the openib off (prevent loading).
- Nathan suggests always running with osc_sm.
- MPI_Win_shared_allocate - osc_sm supports.
- Irony of suggesting to have PSM2 MTL force openib btl off based on decision from this morning's forcing things on/off because we know better than user) is funny.
-
Set as a goal, for OMPI v3.0 the MTLs are fully MPI_THREAD_MULTIPLE compliant.
- Howard wrote a performance of ugenie MTL.
- Problem is, each MTL will have it's own threading issues.
- When MPI_THREAD_MULTIPLE is enabled, MTL gets everything, but all datatypes are already done already.
- Only difference between MTL and PML is if datatype has been done yet or not.
- Think OFI MTL is already thread safe.
- Documentation on what MTLs are doing, so we can make large general statements.
- If customer requests MPI_THREAD_MULTIPLE the MTL detects and disqualifies itself.
- Anything driving a new release?
- Too big to copy/paste from master -> v2.x
- Performance improvements features, refactoring done in master, can't easily pull back.
- People at BOF - what were the promised features that were driving the poll?
- A bit of a push poll. Because it's becoming difficult to keep backporting to v2.x from master.
- One thing that would need a 3.0 is removal of components btl_sm, others...
- People didn't like cutting a 2nd digit change on a branch from master... too unstable.
- Big problem with 1.8 was that it was a couple of years from 1.6. So it was painful from customers.
- Regular releases is a good thing... but frequency is the trick.
- What's a natural place to branch for v3.0?
- A change from George's group that changes the way that progress is done.
- 10 lines in each BTl, not invasive.
- Nvidia and George has an extention to MPI to support GPU buffers and end to end collectives.
- Somewhat localized, somewhat easy to cherry pick back to v2.x
- Basicly taking tune, and move it into a more flexible component.
- A change from George's group that changes the way that progress is done.
- Ralph would like v3.0 out in 2017.
- If stuff deploy too fast, they won't deploy.
- Howard's name was on QA in OMPI-x / HPC from yesterday.
- something happened to break --disable-dlopen. Broken on master / v2.x.
- vendors seem to have more frequent internal release frequency.
- Are we thinking about doing a date based release?
- Lets assume v2.0.2, and v1.10.6 is emminent.
- Everyone seems to want time based releases.
- Would need to be part of conf call, publish schedule and individual members will backfit when a feature will be fit.
- Are we talking about 4 month schedules. 1 year didn't work. Lets miss mid-Oct, Nov, Dec. Jan
- 3 releases per year: Feb, June , Oct
- v2.1 - sooner than June. Still use old schedule.
- Could set some date for no more features in release.
- PMIx drops is somewhat special for new feature drops, due to all eggs in that basket.
- ROMIO we grabs a specific commit, and then cherry pick a few things from MPICH. (keep a file).
- Have done this with ROMIO, because it's a boat load of work for new ROMIO.
- A point in the proccess of branching for release is requiring a release of PMI-x.
- Basicly this is what's slipping OMPI v2.0.2
- Shortly (1 month tops) after prior release, we cut release branch from master for next release.
- After we create the branch, no new features for that release.
- 3.1, 3.2, 3.3 might be different branches from master.
- Is it time to consider we always do Pull Request?
- Some are doing dev every day on master, and it's generally better than v2.x
- Release manager could go to date based.
- All of our CI, is single node.
- Get some configure builds, and replicate
- Get some parallel multi-node testing in CI (on PR).
- Duration of support? How many streams will we simultaneously support.
- Depends on what customers needs.
- Some thought that if some vendors latched onto certain versions and kept stablizing would provide value.
- Realisticly 2 or most 3 in operation. Could probably push people out of v2.0 and into v2.1
- Moving customers from v2.1 -> 3.0 will be harder. Customers don't like .0 releases, and from master more unstable.
- Moving from 3.0 -> 3.1 people adopt quickly.
- Every 4 months, branch out of master, keep master stable!
- At that time decide if it's 1st digit or 2nd digit release.
- One thing that would really help. We have MPI tests, but only handful of runtime tests.
- Better coverage suite of command line options / runtime flavors.
- Ralph ran mapping ranking binding, rand a nice matrix of those, and see if there was a difference.
- Need topology inputs, just need topology strings from lstopo.
- v3.0
- Will require recompile / relink - changing size of MPI_COMMs.
- How to default async_modex - good for sparce connectivity.
- Some talk about two weeks of quiet time stablizing master before branching from master.
- Same as branching, and stablizing there.
- Need to branch for next release same day (or next day) as release to allow for 4 month date based cycle.
- v2.1
- All outstanding PRs against v2.1 including and older than 2729, please rebase.
- Feb, June, Oct rotation is good for schedule reasons.
- Are there blocker bugs in the new world order?
- No, we can always just write it up as a KNOWN error for that specific situations.
- What if there is a known bug that we ALL care about?
- No blockers, still ship it!
- If it's critical for you, you need to help fix it.
- Right now we have single point failures all over this code base.
- This is idelistic. We will struggle against this concept.
- But we should stop doing the things we know don't work.
- Use the clock as the hammer.
- Performance Regression Monitoring options (e.g. a dashboard?)
- Mellanox presented what they do.
- Run baseline and target branch, with some community agreed parameters.
- keep it small, at first, 2 nodes. Run through resource manager.
- Script that generates CSV from OSU output.
- MondoDB - turning it into the HTML that browser is looking at.
- Amazon would prefer DynomoDB, or the MySQL database we already have.
- want a seperate MTT performance database.
- Backend support (new table, etc). Even with Python reporter.
- Payload that goes to daemon, needs to handle the new performance data.
- Josh Ladd can help with browser rendering side.
- Josh Hursey can help with backend piece.
-
Selection of 3.0 resource manager.
- Time commitment is generally a weekly hour phone call.
- As it gets closer to release, a bit more work.
- 2 canidates investigating:
- Amazon - Brian is checking.
- Howard - is checking.
- IBM is checking, but probably not.
-
Send names to ralph about who is interested in OpenMP + Open MPI.
-
Still want v2.0.2 this morning.
- getting last minute PRs, ready.
-
What to do about our 3rd digit bugfixes?
- Unless REALLY needed no more than once a month.
- release managers decide when to release.
-
1st and 2nd digit releases will be done:
- on the 15th of Feb, June, October.
- On the 15th we Branch for NEXT release (4 months later).
- Release manager can plan RCs leading up to the release.
- First will be v3.0. Branched from master June 15th. All features need to be in master by June 15th.
- Making RCs as relevant.
- v3.0 will then release no later than Oct 15th.
- If Oct 15th comes and still blocker bugs, doesn't matter, we still ship. Document known bugs.
-
What are all the things we're trimming out for v3.0?
- MXM MTL (keeping Yalla)
- Supported Platforms:
- OSX (if it's not in jenkins)
- Fujitsu test sparc
- Like the idea, that it's not a supported platform unless we've got a MTT
- Platform 3 categories in README:
- test and support (need MTT and jenkins support)
- take patches to support (Free BSD?)
- Sorry we're not going to take patches on these platforms: native window support.
- Jeff files some issues for removing components: 2849, 2850, 2851
-
ROCE - Doesn't work directly out of the box.
- Need RDMACM
- No one is testing ROCE much.
- Breakout cable is a fork, plug 1 port in 100GB, and have 4 10GB into individual servers.
- Switch port needs to be in multi-port not single port mode.
- This is indicitive of switch config issue, or lower level stack.
- If this is Iwarp or ROCE. You can tell if you're ethernet or Infiniband physical.
- Jeff will open an Issue for v2.0.3
-
-prot concept
- PR 2825
- BML we are building list for each endpoint, ones it can use for rdma (BW), eager (LAT), send(BW), slightly different.
- Put it in ob1, but using BML - pulling out BML send.
- Eager also uses exclusivity - doesn't have an impact on
- Probably want the union of all of the lists (all that COULD be used).
- R2 cleaned it up a bit when added dynamic add_procs.
- Never cleanup BML data.
- BML can't know about PMLs today.
- Need to look at add_procs of everyone if you want the full map.
- We're interested in it understanding other PMLs, also.
- dlsym is pretty clumsy, and doesn't work for --disable_dlopen.
- registration method could be good.
- If there is a seperate registration thing in opal, then it's a buy-in.
- in the component structure - "i am infiniband, I am ugenie". Nathan will see if PR still exists.
- Because we use subobject naming everywhere,
- MCA component filter - PR 1974
- this PR would allow components who want to register who they are, to register.
- orte needs TCP, but want apps to be able to disable TCP.
- orterun has the --net parameter.
- Want to make sure that this PR doesn't cause any problems for other parts of the code-base.
- orte can seperate what the daemons see, versus what we give to
- HCOLL can say I'm Infiniband, and I'm MXM.
- This allows everything to happen in the base, before it's loaded.
- Nathan can create a pointer (in C99) &(int){1} <- this gives a pointer to a an int with a value of 1. Unnamed on stack.
- Like Nathan's approach.
- Want to use the hook framework for display, but use Nathan's ability for discovery.
-
IBM had about a dozen new PRs come upstream.
- will push upstream more frequently.