-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20160112
- Dialup Info: (Do not post to public mailing list or public wiki)
- Brad Benton
- Edgar Gabriel
- Geoffroy Vallee
- George
- Howard
- Josh Hursey
- Nathan Hjelm
- Ralph
- Ryan Grant
- Sylvain Jeaugey
- Todd Kordenbrock
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
- mpirun hangs on ONLY SLES 12. Minimum 40 procs/node. at very end of mpirun. Only seeing it in certain cases. Not sure what's going on.
- Is mpirun not exiting because ORTED not exiting? Nathan saw this on 2.0
- wait for Paul Hardgrove.
- No objections for Ralph shipping 1.10.2
- Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
- Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
- Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
-
https://github.com/open-mpi/ompi/issues/1252 - Nathan working on a decay function for progress functions to "fix" this.
- Nathan's been delayed until later this week. Could get done by middle of next week.
- George commented that openib btl specificly could be made to only progress if there is a send/recv message posted.
- ugeniee progress - could only check for data grams every (only 200ns hit).
- Prefer to stick with nathan's original decay function without modifying openib.
-
https://github.com/open-mpi/ompi/issues/1225 - Totalview debugger problem + PMPI-x.
- SLURM users use srun, doesn't have this issue.
- DDT does NOT have this issue either. Don't know why it's different. Attach FIFO.
- mpirun waits on a pipe for debugger to write a 1 on that pipe.
- Don't see how that CAN work.
- Nathan's been using attach, rather than mpirun --debug. Attach happens after launch, so then it's not going through this step. Nathan thinks not so critical since attach works.
- Anything will work, as long as you're ATTACHING to a running job, rather than launching through debugger.
- Barring a breakthrough with PMI-x notify in next week. We'll do an RC2 and just carfully document what works/doesn't as far as debuggers.
- Will disable "mpirun --debug" and print an error on 2.0 branch that says it's broken.
- No longer a blocker for 2.0.0 due to schedule. Still want to fix this for next release.
- No new features (except for
- Howard will review
- review group comm
- don't know if we'll bother with pls filesystem.
- UXC using Modex stuff.
- OMPI-IO + Luster slow on 2.0.0 (and master) branches. Discussed making ROMIO default for OMPI on Luster (only).
-
Bunch of failures on Master branch. No chance to look at yet.
-
Cisco and Ivy cluster.
-
Nathan's seeing a resource deadlock avoided on OMPI Waitall. Some TCP BTL issue. Looks like something going on down there. Should be fairly easy to test this. Cisco TCP one-sided stuff.
- Nathan will see if he can figure this out. Haven't changed one-sided pt2pt receintly. Surprised. Maybe proclocks on by default? Need to work this out. Just changed locks from being conditional to being unconditional.
-
Edgar found some luster issues. OMPI master, has bad MPI-IO performance on luster. Looked reasonable on master, but now performance is poor. Not completely sure when get performance
- Luster itself, could switch back to ROMIO for default.
- GPFS, and others will look good, but Luster is bad. Can't have OMPI-IO as default on Luster.
- Problem for 2.0.0 AND Master Branch.
-
https://github.com/open-mpi/ompi/issues/398 ready for Pull request
- Nathan - Should go to 2.1 (since mpull changes pushed to 2.1).
-
https://github.com/open-mpi/ompi/pull/1118 - mpull rewrite should be ready to go, but want George to look at make comments. Probably one of first 2.1 requests after into master.
-
https://github.com/open-mpi/ompi/pull/1296 - PMI-x - Spreading changes from PMI-x across non-PMI-x infrastructure. Is that OKay?
- This is just making changes in GLUE that is OMPI specific.
- Should go into 2.0.0. plugs leaks, but minor.. still good.
-
https://github.com/open-mpi/ompi/pull/1290 - OPAL HOTEL problem. Do we need to get this into 2.0 as well?
- Definately needs to go into 2.0! Jeff is using it in 1.10.
-
https://github.com/open-mpi/ompi/pull/1278 - Nathan might want to look at. Giles fixing derived datatypes in one-sided.
- Nathan says it looks okay. Perfectly reasonable to use two different sets of tags.
- Absolutely a 2.0.0 bug as well.
- Nathan will merge it, and open the PR.
- Mellanox - (via email update after the meeting)
We are just now preparing the patch to open a PR. We’ve just finished testing this morning and got the ‘OK’ from UCX folks to open a PR. Sorry for the delay, we just wanted to be sure all the ‘t’s were crossed and ‘I’s dotted before submission.
- https://github.com/open-mpi/ompi-release/pull/891
- Sandia - Ryan, working on getting some bug fixes for 2.0. No major issues
- Intel - Working on MTT re-write. Trying to track down error notification thing. not much cycles.
- re-writing client in python, and make it more pluggable. and extending feature set, to handle broader range of stages.
- Josh has been working on reporter side (last 6 months) with some students. Thinking about more flexible architecture.
- rest interface around database, to support Python, and more flexible javascript reporter. Hopefully get that to a stage where people can play with.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA