Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make segment IDs unique again! #257

Merged
merged 29 commits into from
Feb 18, 2017
Merged

Make segment IDs unique again! #257

merged 29 commits into from
Feb 18, 2017

Conversation

devreal
Copy link
Member

@devreal devreal commented Jan 30, 2017

Store the team ID in the gptr and give every team it's own segment namespace. Segments can thus be uniquely identified on all units in the team that allocated the segment.

Flags are now stored per segment, accessed through dart_gptr_getflags and dart_gptr_setflags.

Points left to do:

  1. Re-use of segment IDs is missing, will add it tomorrow.
  2. The internal handling of team data is still messy, using an index into a global array. We can do better than that!

Tests passed locally but this is a crucial piece of code, so please review carefully.

Addresses #221

@fuchsto
Copy link
Member

fuchsto commented Jan 30, 2017

We have all the flags, we have the best flags!

@fmoessbauer
Copy link
Member

Let's make DART segment IDs great again!

@rkowalewski
Copy link

rkowalewski commented Jan 31, 2017

Hmm...why not following the suggestion by @fuerlinger. If OK for you I will implement it and see if it is feasible.

@rkowalewski rkowalewski self-requested a review January 31, 2017 08:10
Copy link
Member

@fuchsto fuchsto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear ya, sorry! It worked for all my use cases but it's not practicable in general.

@devreal
Copy link
Member Author

devreal commented Jan 31, 2017

@rkowalewski You're right, I forgot about this proposal. It's easily done using bitfields. Should we still store the flags centrally and only use the gptr for quick access? Only storing them in the gptr has the disadvantage of a possible (accidental) loss of information regarding the segment if the user manipulates the gptr.

@devreal
Copy link
Member Author

devreal commented Jan 31, 2017

@fmoessbauer I am stuck with the failing HDF5 test cases. After adding typesafe checks I get errors like the following:

[  RUN     ] HDF5ArrayTest.CustomType
[...]
[  ERROR   ] [UNIT 0] in dash/dash/test/HDF5ArrayTest.cc:468
      Expected: fillin.a
      Which is: 1
To be equal to: el.a
      Which is: 1.0000009536743162
Unit 0

This seems to me that there is a loss in precision somewhere. The way I understand the HDF5 implementation is that it reads data directly into local memory so I have no idea where things go wrong and why these test fail only in this branch. Maybe you have an idea what's going on there?

@rkowalewski
Copy link

@devreal We should store it both in the segment and gptr. You are right. And yes, with bitfields it is very convenient.

@fmoessbauer
Copy link
Member

fmoessbauer commented Feb 4, 2017

@devreal It seems like our HDF5 adapter is broken, as the reported failures are not only precision problems. Will have a look at that.

Edit: I am not too happy with changes in the HDF5 adapter (DASH side), as this will lead to massive conflicts with the new implementation (almost from scratch) in branch dash-hdf5-views. Unfortunately parts of the new implementation also depends on the DART threadsafety features.

Copy link
Member

@fmoessbauer fmoessbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently building with DART logging enabled is not possible, as log statements are not adapted to new interface.

@fuchsto
Copy link
Member

fuchsto commented Feb 4, 2017

@fmoessbauer That's one of the reasons why I reverted your merge. The integration of log levels as proposed in #247 is not complete, the logging run-time configuration does not comply to the DART config interface and de-centralizes parsing of user settings, etc.

@devreal
Copy link
Member Author

devreal commented Feb 6, 2017

@fmoessbauer Building with debug output should be fixed now, sorry for that. We can defer this PR until #252 has been approved, working on it.

@fuchsto AFAICS, this was unrelated to the changes in #247.

@fmoessbauer
Copy link
Member

@devreal I tried to merge with current development but the conflicts in the DART part are not trivial to merge. Please have a look at them. As the new HDF5 adapter is already merged in development, I guess that the HDF5 bugs will not appear anymore. If still, let me know. I will fix them as soon as possible so we can review and merge this PR.

@devreal
Copy link
Member Author

devreal commented Feb 13, 2017

@fmoessbauer Thanks for the heads-up. I finished merging but I now see a TransformTest fail. I will investigate this and fix it asap. HDF5 tests look good so far 👍

@devreal
Copy link
Member Author

devreal commented Feb 14, 2017

I think I am finished merging the current development in this trunk, tests passed locally. the TransformTest failures I mentioned earlier came from an older version of MPI (1.10.2), which seems to have problems with atomic operations.

Please start your review at your convenience. It ended up being quite a big chunk, sorry for that.

@fmoessbauer
Copy link
Member

From my side it looks good so far, but someone else who is more familiar with DART should review too.

@fuchsto
Copy link
Member

fuchsto commented Feb 17, 2017

@devreal Ready to merge! Please resolve the conflicts and merge.

Copy link

@rkowalewski rkowalewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for this.

@devreal
Copy link
Member Author

devreal commented Feb 17, 2017

Thanks for your positive reviews. I merged the conflicts. However, with the recent changes from #282 I see multiple threads hang in the ThreadSafety tests in MPI_Win_flush when running it locally on my machine:

Thread 4 (Thread 0x7fffe27a8700 (LWP 27525)):
#0  0x00007ffff62b1c47 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fffebdfb715 in ompi_osc_pt2pt_flush_lock () from /home/joseph/opt/openmpi-1.10.5/lib/openmpi/mca_osc_pt2pt.so
#2  0x00007ffff752ebe4 in PMPI_Win_flush () from /home/joseph/opt/openmpi-1.10.5/lib/libmpi.so.12
#3  0x000000000095c23b in dart_put_blocking (gptr=..., src=0x7fffe27a7c44, nelem=1, dtype=DART_TYPE_INT) at /home/joseph/src/dash/dash/dart-impl/mpi/src/dart_communication.c:759
#4  0x00000000007cc2a3 in dash::GlobRef<int>::operator= (this=0x7fffe27a7cf0, val=2) at /home/joseph/src/dash/dash/dash/include/dash/GlobRef.h:201
#5  0x000000000089ae16 in dash::GlobRef<int>::operator= (this=0x7fffe27a7cf0, other=...) at /home/joseph/src/dash/dash/dash/include/dash/GlobRef.h:112
#6  0x0000000000898731 in ThreadsafetyTest_ConcurrentPutGet_Test::TestBody () at /home/joseph/src/dash/dash/dash/test/ThreadsafetyTest.cc:74
#7  0x00007ffff67bee46 in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:119
#8  0x00007ffff79b96ba in start_thread (arg=0x7fffe27a8700) at pthread_create.c:333
#9  0x00007ffff62ce82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

I seem to remember a discussion about this kind of hangs but cannot straight remember the potential solution. I gotta do some debugging. However, I assume that this is not related with the changes in this branch. Not sure I can spend much time on the weekend on this, though.

EDIT: For some reason the tests passed here. I ran my tests with OpenMPI 1.10.5 with MPI_THREAD_MULTIPLE enabled (not sure if that's the case with CircleCI).

@fuchsto
Copy link
Member

fuchsto commented Feb 18, 2017

@devreal So it passes in CI but currently still fails on your local machine? Please attach details on your build environment (esp. compiler versions), I will try to reproduce this.

@devreal
Copy link
Member Author

devreal commented Feb 18, 2017

I just tried again with MPICH and the tests pass with shared memory windows disabled. I think we can go ahead and merge this PR, I will open an issue with the details of the hang. Let's just wait for the latest round of CI to pass...

@fuchsto fuchsto merged commit 6f7f508 into development Feb 18, 2017
@fuchsto fuchsto deleted the feat-221-uniqsegid branch February 18, 2017 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants