Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testsuite hangs in collective operations when using MPICH #56

Open
devreal opened this issue Nov 3, 2016 · 5 comments
Open

Testsuite hangs in collective operations when using MPICH #56

devreal opened this issue Nov 3, 2016 · 5 comments

Comments

@devreal
Copy link
Member

devreal commented Nov 3, 2016

I am running the testsuite on our Linux cluster (Laki), this time using MPICH v3.4.1. There seems to be a race condition that leads one process to wait in MPI_Win_free while others are waiting in an allgather. DDT shows the following stack trace:

Processes,Threads,Function
4,4,gomp_thread_start (team.c:121)
4,4,  gomp_barrier_wait_end
4,4,main (main.cc:39)
4,4,  RUN_ALL_TESTS (gtest.h:2233)
4,4,    testing::UnitTest::Run (TeamLocality.cc:15)
4,4,      bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (TeamLocality.cc:15)
4,4,        bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (TeamLocality.cc:15)
4,4,          testing::internal::UnitTestImpl::RunAllTests (TeamLocality.cc:15)
4,4,            testing::TestCase::Run (TeamLocality.cc:15)
4,4,              testing::TestInfo::Run (TeamLocality.cc:15)
4,4,                testing::Test::Run (TeamLocality.cc:15)
4,4,                  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (TeamLocality.cc:15)
4,4,                    void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (TeamLocality.cc:15)
3,3,                      MinElementTest_TestFindArrayDefault_Test::TestBody (MinElementTest.cc:33)
3,3,                        dash::GlobIter<long, dash::Pattern<1, (dash::MemArrange)1, long>, dash::GlobMem<long, dash::allocator::CollectiveAllocator<long> >, dash::GlobPtr<long, dash::Pattern<1, (dash::MemArrange)1, long> >, dash::GlobRef<long> > dash::min_element<long, dash::Pattern<1, (dash::MemArrange)1, long> > (MinMax.h:241)
3,3,                          dart_allgather
3,3,                            PMPI_Allgather
3,3,                              MPIR_Allgather_impl
3,3,                                MPIR_Allgather
3,3,                                  MPIR_Allgather_intra
3,3,                                    MPIC_Sendrecv
3,3,                                      MPIC_Wait
3,3,                                        MPIDI_CH3I_Progress
1,1,                      MinElementTest_TestFindArrayDefault_Test::TestBody (MinElementTest.cc:42)
1,1,                        dash::Array<long, long, dash::Pattern<1, (dash::MemArrange)1, long> >::~Array (Array.h:781)
1,1,                          dash::Array<long, long, dash::Pattern<1, (dash::MemArrange)1, long> >::deallocate (Array.h:1116)
1,1,                            dash::GlobMem<long, dash::allocator::CollectiveAllocator<long> >::~GlobMem (GlobMem.h:165)
1,1,                              dash::allocator::CollectiveAllocator<long>::deallocate (CollectiveAllocator.h:210)
1,1,                                dart_team_memfree
1,1,                                  PMPI_Win_free
1,1,                                    MPIDI_CH3_SHM_Win_free
1,1,                                      MPIR_Reduce_scatter_block_impl
1,1,                                        MPIR_Reduce_scatter_block_intra
1,1,                                          MPIC_Sendrecv
1,1,                                            MPIC_Wait
1,1,                                              MPIDI_CH3I_Progress

I'm not sure whether the following message is related to this issue so I post it here for completeness:

[    0 ERROR ] [ 20713 ] UnitLocality.h           :56   | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/util/UnitLocality.h:56 

When filtering this test, some tests run through until I find another test that hangs. A random example is this:

Processes,Threads,Function
4,4,gomp_thread_start (team.c:121)
4,4,  gomp_barrier_wait_end
4,4,main (main.cc:39)
4,4,  RUN_ALL_TESTS (gtest.h:2233)
4,4,    testing::UnitTest::Run (TeamLocality.cc:15)
4,4,      bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (TeamLocality.cc:15)
4,4,        bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (TeamLocality.cc:15)
4,4,          testing::internal::UnitTestImpl::RunAllTests (TeamLocality.cc:15)
4,4,            testing::TestCase::Run (TeamLocality.cc:15)
4,4,              testing::TestInfo::Run (TeamLocality.cc:15)
4,4,                testing::Test::Run (TeamLocality.cc:15)
4,4,                  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (TeamLocality.cc:15)
4,4,                    void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (TeamLocality.cc:15)
4,4,                      CopyTest_BlockingGlobalToLocalBarrierUnaligned_Test::TestBody (CopyTest.cc:351)
4,4,                        dash::Array<int, long, dash::Pattern<1, (dash::MemArrange)1, long> >::~Array (Array.h:781)
3,3,                          dash::Array<int, long, dash::Pattern<1, (dash::MemArrange)1, long> >::deallocate (Array.h:1107)
3,3,                            dash::Array<int, long, dash::Pattern<1, (dash::MemArrange)1, long> >::barrier (Array.h:1017)
3,3,                              dash::Team::barrier (Team.h:421)
3,3,                                dart_barrier
3,3,                                  PMPI_Barrier
3,3,                                    MPIR_Barrier_impl
3,3,                                      MPIR_Barrier
3,3,                                        MPIR_Barrier_intra
3,3,                                          MPIR_Barrier_impl
3,3,                                            MPIR_Barrier
3,3,                                              MPIR_Barrier_intra
3,3,                                                MPIC_Sendrecv
3,3,                                                  MPIC_Wait
3,3,                                                    MPIDI_CH3I_Progress
1,1,                                                      MPIDU_Sched_are_pending
1,1,                                                      MPID_nem_network_poll
1,1,                                                      MPID_nem_tcp_connpoll
1,1,                                                        poll
1,1,                          dash::Array<int, long, dash::Pattern<1, (dash::MemArrange)1, long> >::deallocate (Array.h:1116)
1,1,                            dash::GlobMem<int, dash::allocator::CollectiveAllocator<int> >::~GlobMem (GlobMem.h:165)
1,1,                              dash::allocator::CollectiveAllocator<int>::deallocate (CollectiveAllocator.h:210)
1,1,                                dart_team_memfree
1,1,                                  PMPI_Win_free
1,1,                                    MPIDI_CH3_SHM_Win_free
1,1,                                      MPIR_Reduce_scatter_block_impl
1,1,                                        MPIR_Reduce_scatter_block_intra
1,1,                                          MPIC_Sendrecv
1,1,                                            MPIC_Wait
1,1,                                              MPIDI_CH3I_Progress

Used compiler was GCC 4.9.1, PAPI was used in version 5.4.2.0.

@fuchsto
Copy link
Member

fuchsto commented Nov 3, 2016

@devreal I suppose the MPICH version doesn't support shared windows then? Does it work when building with CMake flag -DENABLE_SHARED_WINDOWS=OFF?
It looks like it's blocking at a progress handler, we had that issue on SuperMUC with IBM MPI.

@fuchsto
Copy link
Member

fuchsto commented Nov 3, 2016

Apparently the fixes in #58 resolved this.
I just repeated a full CI run with both OpenMPI and MPICH and it passed flawlessly:

https://circleci.com/gh/dash-project/dash/211#usage-queue/containers/1

@devreal Please test again with the current state of development and let me know if you can still reproduce this.

@devreal
Copy link
Member Author

devreal commented Nov 4, 2016

I ran the tests with MPICH v3.2 (CI uses 3.1.6) and it succeeds when running on a single node. However, MPICH reports a faulty MPI_Bcast when running on multiple nodes that I am investigating right now.

@fuchsto
Copy link
Member

fuchsto commented Nov 5, 2016

In the meantime, I repair the "suspicious" TODO code segment in the locality runtime module.
Thank you!
Locality services and dynamic global allocation are huge and very recent additions to the feature set, I'm glad the codebase is thoroughly tested now.

@fuchsto fuchsto added this to the dash-0.3.0 milestone Nov 12, 2016
fuchsto added a commit that referenced this issue Nov 12, 2016
Broadcasting host topology from leaders to local team, addresses #56
@fuchsto
Copy link
Member

fuchsto commented Nov 12, 2016

@devreal Fixed the broadcast from leader unit to local team in #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants