Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail if running on mutliple nodes #61

Closed
devreal opened this issue Nov 4, 2016 · 20 comments
Closed

Tests fail if running on mutliple nodes #61

devreal opened this issue Nov 4, 2016 · 20 comments

Comments

@devreal
Copy link
Member

devreal commented Nov 4, 2016

I am running the testsuite on with 4 processes on 2 nodes using MPICH v3.2 and see the following error in the Shared test:

Fatal error in PMPI_Bcast: Message truncated, error stack:
PMPI_Bcast(1600)....................: MPI_Bcast(buf=0x7fff5850cb88, count=16, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1452)...............:
MPIR_Bcast(1476)....................:
MPIR_Bcast_intra(1249)..............:
MPIR_SMP_Bcast(1081)................:
MPIR_Bcast_binomial(239)............:
MPIC_Recv(353)......................:
MPIDI_CH3U_Request_unpack_uebuf(568): Message truncated; 280 bytes received but buffer size is 16
MPIR_SMP_Bcast(1088)................:
MPIR_Bcast_binomial(310)............: Failure during collective

The 16 bytes are a dart gptr that is sent to everyone in the team. I tried to find out where the 280 Byte are coming from and it seems that they stem from dart__base__host_topology__init where data is send from one leader unit to all others. Notice the following two log lines:

[    0 TRACE ] [  5492 ] host_topology.c          :355  :   DART: dart__base__host_topology__init: broadcasting module locations from leader unit 0 to units in team 0
[    1 TRACE ] [  5493 ] host_topology.c          :355  :   DART: dart__base__host_topology__init: broadcasting module locations from leader unit 0 to units in team 0
[    2 TRACE ] [ 22421 ] host_topology.c          :355  :   DART: dart__base__host_topology__init: broadcasting module locations from leader unit 2 to units in team 0
[    3 TRACE ] [ 22422 ] host_topology.c          :355  :   DART: dart__base__host_topology__init: broadcasting module locations from leader unit 2 to units in team 0`
and 

Notice the different leader units operating on the same team eventually calling a single broadcast. Looking at the code, I see the following comment that is highly suspicious:

if (DART_UNDEFINED_UNIT_ID != local_leader_unit_id) {
    /*
     * TODO: Use local_team instead of team if num_hosts > 1
     */

Maybe someone familiar with this code can comment on whether this might be causing the failing runs on multiple nodes?

@fuchsto
Copy link
Member

fuchsto commented Nov 5, 2016

@devreal
Yes, the snippet you posted is suspicious, but I cannot reproduce this on SuperMUC.
If possible, please send the full log output via mail, there might be an earlier problem in the execution.

Also, does the test fail if you run it separately via runtime parameter --gtest_filter=SharedTest.*?

@fmoessbauer
Copy link
Member

fmoessbauer commented Nov 5, 2016

By implementing the docker openmpi2 container, I discovered the same bug. It is easily reproducible. For the container, see branch feat-docker-ompi2 dash/scripts/docker-testing/openmpi2. There are some tests that hang. I will make a new Issue (#63) for that.

@fuchsto
Copy link
Member

fuchsto commented Nov 7, 2016

Okay, I'm marking this bug report as bug detail then.

@fuchsto
Copy link
Member

fuchsto commented Nov 12, 2016

Broadcast to local team has been implemented in #90

@devreal
Copy link
Member Author

devreal commented Nov 14, 2016

Still seeing the MPI_Bcast fail on latest development using MPICH 3.2:

Processes 2-3: Fatal error in PMPI_Bcast: Invalid root, error stack:
Process 3: PMPI_Bcast(1600): MPI_Bcast(buf=0xb6c610, count=280, MPI_BYTE, root=2, comm=0xc4000002) failed
Process 2: PMPI_Bcast(1600): MPI_Bcast(buf=0xb6c6b0, count=280, MPI_BYTE, root=2, comm=0xc4000008) failed
Processes 2-3: PMPI_Bcast(1562): Invalid root (value given was 2)

Stacktrace:

Processes,Function
2,main (main.cc:21)
2,  dash::init (Init.cc:33)
2,    dart_init (dart_initialization.c:262)
2,      dart__mpi__locality_init (dart_locality_priv.c:21)
2,        dart__base__locality__init (locality.c:77)
2,          dart__base__locality__create (locality.c:174)
2,            dart__base__host_topology__create (host_topology.c:671)
2,              dart__base__host_topology__update_module_locations (host_topology.c:402)
2,                dart_bcast (dart_communication.c:1455)
2,                  PMPI_Bcast
2,                    MPIR_Err_return_comm
2,                      MPIR_Handle_fatal_error
2,                        MPID_Abort

Debugging this a bit, it appears that unit 2 passes root=2 to dart_bcast using a node-local communicator. This is erroneous because root should be 0 on this communicator (unit numbering starts with 0 on all communicators). A translation to team-local unit IDs has to be done. Will commit a fix soon.

@fuchsto
Copy link
Member

fuchsto commented Nov 14, 2016

@devreal Thank you, the ID-translation is required of course. Wondering why it worked on SuperMUC.
Did you validate the fix? Does the test suite pass on your cluster?

@devreal
Copy link
Member Author

devreal commented Nov 14, 2016

I thought it did but I re-checked and it turns out that it crashes with a different error :(

This time I end up with a SIGSEGV. On unit 0 I see the following output:

#### Starting test on unit 0 (n032402 PID: 26192)
[    0 ERROR ] [ 26192 ] domain_locality.c        :147  !!! DART: dart__base__locality__domain__copy: domain  has num_units = 12091040 but domain->unit_ids is NULL
[    0 ERROR ] [ 26192 ] LocalityDomain.cc        :56   | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/dash/src/util/LocalityDomain.cc:56 
[    0 ERROR ] [ 26192 ] LocalityDomain.cc        :263  | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/dash/src/util/LocalityDomain.cc:263 
[    0 ERROR ] [ 26192 ] domain_locality.c        :147  !!! DART: dart__base__locality__domain__copy: domain  has num_units = 12091040 but domain->unit_ids is NULL
[    0 ERROR ] [ 26192 ] LocalityDomain.cc        :56   | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/dash/src/util/LocalityDomain.cc:56 

Stack trace:

Processes,Function
1,main (main.cc:39)
1,  RUN_ALL_TESTS (gtest.h:2233)
1,    testing::UnitTest::Run (Logging.cc:28)
1,      bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (Logging.cc:28)
1,        bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (Logging.cc:28)
1,          testing::internal::UnitTestImpl::RunAllTests (Logging.cc:28)
1,            testing::TestCase::Run (Logging.cc:28)
1,              testing::TestInfo::Run (Logging.cc:28)
1,                testing::Test::Run (Logging.cc:28)
1,                  void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (Logging.cc:28)
1,                    void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (Logging.cc:28)
1,                      MinElementTest_TestFindArrayDefault_Test::TestBody (MinElementTest.cc:33)
1,                        dash::GlobIter<long, dash::Pattern<1, (dash::MemArrange)1, long>, dash::GlobMem<long, dash::allocator::CollectiveAllocator<long> >, dash::GlobPtr<long, dash::Pattern<1, (dash::MemArrange)1, long> >, dash::GlobRef<long> > dash::min_element<long, dash::Pattern<1, (dash::MemArrange)1, long> > (MinMax.h:200)
1,                          long const* dash::min_element<long> (MinMax.h:55)
1,                            dash::util::UnitLocality::UnitLocality (UnitLocality.h:71)
1,                              dash::util::UnitLocality::UnitLocality (UnitLocality.h:62)
1,                                dash::util::LocalityDomain::LocalityDomain (LocalityDomain.cc:73)
1,                                  dash::util::LocalityDomain::init (LocalityDomain.cc:519)
1,                                    __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > std::vector<int, std::allocator<int> >::insert<int*, void> (stl_vector.h:1100)
1,                                      void std::vector<int, std::allocator<int> >::_M_insert_dispatch<int*> (stl_vector.h:1375)
1,                                        void std::vector<int, std::allocator<int> >::_M_range_insert<int*> (vector.tcc:667)
1,                                          int* std::__uninitialized_copy_a<int*, int*, int> (stl_uninitialized.h:281)
1,                                            int* std::uninitialized_copy<int*, int*> (stl_uninitialized.h:126)
1,                                              int* std::__uninitialized_copy<true>::__uninit_copy<int*, int*> (stl_uninitialized.h:93)
1,                                                int* std::copy<int*, int*> (stl_algobase.h:456)
1,                                                  int* std::__copy_move_a2<false, int*, int*> (stl_algobase.h:424)
1,                                                    int* std::__copy_move_a<false, int*, int*> (stl_algobase.h:386)
1,                                                      int* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (stl_algobase.h:368)
1,                                                        memmove
1,                                                          _wordcopy_fwd_dest_aligned

@devreal
Copy link
Member Author

devreal commented Nov 14, 2016

I made some progress I think but I am not done yet, gotta leave for today though. I committed some patches to bug-56-local-team since I think this is still related.

Right now there seems to be some invalid copying going, Valgrind says:

==27607== Invalid read of size 8
==27607==    at 0x4A08C28: memcpy (mc_replace_strmem.c:882)
==27607==    by 0x823D4E: dart__base__locality__domain__copy (domain_locality.c:133)
==27607==    by 0x81B5F8: dart__base__locality__clone_domain (locality.h:49)
==27607==    by 0x81BD6C: dart_domain_clone (dart_locality.c:83)
==27607==    by 0x7EFA00: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==27607==    by 0x762EF1: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==27607==    by 0x7881E4: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==27607==    by 0x7DA296: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/build/bin/dash-test-mpi)
==27607==    by 0x7D5665: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/build/bin/dash-test-mpi)
==27607==    by 0x7BC5FF: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/build/bin/dash-test-mpi)
==27607==    by 0x7BCE61: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/build/bin/dash-test-mpi)
==27607==    by 0x7BD47C: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/bug-56-local-team/build/bin/dash-test-mpi)
==27607==  Address 0xc680dd0 is 0 bytes after a block of size 320 alloc'd
==27607==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==27607==    by 0x82926E: dart__base__locality__domain__create_module_subdomains (domain_locality.c:803)
==27607==    by 0x827A2A: dart__base__locality__domain__create_node_subdomains (domain_locality.c:644)
==27607==    by 0x826B8B: dart__base__locality__domain__create_subdomains (domain_locality.c:550)
==27607==    by 0x8381C0: dart__base__locality__create (locality.c:204)
==27607==    by 0x83709A: dart__base__locality__init (locality.c:77)
==27607==    by 0x81DA11: dart__mpi__locality_init (dart_locality_priv.c:21)
==27607==    by 0x81A921: dart_init (dart_initialization.c:262)
==27607==    by 0x7E61EC: dash::init(int*, char***) (Init.cc:33)
==27607==    by 0x621F9A: main (main.cc:21)

@devreal devreal reopened this Nov 14, 2016
@fuchsto
Copy link
Member

fuchsto commented Nov 14, 2016

@devreal Thank you for crunching this, I have an idea where the invalid copy is coming from. Let me check my implementations in dart/base first, there might be conceptual issues

@devreal
Copy link
Member Author

devreal commented Nov 23, 2016

This still fails (works on just one node) with a Segfault (see at the very end):

[  RUN     ] TeamLocalityTest.SplitNUMA
[=   2  LOG =]       TeamLocalityTest.h :  20 | >>> Test suite: TeamLocalityTest 
[=   3  LOG =]       TeamLocalityTest.h :  20 | >>> Test suite: TeamLocalityTest 
[=   1  LOG =]       TeamLocalityTest.h :  20 | >>> Test suite: TeamLocalityTest 
[=   0  LOG =]       TeamLocalityTest.h :  20 | >>> Test suite: TeamLocalityTest 
[=   2  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   3  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   1  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   0  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
==17079== Invalid read of size 8
==17079==    at 0x4A08D4C: memcpy (mc_replace_strmem.c:882)
==17079==    by 0x7F79FF: dart__base__locality__domain__copy (domain_locality.c:133)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc6804c8 is 4 bytes after a block of size 4 alloc'd
==17079==    at 0x4A06C20: realloc (vg_replace_malloc.c:662)
==17079==    by 0x7FA5C2: dart__base__locality__domain__create_module_subdomains (domain_locality.c:881)
==17079==    by 0x7F9816: dart__base__locality__domain__create_node_subdomains (domain_locality.c:638)
==17079==    by 0x7F8E64: dart__base__locality__domain__create_subdomains (domain_locality.c:544)
==17079==    by 0x80106C: dart__base__locality__create (locality.c:204)
==17079==    by 0x800802: dart__base__locality__init (locality.c:77)
==17079==    by 0x7F3A90: dart__mpi__locality_init (dart_locality_priv.c:21)
==17079==    by 0x7F24DC: dart_init (dart_initialization.c:262)
==17079==    by 0x7CF2F9: dash::init(int*, char***) (Init.cc:33)
==17079==    by 0x619EAA: main (main.cc:21)
==17079== 
==17079== Invalid read of size 4
==17079==    at 0x7F7A07: dart__base__locality__domain__copy (domain_locality.c:137)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc6804b0 is 16 bytes before a block of size 4 alloc'd
==17079==    at 0x4A06C20: realloc (vg_replace_malloc.c:662)
==17079==    by 0x7FA5C2: dart__base__locality__domain__create_module_subdomains (domain_locality.c:881)
==17079==    by 0x7F9816: dart__base__locality__domain__create_node_subdomains (domain_locality.c:638)
==17079==    by 0x7F8E64: dart__base__locality__domain__create_subdomains (domain_locality.c:544)
==17079==    by 0x80106C: dart__base__locality__create (locality.c:204)
==17079==    by 0x800802: dart__base__locality__init (locality.c:77)
==17079==    by 0x7F3A90: dart__mpi__locality_init (dart_locality_priv.c:21)
==17079==    by 0x7F24DC: dart_init (dart_initialization.c:262)
==17079==    by 0x7CF2F9: dash::init(int*, char***) (Init.cc:33)
==17079==    by 0x619EAA: main (main.cc:21)
==17079== 
==17079== Invalid read of size 8
==17079==    at 0x7F7C09: dart__base__locality__domain__copy (domain_locality.c:150)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc6804b8 is 8 bytes before a block of size 4 alloc'd
==17079==    at 0x4A06C20: realloc (vg_replace_malloc.c:662)
==17079==    by 0x7FA5C2: dart__base__locality__domain__create_module_subdomains (domain_locality.c:881)
==17079==    by 0x7F9816: dart__base__locality__domain__create_node_subdomains (domain_locality.c:638)
==17079==    by 0x7F8E64: dart__base__locality__domain__create_subdomains (domain_locality.c:544)
==17079==    by 0x80106C: dart__base__locality__create (locality.c:204)
==17079==    by 0x800802: dart__base__locality__init (locality.c:77)
==17079==    by 0x7F3A90: dart__mpi__locality_init (dart_locality_priv.c:21)
==17079==    by 0x7F24DC: dart_init (dart_initialization.c:262)
==17079==    by 0x7CF2F9: dash::init(int*, char***) (Init.cc:33)
==17079==    by 0x619EAA: main (main.cc:21)
==17079== 
==17079== Invalid read of size 4
==17079==    at 0x7F7D7E: dart__base__locality__domain__copy (domain_locality.c:161)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc680498 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== Invalid read of size 8
==17079==    at 0x7F7D90: dart__base__locality__domain__copy (domain_locality.c:162)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc6804a0 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== Invalid read of size 4
==17079==    at 0x7F7F00: dart__base__locality__domain__copy (domain_locality.c:169)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc680498 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== Invalid read of size 4
==17079==    at 0x7F8184: dart__base__locality__domain__copy (domain_locality.c:182)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc680498 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== Invalid read of size 8
==17079==    at 0x7F80D9: dart__base__locality__domain__copy (domain_locality.c:183)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A6600: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7ACEF6: testing::internal::UnitTestImpl::RunAllTests() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xc6804a0 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== Invalid read of size 8
==17079==    at 0x4A08D4C: memcpy (mc_replace_strmem.c:882)
==17079==    by 0x7F79FF: dart__base__locality__domain__copy (domain_locality.c:133)
==17079==    by 0x7F813C: dart__base__locality__domain__copy (domain_locality.c:186)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  Address 0xe8 is not stack'd, malloc'd or (recently) free'd
==17079== 
==17079== 
==17079== Process terminating with default action of signal 11 (SIGSEGV)
==17079==  Access not within mapped region at address 0xE8
==17079==    at 0x4A08D4C: memcpy (mc_replace_strmem.c:882)
==17079==    by 0x7F79FF: dart__base__locality__domain__copy (domain_locality.c:133)
==17079==    by 0x7F813C: dart__base__locality__domain__copy (domain_locality.c:186)
==17079==    by 0x7F2B94: dart__base__locality__clone_domain (locality.h:49)
==17079==    by 0x7F2FDC: dart_domain_clone (dart_locality.c:83)
==17079==    by 0x7D8584: dash::util::LocalityDomain::LocalityDomain(dart_domain_locality_s const&) (LocalityDomain.cc:52)
==17079==    by 0x751FA9: dash::util::LocalityDomain::scope_domains(dash::util::Locality::Scope) const (LocalityDomain.h:278)
==17079==    by 0x773E1C: TeamLocalityTest_SplitNUMA_Test::TestBody() (TeamLocalityTest.cc:95)
==17079==    by 0x7C33FA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7BE7E9: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5783: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==    by 0x7A5FE5: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==17079==  If you believe this happened as a result of a stack
==17079==  overflow in your program's main thread (unlikely but
==17079==  possible), you can try to increase the size of the
==17079==  main thread stack using the --main-stacksize= flag.
==17079==  The main thread stack size used in this run was 16777216.

@fuchsto
Copy link
Member

fuchsto commented Nov 28, 2016

@devreal
Can you run the tests again with the new CI script where test suites are executed in isolated runs?
I think it is just a small set of tests that are failing due to defect #153

@devreal
Copy link
Member Author

devreal commented Nov 28, 2016

Unfortunately, our cluster is down for maintenance until Thursday. I will try as soon as I get a chance to get on the cluster again.

@devreal
Copy link
Member Author

devreal commented Dec 1, 2016

Re-ran just now (MPICH v3,2). I see the following errors:

[=  0 LOG =]              TeamTest.cc :  61 | team_all.dart_id(): 0, team_core.dart_id(): 3
[    4 ERROR ] [  8814 ] dart_team_private.c      :180  !!! DART: Invalid teamid input: -1
[    3 ERROR ] [  4824 ] dart_team_private.c      :180  !!! DART: Invalid teamid input: -1
[    3 ERROR ] [  4824 ] host_topology.c          :250  !!! DART: Assertion failed: dart_team_myid(leader_team, &my_leader_id) -- Expected return value 0
[    3 ERROR ] [  4824 ] dart_team_private.c      :180  !!! DART: Invalid teamid input: -1
[    4 ERROR ] [  8814 ] host_topology.c          :250  !!! DART: Assertion failed: dart_team_myid(leader_team, &my_leader_id) -- Expected return value 0
dash-test-mpi: /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/internal/host_topology.c:250: dart__base__host_topology__update_module_locations: Assertion `(dart_team_myid(leader_team, &my_leader_id)) == (DART_OK)' failed.
[    4 ERROR ] [  8814 ] dart_team_private.c      :180  !!! DART: Invalid teamid input: -1
dash-test-mpi: /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/internal/host_topology.c:250: dart__base__host_topology__update_module_locations: Assertion `(dart_team_myid(leader_team, &my_leader_id)) == (DART_OK)' failed.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4824 RUNNING AT n023002
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@n023102] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@n023102] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@n023102] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@n023102] HYDT_bscu_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@n023102] HYDT_bsci_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@n023102] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@n023102] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
[[   FAIL ]] [ 20161201-171308 ] 0 failed, returned 1, completed: 0

Same error also occured on GlobMemTest, ArrayTest

[=  3 LOG =]       TeamLocalityTest.h :  20 | >>> Test suite: TeamLocalityTest
[=  3 LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 6 units ...
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(150)...............:
barrier_smp_intra(96).................:
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
barrier_smp_intra(111)................:
MPIR_Bcast_impl(1452).................:
MPIR_Bcast(1476)......................:
MPIR_Bcast_intra(1287)................:
MPIR_Bcast_binomial(310)..............: Failure during collective
[[   FAIL ]] [ 20161201-171310 ] 0 failed, returned 1, completed: 0
[  RUN     ] DARTLocalityTest.ExcludeLocalityDomain
[=  2 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  2 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  1 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  1 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  3 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  3 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  4 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  4 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  6 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  6 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  5 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  5 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  7 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  7 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
[=  0 LOG =]       DARTLocalityTest.h :  20 | >>> Test suite: DARTLocalityTest 
[=  0 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
*** glibc detected *** /zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi: realloc(): invalid next size: 0x00000000027bdd50 ***
*** glibc detected *** /zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi: realloc(): invalid next size: 0x000000000147bd30 ***
*** glibc detected *** /zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi: realloc(): invalid next size: 0x0000000001a5dbd0 ***
*** glibc detected *** /zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi: realloc(): invalid next size: 0x00000000011c5db0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3c99075f3e]
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6[0x3c99075f3e]
======= Backtrace: =========
/lib64/libc.so.6[0x3c99075f3e]
/lib64/libc.so.6[0x3c99075f3e]
/lib64/libc.so.6[0x3c9907be1a]
/lib64/libc.so.6[0x3c9907be1a]
/lib64/libc.so.6[0x3c9907be1a]
/lib64/libc.so.6(realloc+0x158)[0x3c9907c058]
/lib64/libc.so.6(realloc+0x158)[0x3c9907c058]
/lib64/libc.so.6(realloc+0x158)[0x3c9907c058]
/lib64/libc.so.6[0x3c9907be1a]
/lib64/libc.so.6(realloc+0x158)[0x3c9907c058]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x326)[0x7fcb1c]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x326)[0x7fcb1c]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x326)[0x7fcb1c]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x326)[0x7fcb1c]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi[0x7f6c9d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi[0x7f6c9d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi[0x7f6c9d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi[0x7f6c9d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart__base__locality__domain__filter_subdomains+0x1e8)[0x7fc9de]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi[0x7f6c9d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN43DARTLocalityTest_ExcludeLocalityDomain_Test8TestBodyEv+0x3bb)[0x6bbf6d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN43DARTLocalityTest_ExcludeLocalityDomain_Test8TestBodyEv+0x3bb)[0x6bbf6d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN43DARTLocalityTest_ExcludeLocalityDomain_Test8TestBodyEv+0x3bb)[0x6bbf6d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(dart_domain_exclude+0x29)[0x7f70d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c6e2f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c6e2f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c6e2f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN43DARTLocalityTest_ExcludeLocalityDomain_Test8TestBodyEv+0x3bb)[0x6bbf6d]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c221e]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c221e]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c221e]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing4Test3RunEv+0xd0)[0x7a91b8]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c6e2f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing4Test3RunEv+0xd0)[0x7a91b8]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing4Test3RunEv+0xd0)[0x7a91b8]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestInfo3RunEv+0x108)[0x7a9a1a]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestInfo3RunEv+0x108)[0x7a9a1a]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c221e]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestInfo3RunEv+0x108)[0x7a9a1a]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestCase3RunEv+0x101)[0x7aa035]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestCase3RunEv+0x101)[0x7aa035]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing4Test3RunEv+0xd0)[0x7a91b8]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestCase3RunEv+0x101)[0x7aa035]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2a9)[0x7b092b]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2a9)[0x7b092b]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestInfo3RunEv+0x108)[0x7a9a1a]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2a9)[0x7b092b]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c7b5f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c7b5f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c7b5f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8TestCase3RunEv+0x101)[0x7aa035]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c2de6]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c2de6]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4b)[0x7c2de6]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2a9)[0x7b092b]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8UnitTest3RunEv+0xa7)[0x7af665]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8UnitTest3RunEv+0xa7)[0x7af665]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_Z13RUN_ALL_TESTSv+0x11)[0x61c717]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_Z13RUN_ALL_TESTSv+0x11)[0x61c717]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8UnitTest3RunEv+0xa7)[0x7af665]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x7c7b5f]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(main+0x1b4)[0x61c3d9]
/zhome/academic/HLRS/hlrs/hpcjschu/opt/dash-0.3.0/bin/dash/test/mpi/dash-test-mpi(main+0x1b4)[0x61c3d9]

For the last one, Valgrind reports (running on two nodes, ppn=4):

[=  0 LOG =]       DARTLocalityTest.h :  31 | ===> Running test case with 8 units ...
==6240== Invalid write of size 8
==6240==    at 0x4A08C33: memcpy (mc_replace_strmem.c:882)
==6240==    by 0x7FCA96: dart__base__locality__domain__filter_subdomains (domain_locality.c:449)
==6240==    by 0x7FC9DD: dart__base__locality__domain__filter_subdomains (domain_locality.c:430)
==6240==    by 0x7FC9DD: dart__base__locality__domain__filter_subdomains (domain_locality.c:430)
==6242==    by 0x7FCA96: dart__base__locality__domain__filter_subdomains (domain_locality.c:449)
==6240==    by 0x7F6C9C: dart__base__locality__domain_exclude_subdomains (locality.h:90)
==6240==    by 0x7F70D8: dart_domain_exclude (dart_locality.c:126)
==6240==    by 0x6BBF6C: DARTLocalityTest_ExcludeLocalityDomain_Test::TestBody() (DARTLocalityTest.cc:126)
==6240==    by 0x7C6E2E: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7C221D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7A91B7: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7A9A19: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7AA034: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==  Address 0xe872bc0 is 0 bytes after a block of size 16 alloc'd
==6240==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==6240==    by 0x7FBDE2: dart__base__locality__domain__copy (domain_locality.c:144)
==6240==    by 0x7FC38C: dart__base__locality__domain__copy (domain_locality.c:186)
==6240==    by 0x7FC38C: dart__base__locality__domain__copy (domain_locality.c:186)
==6242==    by 0x7FBDE2: dart__base__locality__domain__copy (domain_locality.c:144)
==6242==    by 0x7FC38C: dart__base__locality__domain__copy (domain_locality.c:186)
==6240==    by 0x7F6BC8: dart__base__locality__clone_domain (locality.h:49)
==6240==    by 0x7F7010: dart_domain_clone (dart_locality.c:83)
==6240==    by 0x6BBD02: DARTLocalityTest_ExcludeLocalityDomain_Test::TestBody() (DARTLocalityTest.cc:117)
==6240==    by 0x7C6E2E: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7C221D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7A91B7: testing::Test::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7A9A19: testing::TestInfo::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==    by 0x7AA034: testing::TestCase::Run() (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==6240==

Attaching full Valgrind log.
dash_locality_domain_vg.txt

Attaching full test run log.
dash_tests_mpich3.2.txt

@fuchsto
Copy link
Member

fuchsto commented Dec 1, 2016

@devreal Awesome, thank you!

So it's just a few singular test cases that are affected by the same bug in the new topology module.
Something about leader communication is broken when units are running on multiple nodes.

That helps a lot!

How did you configure / run the Valgrind test? Looks way better than with OpenMPI.

@devreal
Copy link
Member Author

devreal commented Dec 20, 2016

Current development fails on the following test:

[  ERROR   ] [UNIT 0] in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/test/TeamTest.cc:79
      Expected: file_exists
      Which is: false
To be equal to: true
Unit 0
[...]
[  FAILED  ] TeamTest.SplitTeamSync
[  ERROR   ] [UNIT 0] in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/test/TestPrinter.h:145
Failed
Testcase failed at least on one unit
[==========] 2 tests from 1 test cases ran. (46 ms total)
[  FAILED  ] 1 tests, listed below
[  FAILED  ] TeamTest.SplitTeamSync

The test fails since dart_team_create is already called during initialization, i.e.:

#0  dart_team_create (teamid=0, group=0xbbc6b0, newteam=0x7fffffffba8c) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/mpi/src/dart_team_group.c:542
#1  0x0000000000836dad in dart__base__host_topology__update_module_locations (unit_mapping=0xb66930, topo=0xbba9c0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/internal/host_topology.c:232
#2  0x0000000000839b26 in dart__base__host_topology__create (unit_mapping=0xb66930, host_topology=0x7fffffffc150) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/internal/host_topology.c:675
#3  0x000000000083c17b in dart__base__locality__create (team=0) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/locality.c:174
#4  0x000000000083baf7 in dart__base__locality__init () at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/base/src/locality.c:77
#5  0x000000000082ea45 in dart__mpi__locality_init () at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/mpi/src/dart_locality_priv.c:21
#6  0x000000000082d399 in dart_init (argc=0xb23930, argv=0xb23938) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dart-impl/mpi/src/dart_initialization.c:262
#7  0x0000000000809223 in dash::init (argc=0xb23930, argv=0xb23938) at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/src/Init.cc:33

This increments the team counter in DART and thus the test team_core.dart_id() == 1 evaluates to false.

@fuchsto
Copy link
Member

fuchsto commented Dec 20, 2016

@devreal Good catch!

The test case is a rather recent addition by @fmoessbauer
Felix, could you provide a fix?
I'd suggest testing for something like team_core.dart_id() == prev_team_core_id.
Also, testing for the file fails when running on multiple nodes.
You can use the locality interface to skip the test in this case.

@fmoessbauer
Copy link
Member

I will have a look at my test. Is it sufficient to just skip the test if it is executed on multiple nodes?
The file ist necessary to check if the team barriers work correctly. IMO the only way to ensure this is to use some external state like a file.

@fuchsto
Copy link
Member

fuchsto commented Feb 28, 2017

@fmoessbauer Is this issue resolved by your PR #214 ?

@fmoessbauer
Copy link
Member

Yes, the PR disables the unit test for runs on more than one node.

@fuchsto
Copy link
Member

fuchsto commented Feb 28, 2017

Can we close this issue then?

@fuchsto fuchsto closed this as completed Mar 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants