Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeamLocalityTest: Subdomain tag too short #161

Open
devreal opened this issue Nov 24, 2016 · 8 comments
Open

TeamLocalityTest: Subdomain tag too short #161

devreal opened this issue Nov 24, 2016 · 8 comments

Comments

@devreal
Copy link
Member

devreal commented Nov 24, 2016

It seems that there is a problem with the TeamLocalityTest. Valgrind reports:

[=   2  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
==6396== Conditional jump or move depends on uninitialised value(s)
==6396==    at 0x4C30A0A: __GI_strchr (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6396==    by 0x6E4FD1: dart__base__locality__domain_group (locality.c:525)
==6396==    by 0x6B8174: dash::util::LocalityDomain::group(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (LocalityDomain.cc:256)
==6396==    by 0x62E3BE: group (TeamLocality.h:217)
==6396==    by 0x62E3BE: TeamLocalityTest_GroupUnits_Test::TestBody() (TeamLocalityTest.cc:182)
==6396==    by 0x6A773A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x6A0D3E: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x68447B: testing::Test::Run() (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x684E13: testing::TestInfo::Run() (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x685506: testing::TestCase::Run() (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x68C64D: testing::internal::UnitTestImpl::RunAllTests() (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x6A8D12: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)
==6396==    by 0x6A1B80: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in /home/joseph/src/dash/dash/build/bin/dash-test-mpi)

Adding some debug output to dart__base__locality__domain_group, I see that strlen(group_subdomain_tags[sd]) < (group_parent_domain_tag_len + 1):

[=   0  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   3  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   1  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[=   2  LOG =]       TeamLocalityTest.h :  30 | ===> Running test case with 4 units ... 
[    0 ERROR ] [  7484 ] locality.c               :528  !!! DART: group_subdomain_tags[sd] = '.0.0.0.0.1.0.0' too short (required at least 15)
[    0 ERROR ] [  7484 ] locality.c               :528  !!! DART: group_subdomain_tags[sd] = '.0.0.0.0.1.0.0' too short (required at least 15)
[    0 ERROR ] [  7484 ] domain_locality.c        :174  !!! DART: dart__base__locality__domain__copy: domain .0.0.0.0.1.0.0 has num_domains = 0, expected domains = NULL

Note that (group_parent_domain_tag_len + 1) is 15 in this case and group_subdomain_tags[sd] contains '.0.0.0.0.1.0.0'

@devreal devreal added this to the dash-0.3.0 milestone Nov 24, 2016
@fuchsto fuchsto added the bug label Nov 24, 2016
@fuchsto
Copy link
Member

fuchsto commented Dec 1, 2016

The original cause of this defect are invalid parameters, it seems.
The parent domain tag (e.g. ".0.1.2") must be shorter than any tag of its children (".0.1.2.0", ".0.1.2.1", ...).
Checking domain tag parameters before executing the grouping operation would be too expensive, though.

When multiple units are mapped to the same domain (which is an invalid configuration we can't avoid in CI), the test case TeamLocalityTest.GroupUnits picks invalid domains to group.
This explanation is in line with your log output where the domain tag '.0.0.0.0.1.0.0' is added twice.

I added a check and error log message for stability (branch bug-161-group-domains) and now try to reproduce this behavior to rule out any other cause.

@fuchsto
Copy link
Member

fuchsto commented Dec 7, 2016

@devreal Can you still reproduce this? The use case you reported should be fixed in #177.

@devreal
Copy link
Member Author

devreal commented Dec 8, 2016

Just tested with latest development and it still crashes when using 8 units. I am attaching the Valgrind log of one of the processes, which still shows invalid writes in the domain handling.
dash_locality_domain.vg.25710.txt

@fuchsto
Copy link
Member

fuchsto commented Dec 8, 2016

Ok, thank you! (... but ... whhhyyyyy ... ?)

How did you configure / call Valgrind? Is it the bundled memcheck of OpenMPI?
On my clusters, Valgrind reports don't look sensible, let alone pretty.

@devreal
Copy link
Member Author

devreal commented Dec 8, 2016

Right now, things seem to fall apart completely again: No matter the test I run, I get a SIGSEGV during initialization if I have debug output enabled in DART:

==16364== Invalid read of size 1
==16364==    at 0x3C99047DEC: vfprintf (in /lib64/libc-2.12.so)
==16364==    by 0x3C9906F711: vsnprintf (in /lib64/libc-2.12.so)
==16364==    by 0x3C9904F1E2: snprintf (in /lib64/libc-2.12.so)
==16364==    by 0x94F82C: dart__base__host_topology__update_module_locations (host_topology.c:288)
==16364==    by 0x9547CF: dart__base__host_topology__create (host_topology.c:675)
==16364==    by 0x9592FA: dart__base__locality__create (locality.c:174)
==16364==    by 0x958A8E: dart__base__locality__init (locality.c:77)
==16364==    by 0x93F4E7: dart__mpi__locality_init (dart_locality_priv.c:21)
==16364==    by 0x93C1A2: dart_init (dart_initialization.c:262)
==16364==    by 0x90391D: dash::init(int*, char***) (Init.cc:33)
==16364==    by 0x7A92FA: DARTOnesidedTest::SetUp() (DARTOnesidedTest.h:26)
==16364==    by 0x8F730A: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/build/bin/dash-test-mpi)
==16364==  Address 0x1 is not stack'd, malloc'd or (recently) free'd
==16364==

I have no idea why things go wrong there. The recvcounts array is used in dart_allgather but I cannot spot a mistake there. The use of MPI_IN_PLACE seems OK to me. I spent most of my day debugging this and it appears both with MPI 1.10.3 and latest nightly build of 2.x running on 2 nodes, 4 units each.

@devreal
Copy link
Member Author

devreal commented Dec 8, 2016

Not sure if that is related but I also see this error mesasges on unit 0:

#### Starting test on unit 0 (n042902 PID: 19042)
[    0 ERROR ] [ 19042 ] locality.c               :530  !!! DART: dart__base__locality__domain_group ! group subdomain .1.0.3 with invalid parent domain .1.0.3
[    0 ERROR ] [ 19042 ] LocalityDomain.cc        :262  | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/src/util/LocalityDomain.cc:262 
[    0 ERROR ] [ 19042 ] locality.c               :530  !!! DART: dart__base__locality__domain_group ! group subdomain .1.0.1 with invalid parent domain .1.0.1
[    0 ERROR ] [ 19042 ] LocalityDomain.cc        :262  | dash::exception::AssertionFailed             | [ Unit 0 ] Assertion failed: Expected 0 /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/src/util/LocalityDomain.cc:262 

@devreal
Copy link
Member Author

devreal commented Dec 20, 2016

This issue is still open, blocks at least #61 and #56.

@fuchsto
Copy link
Member

fuchsto commented Dec 20, 2016

On it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants