Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dash::Matrix local column and row access and copy #3

Closed
fuchsto opened this issue Oct 12, 2016 · 10 comments
Closed

dash::Matrix local column and row access and copy #3

fuchsto opened this issue Oct 12, 2016 · 10 comments

Comments

@fuchsto
Copy link
Member

fuchsto commented Oct 12, 2016

I am trying to run a simple test using dash::Matrix local row() and col() views to copy a subset of a 2D matrix into a std::vector. I am making the following observations:

  1. I can copy the whole local part of the Matrix into the local vector using dash::copy.

  2. I can get a row using matrix.local.row(0) and correctly print its content on unit 0. However it fails with a std::out_out_range exception on unit 1. I later realized that when using the operator[] instead of operator() it works as expected. There seems to be a problem with the operator().

  3. I am trying to copy a local row into a local std::vector like this:

{
    size_t team_size = dash::Team::All().size();
    size_t myid = dash::Team::All().myid();

    dash::TeamSpec<2> teamspec_2d(team_size, 1);
    teamspec_2d.balance_extents();

    dash::SizeSpec<2> sspec(20, 20);
    dash::DistributionSpec<2> dspec(dash::BLOCKED, dash::BLOCKED);

    dash::Matrix<double, 2> matrix(sspec,dspec, dash::Team::All(), teamspec_2d);

    initialize(matrix);
    dash::barrier();

    auto row = matrix.local.row(0);
    std::vector<double> tmp(row.end() - row.begin());
    dash::copy(row.begin(), row.end(), tmp.data());
}

This code causes memory to be corrupted, e.g., on unit 0:

*** Error in `/home/joseph/src/dash/workspace_2dheat/2dheat/dash_row_test': free(): invalid next size (fast): 0x00000000008de4c0 ***
[...]
Thread 1 "dash_row_test" received signal SIGABRT, Aborted.
0x00007ffff686a418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff686a418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff686c01a in __GI_abort () at abort.c:89
#2  0x00007ffff68ac72a in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff69c56b0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff68b4f4a in malloc_printerr (ar_ptr=, ptr=, str=0x7ffff69c5728 "free(): invalid next size (fast)", action=3) at malloc.c:5007
#4  _int_free (av=, p=, have_lock=0) at malloc.c:3868
#5  0x00007ffff68b8abc in __GI___libc_free (mem=) at malloc.c:2969
#6  0x0000000000405d32 in main (argc=1, argv=0x7fffffffd9d8) at dash_row_test.cc:119
(gdb) 

Valgrind says:

==793== Invalid write of size 8
==793==    at 0x4C323E3: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==793==    by 0x4120A7: double* std::__copy_move::__copy_m(double const*, double const*, double*) (stl_algobase.h:384)
==793==    by 0x410C17: double* std::__copy_move_a(double*, double*, double*) (stl_algobase.h:402)
==793==    by 0x40E102: double* std::__copy_move_a2(double*, double*, double*) (stl_algobase.h:440)
==793==    by 0x40997D: double* std::copy(double*, double*, double*) (stl_algobase.h:472)
==793==    by 0x407736: double* dash::copy, dash::GlobMem >, dash::GlobPtr >, dash::GlobRef > >(dash::GlobViewIter, dash::GlobMem >, dash::GlobPtr >, dash::GlobRef >, dash::GlobViewIter, dash::GlobMem >, dash::GlobPtr >, dash::GlobRef >, double*) (Copy.h:880)
==793==    by 0x405D14: main (dash_row_test.cc:118)
==793==  Address 0x1a18a870 is 0 bytes after a block of size 80 alloc'd
==793==    at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==793==    by 0x405C2E: main (dash_row_test.cc:116)

Since I am new to DASH, I cannot rule out that the code itself is incorrect. However, looking at the API it looks plausible to me.

  1. The same is true when trying to copy a local column into a local vector (exchanging test 2 and 3 in the code). Note that in this case the access through operator() fails in all cases.

I ran all my tests using

$ mpirun -n 2 ./dash_row_test

I am attaching my test case. Please let me know if you think that the test case is malformed, I am happy to correct it. Please also let me know if you need additional information.

@fuchsto fuchsto added the bug label Oct 12, 2016
@fuchsto fuchsto added this to the dash-0.3.0 milestone Oct 12, 2016
@fuchsto fuchsto self-assigned this Oct 12, 2016
@fuchsto
Copy link
Member Author

fuchsto commented Oct 30, 2016

Created branch bug-3-mat-row-copy

@devreal
Copy link
Member

devreal commented Oct 31, 2016

I have played around with the test and it seems that the number of elements to be copied is incorrectly determine.

dash::copy reports that 381 will be copied:

[ 0 TRACE ] [ 27306 ] Copy.h :913 | dash::copy | copy local subrange num_copy_elem: 381
At this point, Valgrind reports the following out-of-bounds write:

==27306== Invalid write of size 8
==27306==    at 0x4C323E3: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27306==    by 0x45F955: __copy_m<double> (stl_algobase.h:384)
==27306==    by 0x45F955: __copy_move_a<false, double*, double*> (stl_algobase.h:402)
==27306==    by 0x45F955: __copy_move_a2<false, double*, double*> (stl_algobase.h:440)
==27306==    by 0x45F955: copy<double*, double*> (stl_algobase.h:472)
==27306==    by 0x45F955: double* dash::copy<double, dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> > >(dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> >, dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> >, double*) (Copy.h:921)
==27306==    by 0x4518C2: main (dash_row_test.cc:121)
==27306==  Address 0x10f58c00 is 0 bytes after a block of size 160 alloc'd
==27306==    at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27306==    by 0x45179C: main (dash_row_test.cc:119)
==27306==

The number of elements in a row should be 20 and I have verified that the target vector is large enough to hold 20 elements. However, I am not sure where the number of 381 comes from...

@fuchsto
Copy link
Member Author

fuchsto commented Nov 2, 2016

Thank you, probably a bug in dash::local _range.I'm about to extend and formalize this function interface anyways this week, I will add your failing use cases as unit tests and assign you for review once it's ready.

@fuchsto
Copy link
Member Author

fuchsto commented Nov 15, 2016

The defect could be easily reproduced and has been added as test case MatrixTest.CopyRow.
Fixed in pull request #105

@fuchsto
Copy link
Member Author

fuchsto commented Nov 15, 2016

@devreal Please close this issue after review.
Relevant part of the diff starts here:
https://github.com/dash-project/dash/pull/105/files#diff-3727b32cf9caed9a261d978fb1a5c2ce

@devreal
Copy link
Member

devreal commented Nov 15, 2016

My test case for copying a row succeeds now. However, trying to copy a column the same way still does not wok. Consider the following example (similar to the example above):

{
    size_t team_size = dash::Team::All().size();
    size_t myid = dash::Team::All().myid();

    dash::TeamSpec<2> teamspec_2d(team_size, 1);
    teamspec_2d.balance_extents();

    dash::SizeSpec<2> sspec(20, 20);
    dash::DistributionSpec<2> dspec(dash::BLOCKED, dash::BLOCKED);

    dash::Matrix<double, 2> matrix(sspec,dspec, dash::Team::All(), teamspec_2d);

    initialize(matrix);
    dash::barrier();

    auto col = matrix.local.col(0);
    auto num_elements = std::distance(col.begin(), col.end());
    std::vector<double> tmp(num_elements);
    dash::copy(row.begin(), row.end(), tmp.data());
}

Here is the log and the valgrind message:

[...]
[    0 TRACE ] [ 13032 ] Copy.h                   :875  | dash::copy                                   | global it. range of local subrange: begin: 0 end: 181
[    0 TRACE ] [ 13032 ] Copy.h                   :876  | dash::copy                                   | |- g_l_in_last.pos(): 181
[    0 TRACE ] [ 13032 ] GlobViewIter.h           :445  | GlobViewIter.pos()                           | idx: 10 vs_offset: 0
[    0 TRACE ] [ 13032 ] Copy.h                   :880  | dash::copy                                   | |- num_prelocal_elem: 0
[    0 TRACE ] [ 13032 ] Copy.h                   :881  | dash::copy                                   | |- num_postlocal_elem: -171
[    0 TRACE ] [ 13032 ] GlobIter.h               :325  | GlobIter.local=()                            | |- _idx: 0
[    0 TRACE ] [ 13032 ] GlobIter.h               :330  | GlobIter.local=                              | |- _max_idx: 399
[    0 TRACE ] [ 13032 ] GlobIter.h               :338  | GlobIter.local=                              | |- idx: 0
[    0 TRACE ] [ 13032 ] GlobIter.h               :339  | GlobIter.local=                              | |- offset: 0
[    0 TRACE ] [ 13032 ] TilePattern.h            :741  | TilePattern.local()                          | |- g_index: 0
[    0 TRACE ] [ 13032 ] TilePattern.h            :776  | TilePattern.local_index()                    | |- global_coords: { 0,0 }
[    0 TRACE ] [ 13032 ] TilePattern.h            :782  | TilePattern.local_index                      | |- l_coords: { 0,0 }
[    0 TRACE ] [ 13032 ] TilePattern.h            :783  | TilePattern.local_index                      | |- unit: 0
[    0 TRACE ] [ 13032 ] TilePattern.h            :678  | TilePattern.local_at()                       | local coords: { 0,0 } local blocks: { 1,1 }
[    0 TRACE ] [ 13032 ] TilePattern.h            :694  | TilePattern.local_at                         | local_coords: { 0,0 } local blocks: { 1,1 } local block coords: { 0,0 } block size: { 10,20 } phase coords: { 0,0 }
[    0 TRACE ] [ 13032 ] TilePattern.h            :700  | TilePattern.local_at >                       | |- local_index: 0
[    0 TRACE ] [ 13032 ] TilePattern.h            :816  | TilePattern.local_index >                    | |- l_index: 0
[    0 TRACE ] [ 13032 ] GlobIter.h               :342  | GlobIter.local= >                            | |- local_pos.unit: 0
[    0 TRACE ] [ 13032 ] GlobIter.h               :343  | GlobIter.local= >                            | |- local_pos.index: 0
[    0 TRACE ] [ 13032 ] Copy.h                   :913  | dash::copy                                   | |- l_in_first: 0x402a0a8
[    0 TRACE ] [ 13032 ] Copy.h                   :914  | dash::copy                                   | |- l_in_last: 0x402a650
[    0 TRACE ] [ 13032 ] Copy.h                   :921  | dash::copy                                   | copy local subrange num_copy_elem: 181
==13032== Invalid write of size 8
==13032==    at 0x4C323E3: memcpy@GLIBC_2.2.5 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13032==    by 0x45F2B1: __copy_m<double> (stl_algobase.h:384)
==13032==    by 0x45F2B1: __copy_move_a<false, double*, double*> (stl_algobase.h:402)
==13032==    by 0x45F2B1: __copy_move_a2<false, double*, double*> (stl_algobase.h:440)
==13032==    by 0x45F2B1: copy<double*, double*> (stl_algobase.h:472)
==13032==    by 0x45F2B1: double* dash::copy<double, dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> > >(dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> >, dash::GlobViewIter<double, dash::TilePattern<2, (dash::MemArrange)1, long>, dash::GlobMem<double, dash::allocator::CollectiveAllocator<double> >, dash::GlobPtr<double, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::GlobRef<double> >, double*) (Copy.h:929)
==13032==    by 0x4519CD: main (dash_row_test.cc:137)
==13032==  Address 0x1ac03ca0 is 0 bytes after a block of size 80 alloc'd
==13032==    at 0x4C2E0EF: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13032==    by 0x451925: allocate (new_allocator.h:104)
==13032==    by 0x451925: allocate (alloc_traits.h:491)
==13032==    by 0x451925: _M_allocate (stl_vector.h:170)
==13032==    by 0x451925: _M_create_storage (stl_vector.h:185)
==13032==    by 0x451925: _Vector_base (stl_vector.h:136)
==13032==    by 0x451925: vector (stl_vector.h:278)
==13032==    by 0x451925: main (dash_row_test.cc:136)
==13032==

The size of the column should be 20, not 181. Sorry for not making the column example explicit in my first post.

@fuchsto
Copy link
Member Author

fuchsto commented Nov 15, 2016

Yaaah, I'm kinda not surprised that this fails, there are commits missing. Hold on ...

@fuchsto
Copy link
Member Author

fuchsto commented Nov 15, 2016

Ah, no, it's a packing / strided copy issue. Will add more info later today.

@fuchsto
Copy link
Member Author

fuchsto commented Nov 15, 2016

So, it's trickier than that.
Slicing columns from a matrix with row-major storage order is fine, but copying this slice is not.
Strictly speaking, this is not a bug, it's a missing feature: DASH does not support strided copying yet.
That's why this is not covered in unit tests.

[ 0 1 2 3 ]  <- row contiguous in memory   | 8 | 4 | 0 |  <- column is strided, requires packing
[ 4 5 6 7 ]                                | 9 | 5 | 1 |     into temporary buffer [ 0 4 8 ] or
[ 8 9 0 1 ]        -- rot 90° -->          | 0 | 6 | 2 |     single copy operations for every
                                           | 1 | 7 | 3 |     value

So, good news is that it's supposed to fail, but an error message would be more polite than, say, a core dump.

But I think there are some members in our team at LMU that can hardly wait to prove their grit and
implement this.

EDIT: Ah, actually this is a feature in the stencil / halo functions. @dhinf, wanna dance?

@fuchsto
Copy link
Member Author

fuchsto commented Nov 17, 2016

Closing this bug issue, missing feature requested in issue #116

@fuchsto fuchsto closed this as completed Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants