-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cudax::async_mdarray
#3095
base: main
Are you sure you want to change the base?
Conversation
cudax::stream stream{}; | ||
Env env{Resource{}, stream}; | ||
|
||
SECTION("Construction with explicit size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pciolkosz : We should also consider CTAD and an make_mdarray
function
🟨 CI finished in 1h 53m: Pass: 98%/168 | Total: 1d 15h | Avg: 14m 03s | Max: 1h 31m | Hits: 10%/22034
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 168)
# | Runner |
---|---|
124 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
//! | ||
//! @endrst | ||
//! @tparam _Tp The underlying type of the elements the \c heterogeneous_iterator points at. | ||
//! @tparam _IsConst Boolean, if false the \c heterogeneous_iterator allows mutating the element pointed to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider making this an enum class
instead of a bool
, so that users can say (e.g.,) heterogeneous_iterator<T, access::read_write, Properties...>
? This would help make code more self-documenting and avoid possible confusion with other template parameters in Properties...
.
"mdspan's Extents template parameter must be a specialization of _CUDA_VSTD::extents."); | ||
|
||
// At least one of the properties must signal an execution space | ||
static_assert(_CUDA_VMR::__contains_execution_space_property<_Properties...>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion: Kokkos::View
also needs to extract properties from a parameter pack. It takes the approach of inheriting from a "ViewTraits
" class that extracts out the needed properties and does compile-time error checking. Taking this approach in mdarray
would have two advantages.
- It would process the parameter pack all in one place, which might improve build times.
- It might would make the class definition shorter and easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely move that out later, but for the PoC I would keep it as is
? (__is_host_only ? cudaMemcpyHostToHost : cudaMemcpyHostToDevice) | ||
: (__is_host_only ? cudaMemcpyDeviceToHost : cudaMemcpyDeviceToDevice); | ||
|
||
//! @brief Helper to return an async_resource_ref to the currently used resource. Used to grow the async_mdarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "[u]sed to grow the async_mdarray
," are you planning to make async_mdarray
resizable after construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a remnant from copying the code from vector.
Currently if we want to be able to assign one mdarray to another we need the ability to reallocate storage because they might have different sizes / memory resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would say we do not want to make it resizable but want to be able t assign
{ | ||
if (__other.size() != 0) | ||
{ | ||
this->__copy_same(__other.__unwrapped_begin(), __other.__unwrapped_end(), __unwrapped_begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the layout is not exhaustive, then this will copy elements that do not belong to the input. If the layout is not unique, then this may copy elements multiple times. While creating an mdarray
with a nonexhaustive layout is a bit weird, nothing here prevents users from doing that.
In general, mdarray
(as defined in P1684) cannot be implemented without an mdspan copy algorithm.
|
||
//! @brief Constructs an empty async_mdarray using an environment | ||
//! @param __env The environment providing the needed information | ||
//! @note No memory is allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "[n]o memory is allocated," then does this mean that the extents are all zero? If so, then what would happen if all extents are compile-time constants, e.g., extents<int, 3, 4>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently all constructors are taken from the one dimensional vector case.
In case we have fully static dimensions we should SFINAE that constructor away
//! @param __mr The memory resource to allocate the async_mdarray with. | ||
//! @param __size The size of the async_mdarray. Defaults to zero | ||
//! @note If `__size == 0` then no memory is allocated. | ||
_CCCL_HIDE_FROM_ABI explicit async_mdarray(const __env_t& __env, const size_type __size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to initialize a multidimensional array with a single integer size
?
Note that layouts don't have to be exhaustive or unique, so we may have required_span_size()
(the actual required allocation size) not equal to size()
(the product of the extents).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I'm guessing that this is a left-over constructor from implementing an "async_vector
."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a constructor that only initializes a one dimensional mdspan with dynamic extents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally there wouldn't be a separate constructor for the rank-1 case. The constructor would just take a constrained pack of index types, like this.
template<class... OtherIndexTypes>
require(
(std::is_convertible_v<OtherIndexTypes, index_type> && ...) &&
(std::is_nothrow_constructible_v<index_type, OtherIndexTypes> && ...) &&
std::is_constructible_v<extents_type, OtherIndexTypes...> &&
std::is_constructible_v<mapping_type, extents_type>
)
explicit constexpr
async_mdarray(const __ent_t& __env, OtherIndexTypes... exts);
//! @param __mr The memory resource to allocate the async_mdarray with. | ||
//! @param __ilist The initializer_list being copied into the async_mdarray. | ||
//! @note If `__ilist.size() == 0` then no memory is allocated | ||
_CCCL_HIDE_FROM_ABI async_mdarray(const __env_t& __env, _CUDA_VSTD::initializer_list<_Tp> __ilist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want construction from an initializer_list
, should we consider construction from a rank()
-nested initializer_list
, so that
a. users would not need to guess what a flat initializer_list<T>
means for a multidimensional array (e.g., does its order depend on the layout?);
b. users could express multidimensional arrays' initial values directly in C++ code; and
c. we could provide a deduction guide that deduces the rank automatically?
Please see the relevant section of P3308R0, "Consider adding construction from nested initializer list,", for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit doubtfull we want that before it has made its way into the standard.
Also I am sceptical if this is possible with dynamic extents due to potential conflicts with list initialization of the stored type.
On the other hand direct-list initialization like here is super useful on its own, so I would definitely want to keep that in
|
||
//! @} | ||
|
||
//! @addtogroup iterators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This iterator range does not coincide with the set of elements of the mdarray if the layout mapping is nonexhaustive.
-
There's no single definition of an iterator range if the layout mapping is nonunique. It depends on whether users want to do read-only access or read-and-write access.
-
The P1684 mdarray design does not include iterators at all. Therefore, given this and the complexities of defining an iterator range for general layouts, should we consider just not providing
begin
andend
?
//! @{ | ||
//! @brief Returns a reference to the \p __n 'th element of the async_mdarray | ||
//! @param __n The index of the element we want to access | ||
_CCCL_NODISCARD _CCCL_HIDE_FROM_ABI reference operator[](const size_type __n) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks left over from an async_vector
implementation.
//! @brief Replaces the stored stream | ||
//! @param __new_stream the new stream | ||
//! @note Always synchronizes with the old stream | ||
_CCCL_HIDE_FROM_ABI constexpr void change_stream(::cuda::stream_ref __new_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the stream after construction feels like changing the allocator after construction -- it's not something that one can do to vector
, and it makes reasoning about code that takes async_mdarray&
difficult.
Should we consider an alternate design, that makes it easier to move from this async_mdarray
into a new async_mdarray
with a different stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is different than changing the allocator. The stream is just something that tells the container where future work happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- (Re)allocation and future work are coupled through copy assignment.
std::vector
has the invariant that it always has the same allocator throughout its lifetime.- If (re)allocation and future work are coupled, then should we also consider adding the invariant that a container always has the same allocator and stream throughout its lifetime?
- Move construction is relatively cheap and can preserve existing allocations and values.
- Therefore, should we consider a design in which users can "change" the allocator and/or stream by move-constructing a new container with the desired allocator and/or stream? This would change the allocator and/or stream for future operations; the current allocation wouldn't necessarily change.
Here's an example of (5).
env e{allocator, stream}; // whatever the syntax may be
async_mdarray<float, std::dims<2, int>> A{env, num_rows, num_cols};
// ... code ...
async_mdarray<float, std::dims<2, int>> B = /* from somewhere */;
// ... more code ...
env e2{new_allocator, new_stream}; // either or both may change
async_mdarray<float, std::dims<2, int>> A2{e2, std::move(A)};
// Assignment from B uses A2's new allocator and new stream.
A2 = B;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes that is true but I do not see any issue here
- That is not correct, there is a whole machinery in allocator_traits that manages how allocators behave during assignment / construction. If you assign a vector that is not always equal you may need to reallocate
- I still dont see how future work and reallocation are coupled. We can happily change streams as long as it is ensured that the streams are properly synchronized
- We currently only allow the user to change the stream (and the execution policy). As long as every work is in proper stream order there is no reason to put additional constraints in.
In general we really want the ability to change the stream of a container, because we never know where we are getting it from. It can realistically happen that we get 3 containers allocated and prepared on different streams and we want to launch an algorithm that takes all of them. In that case we do not only want to synchronize but also ensure that the stream that is internally used is the right one.
Otherwise every operation on the container would need to take a stream as a required argument and we really do not want that
//! @brief Replaces the currently used execution policy | ||
//! @param __new_policy the new policy | ||
_CCCL_HIDE_FROM_ABI constexpr void set_execution_policy(__policy_t __new_policy) noexcept | ||
{ | ||
__policy_ = __new_policy; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the execution policy or stream after construction feels like changing the allocator after construction -- it's not something that one can do to vector
, and it makes reasoning about code that takes async_mdarray
& difficult.
Should we consider an alternate design, that makes it easier to move from this async_mdarray into a new async_mdarray with a different execution policy?
//! @brief Replaces the currently used execution policy | |
//! @param __new_policy the new policy | |
_CCCL_HIDE_FROM_ABI constexpr void set_execution_policy(__policy_t __new_policy) noexcept | |
{ | |
__policy_ = __new_policy; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not follow you here.
The execution policy determines where future operations on the mdarray happen. In what regard would that be problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this above; I'll just repeat it here for clarity.
- (Re)allocation and future work are coupled through copy assignment.
std::vector
has the invariant that it always has the same allocator throughout its lifetime.- If (re)allocation and future work are coupled, then should we also consider adding the invariant that a container always has the same allocator and stream throughout its lifetime?
- Move construction is relatively cheap and can preserve existing allocations and values.
- Therefore, should we consider a design in which users can "change" the allocator and/or stream by move-constructing a new container with the desired allocator and/or stream? This would change the allocator and/or stream for future operations; the current allocation wouldn't necessarily change.
Here's an example of (5).
env e{allocator, stream}; // whatever the syntax may be
async_mdarray<float, std::dims<2>> A{env, num_rows, num_cols};
// ... code ...
async_mdarray<float, std::dims<2>> B = /* from somewhere */;
// ... more code ...
env e2{new_allocator, new_stream}; // either or both may change
async_mdarray<float, std::dims<2>> A2{e2, std::move(A)};
// Assignment from B uses A2's new allocator and new stream.
A2 = B;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would change the allocator and/or stream for future operations; the current allocation wouldn't necessarily change.
This would be like changing the scheduler in std::execution: it affects things that happen after the change. I find this to be a reasonable interpretation of the name "async_mdarray
."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution policy determines where future operations on the mdarray happen. In what regard would that be problematic?
One feature of std::vector
is that I can't change its allocator after construction. In async_mdarray
, copy assignment couples (re)allocation and asynchronous execution (copying or fill). In a hypothetical async_vector
, both copy assignment and resizing would couple allocation and asynchronous execution. (Re)allocation also might depend on the stream, so users might reasonably expect that if they could change the stream, they could also change the allocator.
Seeing the above comments from Mark I was wondering if it wouldn't be better to start from his |
CHECK(vec.data() == nullptr); | ||
} | ||
|
||
{ // from env and size, no alllocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want a constructor that takes an mdspan
using MatchingResource = typename extract_properties<TestType>::matching_resource; | ||
Env matching_env{MatchingResource{resource}, stream}; | ||
|
||
SECTION("cudax::async_mdarray construction with matching async_mdarray") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want explicit conversion to mdspan
//! @brief Replaces the stored stream | ||
//! @param __new_stream the new stream | ||
//! @note Always synchronizes with the old stream | ||
_CCCL_HIDE_FROM_ABI constexpr void change_stream(::cuda::stream_ref __new_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is different than changing the allocator. The stream is just something that tells the container where future work happens.
48595bb
to
cf41313
Compare
This is a derivation of the current `mdarray` proposal. To cater to our heterogeneous use cases we require an environment to be passed to the constructor of the mdspan. This replaces the container template argument in the current proposal. In contrast to `cudax::async_vector` `cudax::async_mdarray` does not offer any APIs that change it size.
cf41313
to
91b059a
Compare
🟨 CI finished in 1h 44m: Pass: 98%/168 | Total: 1d 18h | Avg: 15m 00s | Max: 1h 36m | Hits: 13%/22094
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 168)
# | Runner |
---|---|
124 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
🟨 CI finished in 1h 17m: Pass: 98%/168 | Total: 1d 00h | Avg: 8m 47s | Max: 36m 38s | Hits: 91%/22428
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 168)
# | Runner |
---|---|
124 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
This is a derivation of the current
mdarray
proposal.To cater to our heterogeneous use cases we require an environment to be passed to the constructor of the mdspan.
This replaces the container template argument in the current proposal. In contrast to
cudax::async_vector
cudax::async_mdarray
does not offer any APIs that change it size.