Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Rust representation of GPU memory re-worked #412

Closed
wants to merge 8 commits into from

Conversation

DmytroTym
Copy link
Contributor

@DmytroTym DmytroTym commented Mar 1, 2024

Overview of the PR

Currently, we use a pretty awkward representation of on-device memory. This PR creates a more idiomatic pair: DeviceVec (which is not really a vector but a boxed slice) which allocates, deallocates and owns device memory, and DeviceSlice which provides mutable and immutable views into device memory.

@ChickenLover I also changed vector operations a little bit - Montgomery is removed as it was working differently for multiplication and addition/subtraction, plus device id checks have been added. Also, I wasn't sure how to change Poseidon related data that is currently raw slices, is it correct to say that here we should use DeviceVec (which should help with the memory leak, at least on the Rust side) and digest here should be a host slice?

Unresolved questions

One potential improvement for H2D/D2H memory operations would be to pin memory inside HostSlice::from_slice method. But I'm still not sure how effective pinned memory is for modern GPUs, plus we don't really have ownership of host data so it's hard to make sure it's pinned and unpinned exactly once (which I'm not sure is a real issue). If someone has good understanding of pinned memory, please comment. And maybe we can just provide a flag for the user to decide if they want to pin memory or not.

@DmytroTym DmytroTym self-assigned this Mar 3, 2024
@jeremyfelder jeremyfelder changed the base branch from dev to main March 4, 2024 09:13
@DmytroTym DmytroTym marked this pull request as ready for review March 5, 2024 09:08
Comment on lines +268 to +276
pub fn cuda_malloc_for_device(count: usize, device_id: usize) -> CudaResult<Self> {
check_device(device_id);
Self::cuda_malloc(count)
}

pub fn cuda_malloc_async_for_device(count: usize, stream: &CudaStream, device_id: usize) -> CudaResult<Self> {
check_device(device_id);
Self::cuda_malloc_async(count, stream)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the use case here for passing a specific deviceId if we require it to match the current device's id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is preferred by @vhnatyk over implicit choice of device id in malloc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr - not that it's preferred but seems aligned with everything else. Longer version: the check ensures we maintain correct device_id through our calls - since we don't have implicit device_id management elsewhere except here (by my legacy wip implementation and missed on review?). Maybe we can think of fully implicit or umm automated device_id management inherent from the call to get_device() on the current thread - but since we have the field in DeviceContext - that makes it redundant and anyway feels like can cause multiple bugs? I guess I had to provide basic description in the doc - but I somehow hoped it will emerge from everyone's edit, since it was so intensively discussed:) have to update doc with the current version of API and diagrams

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think internal management of device_id by thread makes the most sense. DeviceContext can also get set this way by using get_device(). It might be more bug prone on our end but I think it will be less error prone on the user end which we don't have control over

@@ -205,9 +227,8 @@ impl<T, const D_ID: usize> DeviceSlice<T, D_ID> {
}
}

impl<T, const D_ID: usize> DeviceVec<T, D_ID> {
impl<T> DeviceVec<T> {
pub fn cuda_malloc(count: usize) -> CudaResult<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and cuda_malloc_async be private now that we have cuda_malloc_for_device<_async>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... If you're only using one card, there's no need in the for_device version of these methods. I personally would use cuda_malloc_async in multi-device settings too.

Copy link
Contributor

@vhnatyk vhnatyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good , just one minor thing similar error strings to const, pr is pretty big one 😄 - maybe worth one more quick look

wrappers/rust/icicle-core/src/ntt/tests.rs Outdated Show resolved Hide resolved
if let Some(device_id) = input.device_id() {
assert_eq!(
device_id, ctx_device_id,
"Device ids in input and context are different"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth to do string const

@alxiong
Copy link
Contributor

alxiong commented Mar 7, 2024

I like the distinction made for DeviceVec and DeviceSlice, and the usage of ManualDrop.

Would like to confirm one thing:

the updated msm() API now accepts either DeviceVec and DeviceSlice now, just to make sure the following flow won't cause double free:

  • assign a DeviceVec for some bases points (say 2^20)
  • get a DeviceSlice for a sub-slice of it (say 2^15 out of it)
  • pass on this slice to the msm(), for some reason, this cuda operation panic and failed (say due to GPU out of memory)

(currently I'm using mem::forget() to forget the subslice without calling destructor to prevent double-freeing. but double-freeing would still be a risk if msm() failed in the middle before I got a chance to call forget() since currently there's no differences between subslice on device and original owned vec on device, so both have the same cudaFree logic in their destructor.
so just want to double-check that changes introduced here would obviate my concern, since DeviceSlice's destructor won't do anything, correct?)

@DmytroTym
Copy link
Contributor Author

@alxiong yea, the new functions do accept either DeviceSlice, HostSlice or DeviceVec. The latter is accepted by Deref to DeviceSlice (similar to how other Rust functions that accept slices can be used with vectors). And like normal Rust slices, DeviceSlice doesn't own any data so won't call cudaFree, only DeviceVec frees data :)

@jeremyfelder
Copy link
Collaborator

Closing in favor of #443 which includes these changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants