Heapsize limit, Device memory management in general #183

ian-p-johnson · 2022-03-24T01:27:04Z

ian-p-johnson
Mar 24, 2022

Maybe I missed something in the documentation but I am having problems accessing more on device heap than 2GB (on my GTX 1070 8GB) By default it seems to be 1GB and i can increase it to 2GB using -Dtornado.heap.allocation=2GB but all values above 2GB are effectively capped at 2GB. I have confirmed the amount of data I am using and i get an error if i use more than 2GB

Is there any way i can use the full 8GB on the GTX 1070

Ubuntu 21.10 (Impish)
Driver: 510.54
CUDA: 11.6
OpenJdk-11

I am looking to use it for trading strategy optimisation and i hoped to upload chunks of source artefacts (ticks, bars, indicators etc) and then strategy parameter sets, outputting trade histories or at minimal summaries. I was hoping to fill the available GPU memory with an optimal data set and then call tasks to submit strategy parameter sets and pull back responses, effectively streaming until I ran out of source data (and then perhaps attempt to overwrite old unused data if i can't free it on the device, before submitting more parameter sets)

Can i interlace a number of TaskSchedule.execute, calling each in turn (say to upload data, upload work*, download results available*, upload more data, submit more work, etc) I don't see any threading primitives so is there an advised way to managing queing on the GPU (barriers etc), or ensuring that blocks of results have been consistently written to GPU memory (so i can use a flag to indicate completion) At the moment i can only imagine managing this on the client and reusing GPU heap space when no longer needed

When i stream a value out, is the GPU memory recovered?
How do a close a "session", releasing all GPU resources - without closing my client VM - so i can start again with a new strategy/data set

(I see some stuff in TornadoDevice ensureAllocated/Present, enqueueBarrier etc but with no guide on how to use them safely, also some goodies in examples/memory & examples.MultipleTasks) - i can guess, but i'd rather not)

There is a general lack of documentation for memory management on the GPU

ian-p-johnson · 2022-03-24T01:57:26Z

ian-p-johnson
Mar 24, 2022
Author

How do i use TaskSchedule.syncObjects, updateReference (no javadoc) and other interesting looking stuff ?

TornadoAPI. Why is streamIn sometimes called, and other times not - when in both cases the data is used, what is forceCopyIn

Is execute synchronous ?

How to use KernelContext. allocate*, *Barrier in general (some clues in examples.kernelcontext.reductions is see)

1 reply

jjfumero Mar 24, 2022
Maintainer

How do i use TaskSchedule.syncObjects, updateReference (no javadoc) and other interesting looking stuff ?

For sync objects, see my previous answer. For updateReference, this is a special case in which we want to copy a new array (new reference). not just streamIn the data that contains the array. This is useful when we resize the generated application. The reason for this call is that TornadoVM performs a sort of partial evaluation, and it specializes and compiles per input application (materializing the loop bounds, for example) to trigger more optimizations, such as loop unrolling.
Recently, we added support to tune the thread deployment (for advanced users), in a similar way to OpenCL or CUDA. Since TornadoVM compiles using the loop-bounds, if the threads blocks are reconfigured, we would need to resize the application (the input data). To do this, users can updateReference the data and this will trigger a new specialized compilation with the new loop-bounds.

TornadoAPI. Why is streamIn sometimes called, and other times not - when in both cases the data is used, what is forceCopyIn

streamIn sends data (host->device) every time the execute method is invoked. This is for streaming applications in which the kernels are the same but taking different set of data. When the streamIn is not present, TornadoVM will cache data. This means that, the first time the execute method is invoked, TornadoVM sends data (host -> device). However, from the second time, if the task-schedule is executed again, data is not sent from host to device, it is cached. As a take away, read-only data should not use streamIn.

forceCopyIn is just for the dynamic reconfiguration, which uses the same API as the user API. We will remove this call from this place since it leads to errors.

Is execute synchronous ?

Yes

How to use KernelContext. allocate*, *Barrier in general (some clues in examples.kernelcontext.reductions is see)

There are two main APIs in TornadoVM that will trigger the JIT compiler:

a) Loop Parallel API: with annotations with @Parallel: The application is managed automatically by TornadoVM. There is no local memory, barriers, etc, exposed to developers.

b) Kernel Parallel API (KernelContext): similar to OpenCL and CUDA. This is for expert and advanced users who want more control over resources. Note that allocate will perform allocation on shared memory (CUDA terminology) or local memory (OpenCL term) of the device. Using this API, developers can access local memory, synchronization primitives, thread ids, etc.

Note that, in many cases, the first choice (using @Parallel), can perform better than the Kernel API. This is because it will trigger more optimizations, and usually the thread-block heuristic is good compared to a manual choice.

jjfumero · 2022-03-24T07:01:11Z

jjfumero
Mar 24, 2022
Maintainer

Hi @ian-p-johnson , thank you for your feedback, answers below:

Question:

I am having problems accessing more on device heap than 2GB (on my GTX 1070 8GB) By default it seems to be 1GB and i can increase it to 2GB using -Dtornado.heap.allocation=2GB but all values above 2GB are effectively capped at 2GB.

Answer:
I assume you are using the OpenCL backend. TornadoVM, when targeting the OpenCL C code, it runs with the OpenCL runtime.
Currently, TornadoVM tries to mimic the Java heap on the accelerator. At the bootstrap, TornadoVM allocates a big heap in a single buffer. However, in OpenCL, there is a limitation for the max allocation of a single buffer (which is approximately 1/4 of the max capacity of the GPU, or any accelerator).

Quoting from the OpenCL SPEC:

“CL_DEVICE_MAX_MEM_ALLOC_SIZE - Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 12810241024) for devices that are not of type CL_DEVICE_TYPE_CUSTOM.”

Full discussion here: https://forums.developer.nvidia.com/t/why-is-cl-device-max-mem-alloc-size-never-larger-than-25-of-cl-device-global-mem-size-only-on-nvidia/47745

This limitation does not exist in the PTX Backend, and Level Zero exceeds the 25% limitation capacity. So, you can try those backends.

In TornadoVM, we enabled batch processing using the batch("size") to allow batch execution on the device using more memory on the host side than on the device.

https://github.com/beehive-lab/TornadoVM/blob/master/unittests/src/main/java/uk/ac/manchester/tornado/unittests/batches/TestBatches.java

Having said that, we are currently improving the device memory management in TornadoVM to better utilize device's memory. Taking as an example the memory management implemented in the Marawacc runtime:

Question:

I am looking to use it for trading strategy optimisation and i hoped to upload chunks of source artefacts (ticks, bars, indicators etc) and then strategy parameter sets, outputting trade histories or at minimal summaries. I was hoping to fill the available GPU memory with an optimal data set and then call tasks to submit strategy parameter sets and pull back responses, effectively streaming until I ran out of source data (and then perhaps attempt to overwrite old unused data if i can't free it on the device, before submitting more parameter sets)

Can i interlace a number of TaskSchedule.execute, calling each in turn (say to upload data, upload work*, download results available*, upload more data, submit more work, etc) I don't see any threading primitives so is there an advised way to managing queing on the GPU (barriers etc), or ensuring that blocks of results have been consistently written to GPU memory (so i can use a flag to indicate completion) At the moment i can only imagine managing this on the client and reusing GPU heap space when no longer needed

Answer:
All data management is fully handled by TornadoVM. In TornadoVM there are some calls to force copies:

For instance:

ts.streamIn(vars).task(....);
ts.execute();

// Force CopyOut
ts.syncObject(output);

API Call: https://github.com/beehive-lab/TornadoVM/blob/master/tornado-api/src/main/java/uk/ac/manchester/tornado/api/TaskSchedule.java#L364

Examples: https://github.com/beehive-lab/TornadoVM/blob/master/benchmarks/src/main/java/uk/ac/manchester/tornado/benchmarks/blackscholes/BlackScholesTornado.java#L90

In essence, TornadoVM will only copies data when the execute method is invoked. There, TornadoVM assigns input buffers to specific areas of the device heap, performs the data copies (host -> device), compiles the kernels (from Java bytecode to <OpenCL,PTX,SPIRV>), deploys the generated kernel/s on the accelerator, and copies back the result (device->host). The execute method is blocking.

At the user level, when invoking the streamIn/streamOut methods, there are actually no data transfers involved at this point. It only indicates the TornadoVM runtime which buffers should be copy In and Out when running the execute method. Note that there is an exception to this. As my previous example shows, the execute method was before the streamOut. This is ideal for applications that run in an streaming fashion on the device, in which the user can force copy out when needed (and save data transfer time).

Question:

When i stream a value out, is the GPU memory recovered? How do a close a "session", releasing all GPU resources - without closing my client VM - so i can start again with a new strategy/data set

Answer:
In the current implementation, these resources are not freed after the execution, it is cached. However there is a way to flush:

for (int i = 0; i < TornadoRuntime.getTornadoRuntime().getNumDrivers(); i++) {
    final TornadoDriver driver = TornadoRuntime.getTornadoRuntime().getDriver(i);
    for (int j = 0; j < driver.getDeviceCount(); j++) {
          driver.getDevice(j).reset();
    }
}

With the new model we are working on, this will be automatic within TornadoVM.

Question:

(I see some stuff in TornadoDevice ensureAllocated/Present, enqueueBarrier etc but with no guide on how to use them safely, also some goodies in examples/memory & examples.MultipleTasks) - i can guess, but i'd rather not)

Answer:
This corresponds to the TornadoVM bytecode interpreter implementation. They are not exposed to the user. For a full description you can check this publication:

https://github.com/jjfumero/jjfumero.github.io/blob/master/files/VEE2019_Fumero_Preprint.pdf

Hope this helps. Please let's know what to clarify.

0 replies

ian-p-johnson · 2022-03-24T14:02:44Z

ian-p-johnson
Mar 24, 2022
Author

Is there any way i can use the full 8GB on the GTX 1070
This limitation does not exist in the PTX Backend, and Level Zero exceeds the 25% limitation capacity. So, you can try those backends.

PTX worked perfectly - I just backed off a little from the max to leave space for code etc but I am now managing to use 1_044_000_000 x 2 floats using tornado -Xmx10G -Dtornado.heap.allocation=7970MB which takes it to around 8110MiB / 8192MiB on nvidia-smi

I'll take a look through the rest of the stuff you sent now - thx

Great project BTW

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heapsize limit, Device memory management in general #183

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Heapsize limit, Device memory management in general #183

ian-p-johnson Mar 24, 2022

Replies: 3 comments · 1 reply

ian-p-johnson Mar 24, 2022 Author

jjfumero Mar 24, 2022 Maintainer

jjfumero Mar 24, 2022 Maintainer

ian-p-johnson Mar 24, 2022 Author

ian-p-johnson
Mar 24, 2022

Replies: 3 comments 1 reply

ian-p-johnson
Mar 24, 2022
Author

jjfumero Mar 24, 2022
Maintainer

jjfumero
Mar 24, 2022
Maintainer

ian-p-johnson
Mar 24, 2022
Author