-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak for disk checkpointing #4020
base: master
Are you sure you want to change the base?
Conversation
|
|
I don't believe this is correct. Merely referencing |
Can I just check how many Degrees of Freedom this is? I'm trying to work out how many vectors we must have leaked. |
Firstly, I have identified memory leaks in two cases involving disk checkpointing:
Additionally, the The following chart evaluates Case 2 on the master branch (black curve) and with a modification replacing The red curve represents the case where no disk checkpointing is applied. Although changing Next StepsWe have two options moving forward:
For Option 2 (this PR proposal), here are the results obtained:
These charts use the example added #4014 issue. |
I believe this answer. Storing the Moving to checkpoint schedules for everything is our aim in any event, so I am happy to always use it in this case. @sghelichkhani: the PR enables this in a way that shouldn't require any code changes for G-Adopt. |
Perfect. I will convert it to a draft to run more tests. As soon as possible, I will open it to review again. |
…_mem_leak_diskcheckpointing
@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on |
@dham For the graph I have above (and presumably what @Ig-dolci is plotting) I am using this reproducer https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/mantle_convection/test/tester.py, where |
I think is missing to replace the time loop from |
Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187? |
Yes. I just noticed that I used it. Applying the garbage collection between a number of time steps (20 steps) helped decrease memory usage by almost 20% for this case. |
Description
This PR seeks to fix the memory leak by calling a
function.name
method at theCheckpointFunction
constructor. In addition, this PR proposes disk checkpointing using theSingleDiskStorageSchedule
schedule and pyadjoint checkpoint manager to execute the solvers. This code also enables the user to choose other schedules fromcheckpoint_schedules
with disk storage.There is a point to consider here:
SingleDiskStorageSchedule
schedule stores only the adjoint dependence data. So, we have to pay attention to the cases requiring all checkpoint data storage on disk in every single time step, i.e., the required data to restart the forward solver and the required data for the adjoint computations. TheSingleDiskStorageSchedule
schedule does not support that.Observation:
The tests will pass if the pyadjoint PR 195 is merged.
I will wait for the PR 4023 to be merged to open this PR.