Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak for disk checkpointing #4020

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

Ig-dolci
Copy link
Contributor

@Ig-dolci Ig-dolci commented Feb 8, 2025

Description

This PR seeks to fix the memory leak by calling a function.name method at the CheckpointFunction constructor. In addition, this PR proposes disk checkpointing using the SingleDiskStorageSchedule schedule and pyadjoint checkpoint manager to execute the solvers. This code also enables the user to choose other schedules from checkpoint_schedules with disk storage.

There is a point to consider here:

  • The SingleDiskStorageSchedule schedule stores only the adjoint dependence data. So, we have to pay attention to the cases requiring all checkpoint data storage on disk in every single time step, i.e., the required data to restart the forward solver and the required data for the adjoint computations. The SingleDiskStorageSchedule schedule does not support that.

Observation:

  • The tests will pass if the pyadjoint PR 195 is merged.

  • I will wait for the PR 4023 to be merged to open this PR.

@Ig-dolci Ig-dolci linked an issue Feb 8, 2025 that may be closed by this pull request
Copy link

github-actions bot commented Feb 8, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake real8158 ran7449 passed709 skipped0 failed

Copy link

github-actions bot commented Feb 8, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake complex8313 ran6632 passed1681 skipped0 failed

@dham
Copy link
Member

dham commented Feb 9, 2025

I don't believe this is correct. Merely referencing function.function_space() in CheckpointFunction.__init__ does not create persistent references to function. That isn't going to be the source of the memory leak.

@sghelichkhani
Copy link
Contributor

Just confirming that this doesn't change any of our reproducers. Also noting that the behaviour is slightly different on my mac and our HPC system. Unlike my mac, the RSS seems to go flat after the first derivative (see image). Not sure why it works differently on mac. But at any rate, the memory usage seems to go higher when disk-checkpointing.
Screenshot 2025-02-10 at 14 02 25

@dham
Copy link
Member

dham commented Feb 10, 2025

Can I just check how many Degrees of Freedom this is? I'm trying to work out how many vectors we must have leaked.

@Ig-dolci
Copy link
Contributor Author

Ig-dolci commented Feb 10, 2025

Firstly, I have identified memory leaks in two cases involving disk checkpointing:

  1. When using only enable_disk_checkpointing().
  2. When using enable_disk_checkpointing() along with checkpoint_schedules.

Additionally, the function.function_space() is not the source of the leak. However, assigning self.name = function.name (where function.name is a callable) contributes to the leak in both cases.

The following chart evaluates Case 2 on the master branch (black curve) and with a modification replacing self.name = function.name with self.name = function.name() (blue curve):

Figure_1

The red curve represents the case where no disk checkpointing is applied.

Although changing self.name = function.name to self.name = function.name() mitigates the issue in Case 2, it does not resolve the memory leak in Case 1. The cause of this leak is still unknown.

Next Steps

We have two options moving forward:

  1. Investigate the memory leak in Case 1 (I know is coming from the derivates/gradients computations).
  2. Use disk checkpointing only with checkpoint_schedules

For Option 2 (this PR proposal), here are the results obtained:

Figure_2

  • The black curve corresponds to the results with this PR.
  • The blue curve represents the use of enable_disk_checkpointing() without checkpoint_schedules.
  • The red curve corresponds to the case where no disk checkpointing is applied.

These charts use the example added #4014 issue.
I used 500 time steps. Only 20 time steps are not enough to evaluate these leaks.

@dham
Copy link
Member

dham commented Feb 10, 2025

I believe this answer. Storing the name method rather than its result will definitely cause the leak observed.

Moving to checkpoint schedules for everything is our aim in any event, so I am happy to always use it in this case. @sghelichkhani: the PR enables this in a way that shouldn't require any code changes for G-Adopt.

@Ig-dolci
Copy link
Contributor Author

Perfect. I will convert it to a draft to run more tests. As soon as possible, I will open it to review again.

@Ig-dolci Ig-dolci marked this pull request as draft February 10, 2025 09:59
@sghelichkhani
Copy link
Contributor

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing:
comp
Am I missing something?

@sghelichkhani
Copy link
Contributor

@dham For the graph I have above (and presumably what @Ig-dolci is plotting) I am using this reproducer https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/mantle_convection/test/tester.py, where Q.dim() is 10201, and we are doing 500 timesteps.

@Ig-dolci
Copy link
Contributor Author

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

@angus-g
Copy link
Contributor

angus-g commented Feb 11, 2025

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

@Ig-dolci
Copy link
Contributor Author

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

Yes. I just noticed that I used it. Applying the garbage collection between a number of time steps (20 steps) helped decrease memory usage by almost 20% for this case.

@Ig-dolci Ig-dolci marked this pull request as ready for review February 13, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory Growth and Unexpected Behaviour in Firedrake Adjoint
4 participants