Fix memory leak for disk checkpointing #4020

Ig-dolci · 2025-02-08T17:29:34Z

Description

This PR seeks to fix the memory leak by calling a function.name method at the CheckpointFunction constructor. In addition, this PR proposes disk checkpointing using the SingleDiskStorageSchedule schedule and pyadjoint checkpoint manager to execute the solvers. This code also enables the user to choose other schedules from checkpoint_schedules with disk storage.

There is a point to consider here:

The SingleDiskStorageSchedule schedule stores only the adjoint dependence data. So, we have to pay attention to the cases requiring all checkpoint data storage on disk in every single time step, i.e., the required data to restart the forward solver and the required data for the adjoint computations. The SingleDiskStorageSchedule schedule does not support that.

Observation:

The tests will pass if the pyadjoint PR 195 is merged.
I will wait for the PR 4023 to be merged to open this PR.

github-actions · 2025-02-08T17:50:09Z

	Tests	Passed ✅	Skipped ⏭️	Failed ❌
Firedrake real	8158 ran	7449 passed	709 skipped	0 failed

github-actions · 2025-02-08T17:50:34Z

	Tests	Passed ✅	Skipped ⏭️	Failed ❌
Firedrake complex	8313 ran	6632 passed	1681 skipped	0 failed

dham · 2025-02-09T09:47:32Z

I don't believe this is correct. Merely referencing function.function_space() in CheckpointFunction.__init__ does not create persistent references to function. That isn't going to be the source of the memory leak.

sghelichkhani · 2025-02-10T03:05:52Z

Just confirming that this doesn't change any of our reproducers. Also noting that the behaviour is slightly different on my mac and our HPC system. Unlike my mac, the RSS seems to go flat after the first derivative (see image). Not sure why it works differently on mac. But at any rate, the memory usage seems to go higher when disk-checkpointing.

dham · 2025-02-10T07:26:04Z

Can I just check how many Degrees of Freedom this is? I'm trying to work out how many vectors we must have leaked.

…ckpoint_schedules

Ig-dolci · 2025-02-10T09:11:04Z

Firstly, I have identified memory leaks in two cases involving disk checkpointing:

When using only enable_disk_checkpointing().
When using enable_disk_checkpointing() along with checkpoint_schedules.

Additionally, the function.function_space() is not the source of the leak. However, assigning self.name = function.name (where function.name is a callable) contributes to the leak in both cases.

The following chart evaluates Case 2 on the master branch (black curve) and with a modification replacing self.name = function.name with self.name = function.name() (blue curve):

The red curve represents the case where no disk checkpointing is applied.

Although changing self.name = function.name to self.name = function.name() mitigates the issue in Case 2, it does not resolve the memory leak in Case 1. The cause of this leak is still unknown.

Next Steps

We have two options moving forward:

Investigate the memory leak in Case 1 (I know is coming from the derivates/gradients computations).
Use disk checkpointing only with checkpoint_schedules

For Option 2 (this PR proposal), here are the results obtained:

The black curve corresponds to the results with this PR.
The blue curve represents the use of enable_disk_checkpointing() without checkpoint_schedules.
The red curve corresponds to the case where no disk checkpointing is applied.

These charts use the example added #4014 issue.
I used 500 time steps. Only 20 time steps are not enough to evaluate these leaks.

dham · 2025-02-10T09:20:21Z

I believe this answer. Storing the name method rather than its result will definitely cause the leak observed.

Moving to checkpoint schedules for everything is our aim in any event, so I am happy to always use it in this case. @sghelichkhani: the PR enables this in a way that shouldn't require any code changes for G-Adopt.

Ig-dolci · 2025-02-10T09:59:23Z

Perfect. I will convert it to a draft to run more tests. As soon as possible, I will open it to review again.

…_mem_leak_diskcheckpointing

sghelichkhani · 2025-02-10T22:37:45Z

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing:

Am I missing something?

sghelichkhani · 2025-02-10T22:43:41Z

@dham For the graph I have above (and presumably what @Ig-dolci is plotting) I am using this reproducer https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/mantle_convection/test/tester.py, where Q.dim() is 10201, and we are doing 500 timesteps.

Ig-dolci · 2025-02-11T07:45:47Z

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

angus-g · 2025-02-11T10:51:16Z

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

Ig-dolci · 2025-02-11T11:14:45Z

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

Yes. I just noticed that I used it. Applying the garbage collection between a number of time steps (20 steps) helped decrease memory usage by almost 20% for this case.

Ig-dolci added 2 commits February 8, 2025 13:21

deepcopy name and count function attrinbutes

a5126e9

Avoid referencing function.

1cc0147

Ig-dolci linked an issue Feb 8, 2025 that may be closed by this pull request

Memory Growth and Unexpected Behaviour in Firedrake Adjoint #4014

Open

Ig-dolci added 3 commits February 10, 2025 08:05

Avoid reference the callable function name; Disk checkpoint using che…

496cfb0

…ckpoint_schedules

Copy is not necessary

d41c820

flake8

d4faac6

Ig-dolci marked this pull request as draft February 10, 2025 09:59

Ig-dolci and others added 11 commits February 10, 2025 11:46

Adding firedrake_adjoint tests from pyadjoint

f75a906

flake8

46e245d

lint yml

998db1d

Update .github/workflows/build.yml

d997152

enable other schedules

ddeae15

lint

42f728a

add pytest fixture; use context manager in test_tao_bounds

c8bc486

Fix test_shape_derivatives; clear package data in fixture

b94f290

linting; remove unneeded clear tapes.

ee97c0a

Add checkpoint_schedules in the requirements

475c57f

Merge branch 'dolci/move_pyadjoint_tests_to_firedrake' into dolci/fix…

6f460d6

…_mem_leak_diskcheckpointing

Ig-dolci added 4 commits February 11, 2025 12:48

Enable disk storage for other shedules

30c533e

Update docs

806a088

Please, no mutable arguments

90f451f

Testing disk checkpointing

2d62f68

angus-g mentioned this pull request Feb 12, 2025

Remove reference cycle in VecAccessMixin #4033

Open

Ig-dolci mentioned this pull request Feb 12, 2025

Optional garbage collection and CheckpointManager._global_deps dolfin-adjoint/pyadjoint#187

Open

Ig-dolci added 5 commits February 12, 2025 19:54

fix conflict

780021a

Fix docs; move disk checkpoint test from output to adjoint

93ae83f

Fix docs

e7a83b3

remove output test

76b28f5

Test with the fixings

b3aa347

Ig-dolci marked this pull request as ready for review February 13, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak for disk checkpointing #4020

Fix memory leak for disk checkpointing #4020

Ig-dolci commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025 •

edited

Loading

dham commented Feb 9, 2025

sghelichkhani commented Feb 10, 2025

dham commented Feb 10, 2025

Ig-dolci commented Feb 10, 2025 •

edited

Loading

dham commented Feb 10, 2025

Ig-dolci commented Feb 10, 2025

sghelichkhani commented Feb 10, 2025

sghelichkhani commented Feb 10, 2025

Ig-dolci commented Feb 11, 2025

angus-g commented Feb 11, 2025

Ig-dolci commented Feb 11, 2025

Fix memory leak for disk checkpointing #4020

Are you sure you want to change the base?

Fix memory leak for disk checkpointing #4020

Conversation

Ig-dolci commented Feb 8, 2025 • edited Loading

Description

github-actions bot commented Feb 8, 2025 • edited Loading

github-actions bot commented Feb 8, 2025 • edited Loading

dham commented Feb 9, 2025

sghelichkhani commented Feb 10, 2025

dham commented Feb 10, 2025

Ig-dolci commented Feb 10, 2025 • edited Loading

Next Steps

dham commented Feb 10, 2025

Ig-dolci commented Feb 10, 2025

sghelichkhani commented Feb 10, 2025

sghelichkhani commented Feb 10, 2025

Ig-dolci commented Feb 11, 2025

angus-g commented Feb 11, 2025

Ig-dolci commented Feb 11, 2025

Ig-dolci commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025 •

edited

Loading

Ig-dolci commented Feb 10, 2025 •

edited

Loading