Momey leak #349

mzouink · 2024-11-23T18:44:19Z

After the new version
there is a memory leak
I think it is coming from funlib.persistence
Because funlib.show.neuroglancer is becoming really slow and buggy

@pattonw

pattonw · 2024-11-23T19:06:42Z

Interesting. You're just visualizing an array with funlib show neuroglancer and it's using 500gb?

mzouink · 2024-11-24T02:31:21Z

no no, that graph is from dacapo train job. but i mentioned neuroglancer as side example

pattonw · 2024-11-25T16:08:28Z

Can you provide some (ideally simplified) config combination that leads to a similar memory profile?

mzouink · 2024-11-25T16:17:45Z

I don't know how to give you this exact code because i was using +200 nrs crops, but this iss the code i was running:
(do you still have access to the cluster ?)

# %%
import csv
import json
import os
import dacapo

# %%
datasplit_path = "datasplit_v2.csv"
classes_to_be_used_path = "to_be_used_v2.json"
#%%
with open(classes_to_be_used_path, 'r') as f:
    classes = ["bg"]+list(json.load(f).keys())
# %%
from dacapo.experiments.datasplits import DataSplitGenerator
from funlib.geometry import Coordinate
from dacapo.store.create_store import create_config_store
config_store = create_config_store()
# %%
from dacapo.experiments.datasplits import DataSplitGenerator
from funlib.geometry import Coordinate

input_resolution = Coordinate(8, 8, 8)
output_resolution = Coordinate(8,8,8)
datasplit_config = DataSplitGenerator.generate_from_csv(
    datasplit_path,
    input_resolution,
    output_resolution,
    # targets=classes,
    name="base_model_20241120_20_target_classes",
    # max_validation_volume_size = 400**3,
).compute()
# %%

datasplit = datasplit_config.datasplit_type(datasplit_config)

config_store.store_datasplit_config(datasplit_config)
# %%

from dacapo.experiments.tasks import  OneHotTaskConfig,

simple_one_hot = OneHotTaskConfig(
        name="one_hot_task",
        classes=classes,
        kernel_size=1,
    )
config_store.store_task_config(simple_one_hot)
# %%
from dacapo.experiments.architectures import CNNectomeUNetConfig
architecture_config = CNNectomeUNetConfig(
    name="simple_unet",
    input_shape=(2, 132, 132),
    eval_shape_increase=(8, 32, 32),
    fmaps_in=1,
    num_fmaps=8,
    fmaps_out=8,
    fmap_inc_factor=2,
    downsample_factors=[(1, 4, 4), (1, 4, 4)],
    kernel_size_down=[[(1, 3, 3)] * 2] * 3,
    kernel_size_up=[[(1, 3, 3)] * 2] * 2,
    constant_upsample=True,
    padding="valid",
)
config_store.store_architecture_config(architecture_config)
# %%
from dacapo.experiments.trainers import GunpowderTrainerConfig

trainer_config = GunpowderTrainerConfig(
    name="default_v3",
    batch_size=2,
    learning_rate=0.0001,
    num_data_fetchers=20,
    augments=[
        ElasticAugmentConfig(
            control_point_spacing=[100, 100, 100],
            control_point_displacement_sigma=[10.0, 10.0, 10.0],
            rotation_interval=(0, math.pi / 2.0),
            subsample=8,
            uniform_3d_rotation=True,
        ),
        IntensityAugmentConfig(
                    scale=(0.25, 1.75),
                    shift=(-0.5, 0.35),
                    clip=True,
                ),
        GammaAugmentConfig(gamma_range=(0.5, 2.0)),
                IntensityScaleShiftAugmentConfig(scale=2, shift=-1),
    ],
    snapshot_interval=100000,
    clip_raw=False,
)
config_store.store_trainer_config(trainer_config)
# %%
from dacapo.experiments import RunConfig
from dacapo.experiments.run import Run

iterations = 1000000
validation_interval = 10000
run_config = RunConfig(
    name=f"simple_base_model",
    datasplit_config=datasplit_config,
    task_config=simple_one_hot,
    architecture_config=architecture_config,
    trainer_config=trainer_config,
    num_iterations=iterations,
    validation_interval=validation_interval,
)
config_store.store_run_config(run_config)
# %%
# i submitted it in different job $ dacapo train run_name
from dacapo import train
train(run_config.name)

pattonw · 2024-11-25T17:13:39Z

Ah, so this might not be a memory leak, just lots of data.
Do you get the same pattern if you train with just 1 dataset?
How many iterations did you train before memory became a problem?

mzouink · 2024-11-25T18:54:46Z

Usually is not a problem even if there is a lot of data, because of the lazy loading.
The error happens after ~ 1000 iterations
i will try multiple scenario to narrow down the problem and come back.

mzouink · 2024-11-25T20:14:06Z

i submitted :
datasplit [1 crop , 100 crop]
Trainer [basid, with 3d augmentation]
each with 3 reperations
3 reperations of big datasplit hit the out of memory in the same time (with both trainers) r
~2600 iterations
the one crop datasplit is still running
so there is a memory leak in holding crops info per time.

mzouink · 2024-11-25T20:39:48Z

@pattonw i think this is the problem:
https://github.com/funkelab/funlib.persistence/blob/3c0760e48edf1b287c4f75d7d11dc6b775332b2b/funlib/persistence/arrays/array.py#L73
after asking GPT i got :

When Does self.data Contain Binary Data?
Before Computation: Only metadata and references to the underlying storage (lazy evaluation).
After .compute(): The binary data is loaded into memory as a concrete array (e.g., NumPy).
After .persist(): The chunks of the Dask array are computed and stored in memory, allowing for quick access but requiring memory proportional to the size of the computed data.

pattonw · 2024-11-25T22:40:28Z

It could be the dask array, but we never call persist, and I'm pretty sure using compute doesn't cache data in memory

mzouink · 2024-11-26T00:23:38Z

now it is clear that is related to high number of crops. but i don't know how can i narrow the reason of the bug more

pattonw · 2024-11-26T01:34:55Z

My best guess is the masking.
Can you try replacing this line, with a dask.array.ones(dataset.gt.data.shape, dtype=dataset.gt.data.dtype) and see if that solves it?

mzouink · 2024-11-26T18:55:20Z

didn't work :/

pattonw · 2024-12-02T10:44:35Z

I tried directly memory profiling funlib.persistence.Array. Here's the result of randomly accessing (100,100,100) cubes:

script:

from funlib.persistence import Array, prepare_ds
from funlib.geometry import Coordinate, Roi
import dask.array as da

import random

a = prepare_ds(
    "scratch/test.zarr", (10_000, 10_000, 10_000), chunk_shape=(10, 10, 10)
)

for ii in range(1000):

    print(ii)
    roi = Roi(
        Coordinate(*[random.randint(0, a.shape[i] // 100) for i in range(3)]),
        (10, 10, 10),
    )
    x = a[roi]
    print(x.sum())

It doesn't seem like there's a memory leak from repeatedly accessing zarrs through funlib.persistence
Can you try to find a combination of somewhat minimal dacapo configs that leads to increasing memory usage during training?

Memory profiling I was using: pip install memory_profiler (https://pypi.org/project/memory-profiler/) mprof run --multiprocess --include-children {your_script}.py, followed by mprof plot

pattonw · 2024-12-02T12:29:39Z

I adapted your script with a fairly basic data setup and tested 3 different modalities. See the script here. I ran each of the following modalities for about 10 minutes at around 50 its/sec so about 30k iterations each. Each ran with 20 data fetching workers.

It looks like the memory cost stays fairly constrained across all runs.

Here are the results:
One small (100, 132, 132) crop:

One large (1000, 1320, 1320) crop:

Many (13) large (1000, 1320, 1320) crops:

mzouink · 2024-12-03T15:15:02Z

Hi Will, you need to keep running the scripts of at least an hour. because generating the random crops takes ~2000 sec
then i hit this error form your script :

Creating FileConfigStore:
	path: /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/configs
Training run test_run_will_3ddd
Creating FileConfigStore:
	path: /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/configs
Starting/resuming training for run test_run_will_3ddd...
Creating FileStatsStore:
	path    : /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/stats
Traceback (most recent call last):
 File "/groups/cellmap/cellmap/zouinkhim/base_model/generate_run/profile_test_runs_3d_.py", line 128, in <module>
   train(run_config.name)
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/train.py", line 47, in train
   return train_run(run, do_validate)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/train.py", line 66, in train_run
   run.validation_scores.scores = stats_store.retrieve_validation_iteration_scores(
   ^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/run.py", line 160, in validation_scores
   self.datasplit.validate,
   ^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/datasplits/simple_config.py", line 69, in validate
   for x in self.get_paths(self.validate_group_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/datasplits/simple_config.py", line 50, in get_paths
   raise Exception(f"No raw data found at {level_0} or {level_1} or {level_2}")
Exception: No raw data found at f{data_path}/test.zarr/raw or f{data_path}/test.zarr/test/raw or f{data_path}/test.zarr/test/**/raw
mprof: Sampling memory every 0.1s
running new process

mzouink · 2024-12-03T15:22:30Z

script: here
when i use you script but with my data with 2d model it was ok:

but with 3d model the problem stated:

i am still doing more tests

pattonw · 2024-12-03T15:29:00Z

Hi Will, you need to keep running the scripts of at least an hour. because generating the random crops takes ~2000 sec then i hit this error form your script :

Creating FileConfigStore:
	path: /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/configs
Training run test_run_will_3ddd
Creating FileConfigStore:
	path: /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/configs
Starting/resuming training for run test_run_will_3ddd...
Creating FileStatsStore:
	path    : /groups/cellmap/cellmap/zouinkhim/base_model/dacapo_files/stats
Traceback (most recent call last):
 File "/groups/cellmap/cellmap/zouinkhim/base_model/generate_run/profile_test_runs_3d_.py", line 128, in <module>
   train(run_config.name)
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/train.py", line 47, in train
   return train_run(run, do_validate)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/train.py", line 66, in train_run
   run.validation_scores.scores = stats_store.retrieve_validation_iteration_scores(
   ^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/run.py", line 160, in validation_scores
   self.datasplit.validate,
   ^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/datasplits/simple_config.py", line 69, in validate
   for x in self.get_paths(self.validate_group_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/groups/cellmap/cellmap/zouinkhim/dacapo_release/dacapo/dacapo/experiments/datasplits/simple_config.py", line 50, in get_paths
   raise Exception(f"No raw data found at {level_0} or {level_1} or {level_2}")
Exception: No raw data found at f{data_path}/test.zarr/raw or f{data_path}/test.zarr/test/raw or f{data_path}/test.zarr/test/**/raw
mprof: Sampling memory every 0.1s
running new process

Yeah I picked large roi's that took a few minutes to generate on my mac, depending on the computer generating those could take longer. Feel free to shrink them to something you can generate decently quickly.

I made a pull request with a small bugfix that should address the exception you got.

pattonw · 2024-12-03T15:30:20Z

script: here when i use you script but with my data with 2d model it was ok:

but with 3d model the problem stated: i am still doing more tests

Looks like the problem is present in both, but just on a smaller scale with the 2D model. Very strange that it is dependent on the model architecture

pattonw · 2024-12-03T15:53:13Z

Are these the 2 configs you are calling "2D" and "3D"?

# architecture_config = CNNectomeUNetConfig(
#     name="simple_unet_tt",
#     input_shape=(2, 64, 64),
#     eval_shape_increase=(8, 32, 32),
#     fmaps_in=1,
#     num_fmaps=8,
#     fmaps_out=8,
#     fmap_inc_factor=2,
#     downsample_factors=[(1, 2, 2), (1, 2, 2)],
#     kernel_size_down=[[(1, 3, 3)] * 2] * 3,
#     kernel_size_up=[[(1, 3, 3)] * 2] * 2,
#     constant_upsample=True,
#     padding="valid",
# )
# config_store.store_architecture_config(architecture_config)

architecture_config = CNNectomeUNetConfig(
    name="unet_tt",
    input_shape=(216, 216, 216),
    eval_shape_increase=(72, 72, 72),
    fmaps_in=1,
    num_fmaps=12,
    fmaps_out=72,
    fmap_inc_factor=6,
    downsample_factors=[(2, 2, 2), (3, 3, 3), (3, 3, 3)],
    constant_upsample=True,
    upsample_factors=[],
)
config_store.store_architecture_config(architecture_config)

One thing I notice is that the input shape is much higher for the 3D model than the 2D.
Just doing some quick math that looks like it would be using a lot of memory:
You have quite a lot of downsampling, I wouldn't be surprised if the context you need is around 84 voxels meaning your input data is potentially almost a 300 voxel cube. Assuming float 32, and an increase in volume size by a factor of about 3 for the rotation augmentation, thats almost .3gb for a single raw input.
Including all the other arrays (target, gt, mask, prediction, etc.) you could be getting up to about 1gb per batch. With 20 workers and a default cache size of 50, that means your pipeline could be using up to around 70gb just for holding data ready for training.
There's still something going wrong since thats still almost an order of magnitude off of the 500gb you see being occupied, but that just means its not surprising at all that the 2D model isn't running into the same memory problems

mzouink · 2024-12-03T18:25:22Z

yes that's what i mean by 2d / 3d

mzouink · 2024-12-03T18:28:11Z

update:
your example:

Your example but with my datasplit:

you example but with 3d model

with my data - 3d model:

I think the old datasplit have a leak somewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Momey leak #349

Momey leak #349

mzouink commented Nov 23, 2024

pattonw commented Nov 23, 2024

mzouink commented Nov 24, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 25, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 25, 2024

mzouink commented Nov 25, 2024

mzouink commented Nov 25, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 26, 2024

pattonw commented Nov 26, 2024

mzouink commented Nov 26, 2024

pattonw commented Dec 2, 2024

pattonw commented Dec 2, 2024

mzouink commented Dec 3, 2024

mzouink commented Dec 3, 2024

pattonw commented Dec 3, 2024

pattonw commented Dec 3, 2024

pattonw commented Dec 3, 2024

mzouink commented Dec 3, 2024

mzouink commented Dec 3, 2024

Momey leak #349

Momey leak #349

Comments

mzouink commented Nov 23, 2024

pattonw commented Nov 23, 2024

mzouink commented Nov 24, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 25, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 25, 2024

mzouink commented Nov 25, 2024

mzouink commented Nov 25, 2024

pattonw commented Nov 25, 2024

mzouink commented Nov 26, 2024

pattonw commented Nov 26, 2024

mzouink commented Nov 26, 2024

pattonw commented Dec 2, 2024

pattonw commented Dec 2, 2024

mzouink commented Dec 3, 2024

mzouink commented Dec 3, 2024

pattonw commented Dec 3, 2024

pattonw commented Dec 3, 2024

pattonw commented Dec 3, 2024

mzouink commented Dec 3, 2024

mzouink commented Dec 3, 2024