You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating FileConfigStore:
path: /home/[email protected]/dacapo/configs
Creating local weights store in directory /home/[email protected]/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
path : /home/[email protected]/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file: /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file: /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Happy to provide a full stack trace if it helps.
I tried to fix this issue by explicitly setting the torch multiprocessing method to use spawn but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling distribute_workers in the LocalTorch compute context, and this somehow fixes the issue.
Haha I totally forgot about configuring with the yaml file. I changed the default value in the LocalTorch class but I think the end result was the same - in that setting distribute_workers to True stopped the crash from happening.
What do you think about just changing the default though? If distribute_workers is false, what is the intended behavior on local machines?
Describe the bug
When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:
If I directly call
validate_run
outside oftrain_run
, I get the same error:Happy to provide a full stack trace if it helps.
I tried to fix this issue by explicitly setting the torch multiprocessing method to use
spawn
but then I got a different error and decided not to go too deep into that hole. I then got around this error by enablingdistribute_workers
in theLocalTorch
compute context, and this somehow fixes the issue.To Reproduce
Just run cosem_example.ipynb on any local workstation with a GPU
Versions:
The text was updated successfully, but these errors were encountered: