Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing error during validation in LocalTorch compute context #298

Open
atc3 opened this issue Sep 25, 2024 · 3 comments
Open

Multiprocessing error during validation in LocalTorch compute context #298

atc3 opened this issue Sep 25, 2024 · 3 comments

Comments

@atc3
Copy link
Contributor

atc3 commented Sep 25, 2024

Describe the bug

When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:

...
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I directly call validate_run outside of train_run, I get the same error:

from dacapo import validate_run

validate_run("cosem_distance_run_4nm", 2000)
Creating FileConfigStore:
	path: /home/[email protected]/dacapo/configs
Creating local weights store in directory /home/[email protected]/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
	path    : /home/[email protected]/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Happy to provide a full stack trace if it helps.

I tried to fix this issue by explicitly setting the torch multiprocessing method to use spawn but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling distribute_workers in the LocalTorch compute context, and this somehow fixes the issue.

To Reproduce

Just run cosem_example.ipynb on any local workstation with a GPU

Versions:

  • OS: Ubuntu 22.04
  • CUDA Version: 12.2
  • 3 x NVIDIA RTX A5000, 24 GB memory each
@vaxenburg
Copy link
Collaborator

The distribute_workers flag is also accessible from the dacapo.yaml file by adding this bit to it:

compute_context:
  type: LocalTorch
  config:
    distribute_workers: True

Or maybe that's what you did?

@atc3
Copy link
Contributor Author

atc3 commented Sep 30, 2024

Haha I totally forgot about configuring with the yaml file. I changed the default value in the LocalTorch class but I think the end result was the same - in that setting distribute_workers to True stopped the crash from happening.

What do you think about just changing the default though? If distribute_workers is false, what is the intended behavior on local machines?

@vaxenburg
Copy link
Collaborator

@rhoadesScholar ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants