Skip to content

bcbio-vc docker image not able to find mounted biodata containing reference genome files #9

@marrojwala

Description

@marrojwala

Hi!

I am trying to incorporate bcbio-vc docker image onto my pipeline manager and for that I am trying to run and test variant calling on bcbio-vc docker image but I see that bcbio_nextgen.py is unable to find the genome builds in the bcbio installation.

This is how I ran it:

  1. Created ~/bcbio/biodata/genomes and ~/bcbio/biodata/galaxy directories on a local system which would be mounted to the docker container. And also created another directory ~/bcbio-test as the scratch for my test .
  2. Started a docker container using docker run -ti -v ~/bcbio/biodata:/mnt/biodata -v ~/bcbio-test:/data quay.io/bcbio/bcbio-vc
  3. In the container, I ran bcbio_nextgen.py upgrade -u skip --genomes hg38 --genomes mm10 --aligners bwa to download the reference genomes which were downloaded to /usr/local/share/bcbio-nextgen/genomes and corresponding galaxy directory was updated at /usr/local/share/bcbio-nextgen/galaxy. Attaching the tail of the stdout
List of genomes to get (from the config file at '{'genomes': [{'dbkey': 'hg38', 'name': 'Human (hg38) full', 'indexes': ['seq', 'twobit', 'bwa', 'hisat2'], 'annotations': ['ccds', 'capture_regions', 'coverage', 'prioritize', 'dbsnp', 'hapmap_snps', '1000g_omni_snps', 'ACMG56_genes', '1000g_snps', 'mills_indels', '1000g_indels', 'clinvar', 'qsignature', 'genesplicer', 'effects_transcripts', 'varpon', 'vcfanno', 'viral', 'purecn_mappability', 'simple_repeat', 'af_only_gnomad', 'transcripts', 'RADAR', 'rmsk', 'salmon-decoys', 'fusion-blacklist', 'mirbase'], 'validation': ['giab-NA12878', 'giab-NA24385', 'giab-NA24631', 'platinum-genome-NA12878', 'giab-NA12878-remap', 'giab-NA12878-crossmap', 'dream-syn4-crossmap', 'dream-syn3-crossmap', 'giab-NA12878-NA24385-somatic', 'giab-NA24143', 'giab-NA24149', 'giab-NA24694', 'giab-NA24695']}, {'dbkey': 'mm10', 'name': 'Mouse (mm10)', 'indexes': ['seq', 'twobit'], 'annotations': ['problem_regions', 'prioritize', 'dbsnp', 'vcfanno', 'transcripts', 'rmsk', 'mirbase']}], 'genome_indexes': ['bwa', 'rtg'], 'install_liftover': False, 'install_uniref': False}'): Human (hg38) full, Mouse (mm10)
bcbio-nextgen data upgrade complete.
Upgrade completed successfully.
  1. Then started to run this tutorial in the /data directory and it failed with the following error:
root@edc1034c416f:/data/cancer-dream-syn3/work# bcbio_nextgen.py ../config/cancer-dream-syn3.yaml -n 8
Running bcbio version: 1.2.4
global config: /data/cancer-dream-syn3/work/bcbio_system.yaml
run info config: /data/cancer-dream-syn3/config/cancer-dream-syn3.yaml
[2021-05-20T00:31Z] System YAML configuration: /data/cancer-dream-syn3/work/bcbio_system-merged.yaml.
[2021-05-20T00:31Z] Locale set to C.UTF-8.
[2021-05-20T00:31Z] Resource requests: bwa, sambamba, samtools; memory: 4.00, 4.00, 4.00; cores: 16, 16, 16
[2021-05-20T00:31Z] Configuring 1 jobs to run, using 8 cores each with 32.1g of memory reserved for each job
[2021-05-20T00:31Z] Timing: organize samples
[2021-05-20T00:31Z] multiprocessing: organize_samples
[2021-05-20T00:31Z] Using input YAML configuration: /data/cancer-dream-syn3/config/cancer-dream-syn3.yaml
[2021-05-20T00:31Z] Checking sample YAML configuration: /data/cancer-dream-syn3/config/cancer-dream-syn3.yaml
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 245, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/main.py", line 50, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/main.py", line 91, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/main.py", line 128, in variant2pipeline
    [x[0]["description"] for x in samples]]])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(*x) for x in items):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/utils.py", line 59, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/distributed/multitasks.py", line 459, in organize_samples
    return run_info.organize(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/run_info.py", line 81, in organize
    item = add_reference_resources(item, remote_retriever)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/run_info.py", line 177, in add_reference_resources
    data["dirs"]["galaxy"], data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/genome.py", line 233, in get_refs
    galaxy_config, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python3.7/site-packages/bcbio/pipeline/genome.py", line 180, in _get_ref_from_galaxy_loc
    (genome_build, os.path.normpath(loc_file)))
ValueError: Did not find genome build hg38 in bcbio installation: /data/cancer-dream-syn3/work/tool-data/sam_fa_indices.loc

I am not sure this is how this docker image is intended to be used but I see that the bcbio installation is all-encompassing based on the Dockerfile. After some sleuthing, I think the issue might be that the way bcbio_nextgen.py is getting the base intallation directory from this function which is causing it to look for .loc file at /data/cancer-dream-syn3/work/tool-data/sam_fa_indices.loc instead of /usr/local/share/bcbio-nextgen/galaxy/tool-data/sam_fa_indices.loc

Let me know if you have questions re: the same and Thanks in Advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions