Converted the tutorial into a jupyter notebook; expanded docs

PopicLab · May 24, 2022 · 66e3456 · 66e3456
1 parent cd2d7e9
commit 66e3456
Show file tree

Hide file tree

Showing 8 changed files with 884 additions and 197 deletions.
diff --git a/README.md b/README.md
@@ -3,14 +3,14 @@
 ##### Table of Contents  
 [Overview](#overview)  
 [Installation](#install)  
-[Demo / Tutorial](#demo)  
-[Execution](#structure)  
+[Tutorial](#demo)  
+[User Guide](#guide)  
 [Recommended workflow](#workflow)     
 
 <a name="overview"></a>
 ### Overview
 
-```cue``` is a deep learning framework for SV calling and genotyping. At a high-level, ```cue``` operates 
+Cue is a deep learning framework for SV calling and genotyping. At a high-level, Cue operates 
 in the following stages illustrated in the figure below: 
 * sequence alignments are converted into images that capture multiple alignment signals across two genome intervals, 
 * a trained neural network is used to generate Gaussian response confidence maps for each image, 
@@ -21,7 +21,7 @@ which encode the location, type, and genotype of the SVs in this image, and
 <img src="docs/img/Cue_overview.png" alt="drawing" width="1000"/>
 </p>
 
-The current version of ```cue``` can be used to detect and genotype the following SV types: 
+The current version of Cue can be used to detect and genotype the following SV types: 
 deletions (DELs), tandem duplication (DUPs), inversions (INVs), deletion-flanked inversions (INVDELs), 
 and inverted duplications (INVDUPs) larger than 5kbp. 
 
@@ -50,123 +50,64 @@ provided in the ```install/requirements.txt``` file; for example:
 
 * Set the ```PYTHONPATH``` as follows: ```export PYTHONPATH=${PYTHONPATH}:/path/to/cue```
 
-<a name="demo"></a>
-### Demo
-
-We recommend trying the following demo/tutorial to ensure that the software was properly installed 
-and to experiment running ```cue```. In this demo we will use ```cue``` to discover variants in 
-a small BAM file provided here: ```data/demo/inputs/chr21.small.bam```. 
-This file contains alignments to a region of ```GRCh38``` ```chr21``` of short reads from a synthetic genome. 
-The ground-truth variants simulated in this region of the genome are provided here: 
-```data/demo/ground_truth/svs.chr21.small.vcf.gz```.
-The associated YAML config files needed to execute this workflow are provided 
-in the ```data/demo/exec``` directory.
-The expected results are provided in the ```data/demo/expected_results``` directory. 
-All commands should be executed from the top-level directory of the ```cue``` repository.
-
-1. To call variants given the input BAM file and the provided pretrained ```cue``` model on a single CPU:   
-```python engine/call.py --data_config data/demo/exec/data.yaml --model_config data/demo/exec/model.yaml```
-
-This command will launch ```cue``` in variant discovery mode. The input BAM file will first be indexed and 
-then the genome will be scanned to generate images and make predictions. 
-This process should take about 5-6 minutes in total on one CPU. This command can be executed on the GPU by providing 
-the desired GPU id in the ```gpu_ids``` field of the ```data/demo/exec/model.yaml``` file. 
-For example, to run on GPU 0, set this field as follows: ```gpu_ids: [0]```. Multiple GPU ids can be provided 
-to parallelize SV calling across multiple devices when running on a full-genome dataset.
-
-The output of the program will be generated in the ```data/demo/exec``` directory (generally ```cue``` writes 
-the results of each command in the parent folder of the provided YAML config files). 
-In particular, the discovered SVs will be provided here: ```data/demo/exec/reports/svs.vcf ``` 
-(```cue``` outputs both a BED and a VCF file with results).
-
-For example, here are three SVs discovered by the model ranked by confidence score (given in ```QUAL``` column). 
-
-```
-#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
-chr21   10399416        DUP     N       <DUP>   86      PASS    END=10427813;SVTYPE=DUP;SVLEN=28398     GT      0/1
-chr21   14855666        INV     N       <INV>   82      PASS    END=14870587;SVTYPE=INV;SVLEN=14922     GT      0/1
-chr21   14874025        DEL     N       <DEL>   82      PASS    END=14891603;SVTYPE=DEL;SVLEN=17579     GT      1/1 
-```
-
-2. (Optional) To visualize the SVs results:  
-
-```python engine/view.py --config data/demo/exec/view.yaml```
-
-This command will generate annotated ```cue``` images and separate image channels centered around each discovered SV.
-These channels can be used to evaluate various alignment signals contributing to each SV call.
-
-For example, here are the annotations and channels produced for the above three events 
-(note that other events may appear in the same image if they are nearby).
-
-```DUP: chr21   10399416-10427813```
-
-|       Annotation          |        READ-DEPTH         |   SPLIT-READ/READ-PAIRS  |         LL+RR PAIRS        |          RL PAIRS         |
-|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
-|<img src="data/demo/expected_results/annotated_images/8411211701_0_chr21_10349415-10477813_&_chr21_10349415-10477813.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch1.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch2.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch3.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch4.png" alt="drawing" width="200"/> | 
-
-```INV: chr21   14855666-14870587```
-
-|       Annotation          |        READ-DEPTH         |   SPLIT-READ/READ-PAIRS  |         LL+RR PAIRS        |          RL PAIRS        |
-|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
-|<img src="data/demo/expected_results/annotated_images/8411211701_1_chr21_14805665-14920587_&_chr21_14805665-14920587.png" alt="drawing" width="200"/>  |  <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch1.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch2.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch3.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch4.png" alt="drawing" width="200"/>|
-
+To deactivate the environment: ```$> deactivate```
 
-```DEL: chr21   14874025-14891603```
-
-|       Annotation          |        READ-DEPTH         |   SPLIT-READ/READ-PAIRS  |         LL+RR PAIRS        |          RL PAIRS         |
-|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
-|<img src="data/demo/expected_results/annotated_images/8411211701_2_chr21_14824024-14941603_&_chr21_14824024-14941603.png" alt="drawing" width="200"/>  |  <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch1.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch2.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch3.png" alt="drawing" width="200"/> |  <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch4.png" alt="drawing" width="200"/>|
-
-
-3. (Optional) To evaluate the results against the ground truth SVs using Truvari:
-
-In order to execute Truvari, we need to post-process the produced VCF to the input format required by Truvari; 
-in particular, we need to sort, compress, and index the SV VCF file. We recommend using ```bcftools``` and ```htslib``` 
-for this task as follows.  
-
-a. Sort the VCF by position: ```bcftools sort data/demo/exec/reports/svs.vcf > data/demo/exec/reports/svs.sorted.vcf```   
-b. Compress the sorted VCF: ```bgzip -f data/demo/exec/reports/svs.sorted.vcf```   
-c. Index the compressed VCF: ```bcftools index -t data/demo/exec/reports/svs.sorted.vcf.gz```  
-
-We can now execute Truvari as follows:  
-```truvari bench -b data/demo/ground_truth/svs.chr21.small.vcf.gz -c data/demo/exec/reports/svs.sorted.vcf.gz -o data/demo/expected_results/reports/truvari --pctsize=0.5 --pctov=0.5 --passonly --sizemax 5000000 --pctsim 0 --gtcomp```  
-
-Truvari will report the precision, recall, and F1 score for this small benchmark. 
-The above command will require a genotype match to consider an SV call a true positive (TP).
-
-Here is a snapshot of the Truvari report: 
-```
-"TP-base": 16,
-"TP-call": 16,
-"FP": 0,
-"FN": 1,
-"precision": 1.0,
-"recall": 0.9411764705882353,
-"f1": 0.9696969696969697,
-"base cnt": 17,
-"call cnt": 16
-```
-
-We can see that ```cue``` discovered 16 out of the 17 events simulated in this region (with no false positives).
-Additional information produced by Truvari can be found here: ```data/demo/expected_results/reports/truvari```. 
+<a name="demo"></a>
+### Tutorial
 
+We recommend trying the provided [demo notebook](notebooks/tutorial.ipynb) to ensure that the software was properly 
+installed and to experiment running Cue. In this demo we will use Cue to discover variants in 
+a small BAM file provided here: ```data/demo/inputs/chr21.small.bam``` (with the associated YAML config files needed 
+to execute this workflow provided in the ```data/demo/config``` directory).
 
-<a name="structure"></a>
-### Execution
+<a name="guide"></a>
+### User guide
 
-In addition to the functionality to call structural variants, the framework can be used to execute custom model training, evaluation, and image generation. The ```engine``` directory contains the following key high-level scripts to train/evaluate the model 
-and generate image datasets:
+In addition to the functionality to call structural variants, the framework can be used to execute 
+custom model training, evaluation, and image generation. The ```engine``` directory contains the following 
+key high-level scripts to train/evaluate the model and generate image datasets:
 
-* ```generate.py```: creates an annotated image dataset from alignments (BAM file(s))
-* ```train.py```: trains a deep learning model (currently, this is a stacked hourglass network architecture) 
-to detect SV keypoints in images
 * ```call.py```: calls structural variants given a pre-trained model and an input BAM file 
 (can be executed on multiple GPUs or CPUs)
-* ```view.py```: plots images annotated with SVs from a VCF/BEDPE file given genome alignments (BAM format);
+* ```train.py```: trains a deep learning model (currently, this is a stacked hourglass network architecture) 
+to detect SV keypoints in images
+* ```generate.py```: creates an annotated image dataset from alignments (BAM file(s))
+* ```view.py```: plots images annotated with SVs from a VCF/BED file given genome alignments (BAM format);
 can be used to visualize model predictions or ground truth SVs 
 
-Each script accepts as input one or multiple YAML config files, which encode a variety of parameters. Template config files are provided 
-in the ```config``` directory.
+Each script accepts as input one or multiple YAML config files, which encode a variety of parameters. 
+Template config files are provided in the ```config``` directory.
+
+The key required and optional YAML parameters for each Cue command are listed below.
+
+```call.py``` (data YAML):
+* ```bam``` [*required*] path to the alignments BAM file
+* ```fai``` [*required*] path to the referene FASTA FAI file
+* ```n_cpus```  [*optional*] number of CPUs to use for calling (parallelized by chromosome)
+* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
+
+```call.py``` (model YAML):
+* ```model_path``` [*required*] path to the pretrained Cue model
+* ```gpu_ids``` [*optional*] list of GPU ids to use for calling -- CPU(s) will be used if empty
+
+```train.py```:
+* ```dataset_dirs``` [*required*] list of annotated imagesets to use for training
+* ```gpu_ids```  [*optional*] GPU id to use for training -- a CPU will be used if empty
+* ```report_interval``` [*optional*] frequency (in number of batches) for reporting training stats and image predictions
+
+```generate.py```:
+* ```bam``` [*required*] path to the alignments BAM file
+* ```bed``` [*required*] path to the ground truth BED or VCF file
+* ```fai``` [*required*] path to the referene FASTA FAI file
+* ```n_cpus```  [*optional*] number of CPUs to use for image generation (parallelized by chromosome)
+* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
+
+```view.py```:
+* ```bam``` [*required*] path to the alignments BAM file
+* ```bed``` [*required*] path to the BED or VCF file with SVs to visualize
+* ```fai``` [*required*] path to the referene FASTA FAI file
+* ```n_cpus```  [*optional*] number of CPUs (parallelized by chromosome)
+* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
 
 <a name="workflow"></a>
 #### Recommended workflow 

diff --git a/data/demo/exec/data.yaml → data/demo/config/data.yaml b/data/demo/exec/data.yaml → data/demo/config/data.yaml
@@ -1,10 +1,15 @@
-bam: "data/demo/inputs/chr21.small.bam"
-fai: "data/demo/inputs/GRCh38.fa.fai"
-bed: null
-blacklist_bed: null 
+#### REQUIRED ####
+bam: "../data/demo/inputs/chr21.small.bam"  # path to the alignments BAM file
+fai: "../data/demo/inputs/GRCh38.fa.fai"  # path to the referene FASTA FAI file
+#### OPTIONAL ####
+n_cpus: 1  # number of CPUs to use (parallelized by chromosome)
+chr_names: ["chr21"]  # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
+#### FIXED (do not modify) ####
 bam_type: "SHORT"
 signal_set: "SHORT"
 signal_set_origin: "SHORT"
+bed: null
+blacklist_bed: null
 signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
 signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
 bin_size: 750
@@ -16,5 +21,3 @@ image_dim: 256
 class_set: "BASIC5ZYG"
 num_keypoints: 1
 bbox_padding: 0
-n_cpus: 1
-chr_names: ["chr21"]
diff --git a/data/demo/config/model.yaml b/data/demo/config/model.yaml
@@ -0,0 +1,16 @@
+#### REQUIRED ####
+model_path: "../data/models/cue.pt"  # path to the pretrained Cue model
+#### OPTIONAL ####
+gpu_ids: []  # list of GPU ids to use for calling -- a CPU will be used if empty
+report_interval: 5  # frequency (in number of batches) for reporting training stats and image predictions
+pretrained_refinenn_path: null  # path to the pretrained keypoint refinement model
+#### FIXED (do not modify) ####
+image_dim: 256
+class_set: "BASIC5ZYG"
+signal_set: "SHORT"
+num_keypoints: 1
+model_architecture: "HG"
+batch_size: 16
+sigma: 10
+stride: 4
+heatmap_peak_threshold: 0.65
diff --git a/data/demo/config/view.yaml b/data/demo/config/view.yaml
@@ -0,0 +1,23 @@
+#### REQUIRED ####
+bam: "../data/demo/inputs/chr21.small.bam"  # path to the alignments BAM file
+fai: "../data/demo/inputs/GRCh38.fa.fai"  # path to the referene FASTA FAI file
+bed: "../data/demo/results/reports/svs.vcf"  # SVs to visualize (BED or VCF)
+#### OPTIONAL ####
+n_cpus: 1  # number of CPUs to use (parallelized by chromosome)
+chr_names: ["chr21"]  # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
+#### FIXED (do not modify) ####
+blacklist_bed: null
+bam_type: "SHORT"
+signal_set: "SHORT"
+signal_set_origin: "SHORT"
+signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
+signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
+bin_size: 750
+interval_size: [150000]
+step_size: [50000]
+shift_size: [0, 70000, 140000]
+heatmap_dim: 1000
+image_dim: 256
+class_set: "BASIC5ZYG"
+num_keypoints: 1
+bbox_padding: 0
diff --git a/data/demo/exec/model.yaml b/data/demo/exec/model.yaml
diff --git a/engine/config_utils.py b/engine/config_utils.py
@@ -127,7 +127,8 @@ def __init__(self, config_file, **entries):
         self.bam_type = constants.BAM_TYPE[self.bam_type]
         self.signal_set = constants.SV_SIGNAL_SET[self.signal_set]
         self.num_signals = len(constants.SV_SIGNALS_BY_TYPE[self.signal_set])
-        self.uid = ''.join(["{}".format(randint(0, 9)) for _ in range(0, 10)])
+        # self.uid = ''.join(["{}".format(randint(0, 9)) for _ in range(0, 10)])
+        self.uid = ''.join(["{}".format(0) for _ in range(0, 10)])
 
         # setup the dataset directory structure
         Path(self.info_dir).mkdir(parents=True, exist_ok=True)

diff --git a/install/requirements.txt b/install/requirements.txt
@@ -1,81 +1,20 @@
-ACEBinf==1.0.2
-argcomplete==2.0.0
-argh==0.26.2
-attrs==21.4.0
+setuptools==58.0.0
 bitarray==1.6.3
-bleach==3.1.5
-bwapy==0.1.4
 cachetools==4.1.0
-certifi==2021.10.8
-cffi==1.14.2
-chardet==3.0.4
-charset-normalizer==2.0.12
-cycler==0.11.0
-Cython==0.29.21
-fonttools==4.33.3
-future==0.18.2
-gast==0.3.3
-gffutils==0.10.1
-idna==3.3
-importlib-metadata==4.11.3
+cython==0.29.21
 intervaltree==3.1.0
-ipython-genutils==0.2.0
 joblib==0.16.0
-jsonpatch==1.32
-jsonpointer==2.3
-kiwisolver==1.4.2
-mappy==2.18
 matplotlib==3.2.1
-more-itertools==8.3.0
-networkx==2.6.3
 numpy==1.18.5
 opencv-python==4.5.1.48
-packaging==21.3
 pandas==1.0.5
-pickleshare==0.7.5
-Pillow==7.2.0
-pluggy==0.13.1
-progressbar2==4.0.0
-py==1.8.1
 pycocotools==2.0.4
-pycparser==2.20
 pyfaidx==0.5.9.5
-pyparsing==3.0.8
 pysam==0.16.0.1
 pytabix==0.1
-pytest==5.4.2
 python-dateutil==2.8.1
-python-Levenshtein==0.12.1
-python-utils==3.1.0
-pytz==2022.1
-PyVCF==0.6.8
-PyYAML==5.3.1
-pyzmq==22.3.0
-RCK==1.1.0
-requests==2.27.1
-scikit-learn==0.23.2
-scipy==1.4.1
+pyyaml==5.3.1
 seaborn==0.11.0
-simplejson==3.17.6
-six==1.14.0
-sklearn==0.0
-sortedcontainers==2.3.0
-Tcl==0.2
-termcolor==1.1.0
-threadpoolctl==2.1.0
-threadsafe-tkinter==1.0.2
 torch==1.5.1
-torchfile==0.1.0
 torchvision==0.6.1
-tornado==6.1
-traitlets==5.0.4
-Truvari==2.1
-typing_extensions==4.2.0
-urllib3==1.25.9
-visdom==0.1.8.9
-wcwidth==0.1.9
-wdl==1.0.22
-webencodings==0.5.1
-websocket-client==1.3.2
-xtermcolor==1.3
-zipp==3.1.0
+jupyter