Skip to content

Commit

Permalink
Converted the tutorial into a jupyter notebook; expanded docs
Browse files Browse the repository at this point in the history
  • Loading branch information
viq854 committed May 24, 2022
1 parent cd2d7e9 commit 66e3456
Show file tree
Hide file tree
Showing 8 changed files with 884 additions and 197 deletions.
165 changes: 53 additions & 112 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
##### Table of Contents
[Overview](#overview)
[Installation](#install)
[Demo / Tutorial](#demo)
[Execution](#structure)
[Tutorial](#demo)
[User Guide](#guide)
[Recommended workflow](#workflow)

<a name="overview"></a>
### Overview

```cue``` is a deep learning framework for SV calling and genotyping. At a high-level, ```cue``` operates
Cue is a deep learning framework for SV calling and genotyping. At a high-level, Cue operates
in the following stages illustrated in the figure below:
* sequence alignments are converted into images that capture multiple alignment signals across two genome intervals,
* a trained neural network is used to generate Gaussian response confidence maps for each image,
Expand All @@ -21,7 +21,7 @@ which encode the location, type, and genotype of the SVs in this image, and
<img src="docs/img/Cue_overview.png" alt="drawing" width="1000"/>
</p>

The current version of ```cue``` can be used to detect and genotype the following SV types:
The current version of Cue can be used to detect and genotype the following SV types:
deletions (DELs), tandem duplication (DUPs), inversions (INVs), deletion-flanked inversions (INVDELs),
and inverted duplications (INVDUPs) larger than 5kbp.

Expand Down Expand Up @@ -50,123 +50,64 @@ provided in the ```install/requirements.txt``` file; for example:

* Set the ```PYTHONPATH``` as follows: ```export PYTHONPATH=${PYTHONPATH}:/path/to/cue```

<a name="demo"></a>
### Demo

We recommend trying the following demo/tutorial to ensure that the software was properly installed
and to experiment running ```cue```. In this demo we will use ```cue``` to discover variants in
a small BAM file provided here: ```data/demo/inputs/chr21.small.bam```.
This file contains alignments to a region of ```GRCh38``` ```chr21``` of short reads from a synthetic genome.
The ground-truth variants simulated in this region of the genome are provided here:
```data/demo/ground_truth/svs.chr21.small.vcf.gz```.
The associated YAML config files needed to execute this workflow are provided
in the ```data/demo/exec``` directory.
The expected results are provided in the ```data/demo/expected_results``` directory.
All commands should be executed from the top-level directory of the ```cue``` repository.

1. To call variants given the input BAM file and the provided pretrained ```cue``` model on a single CPU:
```python engine/call.py --data_config data/demo/exec/data.yaml --model_config data/demo/exec/model.yaml```

This command will launch ```cue``` in variant discovery mode. The input BAM file will first be indexed and
then the genome will be scanned to generate images and make predictions.
This process should take about 5-6 minutes in total on one CPU. This command can be executed on the GPU by providing
the desired GPU id in the ```gpu_ids``` field of the ```data/demo/exec/model.yaml``` file.
For example, to run on GPU 0, set this field as follows: ```gpu_ids: [0]```. Multiple GPU ids can be provided
to parallelize SV calling across multiple devices when running on a full-genome dataset.

The output of the program will be generated in the ```data/demo/exec``` directory (generally ```cue``` writes
the results of each command in the parent folder of the provided YAML config files).
In particular, the discovered SVs will be provided here: ```data/demo/exec/reports/svs.vcf ```
(```cue``` outputs both a BED and a VCF file with results).

For example, here are three SVs discovered by the model ranked by confidence score (given in ```QUAL``` column).

```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr21 10399416 DUP N <DUP> 86 PASS END=10427813;SVTYPE=DUP;SVLEN=28398 GT 0/1
chr21 14855666 INV N <INV> 82 PASS END=14870587;SVTYPE=INV;SVLEN=14922 GT 0/1
chr21 14874025 DEL N <DEL> 82 PASS END=14891603;SVTYPE=DEL;SVLEN=17579 GT 1/1
```

2. (Optional) To visualize the SVs results:

```python engine/view.py --config data/demo/exec/view.yaml```

This command will generate annotated ```cue``` images and separate image channels centered around each discovered SV.
These channels can be used to evaluate various alignment signals contributing to each SV call.

For example, here are the annotations and channels produced for the above three events
(note that other events may appear in the same image if they are nearby).

```DUP: chr21 10399416-10427813```

| Annotation | READ-DEPTH | SPLIT-READ/READ-PAIRS | LL+RR PAIRS | RL PAIRS |
|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
|<img src="data/demo/expected_results/annotated_images/8411211701_0_chr21_10349415-10477813_&_chr21_10349415-10477813.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch1.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch2.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch3.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_10349415-10477813_&_chr21_10349415-10477813.png.ch4.png" alt="drawing" width="200"/> |

```INV: chr21 14855666-14870587```

| Annotation | READ-DEPTH | SPLIT-READ/READ-PAIRS | LL+RR PAIRS | RL PAIRS |
|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
|<img src="data/demo/expected_results/annotated_images/8411211701_1_chr21_14805665-14920587_&_chr21_14805665-14920587.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch1.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch2.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch3.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14805665-14920587_&_chr21_14805665-14920587.png.ch4.png" alt="drawing" width="200"/>|

To deactivate the environment: ```$> deactivate```

```DEL: chr21 14874025-14891603```

| Annotation | READ-DEPTH | SPLIT-READ/READ-PAIRS | LL+RR PAIRS | RL PAIRS |
|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
|<img src="data/demo/expected_results/annotated_images/8411211701_2_chr21_14824024-14941603_&_chr21_14824024-14941603.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch1.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch2.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch3.png" alt="drawing" width="200"/> | <img src="data/demo/expected_results/images/8411211701_chr21_14824024-14941603_&_chr21_14824024-14941603.png.ch4.png" alt="drawing" width="200"/>|


3. (Optional) To evaluate the results against the ground truth SVs using Truvari:

In order to execute Truvari, we need to post-process the produced VCF to the input format required by Truvari;
in particular, we need to sort, compress, and index the SV VCF file. We recommend using ```bcftools``` and ```htslib```
for this task as follows.

a. Sort the VCF by position: ```bcftools sort data/demo/exec/reports/svs.vcf > data/demo/exec/reports/svs.sorted.vcf```
b. Compress the sorted VCF: ```bgzip -f data/demo/exec/reports/svs.sorted.vcf```
c. Index the compressed VCF: ```bcftools index -t data/demo/exec/reports/svs.sorted.vcf.gz```

We can now execute Truvari as follows:
```truvari bench -b data/demo/ground_truth/svs.chr21.small.vcf.gz -c data/demo/exec/reports/svs.sorted.vcf.gz -o data/demo/expected_results/reports/truvari --pctsize=0.5 --pctov=0.5 --passonly --sizemax 5000000 --pctsim 0 --gtcomp```

Truvari will report the precision, recall, and F1 score for this small benchmark.
The above command will require a genotype match to consider an SV call a true positive (TP).

Here is a snapshot of the Truvari report:
```
"TP-base": 16,
"TP-call": 16,
"FP": 0,
"FN": 1,
"precision": 1.0,
"recall": 0.9411764705882353,
"f1": 0.9696969696969697,
"base cnt": 17,
"call cnt": 16
```

We can see that ```cue``` discovered 16 out of the 17 events simulated in this region (with no false positives).
Additional information produced by Truvari can be found here: ```data/demo/expected_results/reports/truvari```.
<a name="demo"></a>
### Tutorial

We recommend trying the provided [demo notebook](notebooks/tutorial.ipynb) to ensure that the software was properly
installed and to experiment running Cue. In this demo we will use Cue to discover variants in
a small BAM file provided here: ```data/demo/inputs/chr21.small.bam``` (with the associated YAML config files needed
to execute this workflow provided in the ```data/demo/config``` directory).

<a name="structure"></a>
### Execution
<a name="guide"></a>
### User guide

In addition to the functionality to call structural variants, the framework can be used to execute custom model training, evaluation, and image generation. The ```engine``` directory contains the following key high-level scripts to train/evaluate the model
and generate image datasets:
In addition to the functionality to call structural variants, the framework can be used to execute
custom model training, evaluation, and image generation. The ```engine``` directory contains the following
key high-level scripts to train/evaluate the model and generate image datasets:

* ```generate.py```: creates an annotated image dataset from alignments (BAM file(s))
* ```train.py```: trains a deep learning model (currently, this is a stacked hourglass network architecture)
to detect SV keypoints in images
* ```call.py```: calls structural variants given a pre-trained model and an input BAM file
(can be executed on multiple GPUs or CPUs)
* ```view.py```: plots images annotated with SVs from a VCF/BEDPE file given genome alignments (BAM format);
* ```train.py```: trains a deep learning model (currently, this is a stacked hourglass network architecture)
to detect SV keypoints in images
* ```generate.py```: creates an annotated image dataset from alignments (BAM file(s))
* ```view.py```: plots images annotated with SVs from a VCF/BED file given genome alignments (BAM format);
can be used to visualize model predictions or ground truth SVs

Each script accepts as input one or multiple YAML config files, which encode a variety of parameters. Template config files are provided
in the ```config``` directory.
Each script accepts as input one or multiple YAML config files, which encode a variety of parameters.
Template config files are provided in the ```config``` directory.

The key required and optional YAML parameters for each Cue command are listed below.

```call.py``` (data YAML):
* ```bam``` [*required*] path to the alignments BAM file
* ```fai``` [*required*] path to the referene FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs to use for calling (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]

```call.py``` (model YAML):
* ```model_path``` [*required*] path to the pretrained Cue model
* ```gpu_ids``` [*optional*] list of GPU ids to use for calling -- CPU(s) will be used if empty

```train.py```:
* ```dataset_dirs``` [*required*] list of annotated imagesets to use for training
* ```gpu_ids``` [*optional*] GPU id to use for training -- a CPU will be used if empty
* ```report_interval``` [*optional*] frequency (in number of batches) for reporting training stats and image predictions

```generate.py```:
* ```bam``` [*required*] path to the alignments BAM file
* ```bed``` [*required*] path to the ground truth BED or VCF file
* ```fai``` [*required*] path to the referene FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs to use for image generation (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]

```view.py```:
* ```bam``` [*required*] path to the alignments BAM file
* ```bed``` [*required*] path to the BED or VCF file with SVs to visualize
* ```fai``` [*required*] path to the referene FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]

<a name="workflow"></a>
#### Recommended workflow
Expand Down
15 changes: 9 additions & 6 deletions data/demo/exec/data.yaml → data/demo/config/data.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
bam: "data/demo/inputs/chr21.small.bam"
fai: "data/demo/inputs/GRCh38.fa.fai"
bed: null
blacklist_bed: null
#### REQUIRED ####
bam: "../data/demo/inputs/chr21.small.bam" # path to the alignments BAM file
fai: "../data/demo/inputs/GRCh38.fa.fai" # path to the referene FASTA FAI file
#### OPTIONAL ####
n_cpus: 1 # number of CPUs to use (parallelized by chromosome)
chr_names: ["chr21"] # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
#### FIXED (do not modify) ####
bam_type: "SHORT"
signal_set: "SHORT"
signal_set_origin: "SHORT"
bed: null
blacklist_bed: null
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
bin_size: 750
Expand All @@ -16,5 +21,3 @@ image_dim: 256
class_set: "BASIC5ZYG"
num_keypoints: 1
bbox_padding: 0
n_cpus: 1
chr_names: ["chr21"]
16 changes: 16 additions & 0 deletions data/demo/config/model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#### REQUIRED ####
model_path: "../data/models/cue.pt" # path to the pretrained Cue model
#### OPTIONAL ####
gpu_ids: [] # list of GPU ids to use for calling -- a CPU will be used if empty
report_interval: 5 # frequency (in number of batches) for reporting training stats and image predictions
pretrained_refinenn_path: null # path to the pretrained keypoint refinement model
#### FIXED (do not modify) ####
image_dim: 256
class_set: "BASIC5ZYG"
signal_set: "SHORT"
num_keypoints: 1
model_architecture: "HG"
batch_size: 16
sigma: 10
stride: 4
heatmap_peak_threshold: 0.65
23 changes: 23 additions & 0 deletions data/demo/config/view.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#### REQUIRED ####
bam: "../data/demo/inputs/chr21.small.bam" # path to the alignments BAM file
fai: "../data/demo/inputs/GRCh38.fa.fai" # path to the referene FASTA FAI file
bed: "../data/demo/results/reports/svs.vcf" # SVs to visualize (BED or VCF)
#### OPTIONAL ####
n_cpus: 1 # number of CPUs to use (parallelized by chromosome)
chr_names: ["chr21"] # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
#### FIXED (do not modify) ####
blacklist_bed: null
bam_type: "SHORT"
signal_set: "SHORT"
signal_set_origin: "SHORT"
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
bin_size: 750
interval_size: [150000]
step_size: [50000]
shift_size: [0, 70000, 140000]
heatmap_dim: 1000
image_dim: 256
class_set: "BASIC5ZYG"
num_keypoints: 1
bbox_padding: 0
13 changes: 0 additions & 13 deletions data/demo/exec/model.yaml

This file was deleted.

3 changes: 2 additions & 1 deletion engine/config_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@ def __init__(self, config_file, **entries):
self.bam_type = constants.BAM_TYPE[self.bam_type]
self.signal_set = constants.SV_SIGNAL_SET[self.signal_set]
self.num_signals = len(constants.SV_SIGNALS_BY_TYPE[self.signal_set])
self.uid = ''.join(["{}".format(randint(0, 9)) for _ in range(0, 10)])
# self.uid = ''.join(["{}".format(randint(0, 9)) for _ in range(0, 10)])
self.uid = ''.join(["{}".format(0) for _ in range(0, 10)])

# setup the dataset directory structure
Path(self.info_dir).mkdir(parents=True, exist_ok=True)
Expand Down
69 changes: 4 additions & 65 deletions install/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,81 +1,20 @@
ACEBinf==1.0.2
argcomplete==2.0.0
argh==0.26.2
attrs==21.4.0
setuptools==58.0.0
bitarray==1.6.3
bleach==3.1.5
bwapy==0.1.4
cachetools==4.1.0
certifi==2021.10.8
cffi==1.14.2
chardet==3.0.4
charset-normalizer==2.0.12
cycler==0.11.0
Cython==0.29.21
fonttools==4.33.3
future==0.18.2
gast==0.3.3
gffutils==0.10.1
idna==3.3
importlib-metadata==4.11.3
cython==0.29.21
intervaltree==3.1.0
ipython-genutils==0.2.0
joblib==0.16.0
jsonpatch==1.32
jsonpointer==2.3
kiwisolver==1.4.2
mappy==2.18
matplotlib==3.2.1
more-itertools==8.3.0
networkx==2.6.3
numpy==1.18.5
opencv-python==4.5.1.48
packaging==21.3
pandas==1.0.5
pickleshare==0.7.5
Pillow==7.2.0
pluggy==0.13.1
progressbar2==4.0.0
py==1.8.1
pycocotools==2.0.4
pycparser==2.20
pyfaidx==0.5.9.5
pyparsing==3.0.8
pysam==0.16.0.1
pytabix==0.1
pytest==5.4.2
python-dateutil==2.8.1
python-Levenshtein==0.12.1
python-utils==3.1.0
pytz==2022.1
PyVCF==0.6.8
PyYAML==5.3.1
pyzmq==22.3.0
RCK==1.1.0
requests==2.27.1
scikit-learn==0.23.2
scipy==1.4.1
pyyaml==5.3.1
seaborn==0.11.0
simplejson==3.17.6
six==1.14.0
sklearn==0.0
sortedcontainers==2.3.0
Tcl==0.2
termcolor==1.1.0
threadpoolctl==2.1.0
threadsafe-tkinter==1.0.2
torch==1.5.1
torchfile==0.1.0
torchvision==0.6.1
tornado==6.1
traitlets==5.0.4
Truvari==2.1
typing_extensions==4.2.0
urllib3==1.25.9
visdom==0.1.8.9
wcwidth==0.1.9
wdl==1.0.22
webencodings==0.5.1
websocket-client==1.3.2
xtermcolor==1.3
zipp==3.1.0
jupyter
Loading

0 comments on commit 66e3456

Please sign in to comment.