Skip to content

Commit

Permalink
Internal parameter default setting (#12)
Browse files Browse the repository at this point in the history
  • Loading branch information
viq854 authored Feb 20, 2023
1 parent 37aa1cd commit 46b1005
Show file tree
Hide file tree
Showing 17 changed files with 261 additions and 218 deletions.
29 changes: 18 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,13 @@ from Cue's capsule in CodeOcean.

The latest pre-trained Cue model can be downloaded from this [link](https://storage.googleapis.com/cue-models/latest/cue.v2.pt).

All the models are stored in the following public [Google Cloud Storage bucket](https://console.cloud.google.com/storage/browser/cue-models). Synthetic datasets are available in the public [Google Cloud Storage datasets bucket](https://console.cloud.google.com/storage/browser/cue-synth-datasets). Files can be viewed/downloaded using [gsutil](https://cloud.google.com/storage/docs/gsutil) or directly from the browser using the [Google Cloud console](https://cloud.google.com/storage/docs/cloud-console).
To download the latest model into the data/models directory:

```wget --directory-prefix=data/models/ https://storage.googleapis.com/cue-models/latest/cue.v2.pt```


All the models are stored in the following public [Google Cloud Storage bucket](https://console.cloud.google.com/storage/browser/cue-models). Synthetic datasets are available in the public [Google Cloud Storage datasets bucket](https://console.cloud.google.com/storage/browser/cue-synth-datasets). Files can be viewed/downloaded using [gsutil](https://cloud.google.com/storage/docs/gsutil) or directly from the browser using the [Google Cloud console](https://cloud.google.com/storage/docs/cloud-console).

<a name="demo"></a>
### Tutorial

Expand All @@ -87,38 +91,41 @@ to detect SV keypoints in images
can be used to visualize model predictions or ground truth SVs

Each script accepts as input one or multiple YAML config files, which encode a variety of parameters.
Template config files are provided in the ```config``` directory.
Template config files with key parameters are provided in the ```config``` directory.
The ```config/custom``` directory contains template config files with additional parameters that
can be useful when generating custom models.

The key required and optional YAML parameters for each Cue command are listed below.

```call.py``` (data YAML):
* ```bam``` [*required*] path to the alignments file (BAM/CRAM format)
* ```fai``` [*required*] path to the referene FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs to use for calling (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)

```call.py``` (model YAML):
* ```model_path``` [*required*] path to the pretrained Cue model
* ```gpu_ids``` [*optional*] list of GPU ids to use for calling -- CPU(s) will be used if empty
* ```model_path``` [*required*] path to the pretrained Cue model (recommended: the latest available model)
* ```gpu_ids``` [*optional*] list of GPU ids to use for calling (default: CPU(s) will be used if empty)
* ```n_jobs_per_gpu``` [*optional*] number of parallel jobs to launch on the same GPU (default: 1)
* ```n_cpus``` [*optional*] number of CPUs to use for calling if no GPUs are listed (default: 1)

```train.py```:
* ```dataset_dirs``` [*required*] list of annotated imagesets to use for training
* ```gpu_ids``` [*optional*] GPU id to use for training -- a CPU will be used if empty
* ```report_interval``` [*optional*] frequency (in number of batches) for reporting training stats and image predictions
* ```report_interval``` [*optional*] frequency (in number of batches) for reporting training stats and image predictions (default: 50)

```generate.py```:
* ```bam``` [*required*] path to the alignments file (BAM/CRAM format)
* ```bed``` [*required*] path to the ground truth BED or VCF file
* ```fai``` [*required*] path to the referene FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs to use for image generation (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
* ```n_cpus``` [*optional*] number of CPUs to use for image generation (parallelized by chromosome) (default: 1)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)

```view.py```:
* ```bam``` [*required*] path to the alignments file (BAM/CRAM format)
* ```bed``` [*required*] path to the BED or VCF file with SVs to visualize
* ```fai``` [*required*] path to the reference FASTA FAI file
* ```n_cpus``` [*optional*] number of CPUs (parallelized by chromosome)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
* ```n_cpus``` [*optional*] number of CPUs (parallelized by chromosome) (default: 1)
* ```chr_names``` [*optional*] list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"] (default: null)

<a name="workflow"></a>
#### Recommended workflow
Expand Down
35 changes: 2 additions & 33 deletions config/call_data.yaml
Original file line number Diff line number Diff line change
@@ -1,37 +1,6 @@
#### REQUIRED ####
bam: "/path/to/bam" # path to the alignments file (BAM/CRAM format)
fai: "/path/to/fai" # path to the referene FASTA FAI file
fai: "/path/to/fai" # path to the reference FASTA FAI file
#### OPTIONAL ####
n_cpus: 1 # number of CPUs to use (parallelized by chromosome)
chr_names: null # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
min_qual_score: 50 # minimum SV confidence/quality score to output
#### FIXED (do not modify) ####
bam_type: "SHORT"
min_sv_len: 4000
signal_set: "SHORT"
signal_set_origin: "SHORT"
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
blacklist_bed: null
bed: null
bin_size: 750
interval_size: 150000
step_size: 50000
shift_size: null
min_pair_support: 2
min_pair_distance: 4000
max_pair_distance: 1000000
scan_target_intervals: True
bins_per_block: 8000
stream: True
min_refine_buffer: 2000
refine_buffer_frac_size: 5
refine_pair_dist_frac_size: 2
refine_bp_kernels: [0, 50, 500]
refine_min_support: 2
heatmap_dim: 1000
image_dim: 256
class_set: "BASIC5ZYG"
num_keypoints: 1
bbox_padding: 0
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
18 changes: 4 additions & 14 deletions config/call_model.yaml
Original file line number Diff line number Diff line change
@@ -1,18 +1,8 @@
#### REQUIRED ####
model_path: "/path/to/model" # path to the pretrained Cue model
#### OPTIONAL ####
gpu_ids: [] # list of GPU ids to use for calling -- a CPU will be used if empty
gpu_ids: [] # list of GPU ids to use for calling (default: CPU(s) will be used if empty)
n_jobs_per_gpu: 1 # how many parallel jobs to launch on the same GPU
report_interval: 10 # frequency (in number of batches) for reporting training stats and image predictions
pretrained_refinenn_path: null # path to the pretrained keypoint refinement model
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
#### FIXED (do not modify) ####
image_dim: 256
class_set: "BASIC5ZYG"
signal_set: "SHORT"
num_keypoints: 1
model_architecture: "HG"
batch_size: 16
sigma: 10
stride: 4
heatmap_peak_threshold: 0.4
n_cpus: 1 # number of CPUs to use for calling if no GPUs are listed
report_interval: 100 # frequency (in number of batches) for reporting image predictions
batch_size: 16 # number of images per batch
20 changes: 20 additions & 0 deletions config/custom/call_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#### REQUIRED ####
bam: "/path/to/bam" # path to the alignments file (BAM/CRAM format)
fai: "/path/to/fai" # path to the reference FASTA FAI file
#### OPTIONAL ####
chr_names: null # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
min_sv_len: 4000 # minimum length of SVs to output
min_qual_score: 50 # minimum SV confidence/quality score to output
blacklist_bed: null # blacklist intervals to filter out SVs
min_pair_support: 2 # minimum number of discordant read pairs to retain an interval pair in targeted mode
min_pair_distance: 4000 # minimum discordant read-pair distance
max_pair_distance: 1000000 # maximum discordant read-pair distance
refine_disable: False # disable SV refinement
bin_size: 750 # size of index bins (in bps)
interval_size: 150000 # size of genome intervals on each axis (in bps)
step_size: 50000 # sliding-window step size in interval generation
stream: True # set to True to enable streaming during targeted indexing (to reduce RAM requirements)
bins_per_block: 8000 # streaming block size
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
22 changes: 22 additions & 0 deletions config/custom/generate.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#### REQUIRED ####
bam: "/path/to/bam" # path to the alignments BAM or CRAM file
bed: "/path/to/gt_bed_or_vcf" # path to the ground truth BED or VCF file
fai: "/path/to/fai" # path to the reference FASTA FAI file
#### OPTIONAL ####
n_cpus: 1 # number of CPUs (parallelized by chromosome)
chr_names: null # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
store_img: True # store generated images and labels
allow_empty: False # set to True to include images that don't overlap any SVs
scan_target_intervals: False # set to True to keep only interval pairs with discordant read pairs
stream: False # set to True to enable streaming during targeted indexing (to reduce RAM requirements)
bins_per_block: 8000 # streaming block size
min_pair_support: 2 # minimum number of discordant read pairs to retain an interval pair in targeted mode
min_pair_distance: 4000 # minimum discordant read-pair distance
max_pair_distance: 1000000 # maximum discordant read-pair distance
shift_size: [0, 75000, 150000] # y-interval shifts (set to null for targeted interval pairs)
bin_size: 750 # size of index bins (in bps)
interval_size: 150000 # size of genome intervals on each axis (in bps)
step_size: 50000 # sliding-window step size in interval generation
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
18 changes: 18 additions & 0 deletions config/custom/training.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#### REQUIRED ####
dataset_dirs: ["/path/to/imageset"]
num_epochs: 32 # number of epochs
#### OPTIONAL ####
batch_size: 16 # number of images per batch
gpu_ids: [] # id of the GPU to use for training
pretrained_model: null
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
report_interval: 50 # frequency (in number of batches) for reporting training stats and predictions
model_checkpoint_interval: 10000 # how often to checkpoint the model as it trains
validation_ratio: 0.1 # fraction of the data to use for validation
plot_confidence_maps: False # output the predicted confidence maps
learning_rate: 0.0001
learning_rate_decay_interval: 5
learning_rate_decay_factor: 1
sigma: 10
stride: 4
heatmap_peak_threshold: 0.4
31 changes: 10 additions & 21 deletions config/generate.yaml
Original file line number Diff line number Diff line change
@@ -1,27 +1,16 @@
#### REQUIRED ####
bam: "/path/to/bam" # path to the alignments BAM or CRAM file
bed: "/path/to/gt_bed_or_vcf" # path to the ground truth BED or VCF file
fai: "/path/to/fai" # path to the referene FASTA FAI file
fai: "/path/to/fai" # path to the reference FASTA FAI file
#### OPTIONAL ####
n_cpus: 1 # number of CPUs (parallelized by chromosome)
chr_names: null # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
allow_empty: False
empty_annotation: False
#### FIXED ####
scan_target_intervals: False
stream: False
store_img: True
bam_type: "SHORT"
signal_set: "SHORT"
signal_set_origin: "SHORT"
class_set: "BASIC5ZYG"
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
bin_size: 750
interval_size: 150000
step_size: 50000
shift_size: [0, 75000, 150000]
heatmap_dim: 1000
image_dim: 256
num_keypoints: 1
bbox_padding: 0
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
store_img: True # store generated images and labels
allow_empty: False # set to True to include images that don't overlap any SVs
scan_target_intervals: False # set to True to keep only interval pairs with discordant read pairs
stream: False # set to True to enable streaming during targeted indexing (to reduce RAM requirements)
min_pair_support: 2 # minimum number of discordant read pairs to retain an interval pair in targeted mode
min_pair_distance: 4000 # minimum discordant read-pair distance
max_pair_distance: 1000000 # maximum discordant read-pair distance
shift_size: [0, 75000, 150000] # y-interval shifts (set to null for targeted interval pairs)
29 changes: 7 additions & 22 deletions config/training.yaml
Original file line number Diff line number Diff line change
@@ -1,26 +1,11 @@
#### REQUIRED ####
dataset_dirs: ["/path/to/imageset"]
num_epochs: 32 # number of epochs
#### OPTIONAL ####
gpu_ids: []
batch_size: 16
num_epochs: 32
batch_size: 16 # number of images per batch
gpu_ids: [] # id of the GPU to use for training
pretrained_model: null
logging_level: "INFO"
report_interval: 50
model_checkpoint_interval: 10000
plot_confidence_maps: False
validation_ratio: 0.1
#### FIXED ####
n_jobs_per_gpu: 1
signal_set: "SHORT"
signal_set_origin: "SHORT"
class_set: "BASIC5ZYG"
image_dim: 256
num_keypoints: 1
model_architecture: "HG"
learning_rate: 0.0001
learning_rate_decay_interval: 5
learning_rate_decay_factor: 1
sigma: 10
stride: 4
heatmap_peak_threshold: 0.4
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
report_interval: 50 # frequency (in number of batches) for reporting training stats and predictions
model_checkpoint_interval: 10000 # how often to checkpoint the model as it trains
validation_ratio: 0.1 # fraction of the data to use for validation
32 changes: 0 additions & 32 deletions data/demo/config/data.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,4 @@
bam: "../data/demo/inputs/chr21.small.bam" # path to the alignments file (BAM or CRAM format)
fai: "../data/demo/inputs/GRCh38.fa.fai" # path to the referene FASTA FAI file
#### OPTIONAL ####
n_cpus: 1 # number of CPUs to use (parallelized by chromosome)
chr_names: ["chr21"] # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
min_qual_score: 50 # minimum SV confidence/quality score to output
#### FIXED (do not modify) ####
bam_type: "SHORT"
min_sv_len: 4000
signal_set: "SHORT"
signal_set_origin: "SHORT"
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
blacklist_bed: null
bed: null
bin_size: 750
interval_size: 150000
step_size: 50000
shift_size: null
min_pair_support: 2
min_pair_distance: 4000
max_pair_distance: 1000000
scan_target_intervals: True
bins_per_block: 8000
stream: True
min_refine_buffer: 2000
refine_buffer_frac_size: 5
refine_pair_dist_frac_size: 2
refine_bp_kernels: [0, 50, 500]
refine_min_support: 2
heatmap_dim: 1000
image_dim: 256
class_set: "BASIC5ZYG"
num_keypoints: 1
bbox_padding: 0
17 changes: 1 addition & 16 deletions data/demo/config/model.yaml
Original file line number Diff line number Diff line change
@@ -1,18 +1,3 @@
#### REQUIRED ####
model_path: "../data/demo/models/cue.pt" # path to the pretrained Cue model
#### OPTIONAL ####
gpu_ids: [] # list of GPU ids to use for calling -- a CPU will be used if empty
n_jobs_per_gpu: 1 # how many parallel jobs to launch on the same GPU
report_interval: 5 # frequency (in number of batches) for reporting training stats and image predictions
pretrained_refinenn_path: null # path to the pretrained keypoint refinement model
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
#### FIXED (do not modify) ####
image_dim: 256
class_set: "BASIC5ZYG"
signal_set: "SHORT"
num_keypoints: 1
model_architecture: "HG"
batch_size: 16
sigma: 10
stride: 4
heatmap_peak_threshold: 0.4
n_cpus: 1
18 changes: 0 additions & 18 deletions data/demo/config/view.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,3 @@ bed: "../data/demo/results/reports/svs.vcf" # SVs to visualize (BED or VCF)
#### OPTIONAL ####
n_cpus: 1 # number of CPUs to use (parallelized by chromosome)
chr_names: ["chr21"] # list of chromosomes to process: null (all) or a specific list e.g. ["chr1", "chr21"]
logging_level: "INFO" # verbosity level (set to "ERROR" to reduce logging volume)
#### FIXED (do not modify) ####
scan_target_intervals: False
stream: False
bam_type: "SHORT"
signal_set: "SHORT"
signal_set_origin: "SHORT"
signal_vmax: {"RD": 600, "RD_LOW": 800, "RD_CLIPPED": 600, "SM": 200, "SR_RP": 600, "LR": 600, "LLRR": 100, "RL": 100, "LLRR_VS_LR": 1}
signal_mapq: {"RD": 20, "RD_LOW": 0, "RD_CLIPPED": 20, "SM": 20, "SR_RP": 0, "LR": 0, "LLRR": 1, "RL": 1, "LLRR_VS_LR": 1}
bin_size: 750
interval_size: 150000
step_size: null
shift_size: null
heatmap_dim: 1000
image_dim: 256
class_set: "BASIC5ZYG"
num_keypoints: 1
bbox_padding: 0
1 change: 1 addition & 0 deletions engine/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = "v0.2.2"
Loading

0 comments on commit 46b1005

Please sign in to comment.