You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ By default, the pipeline will automatically delete some files it deems unnecessa
59
59
# files and directories
60
60
61
61
### [Snakefile](Snakefile)
62
-
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline is made up of two subworkflows:
62
+
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:
63
63
64
64
1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and
65
65
2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants
Copy file name to clipboardExpand all lines: rules/README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ The `prepare` subworkflow can use FASTQ or BAM/BED files as input. The `classify
6
6
If a pre-trained model is available (orange), the two subworkflows can be executed together automatically via the master pipeline. However the subworkflows must be executed separately for training and testing (see [below](#training-and-testing-varca)).
7
7
8
8
## The `prepare` subworkflow
9
-
The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/)pipeline for preparing data for the classifier. It generates a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
9
+
The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/)workflow for preparing data for the classifier. It generates a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
10
10
11
11
### execution
12
12
The `prepare` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `prepare` subworkflow on its own, as a separate Snakefile.
@@ -18,15 +18,15 @@ Then, just call Snakemake with `-s rules/prepare.smk`:
18
18
snakemake -s rules/prepare.smk --use-conda -j
19
19
20
20
### output
21
-
The primary outputs of the `prepare`pipeline will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`. However, several intermediary directories and files are also generated:
21
+
The primary outputs of the `prepare`subworkflow will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`. However, several intermediary directories and files are also generated:
22
22
23
23
-`align/` - output from the BWA FASTQ alignment step and samtools PCR duplicate removal steps
24
24
-`peaks/` - output from MACS 2 and other files required for calling peaks
25
25
-`callers/` - output from each [caller script](/callers) in the ensemble (see the [callers README](/callers/README.md) for more information) and the variant normalization and feature extraction steps
26
26
-`merged_<variant_type>/` - all other output in the `prepare` subworkflow, including the merged and final datasets for each variant type (ie SNV or indels)
27
27
28
28
## The `classify` subworkflow
29
-
The [`classify` subworkflow](classify.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/)pipeline for training and testing the classifier. It uses the TSV output from the `prepare` subworkflow. Its final output is a VCF containing predicted variants.
29
+
The [`classify` subworkflow](classify.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/)workflow for training and testing the classifier. It uses the TSV output from the `prepare` subworkflow. Its final output is a VCF containing predicted variants.
30
30
31
31
### execution
32
32
The `classify` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `classify` subworkflow on its own, as a separate Snakefile.
@@ -83,7 +83,7 @@ You may want to test a trained model if:
83
83
3. You'd like to reproduce our results (since the training and testing steps are usually skipped by the master pipeline)
84
84
85
85
## Creating your own trained model
86
-
For the sake of this example, let's say you'd like to include a new indel variant caller (ie #3 above). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
86
+
For the sake of this example, let's say you'd like to include a new indel variant caller (ie [#3 above](#training)). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
87
87
88
88
To do this, we recommend downloading the truth set we used to create our model. First, download the [GM12878 FASTQ files from Buenrostro et al](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753). Specify the path to these files in `data/samples.tsv`, the samples file within the example data. Then, download the corresponding [Platinum Genomes VCF for that sample](https://www.illumina.com/platinumgenomes.html).
89
89
@@ -93,7 +93,7 @@ After the `prepare` subworkflow has finished running, add the sample (specficall
93
93
94
94
The `classify` subworkflow can only create one trained model at a time, so you will need to repeat these steps if you'd also like to create a trained model for SNVs. Just replace every mention of "indel" in `classify.yaml` with "snp". Also remember to use only the SNV callers (ie GATK, VarScan 2, and VarDict).
95
95
96
-
## Testing your model / Reproducing our Results
96
+
## Testing your model / Reproducing our results
97
97
For this example, we will demonstrate how you can reproduce the results in our paper using the `indel.tsv.gz` truth dataset we provided in the example data. This data was generated by running the `prepare` subworkflow on the GM12878 data as described [above](#creating-your-own-trained-model). If you [ran the `prepare` subworkflow to create your own trained model](#creating-your-own-trained-model), just use your truth dataset instead of the one we provided in the example data.
98
98
99
99
First, split the truth dataset by chromosome parity using `awk` commands like this:
0 commit comments