Skip to content

Commit 8217916

Browse files
committed
retitle plots created by 2vcf and clean up READMEs
1 parent af46e6a commit 8217916

File tree

3 files changed

+7
-7
lines changed

3 files changed

+7
-7
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ By default, the pipeline will automatically delete some files it deems unnecessa
5959
# files and directories
6060
6161
### [Snakefile](Snakefile)
62-
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline is made up of two subworkflows:
62+
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:
6363
6464
1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and
6565
2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants

rules/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ The `prepare` subworkflow can use FASTQ or BAM/BED files as input. The `classify
66
If a pre-trained model is available (orange), the two subworkflows can be executed together automatically via the master pipeline. However the subworkflows must be executed separately for training and testing (see [below](#training-and-testing-varca)).
77

88
## The `prepare` subworkflow
9-
The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for preparing data for the classifier. It generates a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
9+
The [`prepare` subworkflow](prepare.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for preparing data for the classifier. It generates a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The `prepare` subworkflow uses the scripts in the [callers directory](callers) to run every variant caller in the ensemble.
1010

1111
### execution
1212
The `prepare` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `prepare` subworkflow on its own, as a separate Snakefile.
@@ -18,15 +18,15 @@ Then, just call Snakemake with `-s rules/prepare.smk`:
1818
snakemake -s rules/prepare.smk --use-conda -j
1919

2020
### output
21-
The primary outputs of the `prepare` pipeline will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`. However, several intermediary directories and files are also generated:
21+
The primary outputs of the `prepare` subworkflow will be in `<output_directory>/merged_<variant_type>/<sample_ID>/final.tsv.gz`. However, several intermediary directories and files are also generated:
2222

2323
- `align/` - output from the BWA FASTQ alignment step and samtools PCR duplicate removal steps
2424
- `peaks/` - output from MACS 2 and other files required for calling peaks
2525
- `callers/` - output from each [caller script](/callers) in the ensemble (see the [callers README](/callers/README.md) for more information) and the variant normalization and feature extraction steps
2626
- `merged_<variant_type>/` - all other output in the `prepare` subworkflow, including the merged and final datasets for each variant type (ie SNV or indels)
2727

2828
## The `classify` subworkflow
29-
The [`classify` subworkflow](classify.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for training and testing the classifier. It uses the TSV output from the `prepare` subworkflow. Its final output is a VCF containing predicted variants.
29+
The [`classify` subworkflow](classify.smk) is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for training and testing the classifier. It uses the TSV output from the `prepare` subworkflow. Its final output is a VCF containing predicted variants.
3030

3131
### execution
3232
The `classify` subworkflow is included within the [master pipeline](/Snakefile) automatically. However, you can also execute the `classify` subworkflow on its own, as a separate Snakefile.
@@ -83,7 +83,7 @@ You may want to test a trained model if:
8383
3. You'd like to reproduce our results (since the training and testing steps are usually skipped by the master pipeline)
8484

8585
## Creating your own trained model
86-
For the sake of this example, let's say you'd like to include a new indel variant caller (ie #3 above). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
86+
For the sake of this example, let's say you'd like to include a new indel variant caller (ie [#3 above](#training)). You've also already followed the directions in the [callers README](/callers/README.md) to create your own caller script, and you've modified the `prepare.yaml` and `callers.yaml` config files to include your new indel caller. However, before you can predict variants using the indel caller, you must create a new trained classification model that knows how to interpret your new input.
8787

8888
To do this, we recommend downloading the truth set we used to create our model. First, download the [GM12878 FASTQ files from Buenrostro et al](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47753). Specify the path to these files in `data/samples.tsv`, the samples file within the example data. Then, download the corresponding [Platinum Genomes VCF for that sample](https://www.illumina.com/platinumgenomes.html).
8989

@@ -93,7 +93,7 @@ After the `prepare` subworkflow has finished running, add the sample (specficall
9393

9494
The `classify` subworkflow can only create one trained model at a time, so you will need to repeat these steps if you'd also like to create a trained model for SNVs. Just replace every mention of "indel" in `classify.yaml` with "snp". Also remember to use only the SNV callers (ie GATK, VarScan 2, and VarDict).
9595

96-
## Testing your model / Reproducing our Results
96+
## Testing your model / Reproducing our results
9797
For this example, we will demonstrate how you can reproduce the results in our paper using the `indel.tsv.gz` truth dataset we provided in the example data. This data was generated by running the `prepare` subworkflow on the GM12878 data as described [above](#creating-your-own-trained-model). If you [ran the `prepare` subworkflow to create your own trained model](#creating-your-own-trained-model), just use your truth dataset instead of the one we provided in the example data.
9898

9999
First, split the truth dataset by chromosome parity using `awk` commands like this:

scripts/2vcf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def plot_line(lst, show_discards=False):
4141
plt.xlabel("Reverse Arcsin of RF Probability")
4242
else:
4343
plt.xlabel("Phred-Scaled RF Probability")
44-
plt.ylabel("Phred-Scaled Accuracy (QUAL)")
44+
plt.ylabel("Phred-Scaled Precision (QUAL)")
4545
plt.plot(
4646
roc[0],
4747
p(roc[0]),

0 commit comments

Comments
 (0)