diff --git a/data/README.md b/data/README.md index 6d208bc..93d5f74 100644 --- a/data/README.md +++ b/data/README.md @@ -1,5 +1,5 @@ ### Data -This directory includes the following subdirectories containing the inputs used to evaluate VCFdist. If you would like to reproduce our analyses using our scripts from the `pipeline` directory, please create the directory structure shown below, using publicly available data. +This directory includes the following subdirectories containing the inputs used to evaluate VCFdist, as well as our output summary files. If you would like to reproduce our analyses using our scripts from the `pipeline` directory, please create the directory structure shown below, using publicly available data. ``` pfda-v2/ @@ -46,3 +46,30 @@ refs/ GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta [1] ``` 1. [GRCh38 Reference](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38) + +#### Outputs +##### 'Results.zip' +This ZIP file contains the primary comparison results from both vcfeval (baseline) and vcfdist (our work) on the NIST (whole genome) and CMRG (challenging medically-relevant genes) datasets. This includes all 5 variant representations (the original and A, B, C, D in Figure 2) for all 64 pFDA submission VCFs. The data organization is shown below, where `dataset = (cmrg, nist)` and `rep = (O, A, B, C, D)`. The 64 `submission_id`s are defined by the source pFDA dataset. + +``` +source_data/ + {dataset}_vcfeval/ + {submission_id}_HG002_{rep}.summary.csv + {submission_id}_HG002_{rep}.roc.all.csv + {dataset}_vcfdist/ + {submission_id}_HG002_{rep}.precision-recall.tsv + {submission_id}_HG002_{rep}.precision-recall-summary.tsv + {submission_id}_HG002_{rep}.distance.tsv + {submission_id}_HG002_{rep}.distance-summary.tsv +``` + +In total, there are 5 * 2 * 64 = 640 versions of each TSV and CSV listed above. +With these files in the directory structure shown above, Figures 3-6 in our manuscript can be regenerated by the following scripts in the `analysis/` directory of the Github repository. +- Figure 3: `7_vcfeval_pr_plot.py` +- Figure 4: `8_vcfdist_pr_plot.py` +- Figure 5: `4_vcfdist_output.py` +- Figure 6: `9_f1_pr_plot.py`, `9_f1_ed_plot.py` +- Supplementary Figure 5: `9_f1_pr_plot.py`, `9_f1_ed_plot.py` + +##### 'Source Data.xlsx' +This file is an Excel document of 16 sheets in total, with each sheet containing the raw data for each subfigure plot (Figures 3a, 3b, 3c, 4a, 4b, 4c, 4d, 5bi, 5bii, 5biii, 6a, 6b, 6c, and Supplementary Figures 5a, 5b, 5c). Each sheet contains a table listing the evaluation dataset, variant type and representation, submission ID, and evaluation metrics (precision, recall, edit distance, distinct edits, or F1 score) for each data point in the corresponding plot.