Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Added description of 'Source Data.xlsx' and 'Results.zip'
  • Loading branch information
TimD1 authored Nov 2, 2023
1 parent 4309dd6 commit a36cd67
Showing 1 changed file with 28 additions and 1 deletion.
29 changes: 28 additions & 1 deletion data/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
### Data
This directory includes the following subdirectories containing the inputs used to evaluate VCFdist. If you would like to reproduce our analyses using our scripts from the `pipeline` directory, please create the directory structure shown below, using publicly available data.
This directory includes the following subdirectories containing the inputs used to evaluate VCFdist, as well as our output summary files. If you would like to reproduce our analyses using our scripts from the `pipeline` directory, please create the directory structure shown below, using publicly available data.

```
pfda-v2/
Expand Down Expand Up @@ -46,3 +46,30 @@ refs/
GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta [1]
```
1. [GRCh38 Reference](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38)

#### Outputs
##### 'Results.zip'
This ZIP file contains the primary comparison results from both vcfeval (baseline) and vcfdist (our work) on the NIST (whole genome) and CMRG (challenging medically-relevant genes) datasets. This includes all 5 variant representations (the original and A, B, C, D in Figure 2) for all 64 pFDA submission VCFs. The data organization is shown below, where `dataset = (cmrg, nist)` and `rep = (O, A, B, C, D)`. The 64 `submission_id`s are defined by the source pFDA dataset.

```
source_data/
{dataset}_vcfeval/
{submission_id}_HG002_{rep}.summary.csv
{submission_id}_HG002_{rep}.roc.all.csv
{dataset}_vcfdist/
{submission_id}_HG002_{rep}.precision-recall.tsv
{submission_id}_HG002_{rep}.precision-recall-summary.tsv
{submission_id}_HG002_{rep}.distance.tsv
{submission_id}_HG002_{rep}.distance-summary.tsv
```

In total, there are 5 * 2 * 64 = 640 versions of each TSV and CSV listed above.
With these files in the directory structure shown above, Figures 3-6 in our manuscript can be regenerated by the following scripts in the `analysis/` directory of the Github repository.
- Figure 3: `7_vcfeval_pr_plot.py`
- Figure 4: `8_vcfdist_pr_plot.py`
- Figure 5: `4_vcfdist_output.py`
- Figure 6: `9_f1_pr_plot.py`, `9_f1_ed_plot.py`
- Supplementary Figure 5: `9_f1_pr_plot.py`, `9_f1_ed_plot.py`

##### 'Source Data.xlsx'
This file is an Excel document of 16 sheets in total, with each sheet containing the raw data for each subfigure plot (Figures 3a, 3b, 3c, 4a, 4b, 4c, 4d, 5bi, 5bii, 5biii, 6a, 6b, 6c, and Supplementary Figures 5a, 5b, 5c). Each sheet contains a table listing the evaluation dataset, variant type and representation, submission ID, and evaluation metrics (precision, recall, edit distance, distinct edits, or F1 score) for each data point in the corresponding plot.

0 comments on commit a36cd67

Please sign in to comment.