Skip to content

Commit

Permalink
Major overhaul with many minor changes
Browse files Browse the repository at this point in the history
 - updated `supercluster.tsv` and `phase-blocks.tsv` outputs, both now
   include a column for SUPERCLUSTER and PHASE_BLOCK id
 - `phase-blocks.tsv` renamed PHASE column to BLOCK_STATE
 - added more warnings for potential edit distance errors
 - added optional `print` argument to `wf_ed()`
 - fixed edge case of `wf_ed()` causing an error
 - added `max_reach_size` global variable to limit supercluster size
   explosion from large INDELs; reaches now treated as if variant is
   `max_reach_size`
 - removed obsolete query and truth specific alignment parameter options
 - collapsed and re-organized command-line argument printing
 - renamed `summary.vcf` tags `PS` to `BS` and `PF` to `FE`
 - phase and flip errors are now printed per-contig
 - phase blocks are now correctly printed if there is only one
 - if there are two INSertions at the same location, one is filtered now
 - added warning for if the ratio of heterozygous variants on each
   haplotype is too far off
 - added Dockerfile
  • Loading branch information
TimD1 committed Oct 18, 2023
1 parent 88f9d19 commit 15e00e6
Show file tree
Hide file tree
Showing 14 changed files with 376 additions and 436 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ If you do already have HTSlib installed elsewhere, make sure you've added it to
> cd vcfdist/src
> make
> ./vcfdist --version
vcfdist v2.0.3
vcfdist v2.1.0
```

### Option 2: Docker Image
Expand Down
7 changes: 4 additions & 3 deletions docs/outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ This file reports for each edit (where called query sequence differs from truth
#### phase-blocks.tsv
Reports the size, location, and composition of each phase block.

| CONTIG | START | STOP | SIZE | SUPERCLUSTERS | PHASE |
| CONTIG | PHASE_BLOCK | START | STOP | SIZE | SUPERCLUSTERS | BLOCK_STATE |
|-|-|-|-|-|-|

#### superclusters.tsv
Reports the size, location, and composition of each supercluster.

| CONTIG | START | STOP | SIZE | QUERY1_VARS | QUERY2_VARS | TRUTH1_VARS | TRUTH2_VARS | ORIG_ED | SWAP_ED | PHASE | PHASE_BLOCK |
| CONTIG | SUPERCLUSTER | START | STOP | SIZE | QUERY1_VARS | QUERY2_VARS | TRUTH1_VARS | TRUTH2_VARS | ORIG_ED | SWAP_ED | PHASE | PHASE_BLOCK |
|-|-|-|-|-|-|-|-|-|-|-|-|

#### query.tsv, truth.tsv
Expand Down Expand Up @@ -89,7 +89,8 @@ Output query and truth VCFs, standardized by vcfdist (at point C).
| (QUERY/TRUTH)(1/2)_VARS | integer | Total variants on a particular haplotype within this region. |
| (ORIG/SWAP)_ED | integer | Total edit distance (minimum) of supercluster for both possible phasings. |
| PHASE | char | Character representing phasing. (=/X/?) for same, swap, unknown |
| PHASE_BLOCK | integer | 0-based index of current phase block. |
| PHASE_BLOCK | integer | 0-based index of current phase block within contig. |
| BLOCK_STATE | integer | Current phase state for truth to query haplotype mapping (0 = T1Q1:T2Q2, 1 = T1Q2:Q2T1). |
| POS | integer | 0-based index of position within contig. |
| REF/ALT | string | String of reference/alternate sequence at this position. |
| CREDIT | float | Fraction of partial credit this variant received. |
Expand Down
5 changes: 5 additions & 0 deletions docs/update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1. Update `vcfdist` version in `globals.h`
2. Replace `vcfdist --help` text in `src/README.md`
3. Update `vcfdist` version in `README.md`
4. Build and deploy new Docker image
5. Make new release on Github
37 changes: 37 additions & 0 deletions src/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
FROM ubuntu:20.04

# set environment variables
ENV LANG=C.UTF-8 \
LC_ALL=C.UTF-8 \
PATH=/opt/bin/vcfdist/src:$PATH \
DEBIAN_FRONTEND=noninteractive \
LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

# install packages
RUN apt-get update --fix-missing && \
yes | apt-get upgrade && \
apt-get install -y \
git \
make \
g++ \
curl \
wget \
zlib1g-dev \
libbz2-dev \
liblzma-dev

# set up HTSlib
RUN wget https://github.com/samtools/htslib/releases/download/1.17/htslib-1.17.tar.bz2 && \
tar -xvf htslib-1.17.tar.bz2
WORKDIR ./htslib-1.17
RUN ./configure --prefix=/usr/local && \
make && \
make install

# clone repo
WORKDIR /opt/bin
RUN git clone https://github.com/TimD1/vcfdist

# setup virtual environment
WORKDIR /opt/bin/vcfdist/src
RUN make
58 changes: 23 additions & 35 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,68 +8,56 @@ Required:
<STRING> ref.fasta FASTA file containing draft reference sequence
Options:
Inputs/Outputs:
-b, --bed <STRING>
BED file containing regions to evaluate
-p, --prefix <STRING> [./]
prefix for output files (directory needs a trailing slash)
-v, --verbosity <INTEGER> [1]
printing verbosity (0: succinct, 1: default, 2:verbose)
Variant Filtering:
-s, --smallest-variant <INTEGER> [1]
minimum variant size, smaller variants ignored (SNPs are size 1)
-l, --largest-variant <INTEGER> [5000]
maximum variant size, larger variants ignored
--min-qual <INTEGER> [0]
minimum variant quality, lower qualities ignored
--max-qual <INTEGER> [60]
maximum variant quality, higher qualities kept but thresholded
ReAlignment:
-r, --realign-only
standardize truth and query variant representations, then exit
-q, --keep-query
do not realign query variants, keep original representation
-t, --keep-truth
do not realign truth variants, keep original representation
-x, --mismatch-penalty <INTEGER> [3]
Smith-Waterman mismatch (substitution) penalty
-o, --gap-open-penalty <INTEGER> [2]
Smith-Waterman gap opening penalty
-e, --gap-extend-penalty <INTEGER> [1]
Smith-Waterman gap extension penalty
--min-qual <INTEGER> [0]
minimum variant quality, lower qualities ignored
--max-qual <INTEGER> [60]
maximum variant quality, higher qualities kept but thresholded
-s, --smallest-variant <INTEGER> [1]
minimum variant size, smaller variants ignored (SNPs are size 1)
-l, --largest-variant <INTEGER> [5000]
maximum variant size, larger variants ignored
-i, --max-iterations <INTEGER> [4]
maximum iterations for expanding/merging clusters
-g, --supercluster-gap <INTEGER> [50]
minimum base gap between independent superclusters
Clustering:
--simple-cluster
instead of biWFA-based clustering, use gap-based clustering
Utilization:
--max-threads <INTEGER> [64]
maximum threads to use for precision/recall alignment
(haps*contigs used for wavefront clustering)
maximum threads to use for clustering and precision/recall alignment
--max-ram <FLOAT> [64.000GB]
maximum RAM to use for precision/recall alignment
(work in-progress, more may be used in other steps)
(approximate) maximum RAM to use for precision/recall alignment
Miscellaneous:
-h, --help
show this help message
-a, --advanced
show advanced options
show advanced options, not recommended for most users
-c, --citation
please cite vcfdist if used in your analyses
please cite vcfdist if used in your analyses: thanks!
-v, --version
print vcfdist version (v2.0.3)
print vcfdist version (v2.1.0)
```
Loading

0 comments on commit 15e00e6

Please sign in to comment.