This is my own fork of the SNPTools::hapfuse tool in SNPTools v1.0.
-
Dependency on htslib for parsing input chunks. This allows me to avoid nasty buffer overruns when input VCFs contain too many samples. Also allows for the use of BCF files.
-
Input VCFs must contain the GT, and APP or GP field. APP stands for phred-scaled allelic probability. SNPTools uses the AP field, which stands for allelic probability.
-
Input VCFs can have fields with arbitrary precision. SNPTools::hapfuse uses the AP field with fixed precision floats while hapfuse v0.4 accepts APP and GP fields of arbitrary precision.
-
Speed and Reliability. Hapfuse is now way faster than SNPTools::hapfuse as it uses htslib for VCF/BCF parsing. Hapfuse also concurrently loads, fuses and writes chunks, so expect to see CPU usage at about %200.
-
Regression Tests. Hapfuse comes with regression tests to make sure it works!
Hapfuse does not work with data generated by SNPTools::impute. This is because I do not use the AP field and am too lazy to implement support for it. Patches welcome.
This code was written and is maintained by Warren Kretzschmar [email protected]. For bug reports, please raise an issue on the bitbucket page.
I ran these commands on a fresh ubuntu 16.04 LTS server installation
to get a dynamic and static binary of hapfuse
:
sudo apt install perl
cpan install File::Slurp
cpan install Test::Files # only needed for testing
sudo apt install libboost-all-dev
sudo apt install zlib1g-dev
sudo apt install libhts-dev # installs version 1.2.1 for me
This code explains how to build hapfuse v1.1
# clone the repo
git clone [email protected]:wkretzsch/hapfuse.git
cd hapfuse
# configure and build hapfuse in hapfuse.1.1
# the executable will be built at hapfuse.1.1/hapfuse
./bootstrap.pl
# optionally, run regression tests
cd hapfuse.1.1 && make check
I'm not sure why, but it appears that the final link command is
missing -lpthread
. This should be fixed by adding a pthread
dependency in configure.ac
. However, if I rerun the final link step
with the addition of -lpthread
then linking works fine (see for
example static linking commands below).
In order to link boost and hts statically I ran this command instead of the final linking command:
/bin/sh ./libtool --tag=CXX --mode=link g++ -O3 -std=gnu++11 -L/usr/local/lib -Wl,-R,/usr/local/lib /usr/users/flint/winni/lib/libboost_iostreams.a /users/flint/winni/lib/libhts.a /usr/lib64/libz.a /usr/lib64/libbz2.a -o hapfuse hapfuse.o hapSamp.o main.o utils.o -lpthread
When compiling on ubuntu 16.04 LTS I got the final linking step to work with the command:
/bin/bash ./libtool --tag=CXX --mode=link g++ -O3 -std=gnu++11 -o hapfuse src/hapfuse.o src/hapSamp.o src/utils.o src/writer.o src/hfHelper.o src/main.o /usr/lib/x86_64-linux-gnu/libhts.a /usr/lib/x86_64-linux-gnu/libz.a /usr/lib/x86_64-linux-gnu/libboost_iostreams.a /usr/lib/x86_64-linux-gnu/libbz2.a -lpthread
A pre-compiled binary that statically links boost and hts can now be
found in precompiled-binaries/ubuntu_x86-64/static_boost_hts/
.
Fusing three bcf files together and save as gzip compressed bcf
hapfuse file1.bcf file2.bcf file3.bcf -Ob -o fused.bcf.gz
Fusing three WTCCC style hap/sample file pairs together (see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample) and saving as gzip compressed BCF
cat hap.files
file1.hap.gz
file2.hap.gz
file3.hap.gz
cat sample.files
file1.sample
file2.sample
file3.sample
# output as bcf
hapfuse -h hap.files -s sample.files -Ob -o fused.bcf.gz
# since version 1.0: output as WTCCC file pair
hapfuse -h hap.files -s sample.files -Ow -o fused.wtccc.haps.gz,fused.wtccc.sample
Hapfuse ligates overlapping haplotype chunks with identical samples by matching haplotypes at overlap sites. The -o option is required. Defaults are given in []. If VCF/BCF files are given on the command line, then hapfuse assumes that the VCF/BCF files contain haplotype chunks to ligate. WTCCC style haplotype chunks can be provided via the -h and -s arguments. The input and output format tags are restricted to 'GT' when using the -h and -s arguments.
The 'ligation method' is the method used to determine the diplotype of a sample at overlapping chunk sites.
The 'step' method discards the outer half of a chunk's overlap region haplotypes when determining the diplotype of an overlap region. This is the most commonly used method for ligating haplotypes.
The 'average' method uses the average of a sample's two overlapping diplotypes to determine the new diplotypes. This is the method that SNPTools::hapfuse uses, but it only makes sense when using GP or APP fields.
The 'linear' method is experimental. It is the same as average, but it gives more weight to diplotypes that are closer to a chunk's center.
I thank the following people for contributing code to this project:
- Joshua Randall for help with the autoconf files