UPDATE: JULY 2019
I no longer work for STFC. All versions of HULK pre 1.0.0 have been renamed and archived to the STFC github. The STFC Hartree Centre are building genomic solutions based on these and other tools - if you are interested, please contact them.
This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.
I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.
Importantly, this project is now fully open source!
HULK
is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK
approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.
HULK
works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by HULK
. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.
The advantages of HULK
include:
- it's fast and can run on a laptop
- hulk sketches are compact, fixed size and incorporate k-mer frequency information
- it works on data streams and does not require complete data instances
- it can use concept drift for histosketching
- you get to type
hulk smash
into the command line...
Finally, you can use hulk sketches to with a Machine Learning classifier to predict microbiome sample origin (see the paper and BANNER).
- WASM interface
- run HULK locally and from a browser
- based on my baby-GROOT user interface
- HULK will output additional sketches
- KMV MinHash
- HyperMinHash
- Indexing
- re-implementation of the LSH Forest index
- fully re-written codebase
- I've aimed for it to be largely backwards compatible with previous releases
- fully open-sourced!
- MIT license (OSI approved)
- algorithm changes
- underlying histogram is now based on minimizer frequencies
- count-min sketch for k-mer frequencies is now replaced with a fixed-size array and a jump-hash for minimizer placement
- changes to the
sketch
subcommand:- sketches saved to JSON by default (ala sourmash)
- histosketch count-min sketch is no longer configurable by the user (this was Epsilon and Delta)
- spectrum size is determined based on k-mer size
- minCount for k-mer frequencies is removed
- changes to the
smash
subcommand:- operates on JSON input
- outputs matrix as csv
- replaced some unecessary features
- the functionality of the
print
anddistance
subcommands is available in thesmash
subcommand
- the functionality of the
- all versions of HULK (and BANNER) pre v1.0.0 have been moved to the UKRI github and renamed. I can no longer work on these code bases.
Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.
For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.
conda install -c bioconda hulk
HULK
is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:
# Clone this repository
git clone https://github.com/will-rowe/hulk.git
# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...
# Run the unit tests
go test -v ./...
# Compile the program
go build ./
# Call the program
./hulk --help
HULK
is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:
# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA
# Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile
I'm working on some new documentation and this will be available on readthedocs soon.
A paper describing the HULK
method is published in Microbiome:
Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.