Skip to content

Latest commit

 

History

History
240 lines (145 loc) · 13 KB

README.md

File metadata and controls

240 lines (145 loc) · 13 KB

umi-transfer

umi-transfer

A command line tool for transferring Unique Molecular Identifiers (UMIs) provided as separate FastQ file to the header of records in paired FastQ files.



DOI License: MIT GitHub Actions Tests codecov Build status Docker container status Install with Bioconda

Background

To increase the accuracy of quantitative DNA sequencing experiments, Unique Molecular Identifiers may be used. UMIs are short sequences used to uniquely tag each molecule in a sample library, enabling precise identification of read duplicates. They must be added during library preparation and prior to sequencing, therefore require appropriate arrangements with your sequencing provider.

Most tools capable of taking UMIs into consideration during an analysis workflow, expect the respective UMI sequence to be embedded into the read's ID. Please consult your tools' manuals regarding the exact specification.

For some library preparation kits and sequencing adapters, the UMI sequence needs to be read together with the index from the antisense strand. Consequently, it will be output as a separate FastQ file during the demultiplexing process.

This tool efficiently integrates these separate UMIs into the headers and can also correct divergent read numbers back to the canonical 1 and 2.

Installation

Binary Installation

Binaries for umi-transfer are available for most platforms and can be obtained from the Releases page on GitHub. Simply navigate to the releases and download the appropriate binary for your operating system. Once downloaded, you can place it in a directory of your choice and optionally add the binary to your system's $PATH.

Bioconda

umi-transfer is also available on BioConda. Please refer to the Bioconda documentation for comprehensive installation instructions. If you are already familiar with conda and BioConda, here’s a quick reference:

mamba install umi-transfer

If you wish to create a separate virtual environment for the tool, replace <myenvname> with a suitable environment name of your choice and run

mamba create --name <myenvname> umi-transfer

Containerized execution (Docker)

Docker provides a platform for packaging software into self-contained units called containers. Containers encapsulate all the dependencies and libraries needed to run an application, making it easy to deploy and run the software consistently across different environments.

To use umi-transfer with Docker, you can pull the pre-made Docker image from Docker Hub. Open a terminal or command prompt and run the following command:

docker pull mzscilifelab/umi-transfer:latest

Once the image is downloaded, you can run umi-transfer within a Docker container using:

docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer --help

A complete command might look like the example below. The options -t -v -w to Docker will ensure that your local directory is mapped to and available inside the container. Everything after the image command resembles the standard command line syntax:

docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer external --in=read1.fq --in2=read2.fq --umi=umi.fq

Optionally, you can create an alias for the Docker part of the command to be able to use the containerized version as if it was locally installed. Add the line below to your ~/.profile, ~/.bash_aliases, ~/.bashrc or ~/.zprofile (depending on the terminal & configuration being used).

alias umi-transfer="docker run -t -v `pwd`:`pwd` -w `pwd` mzscilifelab/umi-transfer:latest umi-transfer"

Compile from source

Given that you have Rust installed on your computer, clone or download this repository and run

cargo build --release

That should create an executable target/release/umi-transfer that can be placed anywhere in your $PATH or be executed directly by specifying its path:

./target/release/umi-transfer --version
umi-transfer 1.5.0

Usage

The tool requires three FastQ files as input. You can manually specify the names and location of the output files with --out and --out2 or the tool will automatically append a with_UMI suffix to your input file names. It additionally accepts to choose a custom UMI delimiter with --delim and to set the flags -f, -c and -z.

-c is used to ensure the canonical 1 and 2 of paired files as read numbers in the output, regardless of the read numbers of the input reads. -f / --force will overwrite existing output files without prompting the user and -z enables the internal compression of the output files. Alternatively, you can also specify an output file name with .gz suffix to obtain compressed output.

$ umi-transfer external --help


Integrate UMIs from a separate FastQ file

Usage: umi-transfer external [OPTIONS] --in <R1_IN> --in2 <R2_IN> --umi <RU_IN>

Options:
  -c, --correct_numbers
          Read numbers will be altered to ensure the canonical read numbers 1 and 2 in output file sequence headers.


  -z, --gzip
          Compress output files. Turned off by default.


  -l, --compression_level <COMPRESSION_LEVEL>
          Choose the compression level: Maximum 9, defaults to 3. Higher numbers result in smaller files but take longer to compress.


  -t, --threads <NUM_THREADS>
          Number of threads to use for processing. Defaults to the number of logical cores available.


  -f, --force
          Overwrite existing output files without further warnings or prompts.


  -d, --delim <DELIM>
          Delimiter to use when joining the UMIs to the read name. Defaults to `:`.


      --in <R1_IN>
          [REQUIRED] Input file 1 with reads.


      --in2 <R2_IN>
          [REQUIRED] Input file 2 with reads.


  -u, --umi <RU_IN>
          [REQUIRED] Input file with UMI.


      --out <R1_OUT>
          Path to FastQ output file for R1.


      --out2 <R2_OUT>
          Path to FastQ output file for R2.


  -h, --help
          Print help
  -V, --version
          Print version

Example

A typical run may look like this:

umi-transfer external -fz -d '_' --in 'R1.fastq' --in2 'R3.fastq' --umi 'R2.fastq'

umi-transfer warrants paired input files. To run on singletons, use the same input twice and redirect one output to /dev/null:

umi-transfer external --in read1.fastq --in2 read1.fastq --umi read2.fastq --out output1.fastq --out2 /dev/null

Benchmarks and parameter recommendations

With the release of version 1.5, umi-transfer features internal multi-threaded output compression. As a result, umi-transfer 1.5 now runs approximately 25 times faster than version 1.0 when using internal compression and about twice as fast compared to using an external compression tool. This improvement is enabled by the outstanding gzp crate, which abstracts a lot of the underlying complexity away from the main software.

Benchmark of different tool versions

In our first benchmark using 17 threads, version 1.5 of umi-transfer processed approximately 550,000 paired records per second with the default gzip compression level of 3. At the highest compression level of 9, the rate dropped to just below 200,000 records per second. While the exact numbers may vary depending on your storage, file system, and processors, we expect the relative performance rates to remain approximately constant.

Benchmark of thread numbers

In a subsequent benchmark, we tested the effect of increasing the number of threads. For the default compression level, the maximum speed was achieved with 9 to 11 threads. Since umi-transfer writes two output files simultaneously, this configuration allows for 4 to 5 threads per file to handle the output compression.

Adding more threads per file proved unhelpful, as other steps became the rate-limiting factors. These factors include file system I/O, input file decompression, and the actual editing of the file contents, which now determine the performance of umi-transfer. Only when increasing the compression level to higher settings did adding more threads continue to provide a performance benefit. For the highest compression setting, we did not reach the plateau phase during the benchmark, but it is likely to occur in the range of 53-55 total threads, or about 26 threads per output file.

In summary, we recommend running umi-transfer with 9 or 11 threads for compression. Odd numbers are favorable as they allow one dedicated main thread, while evenly splitting the remaining threads between the two output files. It's important to note that specifying more threads than the available physical or logical cores on your machine will result in a severe performance loss, since umi-transfer operates synchronously.

Chaining with other software

umi-transfer cannot be used with the pipe operator, because it neither supports writing output to stdout nor reading input from stdin. However, FIFOs (First In, First Out buffered pipes) can be used to elegantly combine umi-transfer with other software on GNU/Linux and MacOS operating systems.

For example, we may want to use external compression software like Parallel Gzip together with umi-transfer. For this purpose, it would be unfavorable to write the data uncompressed to disk before compressing it. Instead, we create named pipes with mkfifo, which can be provided to umi-transfer as if they were regular output file paths. In reality, the data is directly passed on to pigz via a buffered stream.

First, the named pipes are created:

mkfifo output1
mkfifo output2

Then a multi-threaded pigz compression is tied to the FIFO. Note the trailing & to leave these processes running in the background.

$ pigz -p 10 -c > output1.fastq.gz < output1 &
[4] 233394
$ pigz -p 10 -c > output2.fastq.gz < output2 &
[5] 233395

The argument -p 10 specifies the number of threads that each pigz processes may use. The optimal setting is hardware-specific and will require some testing.

Finally, we can run umi-transfer using the FIFOs as output paths:

umi-transfer external --in read1.fastq --in2 read3.fastq --umi read2.fastq --out output1 --out2 output2

It's good practice to remove the FIFOs after the program has finished:

rm output1.fastq output2.fastq

Contribution guide for developers

umi-transfer is a free and open-source software developed and maintained by scientists of the Swedish National Genomics Infrastructure. We gladly welcome suggestions for improvement, bug reports and code contributions.

If you'd like to contribute code, the best way to get started is to create a personal fork of the repository. Subsequently, use a new branch to develop your feature or contribute your bug fix. Ideally, use a code linter like rust-analyzer in your code editor and run the tests with cargo test.

Before developing a new feature, we recommend opening an issue on the main repository to discuss your proposal upfront. Once you're ready, simply open a pull request to the dev branch and we'll happily review your changes. Thanks for your interest in contributing to umi-transfer!