Efficient Bioinformatics Workflows for High-Throughput Sequence Analysis
The Combinatorial Bioinformatic Meta-Framework (CBMF) is a single point
of access to thousands of biomedical research software packages with additional
tools and resources for analyzing high-throughput sequencing data. By leveraging
the power of the micromamba package manager to create conda environments,
install software and manage dependencies. The framework is designed to be
modular, allowing users to select the tools they need for their specific analyses.
The rapid growth of next-generation sequencing technologies has generated an unprecedented volume of biological data, posing significant challenges for bioinformatics analysis. Traditional scripting approaches often lack reproducibility and can be complex for users without extensive programming expertise. This project introduces a collection of Linux shell scripts designed to address these challenges. These scripts implement standardized workflows for quality control, alignment, and report generation tasks across diverse datasets. By automating these processes, the scripts ensure reproducibility, minimize human error, and promote consistent data processing. This suite offers a scalable and reliable solution for comprehensive bioinformatics analysis, representing an important advancement in making high-throughput sequencing data more accessible and manageable for the broader research community.
- use python to make things typed or iterable
- rely on ultra-portable shell scripts for the tools we interface with
The wiki is no longer maintained and has been migrated here.
Install pre-requisites:
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)Clone the repository...
git clone https://github.com/rdnajac/cbmfCreate the environment:
micromamba create -f cbmf.yml -yActivate the environment and install the project:
micromamba activate cbmf && pip install -e .Optionally, set up tab completion for the cbmf command:
# add this to your .bashrc or .zshrc
# uncomment these lines if using zsh
# autoload -U bashcompinit
# bashcompinit
eval "$(register-python-argcomplete cbmf)"Bundled with the CBMF is the lightweight package manager
micromamba
with access to the entire suite of bioinformatics software available on
Bioconda1, a channel for the
conda package manager (including
all available Bioconductor2 software.
Micromamba is not a conda distribution, but a statically linked C++ executable
that can be used to install conda environments. It is a lightweight binary
that handles the installation of conda environments without root privileges,
or the need for a base environment or a Python installation, making it ideal
for use in high-performance computing clusters.
If you want to install it on your own, skip the init scripts and run:
"$SHELL" <(curl -L micro.mamba.pm/install.sh)CBMF comes with some sensible defaults and pre-configured environments for common bioinformatics tasks, but users can easily create their own environments using the Mamba API.
Skip this section and read about Quality Control if you have already received the demultiplexed FASTQ files.
The following command is the default bcl2fastq
command for demultiplexing on the Nextseq, but with the --no-lane-splitting
option added to combine the reads from all four lanes into a single FASTQ file:
bcl2fastq --no-lane-splitting \
--ignore-missing-bcls \
--ignore-missing-filter \
--ignore-missing-positions \
--ignore-missing-controls \
--auto-set-to-zero-barcode-mismatches \
--find-adapters-with-sliding-window \
--adapter-stringency 0.9 \
--mask-short-adapter-reads 35 \
--minimum-trimmed-read-length 35 \
-R "$run_folder" \
-o "$output_folder" \
--sample-sheet "$sample_sheet" \You can copy and paste this command if you set the variables $run_folder,
$output_folder, and $sample_sheet to the appropriate values.
Warning
As of this document's last revision, bcl2fastq is no longer supported;
use bclconvert
if you have a used recent Illumina sequencer (NovaSeq, NextSeq 1000/2000, etc.).
Quality control is an essential step in the analysis of high-throughput sequencing data. It allows us to assess the quality of the reads and identify any issues that may affect downstream analysis, like adapter contamination or low-quality reads. More interesting quality issues include GC bias, mitochondrial contamination, and over-representation of certain sequences.
| Tool | Description | Source |
|---|---|---|
| FastQC3 | Generates html reports containing straightforward metrics | GitHub |
| GATK4 | Analyzes high-throughput sequencing data | GitHub |
| Picard Tools | Manipulates high-throughput sequencing data | Comes packaged with GATK4 |
| MultiQC5 | Aggregates results from bioinformatics analyses | GitHub |
To run these QC applications, you need a suitable Java Runtime Environment (JRE).
Let micromamba handle the installation of the JRE and the tools from bioconda:
micromamba create -n qc -c conda-forge -c bioconda fastqc gatk4 picard multiqc
micromamba run -n qc fastqc -o <output_dir> <fastq_file>You can also use the qc configuration file in the dev folder to create the environment:
micromamba env create -f dev/qc.yml
micromamba activate qcTip
After aligning the reads to the reference genome, these tools can be re-ran on the resulting SAM/BAM files to ensure that the alignment was successful or to consolidate the results from paired-end sequencing.
Read the wiki for details on experiment-specific processing and analysis.
The manuscript for this project is currently in preparation (kinda) and uses the
Oxford University Press (OUP) Bioinformatics template.
Bioinformatics is an official journal of the International Society for Computational Biology (ISCB).
Announcement: Bioinformatics flipped to become a fully open access journal on 1st January 2023. Any new submissions or accepted manuscripts will publish under an OA license. All material published prior to 2023 is free to view, and all rights are reserved. Please find more details on our Open Access page.
Writing guides:
Nerd stuff:
Message boards:
FAQs:
Shout out to these awesome docs:
Thank you to my labmates in the Palomero Lab for their feedback and guidance.
Footnotes
-
GrΓΌning, B., Dale, R., SjΓΆdin, A. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475β476 (2018). https://doi.org/10.1038/s41592-018-0046-7 β©
-
Gentleman, R.C., Carey, V.J., Bates, D.M. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004). https://doi.org/10.1186/gb-2004-5-10-r80 β©
-
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ β©
-
McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303. PMID: 20644199 β©
-
Ewels, P., Magnusson, M., Lundin, S., & KΓ€ller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(1), 3047. https://doi.org/10.1093/bioinformatics/btw354 β©