diff --git a/index.html b/index.html index 74d4c27..8abc299 100644 --- a/index.html +++ b/index.html @@ -1,37 +1,38 @@ +
- - - - + + + + - - + + - +Welcome to the CodonTransformer project page!
-- CodonTransformer is a cutting-edge, multispecies deep learning model designed for state-of-the-art codon optimization. Trained on over 1 million gene-protein pairs from 164 organisms spanning all kingdoms of life, CodonTransformer leverages advanced neural network architectures to generate host-specific DNA sequences with natural-like codon usage patterns and minimized negative cis-regulatory elements for any protein sequence.
+ + ++ CodonTransformer is a cutting-edge, multispecies deep learning model designed for + state-of-the-art codon optimization. Trained on over 1 million gene-protein pairs from 164 organisms spanning + all kingdoms of life, CodonTransformer leverages advanced neural network architectures to generate + host-specific DNA sequences with natural-like codon usage patterns and minimized negative cis-regulatory + elements for any protein sequence. +
+The genetic code's degeneracy allows for multiple DNA sequences to encode the same protein, but not all codons are equal in the eyes of the host organism. Codon usage bias can significantly impact the efficiency of heterologous protein production due to differences in tRNA abundance, protein folding regulations, and evolutionary constraints.
-CodonTransformer addresses this challenge by using the Transformer architecture and a novel training strategy named STREAM to learn and replicate the intricate codon usage patterns across a diverse array of organisms. By doing so, it provides a powerful tool for tailoring DNA sequences to optimize protein expression in various host species.
+ -Fig. 1: CodonTransformer multispecies model with combined organism-amino acid-codon embedding.
-a. An encoder-only BigBird Transformer model trained by combined amino acid-codon tokens along with organism encoding for host-specific codon usage representation. b. CodonTransformer was trained with ~1 million genes from 164 organisms across all kingdoms of life and fine-tuned with highly expressed genes (top 10% codon usage index, CSI) of 13 organisms and two chloroplast genomes.
-The genetic code's degeneracy allows for multiple DNA sequences to encode the same protein, but not all + codons are equal in the eyes of the host organism. Codon usage bias can significantly impact the efficiency of + heterologous protein production due to differences in tRNA abundance, protein folding regulations, and + evolutionary constraints.
+CodonTransformer uses the Transformer architecture and a novel training strategy named STREAM (Shared Token + Representation and Encoding with Aligned Multi-masking) to learn and replicate the intricate codon usage + patterns across a diverse array of organisms. By doing so, it provides a powerful tool for optimizing DNA + sequences for expression in various host species.
- CodonTransformer utilizes the BigBird Transformer architecture, an advanced neural network model capable of efficiently processing long sequences. A novel sequence representation strategy, Shared Token Representation and Encoding with Aligned Multi-masking (STREAM), combines organism encoding with tokenized amino acid-codon pairs to achieve context-awareness.
+
+
+
+ CodonTransformer addresses the challenge of codon optimization by translating protein sequences into optimized
+ codon sequences
+ using the encoder-only BigBird Transformer architecture. We frame this task as a Masked Language Modeling
+ (MLM) problem,
+ where the model predicts codons by unmasking tokens from [aminoacid_UNK] to [aminoacid_codon].
+ Our innovative STREAM training strategy allows the model to learn codon usage patterns by unmasking multiple
+ mask tokens
+ while organism specific embeddings are added to the sequence to contextualize predictions.
+
+ The training process involves two stages: pretraining on over one million DNA-protein pairs from 164 diverse
+ organisms
+ to capture universal codon usage patterns, followed by fine-tuning on a curated subset of highly optimized
+ genes
+ specific to target organisms. This dual training strategy enables CodonTransformer to generate DNA sequences
+ with
+ natural-like codon distributions tailored to each host, effectively optimizing gene expression across multiple
+ species.
+ Key Features
- Through extensive testing and benchmarking, CodonTransformer demonstrates superior performance in generating natural-like codon distributions and minimizing negative cis-regulatory elements compared to existing codon optimization tools.
-
+ CodonTransformer demonstrates superior performance in generating
+ natural-like codon distributions and minimizing negative cis-regulatory elements compared to existing codon
+ optimization tools.
+ CodonTransformer effectively learned codon usage patterns across multiple species, as shown by high codon
+ similarity indices (CSI) when generating DNA sequences for various organisms. The model adapts to the specific
+ codon preferences of each host, ensuring optimal expression.
+ Fig. 2: CodonTransformer learned codon patterns across organisms. Codon usage index (CSI) for all and the top 10% CSI original genes and generated DNA sequences for all
+ original proteins by CodonTransformer (base and fine-tuned models) for 9 out of 15 genomes used for
+ fine-tuning in this study. See Supplementary Figs. 2-16 for all 15 genomes and additional metrics of GC
+ content codon and distribution frequency (CDF). Source data for Fig. 2 and Supplementary Figs. 2-16 is
+ available at https://zenodo.org/records/13262517.
- CodonTransformer effectively learned codon usage patterns across multiple species, as shown by high codon similarity indices (CSI) when generating DNA sequences for various organisms. The model adapts to the specific codon preferences of each host, ensuring optimal expression.
-
+ The model produces DNA sequences with codon usage patterns closely resembling those found in nature, avoiding
+ clusters of rare or highly frequent codons that can negatively affect protein folding and expression. This is
+ visualized using %MinMax profiles and Dynamic Time Warping (DTW) distance metrics.
+ Fig. 3: CodonTransformer generates natural-like codon distributions. a. Schematic representation of %MinMax and dynamic time warping (DTW). %Minmax represents
+ the proportion of common and rare codons in a sliding window of 18 codons. DTW algorithm computes the minimal
+ distance between two %MinMax profiles by finding the matching positions (Methods). b. %MinMax
+ profiles for sequences generated by different models for genes yahG (E. coli), SER33 (S. cerevisiae),
+ AT4G12540 (A. ²thaliana), Csad (M. musculus), ZBTB7C (H. sapiens). c. DTW distances between
+ %MinMax profiles of model-generated sequences and their genomic counterparts for 50 random genes selected
+ among the top 10% codon similarity index (CSI). For each organism, the gene for which the %MinMax profiles are
+ represented above (b) is highlighted in grey. d. Mean and standard deviation of normalized
+ DTW distances by sequence length between sequences for the 5 organisms (for organism-specific DTW distances,
+ see Supplementary Figs. 17). Data underlying this figure is provided in Supplementary Data 1.
+ When benchmarked against proteins of biotechnological interest, CodonTransformer consistently generates
+ sequences with minimized negative cis-regulatory elements, outperforming other tools. This enhances the
+ potential for successful heterologous expression.
+ Fig. 4: Model benchmark with proteins of biotechnological interest. Mean and standard deviation of Jaccard index (a), sequence similarity (b),
+ and dynamic time warping (c) distance between corresponding sequences for the 52 benchmark
+ proteins across the 5 organisms (for organism-specific results, see Supplementary Figs. 19, 20, and 21,
+ respectively). (d), Number of negative cis-elements in the sequences generated by different
+ tools (✕ shows the mean). Data underlying this figure is provided in Supplementary Data 2. Fig. 2: CodonTransformer learned codon patterns across organisms. Codon usage index (CSI) for all and the top 10% CSI original genes and generated DNA sequences for all original proteins by CodonTransformer (base and fine-tuned models) for 9 out of 15 genomes used for fine-tuning in this study. See Supplementary Figs. 2-16 for all 15 genomes and additional metrics of GC content codon and distribution frequency (CDF). Source data for Fig. 2 and Supplementary Figs. 2-16 is available at https://zenodo.org/records/13262517.
- The model produces DNA sequences with codon usage patterns closely resembling those found in nature, avoiding clusters of rare or highly frequent codons that can negatively affect protein folding and expression. This is visualized using %MinMax profiles and Dynamic Time Warping (DTW) distance metrics.
+
+ Along with open-source the data and model, we also provide a comprehensive Python package for
+ codon optimization. The CodonTransformer package has 5 modules:
Fig. 3: CodonTransformer generates natural-like codon distributions. a. Schematic representation of %MinMax and dynamic time warping (DTW). %Minmax represents the proportion of common and rare codons in a sliding window of 18 codons. DTW algorithm computes the minimal distance between two %MinMax profiles by finding the matching positions (Methods). b. %MinMax profiles for sequences generated by different models for genes yahG (E. coli), SER33 (S. cerevisiae), AT4G12540 (A. ²thaliana), Csad (M. musculus), ZBTB7C (H. sapiens). c. DTW distances between %MinMax profiles of model-generated sequences and their genomic counterparts for 50 random genes selected among the top 10% codon similarity index (CSI). For each organism, the gene for which the %MinMax profiles are represented above (b) is highlighted in grey. d. Mean and standard deviation of normalized DTW distances by sequence length between sequences for the 5 organisms (for organism-specific DTW distances, see Supplementary Figs. 17). Data underlying this figure is provided in Supplementary Data 1.
- When benchmarked against proteins of biotechnological interest, CodonTransformer consistently generates sequences with minimized negative cis-regulatory elements, outperforming other tools. This enhances the potential for successful heterologous expression.
- Fig. 4: Model benchmark with proteins of biotechnological interest. Mean and standard deviation of Jaccard index (a), sequence similarity (b), and dynamic time warping (c) distance between corresponding sequences for the 52 benchmark proteins across the 5 organisms (for organism-specific results, see Supplementary Figs. 19, 20, and 21, respectively). (d), Number of negative cis-elements in the sequences generated by different tools (✕ shows the mean). Data underlying this figure is provided in Supplementary Data 2. Install CodonTransformer via pip: Or clone the repository: Install CodonTransformer via pip: Or clone the repository: The package requires The package requires After installing CodonTransformer, you can use: After installing CodonTransformer, you can use:
+ You can use the inference template for batch inference in Google Colab.
+
+ CodonTransformer represents a significant advancement in codon optimization by leveraging a multispecies, context-aware deep learning approach trained on 164 diverse organisms. Its ability to generate natural-like codon distributions and minimize negative cis-regulatory elements ensures optimized gene expression while preserving protein structure and function.
+
+ The model's flexibility is further enhanced through customizable fine-tuning, allowing users to tailor optimizations to specific gene sets or unique organisms. As an open-access tool, CodonTransformer provides comprehensive resources, including a Python package and an interactive Google Colab notebook, facilitating widespread adoption and adaptation for various biotechnological applications.
+
+ By integrating evolutionary insights and advanced neural network architectures, CodonTransformer sets a new standard for efficient and accurate gene design, with potential extensions to protein engineering and therapeutic development.
+ Model
+
-
- CodonTransformer Outperforms Existing Tools
- CodonTransformer Outperforms Existing Tools
+
Here are the benchmarking and evaluation results of CodonTransformer:
+ Learning Codon Patterns Across Organisms
+
+
Learning Codon Patterns Across Organisms
-
+
+
+
Generating Natural-Like Codon Distributions
+
+
Benchmarking with Real World Proteins
+
+
Getting Started
-Generating Natural-Like Codon Distributions
-
+
facilitates processing of genetic information by cleaning and translating DNA and protein sequences, FASTA files, and managing codon frequencies from databases like NCBI and Kazusa.
+ Benchmarking with Real World Proteins
-
+
enables preprocessing of sequences, prediction of optimized DNA sequences using the CodonTransformer model, and supports various other optimization strategies.
+
provides tools to compute evaluation metrics such as Codon Similarity Index (CSI), GC content, and Codon Frequency Distribution, allowing for detailed assessment of optimized sequences.
+
offers essential constants and helper functions for genetic sequence analysis, including amino acid mappings, codon tables, taxonomy ID management, and sequence validation.
+
enhances Jupyter notebook workflows with interactive widgets for selecting organisms and inputting protein sequences, formatting and displaying optimized DNA sequence outputs.
+ Getting Started
-
- Installation
-
- pip install CodonTransformer
git clone https://github.com/adibvafa/CodonTransformer.git
+
+
Installation
+
+ pip install CodonTransformer
- git clone https://github.com/adibvafa/CodonTransformer.git
cd CodonTransformer
pip install -r requirements.txt
python>=3.9
. The requirements are available here.python>=3.9
. The requirements are available
+ here.Use Case
- import torch
+
Use Case
+
- import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
@@ -361,7 +454,7 @@
Use Case
)
print(format_model_output(output))-----------------------------
+
- -----------------------------
| Organism |
-----------------------------
Escherichia coli general
@@ -380,41 +473,60 @@
Use Case
| Predicted DNA |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAAKey Features
-
-
- Why Choose CodonTransformer?
-
-
Why Choose CodonTransformer?
+
+
+
+ Trained on 164 organisms, CodonTransformer can optimize codon usage
+ for a wide range of host species, including prokaryotes and eukaryotes.
+ The model considers both global codon usage biases and local
+ sequence patterns, ensuring optimal DNA sequence design.
+ Generates sequences with codon distributions similar to
+ natural genes, aiding in proper protein folding and function.
+ The base and fine-tuned models are openly available, along
+ with a comprehensive Python package and a user-friendly Google Colab notebook.
+ Users can fine-tune the model on custom datasets to meet specific
+ requirements or optimize for unique organisms.Conclusion
+ BibTeX
@@ -430,33 +542,38 @@ BibTeX
journal = {bioRxiv}
}