Skip to content

Latest commit

 

History

History
144 lines (109 loc) · 7.38 KB

README.md

File metadata and controls

144 lines (109 loc) · 7.38 KB

Representing biological sequences as Markov chains

Documentation Latest Release DOI
CI Workflow License Work in Progress Downloads Aqua QA JET


BioMarkovChains

A Julia package to represent biological sequences as Markov chains

Installation

BioMarkovChains is a   Julia Language   package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, BioMarkovChains, GeneFinder


# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

orfseq = findorfs(seq)[3] |> sequence

21nt DNA Sequence:
ATGCGTCGAATGGCACGGTGA

If we translate it, we get a 7aa sequence:

translate(orfseq)

7aa Amino Acid Sequence:
MRRMAR*

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfseq, 2)

BioMarkovChain of DNA alphabet and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.25    0.25    0.0     0.5
   0.25    0.0     0.75    0.0
   0.25    0.25    0.25    0.25
   0.0     0.25    0.75    0.0
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.2     0.2     0.4     0.2

But I can also have a BioMarkovChain instance of the Ammino Acid sequence:

BioMarkovChain(translate(orfseq), 2)

BioMarkovChain of AminoAcid alphabet and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(20 × 20):
   0.0    1.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.333  0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.5    0.5    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0
  - Initial Probabilities -> Vector{Float64}(20 × 1):
   0.167  0.5    0.0    0.0    0.0    0.0          0.0    0.0    0.0    0.0    0.0    0.0    0.0

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:

ECOLICDS

BioMarkovChain of DNA alphabet and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.245   0.243   0.273   0.239

What is then the probability of the previous DNA sequence given this model?

markovprobability(orfseq, model=ECOLICDS, logscale=true)

-39.71754773536592

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs