A Julia package for encoding biological sequences into Voss representations
- Provides the fastest encoding of biological sequences into Voss representations (aka. OneHot vectors).
- Can encode all
BioSequence
types andString
s with unambiguous nucleotides and amino acids. - Can handle ambiguous nucleotides and amino acids from a
BioSequence
. - Provides a simple and intuitive API for encoding biological sequences.
- Includes a dedicated type
VossEncoder
that match theBioSequence
s types. - Can be used for single nucletide encoding
vv = vossvector(dna"ACGT", DNA_A)
.
Warning
This package uses internals from BioSequences
and BitMatrix
types, which might not be stable. Use with caution.
BioVossEncoder is a Julia Language package. To install BioVossEncoder, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command
pkg> add BioVossEncoder
This package provides a simple and fast way to encode biological sequences into Voss representations. The main struct
provided by this package is VossEncoder
which is a wrapper of BitMatrix
that encodes a biological sequence into a bit matrix and its corresponding alphabet. The following example shows how to encode a DNA sequence into a Voss matrix.
julia> using BioSequences, BioVossEncoder
julia> seq = dna"ACGT"
julia> VossEncoder(seq)
4×4 Voss Matrix of DNAAlphabet{4}():
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
For simplicity the VossEncoder
struct provides a property bitmatrix
that returns the BitMatrix
representation of the sequence.
julia> VossEncoder(seq).bitmatrix
4×4 BitMatrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Similarly another function that makes use of the VossEncoder
structure is vossmatrix
which returns the BitMatrix
representation of a sequence directly.
julia> vossmatrix(seq)
4×4 BitMatrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Sometimes it proves to be useful to encode a sequence into a Voss vector representation (i.e a bit vector of the sequence from the corresponding molecule alphabet).
This package provides a function vossvector
that returns a Voss vector of a sequence given a BioSequence
and the specific molecule (BioSymbol
) that could be DNA
or AA
.
julia> vossvector(seq, DNA_A)
4-element view(::BitMatrix, 1, :) with eltype Bool:
1
0
0
0
Note that the output is actually using behind the scenes a view of the BitMatrix
representation of the sequence. This is done for performance reasons.
using BioSymbols, BioSequences
function onehot(s::NucSeq)
M = falses(4, length(s))
for (i, s) in enumerate(s)
bits = compatbits(s)
while !iszero(bits)
M[trailing_zeros(bits) + 1, i] = true
bits &= bits - one(bits) # clear lowest bit
end
end
M
end
julia> onehot(dna"TGNTKCTW-T")
4×10 BitMatrix:
0 0 1 0 0 0 0 1 0 0
0 0 1 0 0 1 0 0 0 0
0 1 1 0 1 0 0 0 0 0
1 0 1 1 1 0 1 1 0 1
julia> function onehot_reinterpretator(seq::BioSequence)
seqlen = length(seq)
modvect = Vector{Int8}(undef, seqlen)
modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
@inbounds for i in 1:seqlen
modvect[i] = reinterpreter(seq[i])
end
return 1:4 .== permutedims(modvect)
end
- SequenceTokenizers.jl: A Julia package for tokenizing biological sequences, providing efficient and flexible tokenization methods for various sequence types. Focused on
String
type.
julia> function onehot_tokenizer(str::String)
alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(alphabet, 'N')
tokens = tokenizer([str])
return onehot_batch(tokenizer, tokens)
end # 5×N×1 Array{Float32, 3}
julia> onehot_tokenizer("ACATCAGCATC")
5×11×1 Array{Float32, 3}:
[:, :, 1] =
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
julia> using OneHotArrays
onehotbatch(str, ('A', 'C', 'G','T'))
4×1000000 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅
1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ 1
⋅ 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 ⋅ ⋅ 1 1 1 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ 1 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅
-
Fasta2onehot.jl: A Julia package for converting FASTA sequences into one-hot encoded matrices.
- With
StatsBase.jl
julia> using StatsBase
function onehot_indicator(str::String)::Vector{BitVector}
codeunits(str) |> indicatormat
end # returns 4-element Vector{BitVector}:
- With
collect
: The output is aVector{BitVector}
which is somehow disorganized, but it is a valid one-hot encoding.
julia> function onehot_collector(str::String)::Vector{BitVector}
[collect(str) .== x for x ∈ ['A', 'C', 'G', 'T']]
end # retuns 4-element Vector{BitVector}:
- With
permutedims
andreinterpret
:
julia> function onehot_permutator(seq::BioSequence)
modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
return 1:4 .== permutedims(reinterpreter.(seq))
end
A more efficient version of the previous function With codeunits
and permutedims
:
julia> function onehot_codeunits(str::String)
# A C G T
return UInt8[0x41, 0x43, 0x47, 0x54] .== permutedims(codeunits(str))
end
julia> using BenchmarkTools
str = rand(codeunits("ACGT"),10^6) |> String
seq = randdnaseq(10^6)
# VossEncoder.jl
@btime vossmatrix($seq); # 32.056 μs (4 allocations: 488.42 KiB)
@btime vossvector($str); # 11.565 ms (10 allocations: 488.62 KiB)
# Others
@btime onehot($seq); # 4.408 ms (4 allocations: 488.42 KiB)
@btime onehot_codeunits($str); # 8.124 ms (6 allocations: 488.48 KiB)
@btime onehot_reinterpretator($seq); # 10.140 ms (7 allocations: 1.43 MiB)
@btime onehot_permutator($seq); # 9.670 ms (10 allocations: 2.38 MiB)
@time onehot_indicator($str); # 17.413 ms (14 allocations: 3.82 MiB)
@btime onehot_collector($str); # 12.659 ms (32 allocations: 15.74 MiB)
@btime onehot_tokenizer($str) # 22.816 ms (19 allocations: 26.70 MiB)
# From the special FluxML ecosystem
@btime onehotbatch($str, ('A', 'C', 'G','T')); # 11.418 ms (3 allocations: 3.81 MiB)
versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)