SequenceTokenizers.jl is a Julia convenience package that offers a simplified and efficient way to tokenize character sequences, wrapping functionality from OneHotArrays while handling String
sequences and padding automatically. It provides a SequenceTokenizer struct that can:
- Convert characters to integer tokens based on a predefined alphabet
- Handle unknown characters with a customizable unknown symbol
- Tokenize single characters, arrays of characters, and batches of sequences with variable length with automatic padding
- Convert token indices back to characters
- Create one-hot encoded representations of tokenized sequences
- Convert one-hot encoded representations back to characters
📘 Limitations
- It is not a Flux layer to keep dependencies minimal, therefore it cannot be placed inside a gradient block.
- Single characters must be Char type
- Multiple character sequences must be String type
- No mixed type arrays are supported, it's either Strings or Chars that produce correct behaviour
using SequenceTokenizers
# Create tokenizer
alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(alphabet, 'N') # 'N' is the unknown symbol
# Tokenize sequences (perhaps most common use case)
tokens = tokenizer(["AGTCAGGACA","AGCGTGCGGGTAGGCTCGCC"]) # Returns UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]
# Create a batch of sequences
batch = UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]
# Create one-hot encoded representation
onehot_encoded = onehot_batch(tokenizer, batch)
# Convert one-hot encoded representation back to characters
decoded_batch = onecold_batch(tokenizer, onehot_encoded)
# Returns ['A' 'A'; 'G' 'G'; 'T' 'C'; 'C' 'G'; 'A' 'T'; 'G' 'G'; 'G' 'C'; 'A' 'G'; 'C' 'G'; 'A' 'G'; 'N' 'T'; 'N' 'A'; 'N' 'G'; 'N' 'G'; 'N' 'C'; 'N' 'T'; 'N' 'C'; 'N' 'G'; 'N' 'C'; 'N' 'C'] with N characters right padding shorter sequence
This package is useful for natural language processing tasks, sequence modeling, and any application that requires mapping between characters and integer tokens or one-hot encoded representations.