Skip to content

mashu/SequenceTokenizers.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SequenceTokenizers.jl

Dev Build Status Coverage Benchmarks

SequenceTokenizers.jl is a Julia convenience package that offers a simplified and efficient way to tokenize character sequences, wrapping functionality from OneHotArrays while handling String sequences and padding automatically. It provides a SequenceTokenizer struct that can:

  • Convert characters to integer tokens based on a predefined alphabet
  • Handle unknown characters with a customizable unknown symbol
  • Tokenize single characters, arrays of characters, and batches of sequences with variable length with automatic padding
  • Convert token indices back to characters
  • Create one-hot encoded representations of tokenized sequences
  • Convert one-hot encoded representations back to characters

📘 Limitations

  • It is not a Flux layer to keep dependencies minimal, therefore it cannot be placed inside a gradient block.
  • Single characters must be Char type
  • Multiple character sequences must be String type
  • No mixed type arrays are supported, it's either Strings or Chars that produce correct behaviour

Usage

using SequenceTokenizers

# Create tokenizer
alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(alphabet, 'N')  # 'N' is the unknown symbol

# Tokenize sequences (perhaps most common use case)
tokens = tokenizer(["AGTCAGGACA","AGCGTGCGGGTAGGCTCGCC"])  # Returns UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]

# Create a batch of sequences
batch = UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]

# Create one-hot encoded representation
onehot_encoded = onehot_batch(tokenizer, batch)

# Convert one-hot encoded representation back to characters
decoded_batch = onecold_batch(tokenizer, onehot_encoded)
# Returns ['A' 'A'; 'G' 'G'; 'T' 'C'; 'C' 'G'; 'A' 'T'; 'G' 'G'; 'G' 'C'; 'A' 'G'; 'C' 'G'; 'A' 'G'; 'N' 'T'; 'N' 'A'; 'N' 'G'; 'N' 'G'; 'N' 'C'; 'N' 'T'; 'N' 'C'; 'N' 'G'; 'N' 'C'; 'N' 'C'] with N characters right padding shorter sequence

This package is useful for natural language processing tasks, sequence modeling, and any application that requires mapping between characters and integer tokens or one-hot encoded representations.

About

Character level tokenizers for sequence data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages