SequenceTokenizers.jl

SequenceTokenizers.jl is a Julia convenience package that offers a simplified and efficient way to tokenize character sequences, wrapping functionality from OneHotArrays while handling String sequences and padding automatically. It provides a SequenceTokenizer struct that can:

Convert characters to integer tokens based on a predefined alphabet
Handle unknown characters with a customizable unknown symbol
Tokenize single characters, arrays of characters, and batches of sequences with variable length with automatic padding
Convert token indices back to characters
Create one-hot encoded representations of tokenized sequences
Convert one-hot encoded representations back to characters

📘 Limitations

It is not a Flux layer to keep dependencies minimal, therefore it cannot be placed inside a gradient block.

Single characters must be Char type

Multiple character sequences must be String type

No mixed type arrays are supported, it's either Strings or Chars that produce correct behaviour

Usage

using SequenceTokenizers

# Create tokenizer
alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(alphabet, 'N')  # 'N' is the unknown symbol

# Tokenize sequences (perhaps most common use case)
tokens = tokenizer(["AGTCAGGACA","AGCGTGCGGGTAGGCTCGCC"])  # Returns UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]

# Create a batch of sequences
batch = UInt32[2 2; 4 4; 5 3; 3 4; 2 5; 4 4; 4 3; 2 4; 3 4; 2 4; 1 5; 1 2; 1 4; 1 4; 1 3; 1 5; 1 3; 1 4; 1 3; 1 3]

# Create one-hot encoded representation
onehot_encoded = onehot_batch(tokenizer, batch)

# Convert one-hot encoded representation back to characters
decoded_batch = onecold_batch(tokenizer, onehot_encoded)
# Returns ['A' 'A'; 'G' 'G'; 'T' 'C'; 'C' 'G'; 'A' 'T'; 'G' 'G'; 'G' 'C'; 'A' 'G'; 'C' 'G'; 'A' 'G'; 'N' 'T'; 'N' 'A'; 'N' 'G'; 'N' 'G'; 'N' 'C'; 'N' 'T'; 'N' 'C'; 'N' 'G'; 'N' 'C'; 'N' 'C'] with N characters right padding shorter sequence

This package is useful for natural language processing tasks, sequence modeling, and any application that requires mapping between characters and integer tokens or one-hot encoded representations.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
benchmark		benchmark
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SequenceTokenizers.jl

Usage

About

Releases 4

Packages

Languages

License

mashu/SequenceTokenizers.jl

Folders and files

Latest commit

History

Repository files navigation

SequenceTokenizers.jl

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages