What is SentencePiece?
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
We trained sentencepiece model with bengali wiki data and saved our bengali sentencepiece model.
NB: See bnlp more update version about bengali sentencepiece
- sentencepiece installation
pip install sentencepiece
- preprocess wiki data
We download bengali wiki dump data and extract using bengali_wikiextractor.
We preprocess bengali wiki data into a text file with single sentence per line.
- data format
sentence 1
sentence 2
..........
sentence n
- Sentencepiece Training
import sentencepiece as spm
spm.SentencePieceTrainer.train('--model_prefix=bn_spm --input=data/bn_wiki.txt --vocab_size=50000')
it will save bn_spm.model
and bn_spm.vocab
in your train directory.
- Testing bengali sentencepiece model
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("bn_spm.model")
# output: True
sp.EncodeAsPieces("আমি বাংলায় গান গাই।")
# output: ['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']
sp.EncodeAsIds("আমি বাংলায় গান গাই।")
# output: [914, 1852, 349, 6229, 3]
sp.DecodePieces(['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'])
# output: 'আমি বাংলায় গান গাই।'
sp.NBestEncodeAsPieces("আমি বাংলায় গান গাই।", 5)
"""
output:
[['▁আমি', '▁বাংলায়', '▁গান', '▁গাই', '।'],
['▁আমি', '▁বাংলা', 'য়', '▁গান', '▁গাই', '।'],
['▁আমি', '▁বাংলায়', '▁গান', '▁গা', 'ই', '।'],
['▁আমি', '▁বাংলায়', '▁গান', '▁', 'গাই', '।'],
['▁', 'আমি', '▁বাংলায়', '▁গান', '▁গাই', '।']]
"""
sp.DecodeIds([914, 1852, 349, 6229, 3])
# output: 'আমি বাংলায় গান গাই।'
sp.GetPieceSize()
# output: 50000 as our vocab size is 50000
# same as len(sp)