Adding Megatron Tokenization pipeline #304
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
In this PR I include the
MegatronDocumentTokenizer
pipeline which stores the tokenized documents in a format compatible with Megatron/NeMo (I'll refer to both projects BUT both produce and require exactly the same files).For Megatron/NeMo, we store the document tokens in the same way as
DocumentTokenizer
, in a file with a.bin
extension. The main difference is that we need to store a.idx
file which contains some information about the tokenised documents in the.bin
file & some metadata. The metadata can be a bit tricky to understand but I've included a Byte-Per-Byte explanation of what does it contains. Also, I've dropped support for shuffling and we shouldn't merge the tokenised documents as we should properly handle the.idx
files. The main part of the code involved in this tokenisation process can be found here (Megatron) & here (NeMo).I've tested this implementations vs both the NeMo & Megatron tokenisations scripts (Both are 99% equal, just some minor differences on parallelisation) and we get exactly the same files. I can include some unit tests if necessary, but we would have to port some parts of NeMo/Megatron over here and actively maintain them but it has to be said that this parts of the code haven't been modified for ages.
Toni