Adding Megatron Tokenization pipeline #304

TJ-Solergibert · 2024-11-14T16:30:44Z

Hi!

In this PR I include the MegatronDocumentTokenizer pipeline which stores the tokenized documents in a format compatible with Megatron/NeMo (I'll refer to both projects BUT both produce and require exactly the same files).

For Megatron/NeMo, we store the document tokens in the same way as DocumentTokenizer, in a file with a .bin extension. The main difference is that we need to store a .idx file which contains some information about the tokenised documents in the .bin file & some metadata. The metadata can be a bit tricky to understand but I've included a Byte-Per-Byte explanation of what does it contains. Also, I've dropped support for shuffling and we shouldn't merge the tokenised documents as we should properly handle the .idx files. The main part of the code involved in this tokenisation process can be found here (Megatron) & here (NeMo).

I've tested this implementations vs both the NeMo & Megatron tokenisations scripts (Both are 99% equal, just some minor differences on parallelisation) and we get exactly the same files. I can include some unit tests if necessary, but we would have to port some parts of NeMo/Megatron over here and actively maintain them but it has to be said that this parts of the code haven't been modified for ages.

Toni

TJ-Solergibert added 2 commits November 14, 2024 17:11

Adding Megatron Tokenization pipeline

d36bc46

ups

7c60709

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Megatron Tokenization pipeline #304

Adding Megatron Tokenization pipeline #304

TJ-Solergibert commented Nov 14, 2024

Adding Megatron Tokenization pipeline #304

Are you sure you want to change the base?

Adding Megatron Tokenization pipeline #304

Conversation

TJ-Solergibert commented Nov 14, 2024