Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Megatron Tokenization pipeline #304

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

TJ-Solergibert
Copy link
Contributor

Hi!

In this PR I include the MegatronDocumentTokenizer pipeline which stores the tokenized documents in a format compatible with Megatron/NeMo (I'll refer to both projects BUT both produce and require exactly the same files).

For Megatron/NeMo, we store the document tokens in the same way as DocumentTokenizer, in a file with a .bin extension. The main difference is that we need to store a .idx file which contains some information about the tokenised documents in the .bin file & some metadata. The metadata can be a bit tricky to understand but I've included a Byte-Per-Byte explanation of what does it contains. Also, I've dropped support for shuffling and we shouldn't merge the tokenised documents as we should properly handle the .idx files. The main part of the code involved in this tokenisation process can be found here (Megatron) & here (NeMo).

I've tested this implementations vs both the NeMo & Megatron tokenisations scripts (Both are 99% equal, just some minor differences on parallelisation) and we get exactly the same files. I can include some unit tests if necessary, but we would have to port some parts of NeMo/Megatron over here and actively maintain them but it has to be said that this parts of the code haven't been modified for ages.

Toni

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant