Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load data on an RETROformer based on Deepmind Retroformer model #48

Open
2 tasks
rstarmer opened this issue Jun 28, 2023 · 3 comments
Open
2 tasks

Load data on an RETROformer based on Deepmind Retroformer model #48

rstarmer opened this issue Jun 28, 2023 · 3 comments
Labels
P0 Priority 0 - essential

Comments

@rstarmer
Copy link
Member

rstarmer commented Jun 28, 2023

The RETRO model we are currently investigating: https://arxiv.org/pdf/2112.04426.pdf

An example implementation: https://github.com/lucidrains/RETRO-pytorch

Initial data set: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/tree/main

Acceptance criteria:

  • A python test case that creates the knn vector database used for the encoded neighbors search
  • A python capability that enables a BERT model inferable with the Retro data applied
@rstarmer rstarmer added the EXPLORATION An exploration PR contains too much code to ever be mergeable. It is useful to communicate ideas. label Jun 28, 2023
@rstarmer rstarmer changed the title Train an RETROformer based on Deepmind Retroformer model Load data on an RETROformer based on Deepmind Retroformer model Jun 28, 2023
@rstarmer rstarmer added P0 Priority 0 - essential and removed EXPLORATION An exploration PR contains too much code to ever be mergeable. It is useful to communicate ideas. labels Jun 28, 2023
@rstarmer
Copy link
Member Author

@MostAwesomeDude I added a branch platform/retro-pytorch to the repository with a current attempt at this. The train.py is simply code from the lucidrains/RETRO-pytorch readme.

You will still need to load some txt files into the text_folder for the system to ingest.

@sdake
Copy link
Member

sdake commented Jun 30, 2023

I recommend standard datasets that, are at start, highly curated. I am using this dataset:

https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/tree/main

Cheers,
-steve

@sdake
Copy link
Member

sdake commented Jun 30, 2023

@rstarmer rather than stretch goals, would you consider adding a separate user story with the applicable priority?

Thank you,
-steve

@rstarmer rstarmer assigned sdake and unassigned MostAwesomeDude and sdake Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Priority 0 - essential
Projects
None yet
Development

No branches or pull requests

3 participants