Dataset: RedPajama #3388

iAdanos · 2023-04-20T15:30:15Z

iAdanos
Apr 20, 2023

RedPajama is an open dataset containing more than 1.2 trillion tokens - https://www.together.xyz/blog/redpajama.
It has a permissive license and lots of data, so it would invest a lot of knowledge into the project.
Also, it would permit to switch from llama-based model to a custom one or, for example, a Dolly-based one.

Github: https://github.com/togethercomputer/RedPajama-Data
Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

olliestanley · 2023-04-20T15:32:47Z

olliestanley
Apr 20, 2023
Collaborator

This dataset would be useful for pretraining rather than instruction-tuning. Pretraining is very expensive and requires huge amounts of compute which OA cannot currently commit, so we are exclusively finetuning from existing models

0 replies

Aspie96 · 2023-04-20T18:07:02Z

Aspie96
Apr 20, 2023

The RedPajama project by Together and several other organizations is supposed to have, according to their article, 3 components:

Pretraining data
Base models
Instruction tuning data and models to make them usable and safe

Pretraining data cannot be used directly for the reason stated by Oliver. I think what we should hope for is to have high-quality base models (which should be the next asset), which could be finetuned for Open Assistant, to replace Llama, or at least to have another open source variant besides Pythia.

I am skeptical of the third asset because it's unclear what "safe" would imply here (the word appears exactly once in the whole article). Often alignment with the perception of safety of some organization reduces utility and renounces neutrality. I am a lot more optimistic about the "Base model" asset and I truly hope they will be a viable base for Open Assistant in the near future.

0 replies

olliestanley · 2023-04-20T19:13:57Z

olliestanley
Apr 20, 2023
Collaborator

The RedPajama project by Together and several other organizations is supposed to have, according to their article, 3 components:

Pretraining data

Base models

Instruction tuning data and models to make them usable and safe

Pretraining data cannot be used directly for the reason stated by Oliver. I think what we should hope for is to have high-quality base models (which should be the next asset), which could be finetuned for Open Assistant, to replace Llama, or at least to have another open source variant besides Pythia.

I am skeptical of the third asset because it's unclear what "safe" would imply here (the word appears exactly once in the whole article). Often alignment with the perception of safety of some organization reduces utility and renounces neutrality. I am a lot more optimistic about the "Base model" asset and I truly hope they will be a viable base for Open Assistant in the near future.

Yes agree, we can definitely look at their models and maybe instruction data when those are available, my comment only applies to the pretraining text corpus

0 replies

andreaskoepf · 2023-04-21T16:15:35Z

andreaskoepf
Apr 21, 2023
Maintainer

This dataset can indeed also be helpful during SFT potentially reduce or delay overfitting, especially the small sample seems interesting ... the distribution is close to llama training set and would be continuation of training with simple language-modelling objective.

In general for pre-training my impression is that RedPajama ideally should have included more code (e.g. it has less than LLaMA according to numbers I saw) ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset: RedPajama #3388

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dataset: RedPajama #3388

iAdanos Apr 20, 2023

Replies: 4 comments

olliestanley Apr 20, 2023 Collaborator

Aspie96 Apr 20, 2023

olliestanley Apr 20, 2023 Collaborator

andreaskoepf Apr 21, 2023 Maintainer

iAdanos
Apr 20, 2023

olliestanley
Apr 20, 2023
Collaborator

Aspie96
Apr 20, 2023

olliestanley
Apr 20, 2023
Collaborator

andreaskoepf
Apr 21, 2023
Maintainer