Replies: 4 comments
-
This dataset would be useful for pretraining rather than instruction-tuning. Pretraining is very expensive and requires huge amounts of compute which OA cannot currently commit, so we are exclusively finetuning from existing models |
Beta Was this translation helpful? Give feedback.
-
The RedPajama project by Together and several other organizations is supposed to have, according to their article, 3 components:
Pretraining data cannot be used directly for the reason stated by Oliver. I think what we should hope for is to have high-quality base models (which should be the next asset), which could be finetuned for Open Assistant, to replace Llama, or at least to have another open source variant besides Pythia. I am skeptical of the third asset because it's unclear what "safe" would imply here (the word appears exactly once in the whole article). Often alignment with the perception of safety of some organization reduces utility and renounces neutrality. I am a lot more optimistic about the "Base model" asset and I truly hope they will be a viable base for Open Assistant in the near future. |
Beta Was this translation helpful? Give feedback.
-
Yes agree, we can definitely look at their models and maybe instruction data when those are available, my comment only applies to the pretraining text corpus |
Beta Was this translation helpful? Give feedback.
-
This dataset can indeed also be helpful during SFT potentially reduce or delay overfitting, especially the small sample seems interesting ... the distribution is close to llama training set and would be continuation of training with simple language-modelling objective. In general for pre-training my impression is that RedPajama ideally should have included more code (e.g. it has less than LLaMA according to numbers I saw) ... |
Beta Was this translation helpful? Give feedback.
-
RedPajama is an open dataset containing more than 1.2 trillion tokens - https://www.together.xyz/blog/redpajama.
It has a permissive license and lots of data, so it would invest a lot of knowledge into the project.
Also, it would permit to switch from llama-based model to a custom one or, for example, a Dolly-based one.
Github: https://github.com/togethercomputer/RedPajama-Data
Huggingface: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Beta Was this translation helpful? Give feedback.
All reactions