Replies: 2 comments 4 replies
-
To train a chatbot style model, you need to convert your data into a question and answer format. If you really want to continue pretraining on your own data without teaching a question answering style, prepare a dataset with all your data in a single column Dataframe. Make sure that the length of the text in each row is not too long. In the experiment setup, remove all additional tokens (e.g. Your setup should look like this. |
Beta Was this translation helpful? Give feedback.
-
@dcruiz01 I'm curious to hear how well additional pretraining is working for you and which models you're experimenting with? Are you using LORA? from #155 (comment):
|
Beta Was this translation helpful? Give feedback.
-
First off, thank you so much for sharing this awesome tool with the open source community!
I would like to run some experiments where I use LLMstudio to provide a pretrained model with additional domain-specific pretraining. In other words, I'm not trying to instruction tune a model just yet. I'm just trying to inject some additional knowledge into a pretrained model via standard causal language modeling.
How would you recommend I set up the dataset? LLMstudio expects two columns (prompt & response), but in this context, would you recommend I split my dataset chunks in half and place one half in each column per sample? Or some other methodology?
Thanks!
@maxjeblick @psinger @pascal-pfeiffer
Ps - I created an issue with the same question yesterday. I'll go ahead and close that as this is probably a more appropriate place for the question.
Beta Was this translation helpful? Give feedback.
All reactions