How to set up dataset for additional pretraining? #148

ZQ-Dev8 · 2023-06-08T13:56:20Z

ZQ-Dev8
Jun 8, 2023

First off, thank you so much for sharing this awesome tool with the open source community!

I would like to run some experiments where I use LLMstudio to provide a pretrained model with additional domain-specific pretraining. In other words, I'm not trying to instruction tune a model just yet. I'm just trying to inject some additional knowledge into a pretrained model via standard causal language modeling.

How would you recommend I set up the dataset? LLMstudio expects two columns (prompt & response), but in this context, would you recommend I split my dataset chunks in half and place one half in each column per sample? Or some other methodology?

Thanks!

@maxjeblick @psinger @pascal-pfeiffer

Ps - I created an issue with the same question yesterday. I'll go ahead and close that as this is probably a more appropriate place for the question.

pascal-pfeiffer · 2023-06-12T10:05:21Z

pascal-pfeiffer
Jun 12, 2023
Maintainer

To train a chatbot style model, you need to convert your data into a question and answer format.

If you really want to continue pretraining on your own data without teaching a question answering style, prepare a dataset with all your data in a single column Dataframe. Make sure that the length of the text in each row is not too long. In the experiment setup, remove all additional tokens (e.g. <|prompt|>, <|answer|>, for Text Prompt Start and Text Answer Start respectively) and disable Add Eos Token To Prompt and Add Eos Token To Answer. Deselect everything in the Prompt Column.

Your setup should look like this.

2 replies

ZQ-Dev8 Jun 22, 2023
Author

Hey thanks so much for your reply @pascal-pfeiffer! Apologies for the delay in my own. So, just to confirm with more specificity, all my data should be in a single column dataframe with the column name set to 'output'? Also, I should ensure each row of text (once tokenized) is no longer than the max input length of the model?

Looking forward to trying this.

maxjeblick Jun 23, 2023

So, just to confirm with more specificity, all my data should be in a single column dataframe with the column name set to 'output'?

Yes, exactly. The column name isn't actually important, as it can be set in the UI.

Also, I should ensure each row of text (once tokenized) is no longer than the max input length of the model?

This is not a strict requirement. Exceeding tokens will be dropped, so if your text is slightly longer, a small percentage of your data will not be used for training. If you want to ensure no data loss, the method you described should work.

tmm1 · 2023-06-28T18:58:09Z

tmm1
Jun 28, 2023

@dcruiz01 I'm curious to hear how well additional pretraining is working for you and which models you're experimenting with? Are you using LORA?

from #155 (comment):

It does work with LORA. I am not very sure about the quality, as this is not well tested and we have no benchmark. Will be happy to take suggestions w.r.t. automated tests/benchmarks on this.

2 replies

tmm1 Jun 28, 2023

it's hard to find much information regarding fine-tuning w/ lora to learn new knowledge.

i found some previous attempts mentioned on tloen/alpaca-lora#45 but no follow up with results

there are also some suggestions that you can't necessarily learn new information using just LORA. from tloen/alpaca-lora#161 (comment):

it could be difficult, especially on consumer hardware, because I don't think an LLM could learn a lot of new facts with a Lora, and you need to avoid "catastrophic forgetting".

ZQ-Dev8 Jun 29, 2023
Author

To be honest @tmm1 , I've been sidetracked with other projects so I haven't been able to try in earnest yet. I will soon though, so I'll keep you in the loop! I hadn't planned to use LoRA for the additional pretraining task originally, but it will be interesting to do it both ways and compare the resulting models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set up dataset for additional pretraining? #148

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to set up dataset for additional pretraining? #148

ZQ-Dev8 Jun 8, 2023

Replies: 2 comments · 4 replies

pascal-pfeiffer Jun 12, 2023 Maintainer

ZQ-Dev8 Jun 22, 2023 Author

maxjeblick Jun 23, 2023

tmm1 Jun 28, 2023

tmm1 Jun 28, 2023

ZQ-Dev8 Jun 29, 2023 Author

ZQ-Dev8
Jun 8, 2023

Replies: 2 comments 4 replies

pascal-pfeiffer
Jun 12, 2023
Maintainer

ZQ-Dev8 Jun 22, 2023
Author

tmm1
Jun 28, 2023

ZQ-Dev8 Jun 29, 2023
Author