Important hyper parameters for fine tuning Llama 2 #3579

msmmpts · 2023-09-02T06:01:03Z

msmmpts
Sep 2, 2023

Hi,

I am exploring hyper parameter tuning of Llama-2 model through Ludwig.

Since LLM's have a large # of hyperparameters, which are the most important hyperparemeters one needs to consider while fine tuning (learning rate, epochs).

Can anyone share the list of hyperparameters and their corresponding range of values as a starting point from your experience of fine tuning LLM's?

Thanks

justinxzhao · 2023-09-05T19:25:23Z

justinxzhao
Sep 5, 2023
Maintainer

Hi,

Great to hear you're exploring hyperparameter tuning for the Llama-2 model using Ludwig! Deep learning is an empirical science, so please take this with a grain of salt. Based on my experience, here are some hyperparameters that you might consider tuning:

Learning Rate (5e-5 to 5e-4): This is one of the most critical hyperparameters. A range of 5e-5 to 5e-4 for llama-2 is generally a good starting point. Too high a learning rate might make the model diverge, while too low a learning rate will make training unnecessarily slow.
```
trainer:
  learning_rate: 0.0002
```
Learning Rate Scheduler: Choices like linear decay, cosine decay, or incorporating a warmup fraction can influence how fast your model converges. Different schedules have their pros and cons, so you might have to experiment to see what works best for your specific problem.
```
trainer:
  learning_rate_scheduler:
    type: cosine
    warmup_fraction: 0.03
```
Epochs (2-10): Depending on your data size and time constraints, you might opt for anywhere between 3 to 20 epochs. Fewer epochs might underfit the model, whereas too many might lead to overfitting, especially if you don't have a large dataset. With a good learning rate scheduler, I tend to be less concerned with overfitting, but something to watch out for.
```
trainer:
  epochs: 10
```
Global Max Sequence Length (512-4096, None): The sequence length can be an essential factor for the performance of your model, especially for tasks that require understanding long contexts. You can either set this to 512 or try using full sequences if computational resources allow it.
```
global_max_sequecne_length: 512
```
Amount of Training Data (0.1 - 1.0): The size of your training data can substantially affect the model's ability to generalize. Make sure you have a balanced and representative dataset for best results.
```
preprocessing:
  sample_ratio: 0.5
```
Quantization (4, 8): If you're concerned about model size and latency, you can consider quantization. Both 4-bit and 8-bit options will reduce the model size, though 4-bit quantization may result in a more substantial loss in performance compared to 8-bit.
```
quantization:
  bits: 4
```
Gradient Accumulation Steps (16-128): This parameter allows you to effectively increase the batch size by accumulating gradients over multiple small batches before performing an optimizer step. This is especially useful if you're constrained by memory resources, or if you want to make use of larger batch sizes for improved model generalization.
```
trainer:
  gradient_accumulation_steps: 16
```

Others feel free to chime in and if you give some of these suggestions a try, let me know if this lines up!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Important hyper parameters for fine tuning Llama 2 #3579

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Important hyper parameters for fine tuning Llama 2 #3579

Uh oh!

msmmpts Sep 2, 2023

Replies: 1 comment

Uh oh!

justinxzhao Sep 5, 2023 Maintainer

msmmpts
Sep 2, 2023

justinxzhao
Sep 5, 2023
Maintainer