TRL - Transformer Reinforcement Learning

Full stack library to post-train large language models.

What is it?

TRL is a library that post-trains LLMs and diffusion models using methods such as Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO).

The library is built on top of 🤗 Transformers and is compatible with any model architecture available there.

Highlights

Efficient and scalable:
- 🤗 Accelerate is the backbone of TRL that models training to scale from a single GPU to a large-scale multi-node cluster with methods such as DDP and DeepSpeed.
- PEFT is fully integrated and allows to train even the largest models on modest hardware with quantization and methods such as LoRA or QLoRA.
- Unsloth is also integrated and allows to significantly speed up training with dedicated kernels.
CLI: With the CLI you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
Trainers: The trainer classes are an abstraction to apply many fine-tuning methods with ease such as the SFTTrainer, DPOTrainer, RewardTrainer, PPOTrainer, and ORPOTrainer.
AutoModels: The AutoModelForCausalLMWithValueHead & AutoModelForSeq2SeqLMWithValueHead classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
Examples: Fine-tune Llama for chat applications or apply full RLHF using adapters etc, following the examples.

Installation

Python package

Install the library with pip:

pip install trl

From source

If you want to use the latest features before an official release, you can install TRL from source:

pip install git+https://github.com/huggingface/trl.git

Repository

If you want to use the examples you can clone the repository with the following command:

git clone https://github.com/huggingface/trl.git

Command Line Interface (CLI)

You can use the TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), or vibe check your model with the chat CLI:

SFT:

trl sft --model_name_or_path Qwen/Qwen2.5-0.5B --dataset_name trl-lib/Capybara --output_dir Qwen2.5-0.5B-SFT

DPO:

trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --dataset_name argilla/Capybara-Preferences --output_dir Qwen2.5-0.5B-DPO

Chat:

trl chat --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct

Read more about CLI in the relevant documentation section or use --help for more details.

How to use

For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.

`SFTTrainer`

Here is a basic example on how to use the SFTTrainer:

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT")
trainer = SFTTrainer(
    args=training_args,
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

`RewardTrainer`

Here is a basic example on how to use the RewardTrainer:

from trl import RewardConfig, RewardTrainer
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2)
trainer = RewardTrainer(
    args=training_args,
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset,
)
trainer.train()

`RLOOTrainer`

RLOOTrainer implements a REINFORCE-style optimization for RLHF that is more performant and memory-efficient than PPO. Here is a basic example of how to use the RLOOTrainer:

from trl import RLOOConfig, RLOOTrainer, apply_chat_template
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

dataset = load_dataset("trl-lib/ultrafeedback-prompt")
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset = dataset.map(lambda x: tokenizer(x["prompt"]), remove_columns="prompt")

training_args = RLOOConfig(output_dir="Qwen2.5-0.5B-RL")
trainer = RLOOTrainer(
    config=training_args,
    processing_class=tokenizer,
    policy=policy,
    ref_policy=ref_policy,
    reward_model=reward_model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
trainer.train()

`DPOTrainer`

DPOTrainer implements the popular Direct Preference Optimization (DPO) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(model=model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()

Development

If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install:

git clone https://github.com/huggingface/trl.git
cd trl/
make dev

Citation

@misc{vonwerra2022trl,
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

Name	Name	Last commit message	Last commit date
Latest commit Update incorrect data processing in DataCollatorForChatML (huggingfac… Oct 10, 2024 3107a40 · Oct 10, 2024 History 995 Commits
.github	.github	♾️ [CI] Use transformers from source in "tests_no_optional_dep" (hugg…	Oct 8, 2024
commands	commands	Default `dataset_text_field` to `"text"` (huggingface#2078 )	Oct 4, 2024
docker	docker	[`core` / tests ] v1 slow tests (huggingface#1218 )	Jan 17, 2024
docs/source	docs/source	[DPO] Adding weighted preference optimization (WPO) (huggingface#2141 )	Oct 8, 2024
examples	examples	`skip_prompt=True` in `TextIteratorStreamer` (huggingface#2193 )	Oct 7, 2024
scripts	scripts	[CI] fix dpo gpu ci tests (huggingface#2189 )	Oct 7, 2024
tests	tests	Update incorrect data processing in DataCollatorForChatML (huggingfac…	Oct 10, 2024
trl	trl	Update incorrect data processing in DataCollatorForChatML (huggingfac…	Oct 10, 2024
.gitignore	.gitignore	Clean up README and remove openrlbenchmark dependency (huggingface#2085 )	Sep 23, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	[pre-commit] update pre-commit yaml (huggingface#2002 )	Sep 2, 2024
CITATION.cff	CITATION.cff	Update trl version in CITATION.cff (huggingface#2171 )	Oct 4, 2024
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	Add issue/PR templates, code of conduct & better contributing guide (h…	Aug 23, 2024
CONTRIBUTING.md	CONTRIBUTING.md	Update CONTRIBUTING.md (huggingface#2181 )	Oct 7, 2024
LICENSE	LICENSE	Initial commit	Mar 27, 2020
MANIFEST.in	MANIFEST.in	🃏 Model card for TRL (huggingface#2123 )	Sep 27, 2024
Makefile	Makefile	Clean up README and remove openrlbenchmark dependency (huggingface#2085 )	Sep 23, 2024
README.md	README.md	Update README.md (huggingface#2186 )	Oct 7, 2024
pyproject.toml	pyproject.toml	🧹 Style (huggingface#2132 )	Sep 26, 2024
requirements.txt	requirements.txt	Use `transformers` utilities when possible (huggingface#2064 )	Sep 16, 2024
setup.cfg	setup.cfg	FEAT: Add CLIs in TRL ! (huggingface#1419 )	Mar 18, 2024
setup.py	setup.py	🃏 Model card for TRL (huggingface#2123 )	Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRL - Transformer Reinforcement Learning

Full stack library to post-train large language models.

What is it?

Highlights

Installation

Python package

From source

Repository

Command Line Interface (CLI)

How to use

`SFTTrainer`

`RewardTrainer`

`RLOOTrainer`

`DPOTrainer`

Development

Citation

About

Releases

Packages

Languages

License

coessiane/trl

Folders and files

Latest commit

History

Repository files navigation

TRL - Transformer Reinforcement Learning

Full stack library to post-train large language models.

What is it?

Highlights

Installation

Python package

From source

Repository

Command Line Interface (CLI)

How to use

SFTTrainer

RewardTrainer

RLOOTrainer

DPOTrainer

Development

Citation

About

Resources

License

Citation

Stars

Watchers

Forks

Releases

Packages 0

Languages

`SFTTrainer`

`RewardTrainer`

`RLOOTrainer`

`DPOTrainer`

Packages