Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.
---
name: trl-training
description: Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.
license: Apache-2.0
metadata:
version: "1.0.0"
author: huggingface
commands:
- trl sft
- trl dpo
- trl grpo
- trl kto
- trl rloo
- trl reward
categories:
- machine-learning
- llm-training
- reinforcement-learning
tags:
- rlhf
- supervised-fine-tuning
- dpo
- grpo
- huggingface
- transformers
documentation: https://huggingface.co/docs/trl/en/clis
---
# TRL Training Skill
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
## Overview
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
- **SFT** (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
- **DPO** (Direct Preference Optimization): Align models using preference data
- **GRPO** (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
- **RLOO** (Reinforce Leave One Out): Online RL training with generation-based rewards
- **Reward Model Training**: Train reward models for RLHF
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
## Core Commands
### trl sft - Supervised Fine-Tuning
Fine-tune language models on instruction-following or conversational datasets.
**Full training:**
```bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
```
**Train with LoRA adapters:**
```bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub
```
### trl dpo - Direct Preference Optimization
Align models using preference data (chosen/rejected pairs).
**Full training:**
```bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns
```
**Train with LoRA adapters:**
```bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16
```
### trl grpo - Group Relative Policy Optimization
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
**Basic usage:**
```bash
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hub
```
### trl rloo - Reinforce Leave One Out
Online RL training where the model generates text and receives rewards based on custom criteria.
**Basic usage:**
```bash
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hub
```
### trl reward - Reward Model Training
Train a reward model to score text quality for RLHF.
**Full training:**
```bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048
```
**Train with LoRA adapters:**
```bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16
```
## Configuration Files
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
**Example config (sft_config.yaml):**
```yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
```
**Launch with config:**
```bash
trl sft --config sft_config.yaml
```
**Override config values:**
```bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5
```
## Distributed Training
TRL integrates with Accelerate for multi-GPU and multi-node training.
**Multi-GPU training:**
```bash
trl sft \
--config sft_config.yaml \
--num_processes 4
```
**Use predefined Accelerate configs:**
TRL provides predefined configs: `single_gpu`, `multi_gpu`, `fsdp1`, `fsdp2`, `zero1`, `zero2`, `zero3`
```bash
trl sft \
--config sft_config.yaml \
--accelerate_config zero2
```
**Custom Accelerate config:**
```bash
# Generate custom config
accelerate config
# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
```
**Fully Sharded Data Parallel (FSDP):**
```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2
```
**DeepSpeed ZeRO:**
```bash
trl sft --config sft_config.yaml --accelerate_config zero3
```
## Troubleshooting
### CUDA Out of Memory
- Reduce `--per_device_train_batch_size` and increase `--gradient_accumulation_steps`
- Enable `--use_peft` for LoRA training
- Use `--gradient_checkpointing` to save memory
- Try smaller model or longer sequence truncation
### Dataset Loading Issues
- Verify dataset exists: check Hugging Face Hub or local path
- Check dataset format matches expected columns
- Use `--dataset_config` for multi-config datasets
- Inspect dataset: `from datasets import load_dataset; ds = load_dataset(name)`
### Model Loading Issues
- Verify model exists on Hugging Face Hub
- Check if gated model requires authentication: `hf auth login`
- For local models, provide absolute path
- Ensure sufficient disk space and memory
### Slow Training
- Enable dataset `--packing` for short sequences
- Use larger `--per_device_train_batch_size` if memory allows
- Enable `--tf32` for faster computation on Ampere GPUs
- Use `--bf16` on supported hardware
- Consider multi-GPU training with `--num_processes`
### Generation Issues (GRPO/RLOO)
- Check prompt format in dataset
- Adjust `--temperature` and `--top_p` for generation
- Verify the reward function (for GRPO/RLOO)
## Additional Resources
- **Documentation**: https://huggingface.co/docs/trl
- **GitHub**: https://github.com/huggingface/trl
- **Examples**: https://github.com/huggingface/trl/tree/main/examples
## Best Practices
1. **Start with SFT**: Always fine-tune base models with SFT before preference alignment
2. **Use LoRA for efficiency**: Enable `--use_peft` for faster training and lower memory
3. **Monitor training**: Use `--report_to trackio` (or `--report_to wandb` or `--report_to tensorboard`) for tracking
4. **Save checkpoints**: TRL automatically saves checkpoints in `--output_dir`
5. **Test on small datasets first**: Verify pipeline works before full training
6. **Use configuration files**: Create YAML configs for reproducibility
7. **Leverage Accelerate**: Use multi-GPU training for faster iteration
When helping users with TRL:
- Always check which training method is appropriate for their use case
- Verify dataset format matches the expected schema
- Recommend starting with smaller models for testing
- Suggest LoRA for resource-constrained environments
- Point to specific documentation sections for advanced features
Creator's repository · huggingface/skills
License: Apache-2.0