This project provides a script for training a transformer-based language model using the Hugging Face transformers library and TRL (Transformer Reinforcement Learning). The script loads a pre-trained model, tokenizes a dataset, and fine-tunes the model on the provided training data.
- Supports various transformer-based models (e.g., Qwen, Llama)
- Loads datasets from Hugging Face Hub
- Implements a data collator for completion-only language modeling
- Uses
SFTTrainerfor supervised fine-tuning - Saves the trained model and tokenizer
Before running the script, ensure you have the following dependencies installed:
pip install torch transformers datasets trlThe script uses a dataclass TrainingConfig to store training parameters. These parameters include:
model_name: Name of the pre-trained model (default:Qwen/Qwen2.5-7B-Instruct)block_size: Maximum sequence length for training (default:32768)train_file_path: Path to the training dataset (default:simplescaling/s1K_tokenized)dagger: Boolean flag for an optional setting (default:False)
To train the model, simply run:
python train.pyThe script will:
- Parse command-line arguments using
HfArgumentParser - Load the specified pre-trained model
- Load and split the dataset into training and evaluation sets
- Tokenize and preprocess the data
- Configure the trainer and start the fine-tuning process
- Save the trained model and tokenizer
After training, the fine-tuned model and tokenizer are saved to the directory specified by args.output_dir.
- The script automatically adjusts settings based on the model type (e.g.,
Qwen,Llama) - Uses a special padding token that is model-specific
- Implements
DataCollatorForCompletionOnlyLMto mask user instructions and only compute loss over assistant responses - Uses
FSDPfor efficient loading when training large models (e.g.,70Bmodels)
The script logs training configurations and progress using Python’s built-in logging module. Logs include details about:
- Training configuration
- Dataset processing
- Training progress
- Add support for more dataset formats
- Implement advanced training techniques such as LoRA or PEFT for efficient fine-tuning
- Provide pre-configured training scripts for different hardware setups
This project is licensed under the MIT License.
This script leverages the Hugging Face transformers and TRL libraries for model fine-tuning and training.