A powerful tool for translating datasets using local Large Language Models (LLMs) with efficient chunked processing and HuggingFace integration.
- Overview
- Features
- Project Structure
- Getting Started
- Configuration
- Model Zoo
- Project Roadmap
- Contributing
- License
- Acknowledgments
Local Translate is a specialized tool designed for translating large datasets using local Large Language Models (LLMs). It leverages the power of models like LLaMAX to efficiently translate text data from one language to another, with robust chunked processing to handle large datasets and seamless integration with the HuggingFace ecosystem.
- Local LLM Translation: Uses local LLMs for privacy and offline translation
- Chunked Processing: Handles large datasets efficiently through chunked processing
- HuggingFace Integration: Direct integration with HuggingFace datasets and model hub
- Resumable Processing: Can resume translation from any chunk index
- Quantization Support: Supports 4-bit and 8-bit quantization for memory efficiency
- Multi-Column Translation: Translate multiple columns in a single pass
- Automatic Push to Hub: Automatically push translated datasets to HuggingFace Hub
- 🔒 Privacy-First: All translation happens locally using your own LLM models
- 📊 Dataset Support: Works with any HuggingFace dataset format
- 🚀 Efficient Processing: Chunked processing with configurable batch sizes
- 💾 Memory Optimization: Support for 4-bit and 8-bit quantization
- 🔄 Resumable: Continue translation from where you left off
- 🌐 Multi-Language: Support for any language pair your model supports
- 📈 Progress Tracking: Real-time progress monitoring with tqdm
- 🎯 Flexible Columns: Translate specific columns while preserving others
- 🔧 Easy Configuration: Simple command-line interface with comprehensive options
local_translate/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── scripts/
│ └── example_run.sh # Example usage script
├── src/
│ ├── main.py # Main translation processor
│ └── model.py # LLM model wrapper
└── utils/
└── utils.py # Utility functions
TranslateProcessor: Main class handling dataset translation workflowTranslateModel: Wrapper for LLM models with quantization support- Utility Functions: Dataset management, chunk loading, and HuggingFace integration
Before using Local Translate, ensure your system meets these requirements:
- Python: 3.8 or higher
- CUDA: Compatible GPU with CUDA support (recommended)
- Memory: Sufficient RAM/VRAM for your chosen model
- Storage: Adequate disk space for datasets and translated outputs
- Clone the repository:
git clone https://github.com/Datanyth/translate_data_huggingface_with_llm
cd translate_data_huggingface_with_llm- Install dependencies:
pip install -r requirements.txt- Verify installation:
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"The main translation script can be run with various command-line arguments:
python -m src.main \
--model_id <MODEL_ID> \
--repo_id <REPO_ID> \
--src_language <SOURCE_LANGUAGE> \
--trg_language <TARGET_LANGUAGE> \
--dataset_name <DATASET_NAME> \
--column_name <COLUMN1> <COLUMN2> \
--translated_dataset_dir <OUTPUT_DIR>Here's a complete example translating an English dataset to Vietnamese:
export CUDA_VISIBLE_DEVICES=0
python -m src.main \
--model_id LLaMAX/LLaMAX3-8B-Alpaca \
--repo_id your-username/translated-dataset \
--src_language English \
--trg_language Vietnamese \
--max_length_token 12800 \
--dataset_name knoveleng/open-s1 \
--column_name problem solution \
--translated_dataset_dir ".cache" \
--download_dataset_dir ".cache" \
--start_inter 0 \
--writer_batch_size 20 \
--use_4bitOr use the provided script:
chmod +x scripts/example_run.sh
./scripts/example_run.sh| Argument | Description | Example |
|---|---|---|
--model_id |
HuggingFace model identifier | LLaMAX/LLaMAX3-8B-Alpaca |
--repo_id |
Target HuggingFace repo for translated dataset | username/dataset-name |
--src_language |
Source language name | English |
--trg_language |
Target language name | Vietnamese |
--dataset_name |
HuggingFace dataset name | knoveleng/open-s1 |
--column_name |
Columns to translate (space-separated) | problem solution |
--translated_dataset_dir |
Output directory for translated data | .cache |
| Argument | Description | Default |
|---|---|---|
--max_length_token |
Maximum tokens for model input | 8000 |
--subset |
Dataset subset to use | train |
--start_inter |
Chunk index to start from | 0 |
--batch_size |
Processing batch size | 1 |
--writer_batch_size |
Records per chunk | 20 |
--download_dataset_dir |
Local dataset cache directory | None |
--use_4bit |
Enable 4-bit quantization | False |
--use_8bit |
Enable 8-bit quantization | False |
--push |
Push to HuggingFace Hub | True |
--warning_skip |
Suppress warnings | True |
- Use quantization (
--use_4bitor--use_8bit) for memory efficiency - Adjust
writer_batch_sizebased on your available memory - Set
start_interto resume interrupted translations - Use
download_dataset_dirto cache datasets locally for faster re-runs
Local Translate is optimized for the LLaMAX3-8B-Alpaca model, which provides an excellent balance of translation quality and computational efficiency.
| Model | Size | HuggingFace ID | Memory (FP16) | Memory (4-bit) | Memory (8-bit) | Languages | Notes |
|---|---|---|---|---|---|---|---|
| LLaMAX3-8B-Alpaca | 8B | LLaMAX/LLaMAX3-8B-Alpaca |
~16GB | ~4GB | ~8GB | Multi | ⭐ Recommended - Perfect balance of quality and efficiency |
- Recommended:
LLaMAX/LLaMAX3-8B-Alpaca(4-bit quantization) - Memory requirement: ~4GB VRAM
- Best for: Most translation tasks, excellent quality/speed balance
python -m src.main \
--model_id LLaMAX/LLaMAX3-8B-Alpaca \
--use_4bit \
--max_length_token 12800 \
# ... other argumentspython -m src.main \
--model_id LLaMAX/LLaMAX3-8B-Alpaca \
--max_length_token 12800 \
# ... other arguments| Quantization Level | Memory Requirement | Recommended For |
|---|---|---|
| FP16 (Full Precision) | ~16GB | High-end GPUs with 24GB+ VRAM |
| 8-bit Quantization | ~8GB | Mid-range GPUs with 12GB+ VRAM |
| 4-bit Quantization | ~4GB | Most GPUs with 6GB+ VRAM |
- Quantization: Always use
--use_4bitfor memory efficiency unless you have high-end hardware - Model Compatibility: Uses LLaMA architecture (
LlamaForCausalLM) - Download Time: ~15GB download on first use
- Performance: Excellent translation quality with reasonable speed
- Memory: Ensure your GPU has at least 4GB VRAM for 4-bit quantization
To use a custom model, ensure it:
- Is LLaMA-compatible (uses
LlamaForCausalLMarchitecture) - Has a HuggingFace model ID or local path
- Supports the required tokenizer
Example with custom model:
python -m src.main \
--model_id your-username/your-custom-llama-model \
--use_4bit \
# ... other arguments- Enhanced Model Support: Add support for additional model architectures
- Multi-GPU Support: Parallel processing across multiple GPUs
- Advanced Quantization: Support for more quantization methods
- Benchmark Suite: Comprehensive benchmarking and metrics
- Web Interface: User-friendly web UI for configuration and monitoring
- Batch Inference: Optimized batch processing for faster translation
- Language Detection: Automatic language detection for source text
- Quality Metrics: Translation quality assessment tools
- API Support: REST API for integration with other tools
- Docker Support: Containerized deployment options
- ✅ Chunked Processing: Efficient handling of large datasets
- ✅ Quantization Support: 4-bit and 8-bit model quantization
- ✅ Resumable Processing: Continue from any chunk index
- ✅ HuggingFace Integration: Seamless dataset and model management
We welcome contributions from the community! Here's how you can help:
- 💬 Discussions: Join our GitHub Discussions to share ideas and ask questions
- 🐛 Bug Reports: Report issues and bugs through GitHub Issues
- 💡 Feature Requests: Suggest new features and improvements
- 📝 Code Contributions: Submit pull requests with code improvements
- Fork the repository
- Clone your fork:
git clone https://github.com/your-username/translate_data_huggingface_with_llm cd translate_data_huggingface_with_llm - Create a feature branch:
git checkout -b feature/your-feature-name
- Install development dependencies:
pip install -r requirements.txt pre-commit install
- Make your changes and test them
- Commit with a clear message:
git commit -m "Add feature: description of changes" - Push and create a pull request
This project uses:
- Black for code formatting
- Ruff for linting and import sorting
- Pre-commit hooks for automated code quality checks
This project is licensed under the MIT License. See the LICENSE file for details.
- HuggingFace: For the excellent transformers library and dataset ecosystem
- LLaMAX Team: For providing the base models used in this project
- Open Source Community: For the various tools and libraries that make this project possible
- Contributors: Everyone who has contributed to improving this tool
If you need help or have questions:
- 📖 Documentation: Check this README and the code comments
- 💬 Discussions: Join our GitHub Discussions
- 🐛 Issues: Report bugs via GitHub Issues
- ⭐ Star the repo: If you find this project useful!
Made with ❤️ for the open-source community