Local Translate

A powerful tool for translating datasets using local Large Language Models (LLMs) with efficient chunked processing and HuggingFace integration.

📋 Table of Contents

Overview
Features
Project Structure
Getting Started
Configuration
Model Zoo
Project Roadmap
Contributing
License
Acknowledgments

🎯 Overview

Local Translate is a specialized tool designed for translating large datasets using local Large Language Models (LLMs). It leverages the power of models like LLaMAX to efficiently translate text data from one language to another, with robust chunked processing to handle large datasets and seamless integration with the HuggingFace ecosystem.

Key Capabilities:

Local LLM Translation: Uses local LLMs for privacy and offline translation
Chunked Processing: Handles large datasets efficiently through chunked processing
HuggingFace Integration: Direct integration with HuggingFace datasets and model hub
Resumable Processing: Can resume translation from any chunk index
Quantization Support: Supports 4-bit and 8-bit quantization for memory efficiency
Multi-Column Translation: Translate multiple columns in a single pass
Automatic Push to Hub: Automatically push translated datasets to HuggingFace Hub

✨ Features

🔒 Privacy-First: All translation happens locally using your own LLM models
📊 Dataset Support: Works with any HuggingFace dataset format
🚀 Efficient Processing: Chunked processing with configurable batch sizes
💾 Memory Optimization: Support for 4-bit and 8-bit quantization
🔄 Resumable: Continue translation from where you left off
🌐 Multi-Language: Support for any language pair your model supports
📈 Progress Tracking: Real-time progress monitoring with tqdm
🎯 Flexible Columns: Translate specific columns while preserving others
🔧 Easy Configuration: Simple command-line interface with comprehensive options

📁 Project Structure

local_translate/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── .pre-commit-config.yaml   # Pre-commit hooks configuration
├── scripts/
│   └── example_run.sh        # Example usage script
├── src/
│   ├── main.py              # Main translation processor
│   └── model.py             # LLM model wrapper
└── utils/
    └── utils.py             # Utility functions

Core Components:

TranslateProcessor: Main class handling dataset translation workflow
TranslateModel: Wrapper for LLM models with quantization support
Utility Functions: Dataset management, chunk loading, and HuggingFace integration

🚀 Getting Started

Prerequisites

Before using Local Translate, ensure your system meets these requirements:

Python: 3.8 or higher
CUDA: Compatible GPU with CUDA support (recommended)
Memory: Sufficient RAM/VRAM for your chosen model
Storage: Adequate disk space for datasets and translated outputs

Installation

Clone the repository:

git clone https://github.com/Datanyth/translate_data_huggingface_with_llm
cd translate_data_huggingface_with_llm

Install dependencies:

pip install -r requirements.txt

Verify installation:

python -c "import torch; print(f'PyTorch version: {torch.__version__}')"

Usage

The main translation script can be run with various command-line arguments:

python -m src.main \
  --model_id <MODEL_ID> \
  --repo_id <REPO_ID> \
  --src_language <SOURCE_LANGUAGE> \
  --trg_language <TARGET_LANGUAGE> \
  --dataset_name <DATASET_NAME> \
  --column_name <COLUMN1> <COLUMN2> \
  --translated_dataset_dir <OUTPUT_DIR>

Example

Here's a complete example translating an English dataset to Vietnamese:

export CUDA_VISIBLE_DEVICES=0
python -m src.main \
  --model_id LLaMAX/LLaMAX3-8B-Alpaca \
  --repo_id your-username/translated-dataset \
  --src_language English \
  --trg_language Vietnamese \
  --max_length_token 12800 \
  --dataset_name knoveleng/open-s1 \
  --column_name problem solution \
  --translated_dataset_dir ".cache" \
  --download_dataset_dir ".cache" \
  --start_inter 0 \
  --writer_batch_size 20 \
  --use_4bit

Or use the provided script:

chmod +x scripts/example_run.sh
./scripts/example_run.sh

⚙️ Configuration

Required Arguments

Argument	Description	Example
`--model_id`	HuggingFace model identifier	`LLaMAX/LLaMAX3-8B-Alpaca`
`--repo_id`	Target HuggingFace repo for translated dataset	`username/dataset-name`
`--src_language`	Source language name	`English`
`--trg_language`	Target language name	`Vietnamese`
`--dataset_name`	HuggingFace dataset name	`knoveleng/open-s1`
`--column_name`	Columns to translate (space-separated)	`problem solution`
`--translated_dataset_dir`	Output directory for translated data	`.cache`

Optional Arguments

Argument	Description	Default
`--max_length_token`	Maximum tokens for model input	`8000`
`--subset`	Dataset subset to use	`train`
`--start_inter`	Chunk index to start from	`0`
`--batch_size`	Processing batch size	`1`
`--writer_batch_size`	Records per chunk	`20`
`--download_dataset_dir`	Local dataset cache directory	`None`
`--use_4bit`	Enable 4-bit quantization	`False`
`--use_8bit`	Enable 8-bit quantization	`False`
`--push`	Push to HuggingFace Hub	`True`
`--warning_skip`	Suppress warnings	`True`

Performance Tips

Use quantization (--use_4bit or --use_8bit) for memory efficiency
Adjust writer_batch_size based on your available memory
Set start_inter to resume interrupted translations
Use download_dataset_dir to cache datasets locally for faster re-runs

🤖 Model Zoo

Local Translate is optimized for the LLaMAX3-8B-Alpaca model, which provides an excellent balance of translation quality and computational efficiency.

🏆 Recommended Model

Model	Size	HuggingFace ID	Memory (FP16)	Memory (4-bit)	Memory (8-bit)	Languages	Notes
LLaMAX3-8B-Alpaca	8B	`LLaMAX/LLaMAX3-8B-Alpaca`	~16GB	~4GB	~8GB	Multi	⭐ Recommended - Perfect balance of quality and efficiency

💡 Model Selection Guide

For Production Use

Recommended: LLaMAX/LLaMAX3-8B-Alpaca (4-bit quantization)
Memory requirement: ~4GB VRAM
Best for: Most translation tasks, excellent quality/speed balance

🔧 Model Usage Examples

Basic Usage (Recommended with 4-bit quantization):

python -m src.main \
  --model_id LLaMAX/LLaMAX3-8B-Alpaca \
  --use_4bit \
  --max_length_token 12800 \
  # ... other arguments

Full Precision Usage (if you have sufficient VRAM):

python -m src.main \
  --model_id LLaMAX/LLaMAX3-8B-Alpaca \
  --max_length_token 12800 \
  # ... other arguments

📊 Memory Requirements

Quantization Level	Memory Requirement	Recommended For
FP16 (Full Precision)	~16GB	High-end GPUs with 24GB+ VRAM
8-bit Quantization	~8GB	Mid-range GPUs with 12GB+ VRAM
4-bit Quantization	~4GB	Most GPUs with 6GB+ VRAM

⚠️ Important Notes

Quantization: Always use --use_4bit for memory efficiency unless you have high-end hardware
Model Compatibility: Uses LLaMA architecture (LlamaForCausalLM)
Download Time: ~15GB download on first use
Performance: Excellent translation quality with reasonable speed
Memory: Ensure your GPU has at least 4GB VRAM for 4-bit quantization

🔄 Adding Custom Models

To use a custom model, ensure it:

Is LLaMA-compatible (uses LlamaForCausalLM architecture)
Has a HuggingFace model ID or local path
Supports the required tokenizer

Example with custom model:

python -m src.main \
  --model_id your-username/your-custom-llama-model \
  --use_4bit \
  # ... other arguments

🎯 Project Roadmap

Planned Features

Recent Updates

✅ Chunked Processing: Efficient handling of large datasets
✅ Quantization Support: 4-bit and 8-bit model quantization
✅ Resumable Processing: Continue from any chunk index
✅ HuggingFace Integration: Seamless dataset and model management

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

💬 Discussions: Join our GitHub Discussions to share ideas and ask questions
🐛 Bug Reports: Report issues and bugs through GitHub Issues
💡 Feature Requests: Suggest new features and improvements
📝 Code Contributions: Submit pull requests with code improvements

Development Setup

Fork the repository

Clone your fork:

git clone https://github.com/your-username/translate_data_huggingface_with_llm
cd translate_data_huggingface_with_llm

Create a feature branch:

git checkout -b feature/your-feature-name

Install development dependencies:

pip install -r requirements.txt
pre-commit install

Make your changes and test them

Commit with a clear message:

git commit -m "Add feature: description of changes"

Push and create a pull request

Code Style

This project uses:

Black for code formatting
Ruff for linting and import sorting
Pre-commit hooks for automated code quality checks

Contributor Graph

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

HuggingFace: For the excellent transformers library and dataset ecosystem
LLaMAX Team: For providing the base models used in this project
Open Source Community: For the various tools and libraries that make this project possible
Contributors: Everyone who has contributed to improving this tool

📞 Support

If you need help or have questions:

📖 Documentation: Check this README and the code comments
💬 Discussions: Join our GitHub Discussions
🐛 Issues: Report bugs via GitHub Issues
⭐ Star the repo: If you find this project useful!

Made with ❤️ for the open-source community

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
src		src
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Datanyth/local_translate

Folders and files

Latest commit

History

Repository files navigation

Local Translate

📋 Table of Contents

🎯 Overview

Key Capabilities:

✨ Features

📁 Project Structure

Core Components:

🚀 Getting Started

Prerequisites

Installation

Usage

Example

⚙️ Configuration

Required Arguments

Optional Arguments

Performance Tips

🤖 Model Zoo

🏆 Recommended Model

💡 Model Selection Guide

For Production Use

🔧 Model Usage Examples

Basic Usage (Recommended with 4-bit quantization):

Full Precision Usage (if you have sufficient VRAM):

📊 Memory Requirements

⚠️ Important Notes

🔄 Adding Custom Models

🎯 Project Roadmap

Planned Features

Recent Updates

🤝 Contributing

Ways to Contribute

Development Setup

Code Style

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages