Toolkit for robust LLM fine-tuning on noisy training data.
Companion open-source implementation for the paper:
Fine-Tuning LLMs for Robust Classification in Noisy Data Environments Journal of Information Systems Engineering and Management (JISEM), 2024, 9(1) e-ISSN: 2468-4376
Real-world training data is rarely clean. Mislabeled examples, inconsistent annotations, and sparse data are common in production environments such as e-commerce search, financial services, and customer support. This library provides the tooling to detect noise in classification datasets, clean and augment training data, evaluate model robustness across noise types and levels, and reproduce the benchmark experiments from the paper.
| Module | Purpose |
|---|---|
noisyllm.detect |
Cross-validation confidence scoring to flag likely mislabeled examples |
noisyllm.clean |
Filter high-confidence noise with configurable thresholds |
noisyllm.train |
Pydantic configs for noise-robust training (label smoothing, curriculum learning) |
noisyllm.eval |
Robustness evaluator across noise types and levels |
noisyllm.benchmark |
Synthetic noisy benchmark datasets for reproducible comparison |
pip install noisyllmOr with UV:
uv add noisyllmfrom noisyllm.detect import NoiseProfiler
from noisyllm.clean import DataCleaner
from noisyllm.eval import RobustnessEvaluator
# Detect noise in a labeled dataset
profiler = NoiseProfiler(text_col="text", label_col="label", confidence_threshold=0.3)
report = profiler.analyze(dataset)
print(report.summary())
# Dataset: 5000 samples
# Estimated noise rate: 8.1%
# Flagged samples: 407
# Clean the dataset by removing high-confidence mislabels
cleaner = DataCleaner(noise_report=report, filter_threshold=0.7)
result = cleaner.clean(dataset)
print(result.summary())
# Original: 5000 | Filtered: 312 | Final: 4688
# Evaluate robustness of a trained classifier
evaluator = RobustnessEvaluator(predict_fn=model.predict)
eval_report = evaluator.evaluate(
clean_test=test_dataset,
noise_levels=[0.05, 0.10, 0.15, 0.20],
noise_types=["label_flip", "text_corruption"],
)
print(eval_report.summary())
# Base accuracy (clean): 94.2%
# Robustness index: 0.91from noisyllm.benchmark import load_benchmark
dataset = load_benchmark("intent_classification", noise_level=0.10)
# Returns: BenchmarkDataset with train (noisy), test (clean), label_setAvailable benchmarks: intent_classification, sentiment, document
uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/If you use this library in your research, please cite the paper:
Fine-Tuning LLMs for Robust Classification in Noisy Data Environments.
Journal of Information Systems Engineering and Management (JISEM), 2024, 9(1).
e-ISSN: 2468-4376
Apache 2.0