datamix

Dataset mixing & curriculum optimizer for LLM training. Profile datasets, create mix recipes, schedule curricula, allocate token budgets, and clean data — all with zero dependencies.

Training data composition is one of the most impactful decisions in LLM training, yet there's no standard tooling for it. datamix gives you a programmatic way to profile, blend, schedule, and budget your training data.

Why datamix?

Problem	datamix Solution
"What ratio of code vs. wiki should I use?"	Temperature-scaled mixing with automatic weight computation
No way to profile datasets before mixing	Instant profiling — token counts, lengths, quality metrics
Data curriculum is done manually in configs	Programmatic scheduling — linear, cosine, step functions
Token budget allocation is guesswork	Automatic budget computation with overflow detection
Quality filtering is scattered scripts	Built-in length filter, exact/near dedup, quality scoring

Installation

pip install datamix            # zero dependencies
pip install datamix[cli]       # + click, rich for terminal UI
pip install datamix[all]       # everything

Quick Start

1. Profile your datasets

from datamix import profile_dataset, profile_jsonl, compare_profiles

# From a JSONL file
wiki = profile_jsonl("data/wikipedia.jsonl")
code = profile_jsonl("data/code-python.jsonl")

print(f"{wiki.name}: {wiki.n_examples:,} examples, {wiki.size_tokens_m:.1f}M tokens")
print(f"{code.name}: {code.n_examples:,} examples, {code.size_tokens_m:.1f}M tokens")

# Compare multiple datasets
comparison = compare_profiles([wiki, code])
print(f"Total: {comparison['total_tokens']:,} tokens across {comparison['n_datasets']} datasets")

2. Create a mix recipe

from datamix import create_recipe, MixStrategy

recipe = create_recipe(
    [wiki, code],
    strategy=MixStrategy.TEMPERATURE,
    temperature=1.5,  # >1 = more uniform, <1 = proportional
    total_tokens=2_000_000_000,
)

for name, weight in recipe.normalized_weights.items():
    print(f"  {name}: {weight:.1%}")

3. Schedule a curriculum

from datamix import cosine_schedule, linear_schedule

# Cosine decay: primary dataset starts high, others increase
sched = cosine_schedule(
    ["wikipedia", "code", "arxiv", "books"],
    n_phases=4,
    primary="wikipedia",
    total_tokens=2_000_000_000,
)

# Get weights at any training progress point
weights_start = sched.weights_at(0.0)   # {"wikipedia": 0.93, ...}
weights_mid = sched.weights_at(0.5)     # {"wikipedia": 0.50, ...}
weights_end = sched.weights_at(1.0)     # {"wikipedia": 0.07, ...}

4. Allocate token budgets

from datamix import compute_budget, fit_to_budget, budget_report

# From a recipe
budget = compute_budget(recipe, [wiki, code])
print(budget_report(budget))

# Or fit datasets to a fixed budget
budget = fit_to_budget([wiki, code], token_budget=1_000_000_000)

5. Clean your data

from datamix import length_filter, dedup_exact, dedup_ngram, quality_score

# Filter by length
kept, stats = length_filter(texts, min_length=50, max_length=10000)
print(f"Kept {stats['kept']}, removed {stats['removed']}")

# Remove exact duplicates
kept, stats = dedup_exact(kept)

# Remove near-duplicates (n-gram Jaccard)
kept, stats = dedup_ngram(kept, n=5, threshold=0.8)

# Score individual examples
for text in kept[:5]:
    score = quality_score(text)
    print(f"  {score:.2f}  {text[:60]}...")

CLI

# Profile a JSONL file
datamix profile data/wiki.jsonl

# Create a mix recipe
datamix mix data/wiki.jsonl data/code.jsonl --strategy temperature --budget 2000000000

# Clean a dataset
datamix clean data/raw.jsonl --min-length 50 --dedup

Mixing Strategies

Strategy	Description	When to Use
`PROPORTIONAL`	Weight by dataset size	Default — larger datasets get more weight
`TEMPERATURE`	Temperature-scaled proportional	Control uniformity (T>1) vs. proportional (T<1)
`EQUAL`	Equal weight per dataset	When all datasets are equally important
`CUSTOM`	Explicit weights	When you know the exact ratios

Curriculum Types

Schedule	Description
`linear_schedule`	Linear interpolation from start to end weights
`cosine_schedule`	Cosine decay for primary dataset, others increase
`step_schedule`	Step function with explicit phase configs
`custom_schedule`	Build from CurriculumPhase objects

Architecture

datamix/
├── _types.py        # DatasetProfile, MixRecipe, CurriculumSchedule, TokenBudget
├── profile.py       # Dataset profiling from lists or JSONL files
├── mixer.py         # Mix recipe creation, merging, scaling
├── curriculum.py    # Linear, cosine, step, custom curriculum schedules
├── sampler.py       # Temperature, proportional, stratified sampling
├── budget.py        # Token budget computation and allocation
├── quality.py       # Length filter, exact/near dedup, quality scoring
└── cli.py           # Click CLI interface

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
assets		assets
dist		dist
examples		examples
src/datamix		src/datamix
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
README.md		README.md
generate_svgs.py		generate_svgs.py
pyproject.toml		pyproject.toml

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datamix

Why datamix?

Installation

Quick Start

1. Profile your datasets

2. Create a mix recipe

3. Schedule a curriculum

4. Allocate token budgets

5. Clean your data

CLI

Mixing Strategies

Curriculum Types

Architecture

See Also

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datamix

Why datamix?

Installation

Quick Start

1. Profile your datasets

2. Create a mix recipe

3. Schedule a curriculum

4. Allocate token budgets

5. Clean your data

CLI

Mixing Strategies

Curriculum Types

Architecture

See Also

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages