Customain

Fine-tune LLMs to sound like you.

Customain learns your writing style from your own real text content, and conversations, and fine-tunes (large) language models to mimic your tone, voice, and communication patterns. The result is your custom AI that does not sound generic, but just like you.

How It Works

Your emails → Extract & clean → Fine-tune → A model that writes like you

Connect a content source (Gmail today, more coming)
Process your text into high-quality, anonymized training pairs
Fine-tune OpenAI models on your writing style
Evaluate how well the model captures your tone — with both classical metrics and a trained authorship classifier

Supported Sources

Source	Status
Gmail	✅ Available
Outlook	🔜 Planned
Slack	🔜 Planned
Notion	🔜 Planned
Google Docs	🔜 Planned

Supported Providers

Provider	Models	Methods	Status
OpenAI	GPT-4.1, 4.1-mini, 4.1-nano, 4o, 4o-mini	SFT, DPO	✅ Available
Together AI	Llama, Mixtral, Qwen + any HF model	--	🔜 Planned

Quick Start

Prerequisites

Python 3.11+
uv package manager
OpenAI API key
Gmail OAuth credentials (for Gmail source)

Installation

git clone https://github.com/user/customain.git
cd customain
uv sync

Configure API Keys

Create .secrets/api_keps.json:

{
  "openai_api_key": "sk-...",
  "wandb_api_key": "optional-for-tracking"
}

For Gmail, you'll also need OAuth credentials — see Google's guide.

Step 1 — Build Your Dataset

Run the full Gmail preprocessing pipeline:

uv run python -m gmail_preprocessing_pipeline.run_pipeline

Or skip steps you've already completed:

# Already exported Gmail — start from extract
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 2

# Re-run just anonymize + format
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 5

The pipeline runs 6 steps:

Export Gmail threads to mbox
Extract email-reply pairs
Clean signatures, quotes, links (LLM)
Filter low-quality pairs (LLM)
Anonymize person names → [NAME] (LLM)
Format into SFT train/test split email processing pipeline Output: data/sft_train.jsonl and data/sft_test.jsonl

Step 2 — Fine-Tune & Evaluate

Configure which models and hyperparameters to try in ft/training_configs.py, then run the full pipeline:

uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl

Or run a quick test with a small subset first:

uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl \
  --test-run

You can also skip steps you've already completed:

# Skip data upload and job launch, just evaluate
uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl \
  --skip 1 2

The pipeline will:

Upload data and launch fine-tuning jobs across your configured model/hyperparameter combinations
Poll until all jobs complete
Run each fine-tuned model on the test set
Evaluate results and log metrics to Weights & Biases

Evaluation

Customain includes a pluggable evaluation framework. Evaluators are auto-discovered. You can just drop a new one into ft/evaluation/evaluators/. It can be ml-based, statistical, or any other form you prefer. Take a look at the existing ml-based and metric/statistical evaluators already implemented:

Evaluator	What it measures
`authorship_classifier`	CNN-based authorship probability score
`tone_judge`	LLM-as-judge scoring tone & style fidelity
`bleu`	N-gram overlap (BLEU score)
`meteor`	Token-level alignment (METEOR score)
`semantic_similarity`	Embedding cosine similarity

Configure which evaluators to skip in ft/training_configs.py:

skip_evaluators = ["bleu", "meteor"]  # Only run tone_judge and semantic_similarity

Authorship Classifier

A character-level CNN text classifier trained to distinguish the author's writing from other people's writings. Unlike LLM-as-judge evaluators, this learns style patterns directly from data, hence it does not suffer from the LLM-as-a-judge performance issues. Its current best performance is 91% precision.

# Prepare training data from existing SFT data
uv run python -m classifiers.authorship.prepare_data

# Train (logs to W&B under customain-classifiers)
uv run python -m classifiers.authorship.train \
  --train-data data/classifiers/authorship/train.jsonl \
  --val-data data/classifiers/authorship/val.jsonl

# The authorship_classifier evaluator auto-registers and uses the trained checkpoint

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3).

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
classifiers		classifiers
ft		ft
gmail_preprocessing_pipeline		gmail_preprocessing_pipeline
media		media
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
license.txt		license.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customain

How It Works

Supported Sources

Supported Providers

Quick Start

Prerequisites

Installation

Configure API Keys

Step 1 — Build Your Dataset

Step 2 — Fine-Tune & Evaluate

Evaluation

Authorship Classifier

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Customain

How It Works

Supported Sources

Supported Providers

Quick Start

Prerequisites

Installation

Configure API Keys

Step 1 — Build Your Dataset

Step 2 — Fine-Tune & Evaluate

Evaluation

Authorship Classifier

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages