Skip to content

StemNLP/customain

Repository files navigation

Overview

Customain

Fine-tune LLMs to sound like you.

Customain learns your writing style from your own real text content, and conversations, and fine-tunes (large) language models to mimic your tone, voice, and communication patterns. The result is your custom AI that does not sound generic, but just like you.

How It Works

Your emails → Extract & clean → Fine-tune → A model that writes like you
  1. Connect a content source (Gmail today, more coming)
  2. Process your text into high-quality, anonymized training pairs
  3. Fine-tune OpenAI models on your writing style
  4. Evaluate how well the model captures your tone — with both classical metrics and a trained authorship classifier

Supported Sources

Source Status
Gmail ✅ Available
Outlook 🔜 Planned
Slack 🔜 Planned
Notion 🔜 Planned
Google Docs 🔜 Planned

Supported Providers

Provider Models Methods Status
OpenAI GPT-4.1, 4.1-mini, 4.1-nano, 4o, 4o-mini SFT, DPO ✅ Available
Together AI Llama, Mixtral, Qwen + any HF model -- 🔜 Planned

Quick Start

Prerequisites

  • Python 3.11+
  • uv package manager
  • OpenAI API key
  • Gmail OAuth credentials (for Gmail source)

Installation

git clone https://github.com/user/customain.git
cd customain
uv sync

Configure API Keys

Create .secrets/api_keps.json:

{
  "openai_api_key": "sk-...",
  "wandb_api_key": "optional-for-tracking"
}

For Gmail, you'll also need OAuth credentials — see Google's guide.

Step 1 — Build Your Dataset

Run the full Gmail preprocessing pipeline:

uv run python -m gmail_preprocessing_pipeline.run_pipeline

Or skip steps you've already completed:

# Already exported Gmail — start from extract
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 2

# Re-run just anonymize + format
uv run python -m gmail_preprocessing_pipeline.run_pipeline --start-from 5

The pipeline runs 6 steps:

  1. Export Gmail threads to mbox
  2. Extract email-reply pairs
  3. Clean signatures, quotes, links (LLM)
  4. Filter low-quality pairs (LLM)
  5. Anonymize person names → [NAME] (LLM)
  6. Format into SFT train/test split email processing pipeline Output: data/sft_train.jsonl and data/sft_test.jsonl

Step 2 — Fine-Tune & Evaluate

Configure which models and hyperparameters to try in ft/training_configs.py, then run the full pipeline:

uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl

Or run a quick test with a small subset first:

uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl \
  --test-run

You can also skip steps you've already completed:

# Skip data upload and job launch, just evaluate
uv run python -m ft.run_pipeline \
  --train-file data/sft_train.jsonl \
  --test-file data/sft_test.jsonl \
  --skip 1 2

The pipeline will:

  1. Upload data and launch fine-tuning jobs across your configured model/hyperparameter combinations
  2. Poll until all jobs complete
  3. Run each fine-tuned model on the test set
  4. Evaluate results and log metrics to Weights & Biases

Evaluation

Customain includes a pluggable evaluation framework. Evaluators are auto-discovered. You can just drop a new one into ft/evaluation/evaluators/. It can be ml-based, statistical, or any other form you prefer. Take a look at the existing ml-based and metric/statistical evaluators already implemented:

Evaluator What it measures
authorship_classifier CNN-based authorship probability score
tone_judge LLM-as-judge scoring tone & style fidelity
bleu N-gram overlap (BLEU score)
meteor Token-level alignment (METEOR score)
semantic_similarity Embedding cosine similarity

Configure which evaluators to skip in ft/training_configs.py:

skip_evaluators = ["bleu", "meteor"]  # Only run tone_judge and semantic_similarity

Authorship Classifier

A character-level CNN text classifier trained to distinguish the author's writing from other people's writings. Unlike LLM-as-judge evaluators, this learns style patterns directly from data, hence it does not suffer from the LLM-as-a-judge performance issues. Its current best performance is 91% precision.

# Prepare training data from existing SFT data
uv run python -m classifiers.authorship.prepare_data

# Train (logs to W&B under customain-classifiers)
uv run python -m classifiers.authorship.train \
  --train-data data/classifiers/authorship/train.jsonl \
  --val-data data/classifiers/authorship/val.jsonl

# The authorship_classifier evaluator auto-registers and uses the trained checkpoint

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3).

About

Fine-tune OAI models directly from your writings (emails, docs, etc) so they sounds just like you

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors