Optical Context Compression Is Just (Bad) Autoencoding

Checkpoints uploading in progress

Code for reproducing experiments from our paper investigating whether vision-based context compression (as proposed in DeepSeek-OCR) can serve as an effective compression mechanism for LLM context.

Documentation on the hype surrounding DeepSeek-OCR

Installation

This project uses uv for dependency management and requires Python 3.12+.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/ivnle/bad-autoencoding.git
cd bad-autoencoding

Dependencies are automatically installed on first uv run command.

Quick Start

1. Prepare Data

Disk space required: ~920GB total (37GB download + 57GB filtered + 828GB rendered images)

See data/README.md for the full data preparation pipeline. In brief:

# 1. Download FineWiki dataset (~37GB)
uv run python data/scripts/download_finewiki.py

# 2. Filter articles to ≥2000 tokens (~57GB)
uv run python data/scripts/filter_finewiki.py

# 3. Chunk (1000+1000 tokens) and render images (~828GB for full dataset)
uv run python data/scripts/prepare_training_data.py \
    --input_jsonl data/filtered/finewiki_filtered.jsonl \
    --output_dir data/training/full \
    --num_workers 16

# 4. Create train/val splits
uv run python data/scripts/create_quick_splits.py \
    --input data/training/full/samples.jsonl \
    --output_dir data/training/full \
    --max_samples 510000 \
    --val_size 10000

2. Train

Model weights are automatically downloaded from HuggingFace (deepseek-ai/DeepSeek-OCR) on first run.

# Vision compression (main paper result)
uv run python train.py \
    --regime vision \
    --vision_mode small \
    --objective lm \
    --data_path ./data/training/full/train.jsonl \
    --val_data_path ./data/training/full/val.jsonl \
    --output_dir ./outputs/vision_small_lm \
    --use_wandb

# Text baseline (no compression)
uv run python train.py \
    --regime text \
    --objective lm \
    --data_path ./data/training/full/train.jsonl \
    --val_data_path ./data/training/full/val.jsonl \
    --output_dir ./outputs/text_baseline_lm \
    --use_wandb

See scripts/run_example.sh for more training configurations, or run python train.py --help for all options.

Available Regimes

Regime	Description
`vision`	DeepSeek-OCR vision compression (renders text as image)
`text`	Text baseline (no compression)
`meanpool`	Sliding window mean pooling
`conv1d_residual`	Conv1d with residual skip connections

Text Regime Truncation

The text regime supports context truncation via --text_context_tokens N:

Keeps the last N tokens of context (most recent)
Default (omit flag) = full context, no truncation
--text_context_tokens 0 = empty context (target-only)

Example:

uv run python train.py \
    --regime text \
    --text_context_tokens 512 \
    --data_path ./data/training/full/train.jsonl \
    --val_data_path ./data/training/full/val.jsonl \
    --output_dir ./outputs/text_truncated_512

Reproducing Paper Results

The paper results used these hyperparameters:

Parameter	Value
Effective batch size	48
Learning rate	1e-4
Weight decay	0.01
Warmup ratio	0.1
Max grad norm	1.0
Dataset	510k samples (500k train + 10k val)

To match effective batch size on your hardware:

80GB GPU (A100): --batch_size 16 --gradient_accumulation_steps 3
48GB GPU (A6000): --batch_size 8 --gradient_accumulation_steps 6
24GB GPU (4090): --batch_size 1 --gradient_accumulation_steps 48

The Quick Start examples use conservative settings (batch_size=1) for compatibility with smaller GPUs.

Requirements

GPU: 24-32GB VRAM recommended (memory optimizations enabled by default)
Python: 3.12+
Dependencies: Managed via uv (see pyproject.toml)

Weights & Biases

Training logs to W&B by default when --use_wandb is passed. Configure with:

export WANDB_PROJECT="bad-autoencoding"  # Optional, defaults to bad-autoencoding

Vendor Modifications

This repository includes a modified version of DeepSeek-OCR (vendor/deepseek_ocr/) with two changes:

Trainable encoder support: Allows gradient flow through the vision encoder (SAM + ViT + Projector) during training. Use --train_encoder flag.
Conditional encoder loading: Saves ~2GB GPU memory for text-only regimes by skipping vision encoder initialization when not needed.

Citation

@misc{lee2025opticalcontextcompressionjust,
      title={Optical Context Compression Is Just (Bad) Autoencoding},
      author={Ivan Yee Lee and Cheng Yang and Taylor Berg-Kirkpatrick},
      year={2025},
      eprint={2512.03643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.03643},
}

License

This code is released under the MIT License. Note that the DeepSeek-OCR model weights have their own license - see DeepSeek-OCR for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
data		data
scripts		scripts
trainers		trainers
training		training
vendor/deepseek_ocr		vendor/deepseek_ocr
.gitignore		.gitignore
.python-version		.python-version
HYPE.md		HYPE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optical Context Compression Is Just (Bad) Autoencoding

Installation

Quick Start

1. Prepare Data

2. Train

Available Regimes

Text Regime Truncation

Reproducing Paper Results

Requirements

Weights & Biases

Vendor Modifications

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Optical Context Compression Is Just (Bad) Autoencoding

Installation

Quick Start

1. Prepare Data

2. Train

Available Regimes

Text Regime Truncation

Reproducing Paper Results

Requirements

Weights & Biases

Vendor Modifications

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages