SeedKit

Generate realistic, constraint-safe seed data for any database.

SeedKit connects to your PostgreSQL, MySQL, or SQLite database, reads the schema, and generates seed data that respects foreign keys, unique constraints, check constraints, and enum types -- all without copying production data.

seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sql

Why SeedKit?

Every backend developer needs test data, but the options are broken:

Faker/factory_bot generate random gibberish with no schema awareness. Foreign keys break, unique constraints collide, and the data looks nothing like production.
Copying production data is a compliance nightmare. 93% of organizations aren't privacy-compliant in testing.
Snaplet (the best open-source option) shut down in July 2024.

SeedKit fills this gap. One command, realistic data, zero PII.

Features

Category	Feature
Databases	PostgreSQL, MySQL, and SQLite out of the box
Introspection	Auto-reads tables, columns, FKs, unique constraints, check constraints, enums
Classification	50+ semantic types (Email, FirstName, Price, CreatedAt, etc.) via pattern matching
FK Safety	Topological ordering ensures parent rows exist before child rows reference them
Cycle Resolution	Detects circular FKs (Tarjan SCC), breaks cycles with deferred `UPDATE` statements
Correlations	city/state/zip stay consistent, `created_at < updated_at`, first+last derive full name
Determinism	Lock file (`seedkit.lock`) + seed guarantees identical output across machines
Custom Values	Weighted value lists via `seedkit.toml` config
Smart Sampling	Extract production distributions and generate data that mirrors real patterns (with PII masking)
LLM-Enhanced	Optional `--ai` flag sends schema to Claude/GPT for smarter classification
Output Formats	SQL (`INSERT`/`COPY`), JSON, CSV, or direct database insertion
CI Integration	`seedkit check` detects schema drift (exit code 0/1)
Visualization	`seedkit graph` exports Mermaid.js or Graphviz DOT dependency diagrams

Quick Start

# Install from source
cargo install --path crates/seedkit-cli

# Generate 1000 rows per table, output SQL
seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sql

# Insert directly into database
seedkit generate --db postgres://localhost/myapp --rows 1000

# Use .env or seedkit.toml for connection -- no --db needed
seedkit generate --rows 500 --output seed.sql

Installation

PyPI (recommended)

pip install seedkit
# or
pipx install seedkit

npm

npm install -g @seed-kit/cli
# or run without installing
npx @seed-kit/cli --help

GitHub Releases

Download pre-built binaries for Linux (x64/ARM64), macOS (Intel/Apple Silicon), and Windows from Releases.

From Source

git clone https://github.com/kclaka/seedkit.git
cd seedkit
cargo install --path crates/seedkit-cli

Requirements: Rust 1.75+ (2021 edition)

Verify Installation

seedkit --version
# seedkit 1.5.1

Zero-Config Database Detection

SeedKit automatically finds your database URL by checking (in order):

--db CLI flag
DATABASE_URL environment variable
.env file in the current directory
seedkit.toml config file

Usage

`seedkit generate`

Generate seed data for your database.

# SQL file output
seedkit generate --db postgres://localhost/myapp --rows 500 --output seed.sql

# Direct insert into database
seedkit generate --db postgres://localhost/myapp --rows 1000

# JSON or CSV
seedkit generate --rows 100 --output data.json
seedkit generate --rows 100 --output data.csv

# PostgreSQL COPY format (10-50x faster bulk loading)
seedkit generate --rows 10000 --output seed.sql --copy

# Deterministic with seed
seedkit generate --rows 100 --seed 42 --output seed.sql

# Reproduce from lock file
seedkit generate --from-lock

# Per-table row counts
seedkit generate --rows 100 --table-rows users=500,orders=2000

# Include/exclude tables
seedkit generate --include users,orders --rows 100
seedkit generate --exclude audit_logs,migrations --rows 100

# LLM-enhanced classification (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
seedkit generate --rows 100 --ai --output seed.sql
seedkit generate --rows 100 --ai --model claude-opus-4-20250514 --output seed.sql

# Production-like with sampled distributions
seedkit generate --rows 1000 --subset seedkit.distributions.json

`seedkit sample`

Extract statistical distributions from a production database (read-only replica recommended). Automatically masks PII columns.

# Sample all tables
seedkit sample --db postgres://readonly-replica:5432/myapp

# Sample specific tables with custom limits
seedkit sample --db postgres://localhost/myapp --tables users,orders --categorical-limit 100

# Custom output path
seedkit sample --db postgres://localhost/myapp -o profiles.json

This creates seedkit.distributions.json with:

Categorical distributions -- value frequencies for text/enum columns (PII columns auto-masked)
Numeric distributions -- min, max, mean, stddev for numeric columns
FK ratios -- child-to-parent row count ratios (e.g., 3.2 orders per user)

Then use with seedkit generate --subset seedkit.distributions.json to produce data that mirrors production patterns.

`seedkit introspect`

Analyze your database schema and show classification results.

seedkit introspect --db postgres://localhost/myapp
seedkit introspect --db postgres://localhost/myapp --format json

`seedkit preview`

Preview a few sample rows without generating a full dataset.

seedkit preview --db postgres://localhost/myapp --rows 5

`seedkit check`

Detect schema drift against the lock file. Designed for CI pipelines.

seedkit check --db postgres://localhost/myapp
# Exit code 0 = no drift, 1 = drift detected

seedkit check --db postgres://localhost/myapp --format json

`seedkit graph`

Visualize table dependencies.

seedkit graph --db postgres://localhost/myapp --format mermaid > schema.mmd
seedkit graph --db postgres://localhost/myapp --format dot | dot -Tpng > schema.png

Configuration

Create a seedkit.toml in your project root:

[database]
url = "postgres://localhost/myapp"

[generate]
rows = 500
seed = 42

[tables.users]
rows = 1000

[tables.orders]
rows = 5000

# Custom value lists with optional weights
[columns."products.color"]
values = ["red", "blue", "green", "black", "white"]
weights = [0.25, 0.20, 0.20, 0.20, 0.15]

# Explicit cycle-breaking for circular foreign keys
[graph]
break_cycle_at = ["users.invited_by_id", "comments.parent_id"]

How It Works

                    seedkit generate
                          |
          [1] Introspect  |  Connect to DB, read information_schema
                          v
                   DatabaseSchema
                          |
          [2] Graph       |  Build FK dependency graph (petgraph)
                          |  Detect cycles (Tarjan SCC)
                          |  Break cycles, topological sort
                          v
                  Insertion Order
                          |
          [3] Classify    |  50+ regex rules match column names
                          |  Optional LLM pass (--ai flag)
                          |  Optional distribution profiles (--subset)
                          v
                  SemanticTypes
                          |
          [4] Generate    |  Row-by-row, FK-safe, unique-safe
                          |  Correlated groups, check constraints
                          |  Distribution-aware (normal, categorical)
                          v
                  Generated Data
                          |
          [5] Output      |  SQL / JSON / CSV / Direct Insert
                          v
                    seed.sql

Lock File

seedkit.lock works like package-lock.json. It captures the schema snapshot, random seed, and all configuration so teammates can reproduce the exact same dataset:

# Generate (creates seedkit.lock)
seedkit generate --rows 100

# Teammate reproduces identical data
seedkit generate --from-lock

If there's a merge conflict in seedkit.lock, don't resolve by hand:

git checkout --ours seedkit.lock
seedkit generate --force

Performance

Benchmarked with criterion on Apple Silicon (M-series). Run cargo bench to reproduce.

Operation	Throughput
Generation (10 cols, semantic providers)	~480K rows/sec
Generation (FK references only)	~3.7M rows/sec
Generation (weighted value lists)	~6.9M rows/sec
Generation (distribution sampling)	~8.6M rows/sec
Classification (100 tables x 20 cols)	~2.1M cols/sec
SQL output formatting	~1.5M rows/sec
JSON output formatting	~1.1M rows/sec
CSV output formatting	~1.5M rows/sec

Comparison

Feature	SeedKit	Faker/factory_bot	Snaplet
Schema-aware	Yes	No	Yes (shut down)
Multi-database	PG + MySQL + SQLite	N/A	PG only
FK resolution	Automatic	Manual	Automatic
Circular FK handling	Tarjan SCC + deferral	N/A	Manual
Deterministic	Seed + lock file	Seed only	No
Custom values	TOML config	Code	Code
Smart sampling	Production distributions	No	No
LLM-enhanced	Optional --ai	No	No
CI integration	`seedkit check`	N/A	No
Privacy	Synthetic + PII masking	Synthetic	Copies prod

Architecture

SeedKit is a Rust workspace with three crates:

seedkit/
  crates/
    seedkit-core/     # Library: introspection, graph, classification, generation, output, sampling
    seedkit-cli/      # Binary: clap-based CLI with 6 subcommands
    seedkit-testutil/  # Shared test helpers
  tests/
    fixtures/         # SQL schema fixtures for integration tests

Test suite: 221 tests (201 unit + 13 PostgreSQL integration + 7 MySQL integration)

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

# Run the full test suite
cargo test

# Run integration tests (requires Docker)
docker compose -f docker/docker-compose.test.yml up -d
TEST_POSTGRES_URL=postgres://seedkit:seedkit@localhost:5432/seedkit_test \
TEST_MYSQL_URL=mysql://seedkit:seedkit@localhost:3307/seedkit_test \
  cargo test --test '*' -- --test-threads=1

License

Licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
crates		crates
docker		docker
npm		npm
tests/fixtures/schemas		tests/fixtures/schemas
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PYPI_README.md		PYPI_README.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeedKit

Why SeedKit?

Features

Quick Start

Installation

PyPI (recommended)

npm

GitHub Releases

From Source

Verify Installation

Zero-Config Database Detection

Usage

`seedkit generate`

`seedkit sample`

`seedkit introspect`

`seedkit preview`

`seedkit check`

`seedkit graph`

Configuration

How It Works

Lock File

Performance

Comparison

Architecture

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SeedKit

Why SeedKit?

Features

Quick Start

Installation

PyPI (recommended)

npm

GitHub Releases

From Source

Verify Installation

Zero-Config Database Detection

Usage

seedkit generate

seedkit sample

seedkit introspect

seedkit preview

seedkit check

seedkit graph

Configuration

How It Works

Lock File

Performance

Comparison

Architecture

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`seedkit generate`

`seedkit sample`

`seedkit introspect`

`seedkit preview`

`seedkit check`

`seedkit graph`

Packages