Generate realistic, constraint-safe seed data for any database.
SeedKit connects to your PostgreSQL, MySQL, or SQLite database, reads the schema, and generates seed data that respects foreign keys, unique constraints, check constraints, and enum types -- all without copying production data.
seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sqlEvery backend developer needs test data, but the options are broken:
- Faker/factory_bot generate random gibberish with no schema awareness. Foreign keys break, unique constraints collide, and the data looks nothing like production.
- Copying production data is a compliance nightmare. 93% of organizations aren't privacy-compliant in testing.
- Snaplet (the best open-source option) shut down in July 2024.
SeedKit fills this gap. One command, realistic data, zero PII.
| Category | Feature |
|---|---|
| Databases | PostgreSQL, MySQL, and SQLite out of the box |
| Introspection | Auto-reads tables, columns, FKs, unique constraints, check constraints, enums |
| Classification | 50+ semantic types (Email, FirstName, Price, CreatedAt, etc.) via pattern matching |
| FK Safety | Topological ordering ensures parent rows exist before child rows reference them |
| Cycle Resolution | Detects circular FKs (Tarjan SCC), breaks cycles with deferred UPDATE statements |
| Correlations | city/state/zip stay consistent, created_at < updated_at, first+last derive full name |
| Determinism | Lock file (seedkit.lock) + seed guarantees identical output across machines |
| Custom Values | Weighted value lists via seedkit.toml config |
| Smart Sampling | Extract production distributions and generate data that mirrors real patterns (with PII masking) |
| LLM-Enhanced | Optional --ai flag sends schema to Claude/GPT for smarter classification |
| Output Formats | SQL (INSERT/COPY), JSON, CSV, or direct database insertion |
| CI Integration | seedkit check detects schema drift (exit code 0/1) |
| Visualization | seedkit graph exports Mermaid.js or Graphviz DOT dependency diagrams |
# Install from source
cargo install --path crates/seedkit-cli
# Generate 1000 rows per table, output SQL
seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sql
# Insert directly into database
seedkit generate --db postgres://localhost/myapp --rows 1000
# Use .env or seedkit.toml for connection -- no --db needed
seedkit generate --rows 500 --output seed.sqlpip install seedkit
# or
pipx install seedkitnpm install -g @seed-kit/cli
# or run without installing
npx @seed-kit/cli --helpDownload pre-built binaries for Linux (x64/ARM64), macOS (Intel/Apple Silicon), and Windows from Releases.
git clone https://github.com/kclaka/seedkit.git
cd seedkit
cargo install --path crates/seedkit-cliRequirements: Rust 1.75+ (2021 edition)
seedkit --version
# seedkit 1.5.1SeedKit automatically finds your database URL by checking (in order):
--dbCLI flagDATABASE_URLenvironment variable.envfile in the current directoryseedkit.tomlconfig file
Generate seed data for your database.
# SQL file output
seedkit generate --db postgres://localhost/myapp --rows 500 --output seed.sql
# Direct insert into database
seedkit generate --db postgres://localhost/myapp --rows 1000
# JSON or CSV
seedkit generate --rows 100 --output data.json
seedkit generate --rows 100 --output data.csv
# PostgreSQL COPY format (10-50x faster bulk loading)
seedkit generate --rows 10000 --output seed.sql --copy
# Deterministic with seed
seedkit generate --rows 100 --seed 42 --output seed.sql
# Reproduce from lock file
seedkit generate --from-lock
# Per-table row counts
seedkit generate --rows 100 --table-rows users=500,orders=2000
# Include/exclude tables
seedkit generate --include users,orders --rows 100
seedkit generate --exclude audit_logs,migrations --rows 100
# LLM-enhanced classification (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
seedkit generate --rows 100 --ai --output seed.sql
seedkit generate --rows 100 --ai --model claude-opus-4-20250514 --output seed.sql
# Production-like with sampled distributions
seedkit generate --rows 1000 --subset seedkit.distributions.jsonExtract statistical distributions from a production database (read-only replica recommended). Automatically masks PII columns.
# Sample all tables
seedkit sample --db postgres://readonly-replica:5432/myapp
# Sample specific tables with custom limits
seedkit sample --db postgres://localhost/myapp --tables users,orders --categorical-limit 100
# Custom output path
seedkit sample --db postgres://localhost/myapp -o profiles.jsonThis creates seedkit.distributions.json with:
- Categorical distributions -- value frequencies for text/enum columns (PII columns auto-masked)
- Numeric distributions -- min, max, mean, stddev for numeric columns
- FK ratios -- child-to-parent row count ratios (e.g., 3.2 orders per user)
Then use with seedkit generate --subset seedkit.distributions.json to produce data that mirrors production patterns.
Analyze your database schema and show classification results.
seedkit introspect --db postgres://localhost/myapp
seedkit introspect --db postgres://localhost/myapp --format jsonPreview a few sample rows without generating a full dataset.
seedkit preview --db postgres://localhost/myapp --rows 5Detect schema drift against the lock file. Designed for CI pipelines.
seedkit check --db postgres://localhost/myapp
# Exit code 0 = no drift, 1 = drift detected
seedkit check --db postgres://localhost/myapp --format jsonVisualize table dependencies.
seedkit graph --db postgres://localhost/myapp --format mermaid > schema.mmd
seedkit graph --db postgres://localhost/myapp --format dot | dot -Tpng > schema.pngCreate a seedkit.toml in your project root:
[database]
url = "postgres://localhost/myapp"
[generate]
rows = 500
seed = 42
[tables.users]
rows = 1000
[tables.orders]
rows = 5000
# Custom value lists with optional weights
[columns."products.color"]
values = ["red", "blue", "green", "black", "white"]
weights = [0.25, 0.20, 0.20, 0.20, 0.15]
# Explicit cycle-breaking for circular foreign keys
[graph]
break_cycle_at = ["users.invited_by_id", "comments.parent_id"] seedkit generate
|
[1] Introspect | Connect to DB, read information_schema
v
DatabaseSchema
|
[2] Graph | Build FK dependency graph (petgraph)
| Detect cycles (Tarjan SCC)
| Break cycles, topological sort
v
Insertion Order
|
[3] Classify | 50+ regex rules match column names
| Optional LLM pass (--ai flag)
| Optional distribution profiles (--subset)
v
SemanticTypes
|
[4] Generate | Row-by-row, FK-safe, unique-safe
| Correlated groups, check constraints
| Distribution-aware (normal, categorical)
v
Generated Data
|
[5] Output | SQL / JSON / CSV / Direct Insert
v
seed.sql
seedkit.lock works like package-lock.json. It captures the schema snapshot, random seed, and all configuration so teammates can reproduce the exact same dataset:
# Generate (creates seedkit.lock)
seedkit generate --rows 100
# Teammate reproduces identical data
seedkit generate --from-lockIf there's a merge conflict in seedkit.lock, don't resolve by hand:
git checkout --ours seedkit.lock
seedkit generate --forceBenchmarked with criterion on Apple Silicon (M-series). Run cargo bench to reproduce.
| Operation | Throughput |
|---|---|
| Generation (10 cols, semantic providers) | ~480K rows/sec |
| Generation (FK references only) | ~3.7M rows/sec |
| Generation (weighted value lists) | ~6.9M rows/sec |
| Generation (distribution sampling) | ~8.6M rows/sec |
| Classification (100 tables x 20 cols) | ~2.1M cols/sec |
| SQL output formatting | ~1.5M rows/sec |
| JSON output formatting | ~1.1M rows/sec |
| CSV output formatting | ~1.5M rows/sec |
| Feature | SeedKit | Faker/factory_bot | Snaplet |
|---|---|---|---|
| Schema-aware | Yes | No | Yes (shut down) |
| Multi-database | PG + MySQL + SQLite | N/A | PG only |
| FK resolution | Automatic | Manual | Automatic |
| Circular FK handling | Tarjan SCC + deferral | N/A | Manual |
| Deterministic | Seed + lock file | Seed only | No |
| Custom values | TOML config | Code | Code |
| Smart sampling | Production distributions | No | No |
| LLM-enhanced | Optional --ai | No | No |
| CI integration | seedkit check |
N/A | No |
| Privacy | Synthetic + PII masking | Synthetic | Copies prod |
SeedKit is a Rust workspace with three crates:
seedkit/
crates/
seedkit-core/ # Library: introspection, graph, classification, generation, output, sampling
seedkit-cli/ # Binary: clap-based CLI with 6 subcommands
seedkit-testutil/ # Shared test helpers
tests/
fixtures/ # SQL schema fixtures for integration tests
Test suite: 221 tests (201 unit + 13 PostgreSQL integration + 7 MySQL integration)
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
# Run the full test suite
cargo test
# Run integration tests (requires Docker)
docker compose -f docker/docker-compose.test.yml up -d
TEST_POSTGRES_URL=postgres://seedkit:seedkit@localhost:5432/seedkit_test \
TEST_MYSQL_URL=mysql://seedkit:seedkit@localhost:3307/seedkit_test \
cargo test --test '*' -- --test-threads=1Licensed under the MIT License.