Skip to content

kclaka/seedkit

Repository files navigation

SeedKit

Generate realistic, constraint-safe seed data for any database.

CI Tests Version Rust License: MIT Databases


SeedKit connects to your PostgreSQL, MySQL, or SQLite database, reads the schema, and generates seed data that respects foreign keys, unique constraints, check constraints, and enum types -- all without copying production data.

seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sql

Why SeedKit?

Every backend developer needs test data, but the options are broken:

  • Faker/factory_bot generate random gibberish with no schema awareness. Foreign keys break, unique constraints collide, and the data looks nothing like production.
  • Copying production data is a compliance nightmare. 93% of organizations aren't privacy-compliant in testing.
  • Snaplet (the best open-source option) shut down in July 2024.

SeedKit fills this gap. One command, realistic data, zero PII.

Features

Category Feature
Databases PostgreSQL, MySQL, and SQLite out of the box
Introspection Auto-reads tables, columns, FKs, unique constraints, check constraints, enums
Classification 50+ semantic types (Email, FirstName, Price, CreatedAt, etc.) via pattern matching
FK Safety Topological ordering ensures parent rows exist before child rows reference them
Cycle Resolution Detects circular FKs (Tarjan SCC), breaks cycles with deferred UPDATE statements
Correlations city/state/zip stay consistent, created_at < updated_at, first+last derive full name
Determinism Lock file (seedkit.lock) + seed guarantees identical output across machines
Custom Values Weighted value lists via seedkit.toml config
Smart Sampling Extract production distributions and generate data that mirrors real patterns (with PII masking)
LLM-Enhanced Optional --ai flag sends schema to Claude/GPT for smarter classification
Output Formats SQL (INSERT/COPY), JSON, CSV, or direct database insertion
CI Integration seedkit check detects schema drift (exit code 0/1)
Visualization seedkit graph exports Mermaid.js or Graphviz DOT dependency diagrams

Quick Start

# Install from source
cargo install --path crates/seedkit-cli

# Generate 1000 rows per table, output SQL
seedkit generate --db postgres://localhost/myapp --rows 1000 --output seed.sql

# Insert directly into database
seedkit generate --db postgres://localhost/myapp --rows 1000

# Use .env or seedkit.toml for connection -- no --db needed
seedkit generate --rows 500 --output seed.sql

Installation

PyPI (recommended)

pip install seedkit
# or
pipx install seedkit

npm

npm install -g @seed-kit/cli
# or run without installing
npx @seed-kit/cli --help

GitHub Releases

Download pre-built binaries for Linux (x64/ARM64), macOS (Intel/Apple Silicon), and Windows from Releases.

From Source

git clone https://github.com/kclaka/seedkit.git
cd seedkit
cargo install --path crates/seedkit-cli

Requirements: Rust 1.75+ (2021 edition)

Verify Installation

seedkit --version
# seedkit 1.5.1

Zero-Config Database Detection

SeedKit automatically finds your database URL by checking (in order):

  1. --db CLI flag
  2. DATABASE_URL environment variable
  3. .env file in the current directory
  4. seedkit.toml config file

Usage

seedkit generate

Generate seed data for your database.

# SQL file output
seedkit generate --db postgres://localhost/myapp --rows 500 --output seed.sql

# Direct insert into database
seedkit generate --db postgres://localhost/myapp --rows 1000

# JSON or CSV
seedkit generate --rows 100 --output data.json
seedkit generate --rows 100 --output data.csv

# PostgreSQL COPY format (10-50x faster bulk loading)
seedkit generate --rows 10000 --output seed.sql --copy

# Deterministic with seed
seedkit generate --rows 100 --seed 42 --output seed.sql

# Reproduce from lock file
seedkit generate --from-lock

# Per-table row counts
seedkit generate --rows 100 --table-rows users=500,orders=2000

# Include/exclude tables
seedkit generate --include users,orders --rows 100
seedkit generate --exclude audit_logs,migrations --rows 100

# LLM-enhanced classification (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
seedkit generate --rows 100 --ai --output seed.sql
seedkit generate --rows 100 --ai --model claude-opus-4-20250514 --output seed.sql

# Production-like with sampled distributions
seedkit generate --rows 1000 --subset seedkit.distributions.json

seedkit sample

Extract statistical distributions from a production database (read-only replica recommended). Automatically masks PII columns.

# Sample all tables
seedkit sample --db postgres://readonly-replica:5432/myapp

# Sample specific tables with custom limits
seedkit sample --db postgres://localhost/myapp --tables users,orders --categorical-limit 100

# Custom output path
seedkit sample --db postgres://localhost/myapp -o profiles.json

This creates seedkit.distributions.json with:

  • Categorical distributions -- value frequencies for text/enum columns (PII columns auto-masked)
  • Numeric distributions -- min, max, mean, stddev for numeric columns
  • FK ratios -- child-to-parent row count ratios (e.g., 3.2 orders per user)

Then use with seedkit generate --subset seedkit.distributions.json to produce data that mirrors production patterns.

seedkit introspect

Analyze your database schema and show classification results.

seedkit introspect --db postgres://localhost/myapp
seedkit introspect --db postgres://localhost/myapp --format json

seedkit preview

Preview a few sample rows without generating a full dataset.

seedkit preview --db postgres://localhost/myapp --rows 5

seedkit check

Detect schema drift against the lock file. Designed for CI pipelines.

seedkit check --db postgres://localhost/myapp
# Exit code 0 = no drift, 1 = drift detected

seedkit check --db postgres://localhost/myapp --format json

seedkit graph

Visualize table dependencies.

seedkit graph --db postgres://localhost/myapp --format mermaid > schema.mmd
seedkit graph --db postgres://localhost/myapp --format dot | dot -Tpng > schema.png

Configuration

Create a seedkit.toml in your project root:

[database]
url = "postgres://localhost/myapp"

[generate]
rows = 500
seed = 42

[tables.users]
rows = 1000

[tables.orders]
rows = 5000

# Custom value lists with optional weights
[columns."products.color"]
values = ["red", "blue", "green", "black", "white"]
weights = [0.25, 0.20, 0.20, 0.20, 0.15]

# Explicit cycle-breaking for circular foreign keys
[graph]
break_cycle_at = ["users.invited_by_id", "comments.parent_id"]

How It Works

                    seedkit generate
                          |
          [1] Introspect  |  Connect to DB, read information_schema
                          v
                   DatabaseSchema
                          |
          [2] Graph       |  Build FK dependency graph (petgraph)
                          |  Detect cycles (Tarjan SCC)
                          |  Break cycles, topological sort
                          v
                  Insertion Order
                          |
          [3] Classify    |  50+ regex rules match column names
                          |  Optional LLM pass (--ai flag)
                          |  Optional distribution profiles (--subset)
                          v
                  SemanticTypes
                          |
          [4] Generate    |  Row-by-row, FK-safe, unique-safe
                          |  Correlated groups, check constraints
                          |  Distribution-aware (normal, categorical)
                          v
                  Generated Data
                          |
          [5] Output      |  SQL / JSON / CSV / Direct Insert
                          v
                    seed.sql

Lock File

seedkit.lock works like package-lock.json. It captures the schema snapshot, random seed, and all configuration so teammates can reproduce the exact same dataset:

# Generate (creates seedkit.lock)
seedkit generate --rows 100

# Teammate reproduces identical data
seedkit generate --from-lock

If there's a merge conflict in seedkit.lock, don't resolve by hand:

git checkout --ours seedkit.lock
seedkit generate --force

Performance

Benchmarked with criterion on Apple Silicon (M-series). Run cargo bench to reproduce.

Operation Throughput
Generation (10 cols, semantic providers) ~480K rows/sec
Generation (FK references only) ~3.7M rows/sec
Generation (weighted value lists) ~6.9M rows/sec
Generation (distribution sampling) ~8.6M rows/sec
Classification (100 tables x 20 cols) ~2.1M cols/sec
SQL output formatting ~1.5M rows/sec
JSON output formatting ~1.1M rows/sec
CSV output formatting ~1.5M rows/sec

Comparison

Feature SeedKit Faker/factory_bot Snaplet
Schema-aware Yes No Yes (shut down)
Multi-database PG + MySQL + SQLite N/A PG only
FK resolution Automatic Manual Automatic
Circular FK handling Tarjan SCC + deferral N/A Manual
Deterministic Seed + lock file Seed only No
Custom values TOML config Code Code
Smart sampling Production distributions No No
LLM-enhanced Optional --ai No No
CI integration seedkit check N/A No
Privacy Synthetic + PII masking Synthetic Copies prod

Architecture

SeedKit is a Rust workspace with three crates:

seedkit/
  crates/
    seedkit-core/     # Library: introspection, graph, classification, generation, output, sampling
    seedkit-cli/      # Binary: clap-based CLI with 6 subcommands
    seedkit-testutil/  # Shared test helpers
  tests/
    fixtures/         # SQL schema fixtures for integration tests

Test suite: 221 tests (201 unit + 13 PostgreSQL integration + 7 MySQL integration)

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

# Run the full test suite
cargo test

# Run integration tests (requires Docker)
docker compose -f docker/docker-compose.test.yml up -d
TEST_POSTGRES_URL=postgres://seedkit:seedkit@localhost:5432/seedkit_test \
TEST_MYSQL_URL=mysql://seedkit:seedkit@localhost:3307/seedkit_test \
  cargo test --test '*' -- --test-threads=1

License

Licensed under the MIT License.

About

Generate realistic, constraint-safe seed data for PostgreSQL, MySQL, and SQLite

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors