Skip to content

Add smart sampling: distribution extraction, PII masking, and product…#1

Merged
kclaka merged 1 commit intomainfrom
release/v1.2.0
Feb 25, 2026
Merged

Add smart sampling: distribution extraction, PII masking, and product…#1
kclaka merged 1 commit intomainfrom
release/v1.2.0

Conversation

@kclaka
Copy link
Copy Markdown
Owner

@kclaka kclaka commented Feb 25, 2026

…ion-like generation

Smart Sampling (seedkit sample + seedkit generate --subset):

  • Extract statistical distributions from production databases (categorical frequencies, numeric min/max/mean/stddev, FK ratios)
  • PII masking: auto-detect and remove PII column distributions (email, phone, SSN, password, etc.) before saving profiles
  • Distribution-aware generation: Categorical weighted random, Numeric Box-Muller normal distribution clamped to bounds
  • Ratio-adjusted row counts: child tables scale proportionally to parent tables based on production ratios
  • Profile save/load (seedkit.distributions.json)

CLI Changes:

  • New seedkit sample command with --tables, --categorical-limit, --min-rows, --output options
  • New --subset <path> flag on seedkit generate to use profiles

Engine:

  • GenerationStrategy::Distribution variant with Categorical/Numeric arms
  • Box-Muller transform for normally-distributed numeric values
  • Distribution profiles integrate into GenerationPlan::build()

Version bump to 1.2.0. 221 tests passing (201 unit + 20 integration).

…ion-like generation

Smart Sampling (seedkit sample + seedkit generate --subset):
- Extract statistical distributions from production databases
  (categorical frequencies, numeric min/max/mean/stddev, FK ratios)
- PII masking: auto-detect and remove PII column distributions
  (email, phone, SSN, password, etc.) before saving profiles
- Distribution-aware generation: Categorical weighted random,
  Numeric Box-Muller normal distribution clamped to bounds
- Ratio-adjusted row counts: child tables scale proportionally
  to parent tables based on production ratios
- Profile save/load (seedkit.distributions.json)

CLI Changes:
- New `seedkit sample` command with --tables, --categorical-limit,
  --min-rows, --output options
- New `--subset <path>` flag on `seedkit generate` to use profiles

Engine:
- GenerationStrategy::Distribution variant with Categorical/Numeric arms
- Box-Muller transform for normally-distributed numeric values
- Distribution profiles integrate into GenerationPlan::build()

Version bump to 1.2.0. 221 tests passing (201 unit + 20 integration).
@kclaka kclaka merged commit 87e6c5a into main Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant