SynthGen creates realistic, constraint‑valid synthetic data for SQL Server schemas so that teams can develop, test, and demonstrate solutions without exposing production data.
SynthGen orchestrates a team of LLM‑centric agents supported by lightweight Python helpers to transform SQL Server schema definitions into realistic synthetic data. Each agent in the pipeline specializes in a specific task:
- Schema Parser: Analyzes SQL Server CREATE scripts to build an internal model
- Reference Data Loader: Incorporates lookup values from CSV files with optional distribution weights
- Data Synthesizer: Generates realistic data based on schema constraints and business rules
- Validator: Ensures the generated data meets all constraints
- Artifact Collector: Organizes outputs and debug information
SynthGen includes a comprehensive e-commerce demo that showcases the system's capabilities. This demo generates a complete synthetic e-commerce dataset with users, products, orders, reviews, and more, based on a 30+ table schema with rich relationships.
The demo demonstrates how SynthGen can:
- Parse complex SQL schemas with many tables and relationships
- Respect primary and foreign key constraints
- Generate realistic data that follows business rules
- Create coherent relationships between entities
View the E-commerce Demo Documentation for a step-by-step walkthrough of the process, including descriptions of all input and output files.
The Requirements Specification document provides a detailed overview of SynthGen's purpose, scope, and functionality. It includes:
- Comprehensive user stories from various stakeholder perspectives
- Functional and non-functional requirements
- Acceptance criteria and performance benchmarks
- Scope boundaries and future enhancements
This document serves as the foundation for SynthGen's development and helps ensure all critical features are implemented correctly.
The High-Level Design document outlines SynthGen's technical architecture and implementation approach. Key sections include:
- Component diagram showing the agent pipeline
- Intermediate Representation (IR) JSON schema
- Prompt engineering standards
- Artifact directory structure
- Performance targets and extensibility points
The design document provides developers with a clear roadmap for implementation while maintaining flexibility for future enhancements.
flowchart TD
%% Input files
SQL[SQL CREATE Script] --> Parser
RefData[Reference CSVs] --> RefLoader
Rules[Generation Rules JSON] --> DataSynth
%% Agents
Parser[Schema Parser Agent] --> |"schema.json</br>IR with tables, columns", constraints| RefLoader
RefLoader[Reference Data Agent] --> |"enriched_schema.json</br>IR with reference data"| DataSynth
DataSynth[Data Synthesis Agent] --> |"generated_data.csv</br>Synthetic rows"| Validator
Validator[Validation Agent] --> |"validation_report.json</br>Constraint check results"| Artifact
Artifact[Artifact Agent] --> |"artifacts directory</br>All outputs & traces"| Output
%% Outputs
Output[Final Artifacts]
class Parser,RefLoader,DataSynth,Validator,Artifact agent
class SQL,RefData,Rules file
class Output output
# Clone the repository
git clone https://github.com/your-org/synthgen.git
cd synthgen
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtSynthGen requires an OpenAI API key to function. You have two options for setting up your key:
-
Create a
.envfile in the project root (recommended for development):OPENAI_API_KEY=sk-your-api-key-here -
Set an environment variable:
# Linux/macOS export OPENAI_API_KEY=sk-your-api-key-here # Windows set OPENAI_API_KEY=sk-your-api-key-here
You can verify your API key setup by running:
python tests/integration/test_openai_connectivity.pyFor detailed information about API key management and security best practices, see API_KEY_MANAGEMENT.md.
./cli.py path/to/schema.sql path/to/ref-data/ --rules path/to/rules.jsonusage: cli.py [-h] [--rules RULES] [--artifacts-dir ARTIFACTS_DIR]
[--run-id RUN_ID] [--seed SEED] [--llm-model LLM_MODEL]
sql_script ref_data_dir
SynthGen: Generate realistic, constraint-valid synthetic data for SQL Server schemas
positional arguments:
sql_script Path to the SQL Server CREATE script for the target schema
ref_data_dir Directory containing reference data CSVs (one per lookup table)
optional arguments:
-h, --help show this help message and exit
--rules RULES, -r RULES
Path to the Generation-Rules JSON document
--artifacts-dir ARTIFACTS_DIR, -a ARTIFACTS_DIR
Directory to store artifacts and output files (default: artifacts)
--run-id RUN_ID Unique identifier for this run (for reproducibility)
--seed SEED, -s SEED Random seed for reproducible generation
--llm-model LLM_MODEL Specific OpenAI model to use (default: gpt-4o)
SynthGen allows you to specify distribution weights for reference data, which control how frequently each value appears in the generated data:
# [SchemaName.TableName]
CodeValue, Description, Weight
A, Active, 0.7
I, Inactive, 0.2
D, Deleted, 0.1
The weights can be specified in CSV files with an optional Weight column. When used by the Data Synthesis Agent, these weights will influence the distribution of foreign key references to these values.
Benefits of distribution weights:
- Control over the realism of generated data
- Ability to model real-world frequencies (e.g., mostly active accounts)
- Predictable output patterns for testing specific scenarios
synthgen/
├── agents/ # Agent implementations
├── models/ # Data models and schemas
├── utils/ # Utility functions
├── plugins/ # Extension points
├── schemas/ # JSON schemas
├── tests/ # Test suite
│ ├── unit/ # Unit tests for individual components
│ ├── integration/ # Integration tests across components
│ ├── end_to_end/ # Full pipeline tests
│ └── fixtures/ # Test data and fixtures
├── samples/ # Sample input files
│ ├── sql/ # Example SQL schema files
│ ├── ref_data/ # Example reference data files
│ └── rules/ # Example generation rules
├── artifacts/ # Generated outputs (gitignored)
├── cli.py # Command-line interface
├── api.py # Python API
├── orchestrator.py # Pipeline orchestrator
└── constants.py # Configuration constants
pytest- Create a new file in the
agents/directory - Subclass the
Agentbase class - Implement the required methods
- Update the orchestrator to include your agent in the pipeline
Contributions are welcome! Please feel free to submit a Pull Request.
