Skip to content

mivertowski/SyntheticData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

765 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSynth v2.0.0

License Rust CI

Synthetic enterprise data generation for ML training, audit analytics, and system testing.

DataSynth generates statistically realistic, fully interconnected enterprise financial data. It produces coherent General Ledger journal entries, document flows, subledger records, banking transactions, process mining event logs, and graph exports across 20+ enterprise process families.

Generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford's Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.


Table of Contents


Quick Start

# Build from source
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

# Demo mode -- generates a complete dataset with defaults
./target/release/datasynth-data generate --demo --output ./demo-output

# Full audit simulation (113+ output files)
./target/release/datasynth-data generate --demo --preset audit-group --output ./audit-output

# Or configure for your use case
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data validate --config config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Group Audit Simulation

The audit-group preset generates a complete enterprise group audit dataset following ISA, IFRS, US GAAP, and local regulations:

./target/release/datasynth-data generate --demo --preset audit-group --output ./audit-output

# Export in SAP, French (FEC), or German (GoBD) audit formats
./target/release/datasynth-data generate --config config.yaml --output ./output --export-format sap --export-format fec

This produces 113+ interconnected files:

Category Content
Financial Statements Standalone + consolidated BS/IS/CF with elimination schedules
Audit Lifecycle Engagement, risk assessment, procedures, sampling, findings, opinion
ISA 600 Group Audit Component auditors, materiality allocation, scope, instructions, reports
Risk Assessment Combined Risk Assessment (CRA) per account area and assertion
Audit Methodology Materiality (ISA 320), sampling (ISA 530), analytical procedures (ISA 520)
Accounting Standards Deferred tax, ECL, provisions, pensions, stock comp, business combinations
SOX Compliance Section 302 certifications, Section 404 ICFR assessments
Graph Export 78+ entity types, 39+ edge types for ML training and AI agent interaction

CRA drives sampling, sampling correlates with misstatement rates, misstatements drive findings, findings drive the audit opinion.


Key Capabilities

Statistical Foundations

  • Distribution engine -- Log-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, and zero-inflated distributions with configurable components
  • Copula correlations -- Cross-field dependency modeling via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas
  • Benford's Law -- First and second-digit compliance with configurable deviation for anomaly injection
  • Temporal patterns -- Month-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags, and fiscal calendar support
  • Regime changes -- Economic cycles, acquisition effects, and structural breaks in time series
  • Industry profiles -- Pre-configured distributions for Retail, Manufacturing, Financial Services, Healthcare, and Technology

Enterprise Process Simulation

Every process chain generates its own master data, documents, and journal entries -- all cross-referenced:

Process Family Scope
General Ledger Journal entries, chart of accounts (small/medium/large), ACDOCA event logs
Procure-to-Pay Purchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match
Order-to-Cash Sales orders, deliveries, customer invoices, receipts, dunning
Source-to-Contract Spend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards
Hire-to-Retire Payroll runs, tax/deduction calculations, time & attendance, expense reports, benefit enrollment
Manufacturing Production orders, BOM explosion, routing operations, WIP costing, quality inspections, cycle counts
Financial Reporting Balance sheet, income statement, cash flow, changes in equity, KPIs, budget variance
Tax Accounting Multi-jurisdiction tax (Federal/State/Local), VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48 uncertain positions, withholding
Treasury Cash positioning, probability-weighted forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting
Project Accounting WBS hierarchies, cost lines, percentage-of-completion revenue, earned value (SPI/CPI/EAC), change orders
ESG / Sustainability GHG Scope 1/2/3 emissions, energy/water/waste, workforce diversity, safety metrics, GRI/SASB/TCFD disclosures
Intercompany IC matching, transfer pricing, consolidation eliminations, currency translation
Subledgers AR, AP, Fixed Assets, Inventory -- each with GL reconciliation
Period Close Monthly close engine, depreciation runs, accruals, year-end closing entries
Banking / KYC / AML Customer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel)
Sales Quote-to-order pipeline with win rate modeling and pricing negotiation
Bank Reconciliation Statement matching, outstanding checks, deposits in transit
Audit ISA lifecycle: engagements, workpapers, evidence, risk assessments, findings, opinions (ISA 700), KAMs (ISA 701), SOX 302/404
Group Audit (ISA 600) Component auditors, materiality allocation, scope assignment, component instructions/reports, consolidation

Accounting, Audit & Compliance Standards

  • Accounting frameworks -- US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), and dual reporting
  • Revenue recognition -- ASC 606 / IFRS 15 with contract generation, performance obligations, and SSP allocation
  • Leases -- ASC 842 / IFRS 16 with ROU assets, lease liabilities, and classification
  • Fair value -- ASC 820 / IFRS 13 Level 1/2/3 hierarchy
  • Impairment -- ASC 360 / IAS 36 testing with fair value estimation
  • Audit standards -- ISA (34 standards), PCAOB (19+ standards) with procedure mapping
  • SOX compliance -- Section 302/404 assessments with deficiency classification and material weakness detection
  • COSO 2013 -- 5 components, 17 principles, maturity levels, entity-level and transaction-level controls
  • Compliance regulations -- 45+ built-in standards registry, jurisdiction profiles (10 countries), regulatory filings, audit procedures, and compliance findings with full deficiency classification
  • Cross-domain compliance graph -- Standards linked to GL account types and business processes; full traversal paths (Company -> Jurisdiction -> Standard -> Account -> JournalEntry)
  • Localized exports -- FEC (French) and GoBD (German) audit file formats
  • Enterprise Group Audit (ISA 600) -- Component auditor assignment, group materiality allocation, scope assignment (full/specific/analytical), component instructions and reports
  • Audit Opinion (ISA 700/705/706/701) -- Opinion derived from findings severity and going concern, Key Audit Matters, PCAOB ICFR opinion
  • Audit Methodology -- Combined Risk Assessment (ISA 315), materiality calculations (ISA 320), sampling methodology (ISA 530), SCOTS classification, unusual item detection, analytical relationships (ISA 520)
  • Deferred Tax (IAS 12 / ASC 740) -- Temporary differences, ETR reconciliation, rollforward schedules, valuation allowances
  • Business Combinations (IFRS 3 / ASC 805) -- Purchase price allocation, fair value step-ups, goodwill, contingent consideration
  • Segment Reporting (IFRS 8 / ASC 280) -- Operating segments with reconciliation to consolidated totals
  • Expected Credit Loss (IFRS 9 / ASC 326) -- Provision matrix by aging bucket, forward-looking scenarios, ECL movements
  • Pensions (IAS 19 / ASC 715) -- DBO rollforward, plan assets, pension expense, OCI remeasurements
  • Provisions (IAS 37 / ASC 450) -- Framework-aware recognition thresholds, provision movements
  • Stock Compensation (ASC 718 / IFRS 2) -- Grants, vesting schedules, expense recognition
  • Functional Currency (IAS 21) -- Per-entity functional currency, CTA as OCI
  • Consolidated Financial Statements -- Standalone + consolidated with elimination schedules
  • Going Concern (ISA 570) -- Financial indicator derivation, management mitigation plans
  • Subsequent Events (ISA 560 / IAS 10) -- Adjusting and non-adjusting events

YAML-Driven Audit FSM Engine

The datasynth-audit-fsm crate provides a methodology-agnostic state machine engine that loads audit methodology blueprints from YAML and generates event-sourced audit trails with typed artifacts.

The engine uses a two-layer architecture: blueprints define what happens (procedures, phases, state machines, evidence requirements, standards references), while generation overlays define how it happens (revision probabilities, timing distributions, artifact volumes, anomaly injection rates). The same blueprint can produce a thorough engagement or a rushed engagement by swapping a single overlay file.

10 builtin blueprints cover the major audit methodologies:

Blueprint Procedures Phases Steps Standards Events Artifacts
Financial Statement Audit (FSA) 9 3 24 14 ISA 51 1,916
Internal Audit (IA) 34 9 82 52 IIA-GIAS 205 3,808
KPMG, PwC, Deloitte, EY GAM Lite Firm-specific ISA methodologies
SOC 2 Type II Trust Services Criteria
PCAOB Integrated AS 2201 integrated audit
Regulatory Examination Regulatory examination

Additional methodology blueprints are available at SyntheticDataBlueprints.

The StepDispatcher maps all step commands to 14 concrete audit generators, enriched by the analytics inventory (87 data requirements + 71 analytical procedures across FSA, IA, SOC 2, PCAOB, and Regulatory blueprints) and the form ontology (4,437 canonical field categories). Every step carries a judgment_level for risk-based procedure selection. Every artifact is data-driven: findings cite specific journal entries, workpapers reference applicable ISA paragraphs, and evidence descriptions include expected form fields.

14/14 audit data types and 14/14 analytical procedures -- full coverage of all data types required by FSA audit steps:

Category Data Types
Core financial General ledger, journal entries (with ISA 240 flags), financial statements (with comparatives), sub-ledgers
External evidence Bank statements, confirmations, contracts, estimates
Year-over-year Prior-year comparatives, prior-year findings with remediation tracking
Reference data Industry benchmarks (10 metrics/industry), organizational profile (IT systems, regulatory env)
Governance Board minutes (quarterly + audit committee), management reports (KPI/RAG/budget)
IT controls Access logs (business-hour weighting), change management records (approval gap correlation)

Engine features:

  • 8-state C2CE (Condition-Criteria-Cause-Effect) lifecycle for finding development
  • Self-loop handling with configurable max iterations for follow-up procedures
  • Continuous phase support for parallel execution (ethics, governance, quality)
  • Discriminator-based procedure filtering (categories, risk ratings, engagement types)
  • Generation overlay presets: default, thorough, rushed with cost model (base hours + role rates)
  • 6 export formats: JSON, CSV (Disco/Celonis), XES 2.0 (ProM/pm4py), OCEL 2.0, Celonis, Parquet
  • Streaming execution with live anomaly injection
  • Benchmark datasets: simple/medium/complex with configurable anomaly injection
  • ContentGenerator trait with pluggable implementations (--features claude-content for Claude CLI adapter)
  • 284+ tests across audit FSM and optimizer modules
  • CLI: datasynth-data audit validate|info|run|benchmark
# Enable FSM-driven audit generation
audit:
  enabled: true
  fsm:
    enabled: true
    blueprint: builtin:fsa     # builtin:fsa, builtin:ia, builtin:kpmg, builtin:pwc, etc.
    overlay: builtin:default   # builtin:default, builtin:thorough, builtin:rushed

Programmatic usage:

use datasynth_audit_fsm::loader::{BlueprintWithPreconditions, load_overlay, OverlaySource, BuiltinOverlay};
use datasynth_audit_fsm::engine::AuditFsmEngine;
use datasynth_audit_fsm::context::EngagementContext;
use rand::SeedableRng;
use rand_chacha::ChaCha8Rng;

let bwp = BlueprintWithPreconditions::load_builtin_fsa().unwrap();
let overlay = load_overlay(&OverlaySource::Builtin(BuiltinOverlay::Default)).unwrap();
let mut engine = AuditFsmEngine::new(bwp, overlay, ChaCha8Rng::seed_from_u64(42));
let result = engine.run_engagement(&EngagementContext::demo()).unwrap();

println!("Events: {}, Artifacts: {}", result.event_log.len(), result.artifacts.total_artifacts());

The companion datasynth-audit-optimizer crate (16 modules) provides:

  • Graph analysis: Blueprint to petgraph conversion, shortest path (FSA: 27, IA: 101 min transitions)
  • Resource-constrained optimization: Budget/role-aware audit plan selection with coverage reporting
  • Risk-based scoping: Standards/risk coverage analysis, what-if procedure removal impact
  • Portfolio simulation: Multi-engagement with shared resources, scheduling conflicts, systemic findings
  • Conformance metrics: Fitness, precision, anomaly detection statistics
  • Overlay fitting: Iterative parameter search from target engagement profiles
  • Blueprint discovery: Infer methodology from event logs (alpha miner), compare against reference
  • Anomaly calibration: Auto-tune injection rates to target detection difficulty
  • Cross-firm benchmark comparison: Methodology coverage and efficiency across Big 4 firms
  • ISA 600 group audit simulation: Component auditor assignment, materiality allocation, scope
  • Year-over-year engagement chains: Multi-period simulation with carry-forward findings
  • Blueprint testing: Automated blueprint validation and regression testing

For a deep dive, see the Audit FSM Engine documentation.

Interconnectivity & Relationships

  • Multi-tier vendor networks -- Tier 1/2/3 supply chain with behavioral clusters (Strategic, Operational, Transactional, Problematic)
  • Customer segmentation -- Enterprise/MidMarket/SMB/Consumer with Pareto-like revenue distribution and lifecycle stages
  • Relationship strength -- Composite scoring from volume, count, duration, recency, and mutual connections
  • Cross-process links -- P2P and O2C linked via inventory; payments linked to bank reconciliation
  • Entity graphs -- 16 entity types, 26 relationship types with connectivity and clustering metrics
  • Compliance-to-accounting links -- Standards mapped to GL account types and processes; findings linked to controls and affected accounts; filings linked to companies and jurisdictions

Fraud, Anomalies & Data Quality

  • ACFE-aligned fraud taxonomy -- Asset misappropriation, corruption, and financial statement fraud with calibrated rates
  • 60+ anomaly types -- Fraud, errors, process issues, statistical outliers, and relational anomalies
  • Collusion modeling -- 9 ring types with role-based conspirators, defection, and escalation dynamics
  • Management override -- Senior-level fraud patterns with fraud triangle modeling
  • Red flag generation -- 40+ probabilistic fraud indicators with Bayesian calibration
  • Industry-specific patterns -- Manufacturing yield manipulation, retail sweethearting, healthcare upcoding
  • Data quality variations -- Missing values (MCAR/MAR/MNAR), format variations, typos (keyboard-aware, OCR), duplicates, encoding issues
  • Full labeling -- Every injected anomaly and quality issue is labeled for supervised ML training

Process & Behavioral Drift

  • Organizational events -- Acquisitions, divestitures, mergers, reorganizations with volume multipliers
  • Process evolution -- S-curve automation rollout, workflow changes, policy updates
  • Technology transitions -- ERP migrations with phased rollout (parallel run, cutover, stabilization)
  • Market drift -- Economic cycles, commodity price shocks, recession modeling
  • Labeled drift events -- Ground truth labels with magnitude and detection difficulty for ML training

Machine Learning & Graph Export

  • Graph formats -- PyTorch Geometric (.pt), Neo4j (CSV + Cypher), DGL, RustGraph JSON
  • Multi-layer hypergraph -- 3-layer (Governance, Process Events, Accounting Network) with OCPM events as hyperedges and compliance regulation nodes
  • Compliance graph layer -- Standards, findings, filings, and jurisdictions as graph nodes with cross-domain edges to accounts, controls, and companies
  • 28 audit entity types in graph -- CRA, materiality, opinions, sampling plans, SCOTS, unusual items, analytical relationships, group structure, and more
  • 27 cross-entity edge types -- CRA to entity, opinion to engagement, KAM to opinion, sampling to CRA, unusual to JE, and audit lifecycle traversal paths
  • Train/val/test splits -- Configurable data partitioning for ML pipelines
  • Anomaly labels -- Fraud labels, quality issue labels, and drift labels in standardized format
  • Counterfactual pairs -- (original, mutated) journal entry pairs for causal ML training

Process Mining

  • OCEL 2.0 -- Object-centric event logs in JSON/XML format
  • XES 2.0 -- XML export compatible with ProM, Celonis, Disco, and pm4py
  • 101+ activity types across 12 process families with 65+ object types
  • 10 OCPM generators -- S2C, H2R, MFG, BANK, AUDIT, Bank Recon, Tax, Treasury, Project Accounting, ESG
  • Process variants -- Happy path (75%), exception path (20%), error path (5%)

Advanced Generation

Capability Description
LLM enrichment Pluggable LlmProvider trait (mock/OpenAI-compatible) for vendor names, descriptions, and anomaly explanations
Diffusion models Statistical diffusion with Langevin reverse process; linear/cosine/sigmoid schedules; hybrid blending
Causal models Structural causal models with do-calculus interventions and counterfactual abduction-action-prediction
Natural language config Generate YAML configurations from plain English descriptions
Scenario engine Built-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override, comprehensive
Counterfactual simulation 8 intervention types with causal DAG propagation and diff analysis

Production Features

  • REST / gRPC / WebSocket APIs with streaming generation and backpressure handling
  • Authentication -- API key (Argon2id), JWT/OIDC (RS256), role-based access control (Admin/Operator/Viewer)
  • Quality gates -- Configurable pass/fail thresholds (strict/default/lenient) with 8 metrics
  • Plugin SDK -- GeneratorPlugin, SinkPlugin, TransformPlugin traits with thread-safe registry
  • Resource guards -- Memory, disk, and CPU monitoring with graceful degradation (Normal to Reduced to Minimal to Emergency)
  • Deterministic generation -- Seeded ChaCha8 RNG for fully reproducible output
  • Streaming output -- Async generation with configurable backpressure (block/drop_oldest/drop_newest/buffer)
  • Data lineage -- Per-file checksums, lineage graph, W3C PROV-JSON export
  • Country packs -- Pluggable JSON country configuration (US/DE/GB built-in) with holidays, names, tax, addresses
  • Observability -- OpenTelemetry traces, Prometheus metrics, structured JSON logging
  • Docker & Kubernetes -- Multi-stage distroless containers, Helm chart with HPA/PDB, Prometheus ServiceMonitor
  • CI/CD -- 7-job GitHub Actions pipeline (fmt, clippy, cross-platform test, MSRV, security, coverage, benchmarks)
  • EU AI Act -- Article 50 synthetic content marking and Article 10 data governance reports
  • Fuzzing -- cargo-fuzz targets for config parsing, fingerprint loading, and validation
  • Panic-free -- #![deny(clippy::unwrap_used)] enforced across all library crates

Ecosystem Integrations

Integration Capability
Apache Airflow DataSynthOperator, DataSynthSensor, DataSynthValidateOperator for DAG orchestration
dbt Source YAML generation, seed export, project scaffolding
MLflow Generation runs as experiments with parameter, metric, and artifact logging
Apache Spark DataFrames with schema inference and temp view registration

Architecture

DataSynth is a Rust workspace with 18 crates:

datasynth-cli              CLI binary (generate, validate, init, info, fingerprint, scenario)
datasynth-server           REST / gRPC / WebSocket server with auth and rate limiting
datasynth-ui               Tauri + SvelteKit desktop application
                  |
datasynth-runtime          Generation orchestrator (parallel execution, resource guards, streaming)
                  |
datasynth-generators       50+ data generators across all process families
datasynth-banking          KYC / AML banking transaction generator
datasynth-ocpm             OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint      Privacy-preserving fingerprint extraction and synthesis
datasynth-standards        Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
datasynth-audit-fsm        YAML-driven audit FSM engine (10 builtin blueprints)
datasynth-audit-optimizer  Audit path optimization, Monte Carlo, group audit simulation
                  |
datasynth-graph            Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)
datasynth-graph-export     Unified graph export pipeline with 78+ entity types
datasynth-eval             Statistical evaluation, quality gates, auto-tuning
                  |
datasynth-config           Configuration schema, validation, industry presets
                  |
datasynth-core             Domain models, traits, distributions, resource guards
                  |
datasynth-output           Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming
datasynth-test-utils       Test utilities, fixtures, mocks

Installation

From Source

git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

The binary is available at target/release/datasynth-data.

Requirements


Configuration

DataSynth uses YAML configuration with 30+ top-level sections. Generate a starter config with init:

datasynth-data init --industry retail --complexity medium -o config.yaml

Minimal configuration:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US

transactions:
  target_count: 100000

output:
  format: csv               # csv, json, parquet

Enable specific modules by adding their sections:

# Fraud detection training data
fraud:
  enabled: true
  fraud_rate: 0.005
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

# Graph export for GNN training
graph_export:
  enabled: true
  formats: [pytorch_geometric, neo4j]

# Statistical realism
distributions:
  enabled: true
  industry_profile: retail
  amounts:
    distribution_type: lognormal
    benford_compliance: true
  correlations:
    enabled: true
    copula_type: gaussian

# Enterprise process chains
document_flows:
  enabled: true
source_to_pay:
  enabled: true
hr:
  enabled: true
manufacturing:
  enabled: true
financial_reporting:
  enabled: true
esg:
  enabled: true

# Accounting standards
accounting_standards:
  enabled: true
  framework: us_gaap         # us_gaap, ifrs, french_gaap, german_gaap, dual_reporting

# Process mining
ocpm:
  enabled: true
  output:
    ocel_json: true
    xes: true

Industry presets (manufacturing, retail, financial_services, healthcare, technology) and complexity levels (small ~100 accounts, medium ~400, large ~2500) provide sensible defaults.

See the Configuration Guide for the complete reference.


Output Structure

DataSynth generates 100+ interconnected output tables organized by domain:

output/
+-- journal_entries.csv             Flat CSV: one row per JE line item
+-- journal_entries.json            Nested JSON: full JE structure
+-- acdoca.csv                      SAP ACDOCA-style universal journal
|
+-- master_data/
|   +-- vendors.json
|   +-- customers.json
|   +-- materials.json
|   +-- fixed_assets.json
|   +-- employees.json              Includes salary, hire date, department
|   +-- cost_centers.json           Cost center hierarchy
|
+-- document_flows/
|   +-- purchase_orders.json
|   +-- goods_receipts.json
|   +-- vendor_invoices.json
|   +-- payments.json
|   +-- customer_receipts.json
|   +-- sales_orders.json
|   +-- deliveries.json
|   +-- customer_invoices.json
|   +-- document_references.json    Cross-doc links (PO->GR->Invoice->Payment)
|
+-- sourcing/                       S2C pipeline
|   +-- spend_analyses, sourcing_projects, rfx_events, supplier_bids,
|       bid_evaluations, procurement_contracts, catalog_items, supplier_scorecards
|
+-- subledger/
|   +-- ap_invoices.json, ar_invoices.json
|   +-- fa_records.json, inventory_positions.json, inventory_movements.json
|   +-- ar_aging.json, ap_aging.json
|   +-- depreciation_runs.json, inventory_valuation.json
|   +-- dunning_runs.json, dunning_letters.json
|
+-- hr/
|   +-- payroll_runs.json, payroll_line_items.json
|   +-- time_entries.json, expense_reports.json, benefit_enrollments.json
|   +-- pension_plans.json, pension_obligations.json, plan_assets.json, pension_disclosures.json
|   +-- stock_grants.json, stock_comp_expense.json
|   +-- employee_change_history.json
|
+-- manufacturing/
|   +-- production_orders.json, quality_inspections.json, cycle_counts.json,
|       bom_components.json, inventory_movements.json
|
+-- financial_reporting/
|   +-- financial_statements.json   All standalone statements combined
|   +-- bank_reconciliations.json
|   +-- notes_to_financial_statements.json
|   +-- standalone/                 Per-entity: {entity_code}_financial_statements.json
|   +-- consolidated/
|   |   +-- consolidated_financial_statements.json
|   |   +-- consolidation_schedule.json
|   +-- segment_reporting/
|       +-- segment_reports.json
|       +-- segment_reconciliations.json
|
+-- period_close/
|   +-- trial_balances.json
|
+-- balance/
|   +-- opening_balances.json
|   +-- subledger_reconciliation.json
|
+-- intercompany/
|   +-- group_structure.json
|   +-- ic_matched_pairs.json
|   +-- ic_seller_journal_entries.json
|   +-- ic_buyer_journal_entries.json
|   +-- ic_elimination_entries.json
|   +-- nci_measurements.json
|
+-- accounting_standards/
|   +-- customer_contracts.json, impairment_tests.json
|   +-- business_combinations.json, business_combination_journal_entries.json
|   +-- ecl_models.json, ecl_provision_movements.json, ecl_journal_entries.json
|   +-- provisions.json, provision_movements.json, contingent_liabilities.json
|   +-- fx/currency_translation_results.json
|
+-- tax/
|   +-- tax_jurisdictions.json, tax_codes.json, tax_provisions.json
|   +-- tax_lines.json, tax_returns.json, withholding_records.json
|   +-- temporary_differences.json, etr_reconciliation.json,
|       deferred_tax_rollforward.json, deferred_tax_journal_entries.json
|
+-- treasury/
|   +-- cash_positions.json, cash_forecasts.json, cash_pools.json,
|       debt_instruments.json, hedging_instruments.json, hedge_relationships.json,
|       bank_guarantees.json, netting_runs.json
|
+-- project_accounting/
|   +-- projects.json, cost_lines.json, revenue_records.json,
|       earned_value_metrics.json, change_orders.json, milestones.json
|
+-- esg/
|   +-- emission_records.json, energy_consumption.json, water_usage.json, ...
|
+-- internal_controls/              CSV files for BI/analytics
|   +-- internal_controls.csv
|   +-- control_account_mappings.csv, control_process_mappings.csv
|   +-- control_threshold_mappings.csv, control_doctype_mappings.csv
|   +-- sod_conflict_pairs.csv, sod_rules.csv
|   +-- coso_control_mapping.csv
|   +-- internal_controls.json, sod_violations.json
|
+-- audit/                          33+ audit files
|   +-- audit_engagements.json, audit_workpapers.json, audit_evidence.json
|   +-- audit_risk_assessments.json, audit_findings.json, audit_judgments.json
|   +-- audit_confirmations.json, audit_procedure_steps.json, audit_samples.json
|   +-- engagement_letters.json (ISA 210)
|   +-- combined_risk_assessments.json (ISA 315)
|   +-- significant_transaction_classes.json (ISA 315 SCOTS)
|   +-- materiality_calculations.json (ISA 320)
|   +-- service_organizations.json, soc_reports.json, user_entity_controls.json (ISA 402)
|   +-- unusual_items.json, analytical_relationships.json (ISA 520)
|   +-- sampling_plans.json, sampled_items.json (ISA 530)
|   +-- accounting_estimates.json (ISA 540)
|   +-- subsequent_events.json (ISA 560)
|   +-- going_concern_assessments.json (ISA 570)
|   +-- component_auditors.json, group_audit_plan.json,
|   |   component_instructions.json, component_reports.json (ISA 600)
|   +-- audit_opinions.json, key_audit_matters.json (ISA 700/701)
|   +-- sox_302_certifications.json, sox_404_assessments.json
|   +-- isa_mappings.json, isa_pcaob_mappings.json
|
+-- banking/
|   +-- banking_customers.json, banking_accounts.json, banking_transactions.json,
|       aml_transaction_labels.json, aml_customer_labels.json, aml_narratives.json
|
+-- sales_kpi_budgets/
|   +-- sales_quotes.json, management_kpis.json, budgets.json
|
+-- process_mining/                 OCEL 2.0 JSON, XES 2.0, process variants
+-- graphs/                         PyTorch Geometric, Neo4j CSV+Cypher, DGL, RustGraph
+-- labels/                         anomaly_labels, fraud_labels, quality_labels
+-- standards/                      Compliance standards, cross-references, filings
+-- events/                         process_evolution_events, organizational_events

Python SDK

cd python && pip install -e ".[all]"
from datasynth_py import DataSynth
from datasynth_py import to_pandas, to_polars, list_tables
from datasynth_py.config import blueprints

# Generate with a preset blueprint
config = blueprints.retail_small(companies=4, transactions=10000)
result = DataSynth().generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Load as DataFrames
tables = list_tables(result)                  # ['journal_entries', 'vendors', ...]
df = to_pandas(result, "journal_entries")
pl_df = to_polars(result, "vendors")

# Async generation
from datasynth_py import AsyncDataSynth
async with AsyncDataSynth() as synth:
    result = await synth.generate(config=config)

# Fingerprint operations
synth = DataSynth()
synth.fingerprint.extract("./real_data/", "./fingerprint.dsf", privacy_level="standard")
report = synth.fingerprint.evaluate("./fingerprint.dsf", "./synthetic/")

Available blueprints: retail_small(), banking_medium(), manufacturing_large(), ml_training(), statistical_validation(), with_distributions(), with_llm_enrichment(), with_diffusion(), with_causal()

Optional dependencies: [pandas], [polars], [jupyter], [streaming], [airflow], [dbt], [mlflow], [spark], [all]


Server & Deployment

# Start REST + gRPC server
cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051

# With authentication
cargo run -p datasynth-server -- --api-keys "key1,key2"

# With JWT/OIDC (Keycloak, Auth0, Entra ID)
cargo run -p datasynth-server --features jwt -- \
  --jwt-issuer "https://auth.example.com" \
  --jwt-audience "datasynth-api"

API endpoints:

curl http://localhost:3000/health
curl http://localhost:3000/ready
curl http://localhost:3000/metrics
curl -H "Authorization: Bearer <key>" -X POST http://localhost:3000/api/stream/start

WebSocket streaming: ws://localhost:3000/ws/events

Docker:

docker build -t datasynth:latest .
docker run -p 3000:3000 -p 50051:50051 datasynth:latest

# Full stack with Prometheus + Grafana
docker compose up -d

See the Deployment Guide for Docker, Kubernetes Helm chart, systemd, and reverse proxy configuration.


Desktop UI

cd crates/datasynth-ui
npm install
npm run tauri dev

Cross-platform Tauri + SvelteKit application with 40+ configuration pages, real-time streaming visualization, and preset management.


Privacy-Preserving Fingerprinting

Extract statistical fingerprints from real data with formal privacy guarantees, then generate matching synthetic data:

# Extract with differential privacy
datasynth-data fingerprint extract --input ./real_data.csv --output ./fp.dsf --privacy-level standard

# Validate and evaluate
datasynth-data fingerprint validate ./fp.dsf
datasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./synthetic/
Privacy Level Epsilon k-Anonymity Description
minimal 5.0 3 Higher utility, lower privacy
standard 1.0 5 Balanced (default)
high 0.5 10 Higher privacy
maximum 0.1 20 Maximum privacy

Includes Renyi DP and zCDP composition accounting, privacy budget management, federated fingerprinting for distributed data, membership inference attack testing, and cryptographic synthetic data certificates (HMAC-SHA256).


Use Cases

Domain Application
Fraud Detection Train supervised models with ACFE-aligned labeled fraud patterns and collusion networks
Graph Neural Networks Entity relationship graphs with typed edges for anomaly detection
AML / KYC Testing Banking transactions with structuring, layering, and mule typologies
Audit Analytics Validate audit procedures with known control exceptions and ISA/PCAOB mappings
Process Mining OCEL 2.0 and XES 2.0 event logs for process discovery and conformance checking
ERP Load Testing Realistic transaction volumes with proper document chains
SOX Compliance Internal control monitoring with COSO 2013 mappings and deficiency classification
Causal ML Research Interventional and counterfactual datasets with causal DAG propagation
Data Quality ML Train models to detect missing values, format variations, typos, and duplicates
ESG Reporting GHG emissions, diversity metrics, and GRI/SASB/TCFD disclosure data
Tax Compliance Multi-jurisdiction tax returns, provisions, and withholding records
Treasury Operations Cash positioning, hedging effectiveness, and debt covenant monitoring

Performance

Metric Value
Single-threaded throughput 200,000+ journal entries/second
Parallel scaling Linear with available CPU cores
Memory model Streaming generation with configurable backpressure
Determinism Fully reproducible via seeded ChaCha8 RNG

Documentation


License

Copyright 2024-2026 Michael Ivertowski

Licensed under the Apache License, Version 2.0. See LICENSE for details.


Support

Commercial support, custom development, and enterprise licensing are available. Open an issue on GitHub.


DataSynth is provided "as is" without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.

About

High-performance synthetic enterprise data generator. Produces 100+ interconnected financial tables — GL journal entries, document flows, subledgers, banking/KYC/AML, process mining (OCEL 2.0), graph exports (PyTorch Geometric, Neo4j), and 20+ process chains — with Benford's Law compliance, ACFE-aligned fraud labels, and formal privacy guarantees.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Contributors