DataSynth v2.0.0

Synthetic enterprise data generation for ML training, audit analytics, and system testing.

DataSynth generates statistically realistic, fully interconnected enterprise financial data. It produces coherent General Ledger journal entries, document flows, subledger records, banking transactions, process mining event logs, and graph exports across 20+ enterprise process families.

Generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford's Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.

Quick Start

# Build from source
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

# Demo mode -- generates a complete dataset with defaults
./target/release/datasynth-data generate --demo --output ./demo-output

# Full audit simulation (113+ output files)
./target/release/datasynth-data generate --demo --preset audit-group --output ./audit-output

# Or configure for your use case
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data validate --config config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Group Audit Simulation

The audit-group preset generates a complete enterprise group audit dataset following ISA, IFRS, US GAAP, and local regulations:

./target/release/datasynth-data generate --demo --preset audit-group --output ./audit-output

# Export in SAP, French (FEC), or German (GoBD) audit formats
./target/release/datasynth-data generate --config config.yaml --output ./output --export-format sap --export-format fec

This produces 113+ interconnected files:

Category	Content
Financial Statements	Standalone + consolidated BS/IS/CF with elimination schedules
Audit Lifecycle	Engagement, risk assessment, procedures, sampling, findings, opinion
ISA 600 Group Audit	Component auditors, materiality allocation, scope, instructions, reports
Risk Assessment	Combined Risk Assessment (CRA) per account area and assertion
Audit Methodology	Materiality (ISA 320), sampling (ISA 530), analytical procedures (ISA 520)
Accounting Standards	Deferred tax, ECL, provisions, pensions, stock comp, business combinations
SOX Compliance	Section 302 certifications, Section 404 ICFR assessments
Graph Export	78+ entity types, 39+ edge types for ML training and AI agent interaction

CRA drives sampling, sampling correlates with misstatement rates, misstatements drive findings, findings drive the audit opinion.

Key Capabilities

Statistical Foundations

Distribution engine -- Log-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, and zero-inflated distributions with configurable components
Copula correlations -- Cross-field dependency modeling via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas
Benford's Law -- First and second-digit compliance with configurable deviation for anomaly injection
Temporal patterns -- Month-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags, and fiscal calendar support
Regime changes -- Economic cycles, acquisition effects, and structural breaks in time series
Industry profiles -- Pre-configured distributions for Retail, Manufacturing, Financial Services, Healthcare, and Technology

Enterprise Process Simulation

Every process chain generates its own master data, documents, and journal entries -- all cross-referenced:

Process Family	Scope
General Ledger	Journal entries, chart of accounts (small/medium/large), ACDOCA event logs
Procure-to-Pay	Purchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match
Order-to-Cash	Sales orders, deliveries, customer invoices, receipts, dunning
Source-to-Contract	Spend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards
Hire-to-Retire	Payroll runs, tax/deduction calculations, time & attendance, expense reports, benefit enrollment
Manufacturing	Production orders, BOM explosion, routing operations, WIP costing, quality inspections, cycle counts
Financial Reporting	Balance sheet, income statement, cash flow, changes in equity, KPIs, budget variance
Tax Accounting	Multi-jurisdiction tax (Federal/State/Local), VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48 uncertain positions, withholding
Treasury	Cash positioning, probability-weighted forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting
Project Accounting	WBS hierarchies, cost lines, percentage-of-completion revenue, earned value (SPI/CPI/EAC), change orders
ESG / Sustainability	GHG Scope 1/2/3 emissions, energy/water/waste, workforce diversity, safety metrics, GRI/SASB/TCFD disclosures
Intercompany	IC matching, transfer pricing, consolidation eliminations, currency translation
Subledgers	AR, AP, Fixed Assets, Inventory -- each with GL reconciliation
Period Close	Monthly close engine, depreciation runs, accruals, year-end closing entries
Banking / KYC / AML	Customer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel)
Sales	Quote-to-order pipeline with win rate modeling and pricing negotiation
Bank Reconciliation	Statement matching, outstanding checks, deposits in transit
Audit	ISA lifecycle: engagements, workpapers, evidence, risk assessments, findings, opinions (ISA 700), KAMs (ISA 701), SOX 302/404
Group Audit (ISA 600)	Component auditors, materiality allocation, scope assignment, component instructions/reports, consolidation

Accounting, Audit & Compliance Standards

Accounting frameworks -- US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), and dual reporting
Revenue recognition -- ASC 606 / IFRS 15 with contract generation, performance obligations, and SSP allocation
Leases -- ASC 842 / IFRS 16 with ROU assets, lease liabilities, and classification
Fair value -- ASC 820 / IFRS 13 Level 1/2/3 hierarchy
Impairment -- ASC 360 / IAS 36 testing with fair value estimation
Audit standards -- ISA (34 standards), PCAOB (19+ standards) with procedure mapping
SOX compliance -- Section 302/404 assessments with deficiency classification and material weakness detection
COSO 2013 -- 5 components, 17 principles, maturity levels, entity-level and transaction-level controls
Compliance regulations -- 45+ built-in standards registry, jurisdiction profiles (10 countries), regulatory filings, audit procedures, and compliance findings with full deficiency classification
Cross-domain compliance graph -- Standards linked to GL account types and business processes; full traversal paths (Company -> Jurisdiction -> Standard -> Account -> JournalEntry)
Localized exports -- FEC (French) and GoBD (German) audit file formats
Enterprise Group Audit (ISA 600) -- Component auditor assignment, group materiality allocation, scope assignment (full/specific/analytical), component instructions and reports
Audit Opinion (ISA 700/705/706/701) -- Opinion derived from findings severity and going concern, Key Audit Matters, PCAOB ICFR opinion
Audit Methodology -- Combined Risk Assessment (ISA 315), materiality calculations (ISA 320), sampling methodology (ISA 530), SCOTS classification, unusual item detection, analytical relationships (ISA 520)
Deferred Tax (IAS 12 / ASC 740) -- Temporary differences, ETR reconciliation, rollforward schedules, valuation allowances
Business Combinations (IFRS 3 / ASC 805) -- Purchase price allocation, fair value step-ups, goodwill, contingent consideration
Segment Reporting (IFRS 8 / ASC 280) -- Operating segments with reconciliation to consolidated totals
Expected Credit Loss (IFRS 9 / ASC 326) -- Provision matrix by aging bucket, forward-looking scenarios, ECL movements
Pensions (IAS 19 / ASC 715) -- DBO rollforward, plan assets, pension expense, OCI remeasurements
Provisions (IAS 37 / ASC 450) -- Framework-aware recognition thresholds, provision movements
Stock Compensation (ASC 718 / IFRS 2) -- Grants, vesting schedules, expense recognition
Functional Currency (IAS 21) -- Per-entity functional currency, CTA as OCI
Consolidated Financial Statements -- Standalone + consolidated with elimination schedules
Going Concern (ISA 570) -- Financial indicator derivation, management mitigation plans
Subsequent Events (ISA 560 / IAS 10) -- Adjusting and non-adjusting events

YAML-Driven Audit FSM Engine

The datasynth-audit-fsm crate provides a methodology-agnostic state machine engine that loads audit methodology blueprints from YAML and generates event-sourced audit trails with typed artifacts.

The engine uses a two-layer architecture: blueprints define what happens (procedures, phases, state machines, evidence requirements, standards references), while generation overlays define how it happens (revision probabilities, timing distributions, artifact volumes, anomaly injection rates). The same blueprint can produce a thorough engagement or a rushed engagement by swapping a single overlay file.

10 builtin blueprints cover the major audit methodologies:

Blueprint	Procedures	Phases	Steps	Standards	Events	Artifacts
Financial Statement Audit (FSA)	9	3	24	14 ISA	51	1,916
Internal Audit (IA)	34	9	82	52 IIA-GIAS	205	3,808
KPMG, PwC, Deloitte, EY GAM Lite	Firm-specific ISA methodologies
SOC 2 Type II	Trust Services Criteria
PCAOB Integrated	AS 2201 integrated audit
Regulatory Examination	Regulatory examination

Additional methodology blueprints are available at SyntheticDataBlueprints.

The StepDispatcher maps all step commands to 14 concrete audit generators, enriched by the analytics inventory (87 data requirements + 71 analytical procedures across FSA, IA, SOC 2, PCAOB, and Regulatory blueprints) and the form ontology (4,437 canonical field categories). Every step carries a judgment_level for risk-based procedure selection. Every artifact is data-driven: findings cite specific journal entries, workpapers reference applicable ISA paragraphs, and evidence descriptions include expected form fields.

14/14 audit data types and 14/14 analytical procedures -- full coverage of all data types required by FSA audit steps:

Category	Data Types
Core financial	General ledger, journal entries (with ISA 240 flags), financial statements (with comparatives), sub-ledgers
External evidence	Bank statements, confirmations, contracts, estimates
Year-over-year	Prior-year comparatives, prior-year findings with remediation tracking
Reference data	Industry benchmarks (10 metrics/industry), organizational profile (IT systems, regulatory env)
Governance	Board minutes (quarterly + audit committee), management reports (KPI/RAG/budget)
IT controls	Access logs (business-hour weighting), change management records (approval gap correlation)

Engine features:

8-state C2CE (Condition-Criteria-Cause-Effect) lifecycle for finding development
Self-loop handling with configurable max iterations for follow-up procedures
Continuous phase support for parallel execution (ethics, governance, quality)
Discriminator-based procedure filtering (categories, risk ratings, engagement types)
Generation overlay presets: default, thorough, rushed with cost model (base hours + role rates)
6 export formats: JSON, CSV (Disco/Celonis), XES 2.0 (ProM/pm4py), OCEL 2.0, Celonis, Parquet
Streaming execution with live anomaly injection
Benchmark datasets: simple/medium/complex with configurable anomaly injection
ContentGenerator trait with pluggable implementations (--features claude-content for Claude CLI adapter)
284+ tests across audit FSM and optimizer modules
CLI: datasynth-data audit validate|info|run|benchmark

# Enable FSM-driven audit generation
audit:
  enabled: true
  fsm:
    enabled: true
    blueprint: builtin:fsa     # builtin:fsa, builtin:ia, builtin:kpmg, builtin:pwc, etc.
    overlay: builtin:default   # builtin:default, builtin:thorough, builtin:rushed

Programmatic usage:

use datasynth_audit_fsm::loader::{BlueprintWithPreconditions, load_overlay, OverlaySource, BuiltinOverlay};
use datasynth_audit_fsm::engine::AuditFsmEngine;
use datasynth_audit_fsm::context::EngagementContext;
use rand::SeedableRng;
use rand_chacha::ChaCha8Rng;

let bwp = BlueprintWithPreconditions::load_builtin_fsa().unwrap();
let overlay = load_overlay(&OverlaySource::Builtin(BuiltinOverlay::Default)).unwrap();
let mut engine = AuditFsmEngine::new(bwp, overlay, ChaCha8Rng::seed_from_u64(42));
let result = engine.run_engagement(&EngagementContext::demo()).unwrap();

println!("Events: {}, Artifacts: {}", result.event_log.len(), result.artifacts.total_artifacts());

The companion datasynth-audit-optimizer crate (16 modules) provides:

Graph analysis: Blueprint to petgraph conversion, shortest path (FSA: 27, IA: 101 min transitions)
Resource-constrained optimization: Budget/role-aware audit plan selection with coverage reporting
Risk-based scoping: Standards/risk coverage analysis, what-if procedure removal impact
Portfolio simulation: Multi-engagement with shared resources, scheduling conflicts, systemic findings
Conformance metrics: Fitness, precision, anomaly detection statistics
Overlay fitting: Iterative parameter search from target engagement profiles
Blueprint discovery: Infer methodology from event logs (alpha miner), compare against reference
Anomaly calibration: Auto-tune injection rates to target detection difficulty
Cross-firm benchmark comparison: Methodology coverage and efficiency across Big 4 firms
ISA 600 group audit simulation: Component auditor assignment, materiality allocation, scope
Year-over-year engagement chains: Multi-period simulation with carry-forward findings
Blueprint testing: Automated blueprint validation and regression testing

For a deep dive, see the Audit FSM Engine documentation.

Interconnectivity & Relationships

Multi-tier vendor networks -- Tier 1/2/3 supply chain with behavioral clusters (Strategic, Operational, Transactional, Problematic)
Customer segmentation -- Enterprise/MidMarket/SMB/Consumer with Pareto-like revenue distribution and lifecycle stages
Relationship strength -- Composite scoring from volume, count, duration, recency, and mutual connections
Cross-process links -- P2P and O2C linked via inventory; payments linked to bank reconciliation
Entity graphs -- 16 entity types, 26 relationship types with connectivity and clustering metrics
Compliance-to-accounting links -- Standards mapped to GL account types and processes; findings linked to controls and affected accounts; filings linked to companies and jurisdictions

Fraud, Anomalies & Data Quality

ACFE-aligned fraud taxonomy -- Asset misappropriation, corruption, and financial statement fraud with calibrated rates
60+ anomaly types -- Fraud, errors, process issues, statistical outliers, and relational anomalies
Collusion modeling -- 9 ring types with role-based conspirators, defection, and escalation dynamics
Management override -- Senior-level fraud patterns with fraud triangle modeling
Red flag generation -- 40+ probabilistic fraud indicators with Bayesian calibration
Industry-specific patterns -- Manufacturing yield manipulation, retail sweethearting, healthcare upcoding
Data quality variations -- Missing values (MCAR/MAR/MNAR), format variations, typos (keyboard-aware, OCR), duplicates, encoding issues
Full labeling -- Every injected anomaly and quality issue is labeled for supervised ML training

Process & Behavioral Drift

Organizational events -- Acquisitions, divestitures, mergers, reorganizations with volume multipliers
Process evolution -- S-curve automation rollout, workflow changes, policy updates
Technology transitions -- ERP migrations with phased rollout (parallel run, cutover, stabilization)
Market drift -- Economic cycles, commodity price shocks, recession modeling
Labeled drift events -- Ground truth labels with magnitude and detection difficulty for ML training

Machine Learning & Graph Export

Graph formats -- PyTorch Geometric (.pt), Neo4j (CSV + Cypher), DGL, RustGraph JSON
Multi-layer hypergraph -- 3-layer (Governance, Process Events, Accounting Network) with OCPM events as hyperedges and compliance regulation nodes
Compliance graph layer -- Standards, findings, filings, and jurisdictions as graph nodes with cross-domain edges to accounts, controls, and companies
28 audit entity types in graph -- CRA, materiality, opinions, sampling plans, SCOTS, unusual items, analytical relationships, group structure, and more
27 cross-entity edge types -- CRA to entity, opinion to engagement, KAM to opinion, sampling to CRA, unusual to JE, and audit lifecycle traversal paths
Train/val/test splits -- Configurable data partitioning for ML pipelines
Anomaly labels -- Fraud labels, quality issue labels, and drift labels in standardized format
Counterfactual pairs -- (original, mutated) journal entry pairs for causal ML training

Process Mining

OCEL 2.0 -- Object-centric event logs in JSON/XML format
XES 2.0 -- XML export compatible with ProM, Celonis, Disco, and pm4py
101+ activity types across 12 process families with 65+ object types
10 OCPM generators -- S2C, H2R, MFG, BANK, AUDIT, Bank Recon, Tax, Treasury, Project Accounting, ESG
Process variants -- Happy path (75%), exception path (20%), error path (5%)

Advanced Generation

Capability	Description
LLM enrichment	Pluggable `LlmProvider` trait (mock/OpenAI-compatible) for vendor names, descriptions, and anomaly explanations
Diffusion models	Statistical diffusion with Langevin reverse process; linear/cosine/sigmoid schedules; hybrid blending
Causal models	Structural causal models with do-calculus interventions and counterfactual abduction-action-prediction
Natural language config	Generate YAML configurations from plain English descriptions
Scenario engine	Built-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override, comprehensive
Counterfactual simulation	8 intervention types with causal DAG propagation and diff analysis

Production Features

REST / gRPC / WebSocket APIs with streaming generation and backpressure handling
Authentication -- API key (Argon2id), JWT/OIDC (RS256), role-based access control (Admin/Operator/Viewer)
Quality gates -- Configurable pass/fail thresholds (strict/default/lenient) with 8 metrics
Plugin SDK -- GeneratorPlugin, SinkPlugin, TransformPlugin traits with thread-safe registry
Resource guards -- Memory, disk, and CPU monitoring with graceful degradation (Normal to Reduced to Minimal to Emergency)
Deterministic generation -- Seeded ChaCha8 RNG for fully reproducible output
Streaming output -- Async generation with configurable backpressure (block/drop_oldest/drop_newest/buffer)
Data lineage -- Per-file checksums, lineage graph, W3C PROV-JSON export
Country packs -- Pluggable JSON country configuration (US/DE/GB built-in) with holidays, names, tax, addresses
Observability -- OpenTelemetry traces, Prometheus metrics, structured JSON logging
Docker & Kubernetes -- Multi-stage distroless containers, Helm chart with HPA/PDB, Prometheus ServiceMonitor
CI/CD -- 7-job GitHub Actions pipeline (fmt, clippy, cross-platform test, MSRV, security, coverage, benchmarks)
EU AI Act -- Article 50 synthetic content marking and Article 10 data governance reports
Fuzzing -- cargo-fuzz targets for config parsing, fingerprint loading, and validation
Panic-free -- #![deny(clippy::unwrap_used)] enforced across all library crates

Ecosystem Integrations

Integration	Capability
Apache Airflow	`DataSynthOperator`, `DataSynthSensor`, `DataSynthValidateOperator` for DAG orchestration
dbt	Source YAML generation, seed export, project scaffolding
MLflow	Generation runs as experiments with parameter, metric, and artifact logging
Apache Spark	DataFrames with schema inference and temp view registration

Architecture

DataSynth is a Rust workspace with 18 crates:

datasynth-cli              CLI binary (generate, validate, init, info, fingerprint, scenario)
datasynth-server           REST / gRPC / WebSocket server with auth and rate limiting
datasynth-ui               Tauri + SvelteKit desktop application
                  |
datasynth-runtime          Generation orchestrator (parallel execution, resource guards, streaming)
                  |
datasynth-generators       50+ data generators across all process families
datasynth-banking          KYC / AML banking transaction generator
datasynth-ocpm             OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint      Privacy-preserving fingerprint extraction and synthesis
datasynth-standards        Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
datasynth-audit-fsm        YAML-driven audit FSM engine (10 builtin blueprints)
datasynth-audit-optimizer  Audit path optimization, Monte Carlo, group audit simulation
                  |
datasynth-graph            Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)
datasynth-graph-export     Unified graph export pipeline with 78+ entity types
datasynth-eval             Statistical evaluation, quality gates, auto-tuning
                  |
datasynth-config           Configuration schema, validation, industry presets
                  |
datasynth-core             Domain models, traits, distributions, resource guards
                  |
datasynth-output           Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming
datasynth-test-utils       Test utilities, fixtures, mocks

Installation

From Source

git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

The binary is available at target/release/datasynth-data.

Requirements

Rust 1.88+
Desktop UI: Node.js 18+ and platform-specific Tauri prerequisites

Configuration

DataSynth uses YAML configuration with 30+ top-level sections. Generate a starter config with init:

datasynth-data init --industry retail --complexity medium -o config.yaml

Minimal configuration:

global:
  seed: 42
  industry: manufacturing
  start_date: 2024-01-01
  period_months: 12
  group_currency: USD

companies:
  - code: "1000"
    name: "Headquarters"
    currency: USD
    country: US

transactions:
  target_count: 100000

output:
  format: csv               # csv, json, parquet

Enable specific modules by adding their sections:

# Fraud detection training data
fraud:
  enabled: true
  fraud_rate: 0.005
anomaly_injection:
  enabled: true
  total_rate: 0.02
  generate_labels: true

# Graph export for GNN training
graph_export:
  enabled: true
  formats: [pytorch_geometric, neo4j]

# Statistical realism
distributions:
  enabled: true
  industry_profile: retail
  amounts:
    distribution_type: lognormal
    benford_compliance: true
  correlations:
    enabled: true
    copula_type: gaussian

# Enterprise process chains
document_flows:
  enabled: true
source_to_pay:
  enabled: true
hr:
  enabled: true
manufacturing:
  enabled: true
financial_reporting:
  enabled: true
esg:
  enabled: true

# Accounting standards
accounting_standards:
  enabled: true
  framework: us_gaap         # us_gaap, ifrs, french_gaap, german_gaap, dual_reporting

# Process mining
ocpm:
  enabled: true
  output:
    ocel_json: true
    xes: true

Industry presets (manufacturing, retail, financial_services, healthcare, technology) and complexity levels (small ~100 accounts, medium ~400, large ~2500) provide sensible defaults.

See the Configuration Guide for the complete reference.

Output Structure

DataSynth generates 100+ interconnected output tables organized by domain:

output/
+-- journal_entries.csv             Flat CSV: one row per JE line item
+-- journal_entries.json            Nested JSON: full JE structure
+-- acdoca.csv                      SAP ACDOCA-style universal journal
|
+-- master_data/
|   +-- vendors.json
|   +-- customers.json
|   +-- materials.json
|   +-- fixed_assets.json
|   +-- employees.json              Includes salary, hire date, department
|   +-- cost_centers.json           Cost center hierarchy
|
+-- document_flows/
|   +-- purchase_orders.json
|   +-- goods_receipts.json
|   +-- vendor_invoices.json
|   +-- payments.json
|   +-- customer_receipts.json
|   +-- sales_orders.json
|   +-- deliveries.json
|   +-- customer_invoices.json
|   +-- document_references.json    Cross-doc links (PO->GR->Invoice->Payment)
|
+-- sourcing/                       S2C pipeline
|   +-- spend_analyses, sourcing_projects, rfx_events, supplier_bids,
|       bid_evaluations, procurement_contracts, catalog_items, supplier_scorecards
|
+-- subledger/
|   +-- ap_invoices.json, ar_invoices.json
|   +-- fa_records.json, inventory_positions.json, inventory_movements.json
|   +-- ar_aging.json, ap_aging.json
|   +-- depreciation_runs.json, inventory_valuation.json
|   +-- dunning_runs.json, dunning_letters.json
|
+-- hr/
|   +-- payroll_runs.json, payroll_line_items.json
|   +-- time_entries.json, expense_reports.json, benefit_enrollments.json
|   +-- pension_plans.json, pension_obligations.json, plan_assets.json, pension_disclosures.json
|   +-- stock_grants.json, stock_comp_expense.json
|   +-- employee_change_history.json
|
+-- manufacturing/
|   +-- production_orders.json, quality_inspections.json, cycle_counts.json,
|       bom_components.json, inventory_movements.json
|
+-- financial_reporting/
|   +-- financial_statements.json   All standalone statements combined
|   +-- bank_reconciliations.json
|   +-- notes_to_financial_statements.json
|   +-- standalone/                 Per-entity: {entity_code}_financial_statements.json
|   +-- consolidated/
|   |   +-- consolidated_financial_statements.json
|   |   +-- consolidation_schedule.json
|   +-- segment_reporting/
|       +-- segment_reports.json
|       +-- segment_reconciliations.json
|
+-- period_close/
|   +-- trial_balances.json
|
+-- balance/
|   +-- opening_balances.json
|   +-- subledger_reconciliation.json
|
+-- intercompany/
|   +-- group_structure.json
|   +-- ic_matched_pairs.json
|   +-- ic_seller_journal_entries.json
|   +-- ic_buyer_journal_entries.json
|   +-- ic_elimination_entries.json
|   +-- nci_measurements.json
|
+-- accounting_standards/
|   +-- customer_contracts.json, impairment_tests.json
|   +-- business_combinations.json, business_combination_journal_entries.json
|   +-- ecl_models.json, ecl_provision_movements.json, ecl_journal_entries.json
|   +-- provisions.json, provision_movements.json, contingent_liabilities.json
|   +-- fx/currency_translation_results.json
|
+-- tax/
|   +-- tax_jurisdictions.json, tax_codes.json, tax_provisions.json
|   +-- tax_lines.json, tax_returns.json, withholding_records.json
|   +-- temporary_differences.json, etr_reconciliation.json,
|       deferred_tax_rollforward.json, deferred_tax_journal_entries.json
|
+-- treasury/
|   +-- cash_positions.json, cash_forecasts.json, cash_pools.json,
|       debt_instruments.json, hedging_instruments.json, hedge_relationships.json,
|       bank_guarantees.json, netting_runs.json
|
+-- project_accounting/
|   +-- projects.json, cost_lines.json, revenue_records.json,
|       earned_value_metrics.json, change_orders.json, milestones.json
|
+-- esg/
|   +-- emission_records.json, energy_consumption.json, water_usage.json, ...
|
+-- internal_controls/              CSV files for BI/analytics
|   +-- internal_controls.csv
|   +-- control_account_mappings.csv, control_process_mappings.csv
|   +-- control_threshold_mappings.csv, control_doctype_mappings.csv
|   +-- sod_conflict_pairs.csv, sod_rules.csv
|   +-- coso_control_mapping.csv
|   +-- internal_controls.json, sod_violations.json
|
+-- audit/                          33+ audit files
|   +-- audit_engagements.json, audit_workpapers.json, audit_evidence.json
|   +-- audit_risk_assessments.json, audit_findings.json, audit_judgments.json
|   +-- audit_confirmations.json, audit_procedure_steps.json, audit_samples.json
|   +-- engagement_letters.json (ISA 210)
|   +-- combined_risk_assessments.json (ISA 315)
|   +-- significant_transaction_classes.json (ISA 315 SCOTS)
|   +-- materiality_calculations.json (ISA 320)
|   +-- service_organizations.json, soc_reports.json, user_entity_controls.json (ISA 402)
|   +-- unusual_items.json, analytical_relationships.json (ISA 520)
|   +-- sampling_plans.json, sampled_items.json (ISA 530)
|   +-- accounting_estimates.json (ISA 540)
|   +-- subsequent_events.json (ISA 560)
|   +-- going_concern_assessments.json (ISA 570)
|   +-- component_auditors.json, group_audit_plan.json,
|   |   component_instructions.json, component_reports.json (ISA 600)
|   +-- audit_opinions.json, key_audit_matters.json (ISA 700/701)
|   +-- sox_302_certifications.json, sox_404_assessments.json
|   +-- isa_mappings.json, isa_pcaob_mappings.json
|
+-- banking/
|   +-- banking_customers.json, banking_accounts.json, banking_transactions.json,
|       aml_transaction_labels.json, aml_customer_labels.json, aml_narratives.json
|
+-- sales_kpi_budgets/
|   +-- sales_quotes.json, management_kpis.json, budgets.json
|
+-- process_mining/                 OCEL 2.0 JSON, XES 2.0, process variants
+-- graphs/                         PyTorch Geometric, Neo4j CSV+Cypher, DGL, RustGraph
+-- labels/                         anomaly_labels, fraud_labels, quality_labels
+-- standards/                      Compliance standards, cross-references, filings
+-- events/                         process_evolution_events, organizational_events

Python SDK

cd python && pip install -e ".[all]"

from datasynth_py import DataSynth
from datasynth_py import to_pandas, to_polars, list_tables
from datasynth_py.config import blueprints

# Generate with a preset blueprint
config = blueprints.retail_small(companies=4, transactions=10000)
result = DataSynth().generate(config=config, output={"format": "csv", "sink": "temp_dir"})

# Load as DataFrames
tables = list_tables(result)                  # ['journal_entries', 'vendors', ...]
df = to_pandas(result, "journal_entries")
pl_df = to_polars(result, "vendors")

# Async generation
from datasynth_py import AsyncDataSynth
async with AsyncDataSynth() as synth:
    result = await synth.generate(config=config)

# Fingerprint operations
synth = DataSynth()
synth.fingerprint.extract("./real_data/", "./fingerprint.dsf", privacy_level="standard")
report = synth.fingerprint.evaluate("./fingerprint.dsf", "./synthetic/")

Available blueprints: retail_small(), banking_medium(), manufacturing_large(), ml_training(), statistical_validation(), with_distributions(), with_llm_enrichment(), with_diffusion(), with_causal()

Optional dependencies: [pandas], [polars], [jupyter], [streaming], [airflow], [dbt], [mlflow], [spark], [all]

Server & Deployment

# Start REST + gRPC server
cargo run -p datasynth-server -- --rest-port 3000 --grpc-port 50051

# With authentication
cargo run -p datasynth-server -- --api-keys "key1,key2"

# With JWT/OIDC (Keycloak, Auth0, Entra ID)
cargo run -p datasynth-server --features jwt -- \
  --jwt-issuer "https://auth.example.com" \
  --jwt-audience "datasynth-api"

API endpoints:

curl http://localhost:3000/health
curl http://localhost:3000/ready
curl http://localhost:3000/metrics
curl -H "Authorization: Bearer <key>" -X POST http://localhost:3000/api/stream/start

WebSocket streaming: ws://localhost:3000/ws/events

Docker:

docker build -t datasynth:latest .
docker run -p 3000:3000 -p 50051:50051 datasynth:latest

# Full stack with Prometheus + Grafana
docker compose up -d

See the Deployment Guide for Docker, Kubernetes Helm chart, systemd, and reverse proxy configuration.

Desktop UI

cd crates/datasynth-ui
npm install
npm run tauri dev

Cross-platform Tauri + SvelteKit application with 40+ configuration pages, real-time streaming visualization, and preset management.

Privacy-Preserving Fingerprinting

Extract statistical fingerprints from real data with formal privacy guarantees, then generate matching synthetic data:

# Extract with differential privacy
datasynth-data fingerprint extract --input ./real_data.csv --output ./fp.dsf --privacy-level standard

# Validate and evaluate
datasynth-data fingerprint validate ./fp.dsf
datasynth-data fingerprint evaluate --fingerprint ./fp.dsf --synthetic ./synthetic/

Privacy Level	Epsilon	k-Anonymity	Description
minimal	5.0	3	Higher utility, lower privacy
standard	1.0	5	Balanced (default)
high	0.5	10	Higher privacy
maximum	0.1	20	Maximum privacy

Includes Renyi DP and zCDP composition accounting, privacy budget management, federated fingerprinting for distributed data, membership inference attack testing, and cryptographic synthetic data certificates (HMAC-SHA256).

Use Cases

Domain	Application
Fraud Detection	Train supervised models with ACFE-aligned labeled fraud patterns and collusion networks
Graph Neural Networks	Entity relationship graphs with typed edges for anomaly detection
AML / KYC Testing	Banking transactions with structuring, layering, and mule typologies
Audit Analytics	Validate audit procedures with known control exceptions and ISA/PCAOB mappings
Process Mining	OCEL 2.0 and XES 2.0 event logs for process discovery and conformance checking
ERP Load Testing	Realistic transaction volumes with proper document chains
SOX Compliance	Internal control monitoring with COSO 2013 mappings and deficiency classification
Causal ML Research	Interventional and counterfactual datasets with causal DAG propagation
Data Quality ML	Train models to detect missing values, format variations, typos, and duplicates
ESG Reporting	GHG emissions, diversity metrics, and GRI/SASB/TCFD disclosure data
Tax Compliance	Multi-jurisdiction tax returns, provisions, and withholding records
Treasury Operations	Cash positioning, hedging effectiveness, and debt covenant monitoring

Performance

Metric	Value
Single-threaded throughput	200,000+ journal entries/second
Parallel scaling	Linear with available CPU cores
Memory model	Streaming generation with configurable backpressure
Determinism	Fully reproducible via seeded ChaCha8 RNG

Documentation

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available. Open an issue on GitHub.

DataSynth is provided "as is" without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.

Name		Name	Last commit message	Last commit date
Latest commit History 765 Commits
.cargo		.cargo
.github		.github
.serena		.serena
benches		benches
crates		crates
deploy		deploy
docs		docs
examples/templates		examples/templates
fuzz		fuzz
paper		paper
python		python
scripts		scripts
src		src
templates		templates
tests/load		tests/load
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
Dockerfile		Dockerfile
Dockerfile.cli		Dockerfile.cli
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RustGraph		RustGraph
cliff.toml		cliff.toml
deny.toml		deny.toml
docker-compose.yml		docker-compose.yml
docs-code.png		docs-code.png
docs-dark-code.png		docs-dark-code.png
docs-dark-proper.png		docs-dark-proper.png
docs-dark.png		docs-dark.png
docs-hero.png		docs-hero.png
docs-tables.png		docs-tables.png
protoc--linux-x86_64.zip		protoc--linux-x86_64.zip

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DataSynth v2.0.0

Table of Contents

Quick Start

Group Audit Simulation

Key Capabilities

Statistical Foundations

Enterprise Process Simulation

Accounting, Audit & Compliance Standards

YAML-Driven Audit FSM Engine

Interconnectivity & Relationships

Fraud, Anomalies & Data Quality

Process & Behavioral Drift

Machine Learning & Graph Export

Process Mining

Advanced Generation

Production Features

Ecosystem Integrations

Architecture

Installation

From Source

Requirements

Configuration

Output Structure

Python SDK

Server & Deployment

Desktop UI

Privacy-Preserving Fingerprinting

Use Cases

Performance

Documentation

License

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 28

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages