Skip to content

Implementing an Advanced Solver for Cross-Column Dependency Resolution #9

@Sauvikn98

Description

@Sauvikn98

The current data generation pipeline operates on a column isolated execution model. Each column is evaluated independently using predefined generators or inferred semantics. While this model ensures high throughput and simplicity, it fundamentally breaks down when inter column relationships exist.

In real world schemas, columns rarely exist in isolation. Temporal consistency, numeric bounds, and logical dependencies are extremely common. For example:

  • created_at must always be less than or equal to updated_at
  • start_date must be less than end_date
  • discount_price must be less than original_price
  • quantity_available must not exceed total_quantity

Without enforcing these constraints, generated datasets become unrealistic and often invalid for downstream testing.

Deeper Architectural Requirements

1. Dependency Graph Construction

Before row generation begins, the engine must construct a directed acyclic graph representing column dependencies.

  • Nodes represent columns
  • Edges represent dependency relationships
  • Edge direction indicates evaluation order

This graph must be built dynamically using:

  • Explicit constraints defined in schema metadata
  • Inferred constraints derived from semantic analysis
  • Built in rules for common field pairs

Topological sorting will be required to determine execution order. If cycles are detected, the system must either:

  • Break the cycle using heuristic prioritization
  • Defer to a constraint relaxation strategy
  • Fail fast with a descriptive error

2. Context Aware Generation

Each column generator must accept a context object containing already generated values from upstream dependencies.

Example:

updated_at = generateDate({
min: context.created_at
})

This implies a shift from stateless generators to stateful evaluators. Generators must now:

  • Validate incoming constraints
  • Adjust generation ranges dynamically
  • Propagate constraints further downstream

3. Constraint Solving Strategies

A naive sequential approach will not be sufficient for complex constraint systems. The engine should support:

  • Deterministic resolution for simple constraints
  • Backtracking when constraints fail
  • Probabilistic sampling when multiple valid solutions exist

For highly constrained schemas, a lightweight constraint solver may be required, potentially inspired by SAT or CSP techniques.

4. Conflict Detection and Recovery

Conflicts may arise due to:

  • Circular dependencies
  • Over constrained fields
  • Contradictory rules

The system should implement:

  • Early validation during graph construction
  • Runtime detection of invalid states
  • Recovery strategies such as value regeneration or constraint relaxation

Implementation Considerations

  • Performance overhead of graph construction per row versus per batch
  • Memory cost of passing context objects
  • Need for deterministic reproducibility in seeded runs
  • Extensibility for user defined constraints

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions