Skip to content

Upgrading the FieldInferenceEngine with semantic regex heuristics #10

@Sauvikn98

Description

@Sauvikn98

The current inference engine relies on direct string comparisons or simple token matching. This approach is brittle and fails when developers use unconventional naming patterns.

Examples of failure cases:

  • usr_mail instead of email
  • ph_no instead of phone_number
  • dt_created instead of created_at

This leads to fallback generation, reducing realism and usefulness of generated datasets.

Deeper Architectural Requirements

1. Regex Driven Semantic Matching

Introduce a regex based matching system that captures semantic intent rather than exact naming.

Examples:

  • Phone detection:

/(phone|mobile|cell|contact_no|tel)/i

  • Email detection:

/(email|e_mail|mail_id)/i

  • Temporal fields:

/(created|updated|timestamp|date|dt)/i

This allows broader coverage across diverse schemas.

2. Weighted Scoring Model

Each regex match should contribute a confidence score.

  • Exact matches yield high confidence
  • Partial or ambiguous matches yield lower confidence

Example scoring:

  • exact email match = 1.0
  • partial mail match = 0.6
  • generic text match = 0.2

The final semantic classification is determined by aggregating scores across all matching patterns.

3. Multi Label Classification

Some fields may belong to multiple semantic categories. For example:

  • billing_email may match both financial and contact categories

The engine should support:

  • Primary classification based on highest score
  • Secondary tags for contextual enrichment

4. Externalized Heuristic Dictionary

All regex patterns and scoring weights should be stored in a configuration layer.

Benefits:

  • Enables community contributions without touching core logic
  • Allows domain specific extensions
  • Simplifies testing and iteration

The dictionary structure may include:

{
"email": {
"patterns": [...],
"weight": ...
}
}

5. Continuous Learning Loop

Optionally, the system can log fallback cases and allow developers to:

  • Add new regex patterns
  • Adjust weights based on observed failures

Implementation Considerations

  • Regex performance impact on large schemas
  • Avoiding overfitting or overly broad matches
  • Balancing precision versus recall
  • Ensuring deterministic inference for repeatable runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions