A cozy, comfortable matching library that makes entity resolution and deduplication feel natural.
matcher is a lightweight library for matching records across data sources (entity resolution) and identifying duplicate records within a single source (deduplication). Built on Polars for optimal performance, it provides a cozy, notebook-friendly API that makes matching feel natural and comfortable.
matcher is built on Polars for optimal matching performance. This provides:
- Efficient columnar operations for large datasets
- Automatic parallelization of joins
- Clean, intuitive API that feels natural
- Zero-copy operations where possible
We chose Polars because it provides the best balance of performance, developer experience, and reliability for matching workflows.
Key Design Principles:
- Comfort Over Complexity: APIs should feel natural and intuitive
- Flow Over Force: Matching should work smoothly between data sources
- Reliability Over Speed: Prefer robust, predictable behavior
- Clarity Over Cleverness: Simple, clear code over complex optimizations
- Progress over Perfection: Ship working solutions that solve real problems
uv add matcherOr with pip:
pip install matcherThe examples below use paths from the project's generated test data. Create them with uv run python scripts/generate_test_data.py, or use your own Parquet/CSV paths.
import polars as pl
from matcher import Matcher, Deduplicator, FuzzyMatcher
# Load data (you load DataFrames, matcher operates on them)
left_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_a.parquet")
right_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_b.parquet")
# Entity resolution (using default components)
matcher = Matcher(left=left_df, right=right_df, left_id="id", right_id="id")
results = matcher.match(on="email")
print(f"Found {results.count} matches")
results.matches.head(10)
# Deduplication (single source)
df = pl.read_parquet("data/ExactMatcher/deduplication/customers.parquet")
deduplicator = Deduplicator(source=df, id_col="id")
results = deduplicator.match(on="email")
print(f"Found {results.count} duplicate pairs")
# Multiple rules: cascading (email first, then name for unmatched left rows)
results = matcher.match(on="email").refine(on=["first_name", "last_name"])
# Fuzzy matching (typo-tolerant, single field) and optional blocking
# results = matcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.85))
# results = matcher.match(on="email", blocking_key="zip_code")Exact matching uses Polars inner joins. Rows where any join key (e.g. email, first_name) is null are excluded from matches—including null-to-null. Fill or drop nulls in your match columns beforehand if you need different behavior.
matcher is built for experimentation and comparison. The ability to swap components, test different approaches, and measure results is foundational to matcher's design. This enables data-driven decisions about which matching strategies work best for your specific use case.
Like a cozy experiment, you can try different approaches, see what feels right, and choose what works best for your data.
The component-based architecture (similar to scikit-learn) enables you to:
- Compare approaches: Swap matching algorithms or evaluators to test alternatives
- Measure impact: Use built-in evaluation to quantify which approach performs better
- Make informed decisions: Choose components based on actual results, not assumptions
- Iterate quickly: Test new ideas without rewriting core logic
from matcher import Matcher
import polars as pl
# Load data (generate first: uv run python scripts/generate_test_data.py)
left_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_a.parquet")
right_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_b.parquet")
# Ground truth: known pairs as DataFrame with left_id, right_id
ground_truth = pl.DataFrame({
"left_id": ["left_1", "left_2", "left_3"],
"right_id": ["right_1", "right_2", "right_3"]
}) # or pl.read_parquet("your_ground_truth.parquet")
matcher = Matcher(left=left_df, right=right_df, left_id="id", right_id="id")
# Test email-only matching
results_email = matcher.match(on="email")
metrics_email = results_email.evaluate(ground_truth)
# Test name-only matching
results_name = matcher.match(on=["first_name", "last_name"])
metrics_name = results_name.evaluate(ground_truth)
# Compare results
print(f"Email rule: Precision={metrics_email['precision']:.2%}, Recall={metrics_email['recall']:.2%}")
print(f"Name rule: Precision={metrics_name['precision']:.2%}, Recall={metrics_name['recall']:.2%}")
# You can also swap matching_algorithm (e.g. custom case-insensitive matcher) and comparematcher uses a component-based architecture (similar to scikit-learn), allowing you to customize matching algorithms:
from matcher import Matcher, MatchingAlgorithm
import polars as pl
# Use custom matching algorithm
class MyCustomMatcher(MatchingAlgorithm):
def match(self, left, right, rule):
# Your custom matching logic
# rule is a list of fields (e.g., ["email"] or ["first_name", "last_name"])
# Return a DataFrame with matches
pass
# Load data
left_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_a.parquet")
right_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_b.parquet")
# Use custom algorithm
matcher = Matcher(
left=left_df,
right=right_df,
matching_algorithm=MyCustomMatcher()
)
results = matcher.match(on="email")Additional rules run only on left-side records that didn’t match yet (cascading). Chain match(on=...) then refine(on=...); optional blocking_key per step:
results = (
matcher
.match(on="email")
.refine(on=["first_name", "last_name"])
.refine(on=["address"], blocking_key="zip_code")
)Or chain manually:
results = (
matcher
.match(on="email")
.refine(on=["first_name", "last_name"])
.refine(on=["address"], blocking_key="zip_code")
)See the test suite (tests/) for more examples of custom algorithms and usage.
matcher includes built-in evaluation so you can measure matching performance and improve over time. Ground truth should be provided as a Polars DataFrame with left_id and right_id columns listing known true pairs. If your labels are stored on disk (e.g. Parquet or CSV), load them first with pl.read_parquet or pl.read_csv before calling evaluate().
from matcher import Matcher
import polars as pl
# Load data
left_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_a.parquet")
right_df = pl.read_parquet("data/ExactMatcher/entity_resolution/customers_b.parquet")
matcher = Matcher(left=left_df, right=right_df, left_id="id", right_id="id")
# Run matching
results = matcher.match(on="email")
# Evaluate against ground truth (DataFrame with left_id, right_id columns)
ground_truth = pl.DataFrame({
"left_id": ["left_1", "left_2"],
"right_id": ["right_1", "right_2"]
})
metrics = results.evaluate(ground_truth)
print(f"Precision: {metrics['precision']:.2%}")
print(f"Recall: {metrics['recall']:.2%}")
print(f"F1 Score: {metrics['f1']:.2%}")Use evaluate so you can improve: get ground truth, run match, evaluate, change something, re-run, compare metrics until the result is good enough.
- Get ground truth — Known pairs (e.g. from a human-reviewed sample or existing labels) as a DataFrame with
left_idandright_id. Load from CSV or Parquet if needed:ground_truth = pl.read_csv("reviewed.csv"). - Run your matcher — e.g.
results = matcher.match(on="email")ormatcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.85)). - Evaluate —
metrics = results.evaluate(ground_truth). For deduplication, or when the left and right id columns share the same name (e.g. bothid), passright_id_col="id_right"so the evaluator can correctly resolve right-side ids. - Change something — Adjust rules, threshold, or blocking_key.
- Re-run and compare — Run again, call
evaluate(ground_truth), compare precision/recall/F1 to the previous run. - Repeat until quality is good enough.
Example: compare two thresholds by running each and comparing metrics:
from matcher import FuzzyMatcher
# Try threshold 0.85
results_85 = matcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.85))
m85 = results_85.evaluate(ground_truth)
# Try threshold 0.82 (more recall, maybe more false positives)
results_82 = matcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.82))
m82 = results_82.evaluate(ground_truth)
# Choose based on evidence
print(f"0.85: precision={m85['precision']:.2%}, recall={m85['recall']:.2%}")
print(f"0.82: precision={m82['precision']:.2%}, recall={m82['recall']:.2%}")Use evaluation to:
- Compare approaches: Test different algorithms or thresholds and choose the best performer
- Validate improvements: Measure impact before committing to a new approach
- Track quality: Iterate until precision/recall are good enough for your use case
For fuzzy matching, use find_best_threshold() to pick a confidence threshold from match results and ground truth (it sweeps thresholds and returns the one that maximizes F1). Requires a confidence column, so use results from match(on=[...], matching_algorithm=FuzzyMatcher(...)):
from matcher import Matcher, FuzzyMatcher, find_best_threshold
results = matcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.85))
best = find_best_threshold(results.matches, ground_truth, right_id_col="id_right")
print(f"Best threshold: {best['best_threshold']}, F1: {best['best_f1']:.2%}")Export match results to CSV for human review (opens in Excel, Power BI, or any tool). The file includes identifiers and joined columns so reviewers have enough context without opening other systems. Use sample(n=...) to export a manageable sample for reviewers.
results = matcher.match(on=["name"], matching_algorithm=FuzzyMatcher(threshold=0.85))
results.export_for_review("matches_for_review.csv")
# Export a sample for reviewers
results.sample(n=50, seed=42).export_for_review("sample_for_review.csv")
# Focused export: only selected columns
results.pipe(lambda df: df.select(["id", "id_right", "confidence", "name", "name_right"])).export_for_review("review.csv")# Install dependencies
uv sync
# Generate test datasets
uv run python scripts/generate_test_data.py
# Run tests
uv run pytestThe project includes sample datasets organized by component:
-
ExactMatcher (
data/ExactMatcher/):- Entity Resolution (
entity_resolution/):customers_a.parquetandcustomers_b.parquet- 500 records each- 40 known matches with exotic matching scenarios (documented in
ground_truth.md) - Tests various matching rules: email-only, name-only, address+zip, mixed
- Deduplication (
deduplication/):customers.parquet- 1000 records- 50 known duplicate pairs (documented in
ground_truth.md)
- Evaluation (
evaluation/):customers_a.parquetandcustomers_b.parquet- 50 records each- 30 simple email matches for stable evaluation testing
- Entity Resolution (
-
SimpleEvaluator (
data/SimpleEvaluator/):- Evaluation (
evaluation/):- Test datasets for evaluator component testing
- Evaluation (
Regenerate test data with: uv run python scripts/generate_test_data.py
See docs/archive/MATCHING_PLAN_V2.md for the implementation plan and CLAUDE.md for development guidelines.
matcher follows hygge philosophy in its design:
-
Comfort Over Complexity
- APIs should feel natural and intuitive
- Configuration should be simple but flexible
- Defaults should "just work"
-
Flow Over Force
- Matching should work smoothly between data sources
- Results should be immediately explorable
- Progress should be visible but unobtrusive
-
Reliability Over Speed
- Prefer robust, predictable behavior
- Handle errors gracefully
- Make recovery simple
-
Clarity Over Cleverness
- Simple, clear code over complex optimizations
- Explicit configuration over implicit behavior
- Clear error messages and helpful guidance
matcher isn't just about matching records - it's about making entity resolution and deduplication feel natural, comfortable, and reliable. Like a warm blanket for your data matching needs.