Dataline

CJK-native entity resolution engine, built in Rust.

Dataline resolves mixed-script customer records — 陳大文 / Chan Tai Man / CHAN, Tai-Man / 陈大文 — using phonetic, visual, and normalization signals rather than transliterate-to-Latin approaches. It introduces a pool-based architecture that separates records by data completeness before matching, preventing sparse records from generating false merges.

Paper: Pool-Based Entity Resolution for Mixed-Script CJK Records

Quick start

git clone https://github.com/digital-rain-tech/dataline
cd dataline
cargo run --release --bin dataline-demo -- pipeline data/sample_1m.csv data/job_1m

Processes 2.7 million records in ~66 seconds on a 20-core machine. Output:

data/job_1m/results.db     — full results in SQLite (query with sqlite3)
data/job_1m/matches.csv    — enriched match pairs (open in Excel / pandas)

The matches.csv columns: left_id, left_source, left_name, right_id, right_source, right_name, phase, confidence, rule, correct

The correct column is the ground-truth indicator — true when two records belong to the same person, false when they don't. On the 1M benchmark: 98.0% precision on the auto-merge tier, 0% precision on sparse-record pairwise (which pool separation prevents entirely).

Why pool-based?

Standard ER applied to CJK enterprise data produces a false-positive flood on sparse records. The benchmark shows this directly:

Phase	Comparisons	Matches	True Positives	Precision
Phase 1 (hash grouping)	0	—	—	—
Phase 2a (phone-corroborated)	21,176	20,878	20,877	100.0%
Phase 2b (attractor assignment)	1,945,780	14,347	13,636	95.0%
Stage 2 (sparse pairwise)	389,748	112,299	1	0.0%
Total auto-merge (2a+2b)		35,225	34,513	98.0%

Stage 2 is what happens without pool separation: 112,299 matches, 1 true positive. Pool separation routes those records to UNRESOLVED cohorts instead.

Why not just transliterate to Latin?

That's the common production pattern: convert CJK to pinyin, run NYSIIS, compare. It loses information at every stage:

Pinyin is many-to-one — multiple unrelated characters romanize identically
NYSIIS collapses Chinese consonant distinctions — zh/z/j, ch/c/q, sh/s/x are phonemically distinct in Chinese but collapse
Tones are discarded — Cantonese has 6–9 tones; Mandarin has 4; all signal
Visual errors are invisible — OCR misreads produce characters that look identical but sound different; phonetic-only matching scores them as non-matches

Architecture

Three matching signals

Signal	Measures	Catches
Phonetic	Jyutping/pinyin coordinate distance	Phone dictation, dialect variants, romanization differences
Visual	Stroke sequence similarity (20,901-char dictionary)	OCR errors, wrong radical, handwriting variants
Normalization	Simplified ↔ Traditional mapping	Cross-system script variants

Signals are combined via a deterministic rule engine (not a continuous threshold), producing traceable decisions: R3d: family match + phone match rather than score 0.73 > threshold 0.7.

Pool-based pipeline

Records are classified by expected collision count before matching:

Pool A (rich): records with a low-collision corroborator (phone, national ID, DOB) — expected < ~1 person per name+field combination → form validated anchor clusters
Pool B (sparse): name-only or name+district records — expected ~6–118 persons per combination → classified as UNRESOLVED cohorts, never merged at low confidence

Pipeline phases:

Phase 1 — Zero-comparison hash grouping across 10 name variant × corroborator combinations
Phase 2a — Rule-engine validation within phone-corroborated clusters (100% precision)
Phase 2b — Attractor assignment: remaining records compared against cluster drivers
Stage 2 — Traditional pairwise on the small residual population

Output states: MERGED (auto-merge safe), UNRESOLVED (cohort awaiting enrichment), SINGLETON.

Hong Kong name handling

The same person appears as 陳大文先生, CHAN Tai Man, CHAN Tai Man, Peter, Peter Chan, 陈大文, and 阿文 across enterprise systems. The parser handles compound surnames (歐陽, Au-Yeung), honorific stripping (先生, 阿/小/老), HKID format (ALL CAPS surname-first), and comma-separated English aliases using an 80-entry HK surname dictionary.

Commands

# Run full pipeline on included 1M benchmark dataset
cargo run --release --bin dataline-demo -- pipeline data/sample_1m.csv data/job_1m

# Generate your own synthetic dataset
cargo run --release --bin dataline-demo -- generate 50000 data/my_data.csv
cargo run --release --bin dataline-demo -- pipeline data/my_data.csv data/my_job

# Run tests
cargo test

# Run benchmarks
cargo bench

Sample dataset

data/sample_1m.csv (stored via Git LFS) — 1,000,000 synthetic HK persons, 2,691,721 records across four source systems:

Source	Script	Phone coverage	Notes
CRM	Traditional Chinese + honorific	80%	Primary system
Billing	HKID romanization	100%	High completeness
Legacy	Simplified Chinese	40%	50% inclusion rate
English	English name only	70%	19% inclusion rate

Names drawn from gender-stratified bigram pools (87 male classic, 70 female classic, 68 Gen Y/Z) with HK Government Romanization for cross-script consistency.

Browser demo

wasm-pack build --target web --no-default-features --features wasm

Produces a ~1.6MB .wasm binary (all dictionaries embedded) that runs entirely in the browser. Live demo: dataline.dev

Data sources

File	Source	Contents
`dict_chinese_stroke.txt`	FuzzyChinese	Stroke decompositions, 20,901 chars
`dict_cantonese_jyutping.txt`	cpp-pinyin	Cantonese Jyutping, 19,482 chars
`hk_gov_romanization.json`	cantoroman	HK Government Romanization, 11,612 chars
`STCharacters.txt` / `TSCharacters.txt`	OpenCC	S↔T mappings, 8,093 entries

Architecture decisions

Design rationale in docs/adr/:

ADR-003 — Multi-signal design and why phonetic-only fails
ADR-010 — Rule engine vs threshold scoring
ADR-018 — Phased cluster-first pipeline
ADR-022 — Pool-based separation design
ADR-024 — Driver record selection and cluster merge

Reproducibility

The benchmark uses synthetic data with known ground truth (person_id), enabling exact precision measurement:

# Run the full pipeline
cargo run --release --bin dataline-demo -- pipeline data/sample_1m.csv data/job_1m

# Check results
sqlite3 data/job_1m/results.db "SELECT phase, COUNT(*) as matches, SUM(CASE WHEN correct THEN 1 ELSE 0 END) as true_positives FROM matches GROUP BY phase"

Precision note: The code has been refined since publication. Current results on the 1M benchmark:

Phase 2a: 100% precision
Phase 2b: 100% precision
Stage 2: 100% precision (0 false positives on 748k candidate pairs)

The published 98% figure reflects the original code state. The improvement comes from fixes to specific rule edge cases: exact match requiring same token position (not just same tokens), S↔T only for CJK characters (not Latin), phone requiring 7-8 digit prefix (not last 4), and R3d phone corroboration requiring a given name signal. These fixes are verifiable by running the same command above — the ground truth is in the correct column.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benches		benches
data		data
paper		paper
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataline

Quick start

Why pool-based?

Why not just transliterate to Latin?

Architecture

Three matching signals

Pool-based pipeline

Hong Kong name handling

Commands

Sample dataset

Browser demo

Data sources

Architecture decisions

Reproducibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataline

Quick start

Why pool-based?

Why not just transliterate to Latin?

Architecture

Three matching signals

Pool-based pipeline

Hong Kong name handling

Commands

Sample dataset

Browser demo

Data sources

Architecture decisions

Reproducibility

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages