Readme

NO HARMS Identity Linkage
Methods and Code

NO HARMS Identity Linkage

This (sub)repository contains the code that implements record linkage for the NO HARMS project.

Methods and Code

Process Flowchart

^ A stylized flowchart. See targets::tar_visnetwork(T) for the full workflow. targets::tar_visnetwork(T, exclude = c('theform', 'model_version', 'data_version', 'targetfile', 'tdsets', 'tdsets_files','targetfilepath', 'targetfilepath_files')) might work better depending on how finicky the dag visualizer is feeling.

TLDR

A list of records (the unique combination of source system, source id, first name, middle name, last name, DOB, and ssn) to compare for a match based on the DOB and name characteristics are generated.
A small subset of the generated pairs are manually assessed for matchy-ness. These manually labeled pairs are the training data for the linkage model.
A linkage model is constructed using ensemble modeling. The predictors are based on variables primarily constructed from person identifiers and location history source data.
All pairs identified in step #1 are evaluated for links.
Cross-validation is employed to identify a match score cutoff point.
Linkage networks are evaluated and where relevant broken up into sub-clusters via a network linkage algorithm.
The results of #6 is are aggregated to ensure the internal consistency at the source system X source id level.

Code workflow (main bits)

Review _targets.R and associated helper functions (e.g. targets::tar_visnetwork(T)) for understanding the process flow.

Before starting

Run create_data_duck.R to (re)generate the data.duckdb database. Do any migration of training ids to a new hash set (previous examples can be found in the migrations folder).

Data

Raw Data

The raw data comes from noharms.identifiers on HHSAW. There is a separate ETL process that prepares the data.

Clean Data

Data are downloaded and cleaned when loading via the init_duck function, with most of the actual work farmed out to the clean_noharms_ids function (which init_duck calls). The cleaning steps are listed below.

Junk names not removed in the preparation of the “raw” data are set NA.
Names are limited to alpha numeric characters and cleaned (e.g. removing JRs). Name columns without spaces are also created.
SSNs are forced into compliance (e.g. 9 digits, less than 900000000)
ZIP codes are forced into compliance with the schema (5 digit number)
Junk ZIPs are removed/set to NA
NAs for certain identifiers (e.g. names, ssns) are filled in within valid responses within a given source_id/source_system.

Generating pairs for evaluation

Block and Filter Pairs

It is computationally ill advised to compare all people to each other when trying to identify matches. Listed below is the criteria used to a-priori decide whether to evaluate two records as a match. The blocking/filtering occurs via the make_query_grid, make_block_var_table, make_block, and block functions. Combined, these functions identify and store the pairs of records that meet at least one of the following conditions:

Exact match between DOB and last name
Exact match between DOB and first name
Exact match between valid SSNs
Exact match between last 4 of SSN and last name
Exact match between last 4 of SSN and first name
Exact match between last 4 of SSN and exact dob match
Exact match between DOB, first two letters of first name, and first two letters of last name
Exact match on first name, last name, and year of birth
Exact match on first name, last name, dob month, and dob day
Exact match on DOB and jaccard similarities >.5 for both first name and last name
Exact match on DOB and if A’s last name is wholly contained in B’s last name
Same as above, but comparing B to A (instead of A to B)
Exact match on DOB and if A’s firstname is wholly contained in B’s first name
Same as above, but comparing B to A (instead of A to B)
Jaro-Winkler similarity >.7 for first and last name and exact match on year of birth and exact match on day of birth
Jaro-Winkler similarity >.7 for first and last name and exact match on year of birth and exact match on month of birth and (an absolute difference in day of birth less than 10 OR either record’s day of birth is the first of the month). Note: This condition could probably be faster as three separate AND conditions.

For conditions 10+, the results from 1-9 are screened out to save on computation. make_query_grid and make_block contain the code where this occurs.

Generate Comparisons/Variables

Once the list of pairs to evaluate has been identified, a series of predictor variables (the input to the ML model) are created. The predictor variables are created by make_model_frame which is mostly a light wrapper on some sql. predict_links.R and compile_training_data.R are the relevant parts of the targets workflow. As of 08/15/2024 the variable list includes:

dob_exact: DOB exact match
dob_year_exact: exact match of year of birthm
dob_mdham: hamming distance between month and day of birth
dob0101: Either DOB is 01-01-YYYY
gender_agree: gender explicitly matches
first_name_jw: jaro-winkler distance of first names
last_name_jw: jaro-winkler distance of last names
name_swap_jw: jaro-winkler distance of names with first and last swapped
complete_name_dl: daimaru-levenstein distance between the full names. Full name is either first + last or first + middle + last. The minimum distance is used.
middle_initial_agree: Explicit match of middle initial
last_in_last: where either records’ whole last name is contained in the other one
ssn_full_exact: The full valid SSNs exactly match
ssn_last4_exact: Last 4 digits of SSN exactly match
ssn_last_4_dl: The daimaru-levenstein distance between the last 4 digits of ssn
first_is_middle: first name exactly matches to middle name
first_name_freq: created by create_name_frequency. Scaled frequency tabulation of first names
last_name_freq: created by create_name_frequency. Scaled frequency tabulation of last names
dob_freq: created by create_name_frequency. Scaled frequency tabulation of dobs
mn_ziphist_distance: Minimum observed distance of ZIP codes
exact_location: binary flag indicating address histories overlap within 3 meters (location only – not spatio-temporal).
ssn_any_lastname: Exact match in any last name between records with the same valid SSN. For example, the record for person A may have a name of “Jon” while person B is ‘Jonathan’, but there is a Jonthan among records that share an ssn with record A.
ssn_any_firstname: Exact match in any first name between records with the same valid SSN
ssid_any_firstname: Exact match in any first name between records with the same source_id and source_system
ssid_any_lastname: Exact match in any last name between records with the same source_id and source_system
nickname_match: Whether one first name is the nickname of the other. See add_nicknames_to_db.R for sourcing and what not.

Make Linkages

Linkage model

The linkage model contains two separate sub-models, a screener and a tuned ensemble, as well as some bounds parameters. The screener model is a lasso logistic regression. Pairs with a prediction within the specified bounds, (.01 - .99) by default, are passed on to the ML ensemble model. Logistic lasso regression is used to ensemble results from tuned child models. While the child models that don’t get lasso-ed out vary by model version, the candidate pool usually includes boosted mlps, random forests, boosted regression trees, neural nets, and svms. fit_submodel and create_stacked_model implement the model fitting and ensembling while fit_screening_model fits the screener. The selection of candidate models is handled/specified in _targets.R

Predict/identify links

After blocking, filtering, and generating comparisons, pairs are evaluated and given a match score/prediction from the linkage model. The prediction process is parallelized by chunks of pairs to facilitate a consistent computational flow. Potential pairs with a match score of >.05 are saved to a versioned table in HHSAW.

The prediction step occurs in two main steps (within and between) via predict_links.R. Records with a shared source system and source system are assumed to be matches and are identified via fixed_links.R.

Select match score cutoff and generate fit metrics

Fit/error/goodness metrics are calculated via 5 rounds of 5-fold cross validation in cv_refit.R and compiled together in cv_cutoffs.R. The final cutoff (designating a match/non-match) is select at the point of greatest accuracy. EMS <-> RHINO and MEO <-> Death both have deterministic linkage logic.

Build linkage networks and find subclusters

Pairs with a match score above the cutoff point are configured into networks with match score as the edge weight. To reduce the effect of bridging edges and increase network density, the Leiden clustering algorithm (reconcile_links.R) is applied to networks with more than 3 vertices and a density (# of connections/# possible connections) below .7. The resulting subclusters (and the networks that did not need to be clustered) are aggregated to the source id/source system level (collapse_links.R). The result of the aggregations is a final “study id” that determines which records represent the same person.

Generating new training pairs

make_new_training_data.R: Identifies new record pairs for manual review and subsequent inclusion into the training data. A few methods are implemented/regularly used:

Stratified sampling based on match score
Minimum numbers of evaluated pairs based on dataset combinations (e.g. there must be at least 20 pairs between EMS and Medicaid)
Minimum numbers of evaluated pairs by race/ethnicity
Top N worst predictions for training pairs (should a pair be re-evaluated? Maybe the human was wrong?)
Pairs where the ML approach disagrees with the IDH
Pairs where the previous ML version disagrees with the current one

Labelling training data

New pairs are labeled with the assistance of the hyrule::matchmaker shiny app (launched from make_new_training_data.R). Exploration into the raw data is sometimes conducted for difficult to assess cases (usually via sql).

Model Refitting

Once new training data/labelled pairs are available, the ensemble is usually refit and the prediction -> assessment -> new pairs process begins anew.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
R		R
_targets.R		_targets.R
create_data_duck.R		create_data_duck.R
make_new_training_data.R		make_new_training_data.R
nh_match_pub.Rproj		nh_match_pub.Rproj
noharms_flow.png		noharms_flow.png
readme.md		readme.md
readme.qmd		readme.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

NO HARMS Identity Linkage

Methods and Code

Process Flowchart

TLDR

Code workflow (main bits)

Before starting

Data

Raw Data

Clean Data

Generating pairs for evaluation

Block and Filter Pairs

Generate Comparisons/Variables

Make Linkages

Linkage model

Predict/identify links

Select match score cutoff and generate fit metrics

Build linkage networks and find subclusters

Generating new training pairs

Labelling training data

Model Refitting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Readme

NO HARMS Identity Linkage

Methods and Code

Process Flowchart

TLDR

Code workflow (main bits)

Before starting

Data

Raw Data

Clean Data

Generating pairs for evaluation

Block and Filter Pairs

Generate Comparisons/Variables

Make Linkages

Linkage model

Predict/identify links

Select match score cutoff and generate fit metrics

Build linkage networks and find subclusters

Generating new training pairs

Labelling training data

Model Refitting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages