- NO HARMS Identity Linkage
- Methods and Code
This (sub)repository contains the code that implements record linkage for the NO HARMS project.
^ A stylized flowchart. See targets::tar_visnetwork(T) for the full
workflow.
targets::tar_visnetwork(T, exclude = c('theform', 'model_version', 'data_version', 'targetfile', 'tdsets', 'tdsets_files','targetfilepath', 'targetfilepath_files'))
might work better depending on how finicky the dag visualizer is
feeling.
- A list of records (the unique combination of source system, source id, first name, middle name, last name, DOB, and ssn) to compare for a match based on the DOB and name characteristics are generated.
- A small subset of the generated pairs are manually assessed for matchy-ness. These manually labeled pairs are the training data for the linkage model.
- A linkage model is constructed using ensemble modeling. The predictors are based on variables primarily constructed from person identifiers and location history source data.
- All pairs identified in step #1 are evaluated for links.
- Cross-validation is employed to identify a match score cutoff point.
- Linkage networks are evaluated and where relevant broken up into sub-clusters via a network linkage algorithm.
- The results of #6 is are aggregated to ensure the internal consistency at the source system X source id level.
Review _targets.R and associated helper functions
(e.g. targets::tar_visnetwork(T)) for understanding the process flow.
Run create_data_duck.R to (re)generate the data.duckdb database. Do any migration of training ids to a new hash set (previous examples can be found in the migrations folder).
The raw data comes from noharms.identifiers on HHSAW. There is a separate ETL process that prepares the data.
Data are downloaded and cleaned when loading via the init_duck
function, with most of the actual work farmed out to the
clean_noharms_ids function (which init_duck
calls). The cleaning steps are listed below.
- Junk names not removed in the preparation of the “raw” data are set NA.
- Names are limited to alpha numeric characters and cleaned (e.g. removing JRs). Name columns without spaces are also created.
- SSNs are forced into compliance (e.g. 9 digits, less than 900000000)
- ZIP codes are forced into compliance with the schema (5 digit number)
- Junk ZIPs are removed/set to NA
- NAs for certain identifiers (e.g. names, ssns) are filled in within valid responses within a given source_id/source_system.
It is computationally ill advised to compare all people to each other
when trying to identify matches. Listed below is the criteria used to
a-priori decide whether to evaluate two records as a match. The
blocking/filtering occurs via the
make_query_grid, make_block_var_table,
make_block, and block functions. Combined, these
functions identify and store the pairs of records that meet at least one
of the following conditions:
- Exact match between DOB and last name
- Exact match between DOB and first name
- Exact match between valid SSNs
- Exact match between last 4 of SSN and last name
- Exact match between last 4 of SSN and first name
- Exact match between last 4 of SSN and exact dob match
- Exact match between DOB, first two letters of first name, and first two letters of last name
- Exact match on first name, last name, and year of birth
- Exact match on first name, last name, dob month, and dob day
- Exact match on DOB and jaccard similarities >.5 for both first name and last name
- Exact match on DOB and if A’s last name is wholly contained in B’s last name
- Same as above, but comparing B to A (instead of A to B)
- Exact match on DOB and if A’s firstname is wholly contained in B’s first name
- Same as above, but comparing B to A (instead of A to B)
- Jaro-Winkler similarity >.7 for first and last name and exact match on year of birth and exact match on day of birth
- Jaro-Winkler similarity >.7 for first and last name and exact match on year of birth and exact match on month of birth and (an absolute difference in day of birth less than 10 OR either record’s day of birth is the first of the month). Note: This condition could probably be faster as three separate AND conditions.
For conditions 10+, the results from 1-9 are screened out to save on
computation. make_query_grid and
make_block contain the code where this occurs.
Once the list of pairs to evaluate has been identified, a series of
predictor variables (the input to the ML model) are created. The
predictor variables are created by
make_model_frame which is mostly a light wrapper
on some sql. predict_links.R and
compile_training_data.R are the relevant
parts of the targets workflow. As of 08/15/2024 the variable list
includes:
dob_exact: DOB exact matchdob_year_exact: exact match of year of birthmdob_mdham: hamming distance between month and day of birthdob0101: Either DOB is 01-01-YYYYgender_agree: gender explicitly matchesfirst_name_jw: jaro-winkler distance of first nameslast_name_jw: jaro-winkler distance of last namesname_swap_jw: jaro-winkler distance of names with first and last swappedcomplete_name_dl: daimaru-levenstein distance between the full names. Full name is either first + last or first + middle + last. The minimum distance is used.middle_initial_agree: Explicit match of middle initiallast_in_last: where either records’ whole last name is contained in the other onessn_full_exact: The full valid SSNs exactly matchssn_last4_exact: Last 4 digits of SSN exactly matchssn_last_4_dl: The daimaru-levenstein distance between the last 4 digits of ssnfirst_is_middle: first name exactly matches to middle namefirst_name_freq: created bycreate_name_frequency. Scaled frequency tabulation of first nameslast_name_freq: created bycreate_name_frequency. Scaled frequency tabulation of last namesdob_freq: created bycreate_name_frequency. Scaled frequency tabulation of dobsmn_ziphist_distance: Minimum observed distance of ZIP codesexact_location: binary flag indicating address histories overlap within 3 meters (location only – not spatio-temporal).ssn_any_lastname: Exact match in any last name between records with the same valid SSN. For example, the record for person A may have a name of “Jon” while person B is ‘Jonathan’, but there is aJonthanamong records that share an ssn with record A.ssn_any_firstname: Exact match in any first name between records with the same valid SSNssid_any_firstname: Exact match in any first name between records with the same source_id and source_systemssid_any_lastname: Exact match in any last name between records with the same source_id and source_systemnickname_match: Whether one first name is the nickname of the other. Seeadd_nicknames_to_db.Rfor sourcing and what not.
The linkage model contains two separate sub-models, a screener and a
tuned ensemble, as well as some bounds parameters. The screener model is
a lasso logistic regression. Pairs with a prediction within the
specified bounds, (.01 - .99) by default, are passed on to the ML
ensemble model. Logistic lasso regression is used to ensemble results
from tuned child models. While the child models that don’t get lasso-ed
out vary by model version, the candidate pool usually includes boosted
mlps, random forests, boosted regression trees, neural nets, and svms.
fit_submodel and
create_stacked_model implement the model
fitting and ensembling while
fit_screening_model fits the screener. The
selection of candidate models is handled/specified in
_targets.R
After blocking, filtering, and generating comparisons, pairs are evaluated and given a match score/prediction from the linkage model. The prediction process is parallelized by chunks of pairs to facilitate a consistent computational flow. Potential pairs with a match score of >.05 are saved to a versioned table in HHSAW.
The prediction step occurs in two main steps (within and between) via
predict_links.R. Records with a shared source
system and source system are assumed to be matches and are identified
via fixed_links.R.
Fit/error/goodness metrics are calculated via 5 rounds of 5-fold cross
validation in cv_refit.R and compiled together in
cv_cutoffs.R. The final cutoff (designating a
match/non-match) is select at the point of greatest accuracy. EMS <->
RHINO and MEO <-> Death both have deterministic linkage logic.
Pairs with a match score above the cutoff point are configured into
networks with match score as the edge weight. To reduce the effect of
bridging edges and increase network density, the Leiden clustering
algorithm (reconcile_links.R) is applied to
networks with more than 3 vertices and a density (# of connections/#
possible connections) below .7. The resulting subclusters (and the
networks that did not need to be clustered) are aggregated to the source
id/source system level (collapse_links.R). The
result of the aggregations is a final “study id” that determines which
records represent the same person.
make_new_training_data.R: Identifies new
record pairs for manual review and subsequent inclusion into the
training data. A few methods are implemented/regularly used:
- Stratified sampling based on match score
- Minimum numbers of evaluated pairs based on dataset combinations (e.g. there must be at least 20 pairs between EMS and Medicaid)
- Minimum numbers of evaluated pairs by race/ethnicity
- Top N worst predictions for training pairs (should a pair be re-evaluated? Maybe the human was wrong?)
- Pairs where the ML approach disagrees with the IDH
- Pairs where the previous ML version disagrees with the current one
New pairs are labeled with the assistance of the hyrule::matchmaker
shiny app (launched from
make_new_training_data.R). Exploration
into the raw data is sometimes conducted for difficult to assess cases
(usually via sql).
Once new training data/labelled pairs are available, the ensemble is usually refit and the prediction -> assessment -> new pairs process begins anew.
