Skip to content

naccdata/UDS4_Cross_Walker

 
 

Repository files navigation

UDS4 Crosswalk

Authors

License

This project is licensed under the MIT License - see LICENSE for details

Part 1 - Generation of JSON Files

UDS_JSON_Maker.py

This is the primary Python script which processes Excel files in a given folder (given as a command line argument), process the relevant mapping rules, and saves the output as JSON.

The goal of the JSON files is to help programmatically apply the conversion of data elements between UDS3 and UDS4 formats. An emphasis is given on UDS3 -> UDS4 although reversible rules are also flagged.

Crosswalk Folder

This folder contains properly formatted Excel workbooks used by the Python script to generate the JSON files containing the following sheets:

  • UDS3_4 Tables Data Dictionary: Data dictionary for the three UDS variable tables and the change indicators between UDS3 and UDS4
  • XX UDS3 DED: Form specific table for UDS3 built from the Data Element Dictionary for UDS3, included for reference
  • XX UDS3 REDCap: Form specific table for UDS3 built from the original Kansas ADRC REDCap data dictionary
  • XX UDS4 REDCap: Form specific table for UDS4
    • Note, Template Mappings are oriented using REDCap-to-REDCap conventions
  • Mappings Data Dictionary: Data dictionary for the template mappings to migrate UDS variables from version 3 to version 4
  • Template Mappings REDCap: The sheet containing all rules processed and stored into JSON format. Modifications to rules for a specific ADRC can be applied here.
  • XX Mappings - ADRC Decisions: Ambiguous rules that should be vetted by the ADRC. Options are flagged as ADRC Decision_A and ADRC Decision_B where applicable to indicate rule sets.

Rules in 'Mappings - ADRC Decisions' must be vetted and added to the 'Template Mappings REDCap' sheet before compiling JSON files!

Modification of Template Mappings

Template mappings can be modified as needed to reflect your specific instance of UDS3 and UDS4. Structure on the Template Mappings REDCap should remain consistent, simply update the entries in the 'UDS3 value' and 'UDS4 value' columns.

As described, be sure to select rules from the ADRC Decision sheet and add them to the Template Mappings prior to generating JSON files.

Application

Requirements

Before you run the script, make sure you have Python installed on your system along with the required libraries. To install the necessary dependencies, follow these steps:

  1. Install Python (if not already installed):

  2. Install required libraries from the command line using pip (or controlling environment e.g. conda)

    pip install -r requirements.txt

How to Use the Script

Step 1: Prepare Your Files Ensure that your Excel files are located in a folder on your computer. Default

Step 2: Run the Script To run the script, open your terminal or command prompt and navigate to the folder where the script is located. Run the following command, replacing <folder_path> with the path to the folder containing your Excel files if needed:

python UDS_JSON_Maker.py <folder_path>

For example:

python UDS_JSON_Maker.py C:/Users/YourUsername/Downloads/crosswalks

If no folder path is provided, this will default to './Crosswalk' located in folder from which the UDS_JSON_Maker.py is run. For convenience, a collection of crosswalk sheets are provided here.

The script will process all Excel files in the provided folder and generate:

  • A JSON file for each Excel file, named <original_filename>_mappings.json which can then be used for a programmatic crosswalk.

Step 3: Check the Output: After running the script, the generated JSON files are in the same folder as the Excel files.

Note: Ensure the Excel files contain the required input in the UDS3/4 REDCap dictionary sheets and Template Mappings sheets. Note, new UDS4 forms have no mappings rules thus some crosswalk files for forms like A1a, A4a, etc., do not have the 'Template Mappings REDCap' sheet . Be sure to remove or delete those Excel crosswalk files from the folder before running the script.

Troubleshooting

If you encounter any issues, ensure that you are using the correct Python version and have installed all dependencies from the requirements.txt file.
Make sure that the folder path provided contains only .xlsx files that match the required format described in the Crosswalk section above.

Part 2 - Crosswlk Parser

Overview

This script is designed to be used in conjunction with a collection of JSON-based crosswalk rules to migrate UDS3 data to UDS4 data. It is the final step in the UDS3-UDS4 Crosswalk process. Three constituent pieces are required:

  • UDS3 source data csv file
  • At least one JSON Crosswalk file
  • A file of UDS4 Data Elements used for re-ordering Both Python and R versions of the Crosswalk Parser exist.

Note a column order text file is provided with this repository and can be used with this Parser. This matches the Data Element Dictionary order provided by NACC. The file name is UDS4Data_ColumnOrder.txt found within this parser's main folder.

Scope

This Crosswalk Parser is designed to work with csv formatted data files e.g. those from REDCap export. It has been validated for mapping of levels (i.e. the numeric representation of responses) although options for label mapping are a potential modification the authors are considering.

JSON Files

For details on the generation of JSON crosswalk mapping files, please see the UDS4 Crosswalk - JSON Maker repository. These scripts presume appropriate JSON files have already been generated. They do not need have been generated from the JSON Maker but a similar output is expected.

UDS_Crosswalk_Parser.py / UDS_Crosswalk_Parser.R

These are the principal processing scripts which take in an appropriately formatted file of UDS3 data, a folder of the JSON files developed by the JSON maker, and a listing of the UDS4 variables in a preferred order. The Crosswalk Parser then applies the JSON rules in order to migrate all UDS3 data to their nearest UDS4 variants where they exist. UDS3 data which do not have a JSON mapping rule (i.e. a nearest UDS4 neighbor) are excluded and will not be returned.

Arguments, Folders, and Required Files

Both UDS_Crosswalk_Parser.py and UDS_Crosswalk_Parser.R are designed to be run from the command line using either of the following:

1. Paths as arguments from the command line

python UDS_Crosswalk_Parser.py <path to UDS3 CSV> <path to folder of JSON files> <path to UDS4 order txt>
Rscript UDS_Crosswalk_Parser.R <path to UDS3 CSV> <path to folder of JSON files> <path to UDS4 order txt>

2. Interactive selection of files and folders

If the OS support GUI based folder selection, the scripts can also be run from the command line without arguments. The system will then prompt the user to select (1) the UDS3 CSV, (2) the folder location of the JSON files, and (3) the text file of UDS4 variables for ordering.

3. Running scripts directly from IDE environment

The two scripts can also be run directly from within a Jupyter notebook or Rstudio environment. Be sure to initialize the string paths to the files / folders within the Main Process sections as needed.

Application

Requirements - Python

Before you run the script, make sure you have Python installed on your system along with the required libraries. To install the necessary dependencies, follow these steps:

  1. Install Python (if not already installed):

  2. Install required libraries from the command line using pip (or controlling environment e.g. conda)

    pip install -r requirements.txt

Requirements - R

  1. Install R (if not already installed)

  2. Necessary packages will be evaluated and downloaded automatically when UDS_Crosswalk_Parser.R is run using primary repository (default CRAN)

How to Use the Script

Step 1: Prepare your UDS3, JSON Crosswalk, and Variable order files. Make note of locations.

Step 2a: Run the Script from the command line Navigate to the folder where the desired script file is located and enter either the Python or R variant as described above

Step 2b: Run the script interactively The .py or .R file can be opened within an IDE such as Jupyter Notebook or RStudio and run interactively. If this step is done, folder paths will need to be provided by the user within the scripts. Sections for these variables can be found at the beginning of the "Main Process" section of each parser.

The script will apply the crosswalk parsing migrating all valid variables from UDS3 to UDS4.

Step 3: Check the Output: After running the script and applying the cross walk to the UDS3 data, a new UDS4 data file will be generated called uds4_redcap_data.csv.

Troubleshooting

If you encounter any issues, ensure that you are using the correct R or Python version and have installed all dependencies as required.
Be sure JSON files only exist within the JSON folder path provided. Make certain the ordered variable list for UDS4 exists and contains all listed variables. Again, use the UDS4Data_ColumnOrder.txt file provided with this repository as desired.

Part 3 - Dictionary Comparison

Overview

This script compares the content of two REDCap dictionaries for discrepancies in their field. Although developed for UDS4 comparisons (e.g. between a local ADRC's version and an update from NACC), it could be applied to any two dictionaries with matched variables. Three constituent pieces are required:

  • A prior version of a REDCap dictionary
  • A new iteration of the REDCap dictionary
  • A curated file listing a specific set of variables for comparison

Note a curated file is provided with this repository and can be used with this Comparison script. It matches the Data Element Dictionary order provided by NACC (May 2025). The file name is UDS4Data_ColumnOrder.txt found within this parser's main folder.

Scope

This Dictionary Comparison tool is designed to work with a set of csv-based data dictionaries, specifically from REDCap export. While developed for UDS4, this script could be used for any version comparisons of a dictionary.

After the comparison is completed, an Excel workbook is generated which can be used to update local REDCap dictionaries or Crosswalk dictionaries as the user sees fit.

UDS_Dictionary_Update.R

This is the primary script which takes in a prior and updated pair of REDCap dictionaries and an optional curated list of variables to explicitly compare. The comparison tool identifies variables which only appear in the prior or new dictionary as well as any dictionary field deviations for any matched variables. After the comparison, an Excel workbook output summarizes all comparisons made for downstream application by the user.

Arguments, Folders, and Required Files

UDS_Dictionary_Update.R is designed to be run from the command line using either of the following:

1. Paths as arguments from the command line

Rscript UDS_Dictionary_Update.R <path to prior dictionary CSV> <path to updated dictionary CSV> <path to curated variable list txt>

2. Interactive selection of files and folders

If the OS support GUI based folder selection, the scripts can also be run from the command line without arguments. The system will then prompt the user to select (1) the prior REDCap dictionary, (2) the updated REDCap dictionary, and (3) the curated list of variables for comparison.

3. Running scripts directly from IDE environment

The script can also be run directly from within an Rstudio environment. Be sure to initialize the string paths to the files / folders within the Main Process sections as needed.

Application

Requirements - R

  1. Install R (if not already installed)

  2. Necessary packages will be evaluated and downloaded automatically when UDS_Dictionary_Update.R is run using primary repository (default CRAN)

How to Use the Script

Step 1: Prepare your prior dictionary, new dictionary, and curated variable list files. Make note of locations.

Step 2a: Run the Script from the command line Navigate to the folder where the desired script file is located and enter either the BASH commands as described above

Step 2b: Run the script interactively The .R file can be opened within the RStudio IDE and run interactively. If this step is done, folder paths will need to be provided by the user within the scripts. Sections for these variables can be found at the beginning of the "Main Process" section.

The script will compare the two dictionaries as described before generating an Excel workbook.

Step 3: Check the Output: After running the script, an Excel workbook output file will be generated titled "UDS_Dictionary_Comparison_YYYYMMDD.xlsx" using today's date. It will have the following tabs:

  • 'Unmatched - Prior' - entries found exclusively in the prior dictionary
  • 'Unmatched - New' - entries found solely in the new dictionary
  • 'Prior Entries' - entries from the prior dictionary where field discrepancies were found
  • 'New Entries' - entries for those same variables from the new dictionary
  • 'Field Differences' - a breakdown for each variable listing which dictionary fields differed and what the field entries were for the prior and new dictionary Note: the 'Prior Entries' and 'New Entries' tab will include a new "Index" column which lists the absolute position of the variable in its respective dictionary

Modifying definitions

A series of variables can be modified as desired within UDS_Dictionary_Update.R

  • var_col - The column containing the variable name in the dictionary, defaults to "Variable / Field Name"
  • ignored_fields - Any column names the user wishes to not be compared, defaults solely the "Index" column generated during processing but other REDCap fields can be added to the vector as desired
  • to_ignore - Any specific variables the user wishes to ignore, set to NULL by default but can be used in tandem with the curated list
  • ignored_regex - A regular expression which can be applied during comparison to flag a field pairing as being the same even if a difference is observed; default is set to ignore any instance of "[current-instance]==1" as a practical example

Troubleshooting

If you encounter any issues, ensure that you are using the correct Python version and have installed all dependencies from the requirements.txt file.
Although columns are checked between the dictionaries, make sure they match appropriately for ease of application Make certain the ordered variable list for UDS4 exists and contains all listed variables. Again, use the UDS4Data_ColumnOrder.txt file provided with this repository as desired.

Part 4 - Crosswalk Field / Level / Label Updater

Overview

This script steps through previously established human readable crosswalk files and compares a subset of their fields (currently limited to the Field Label, Response Levels, and Response labels) against a REDCap dictionary to check for discrepancies on a form-by-form basis which should be updated in the Crosswalk and its rules prior to JSON generation. This is largely an adaptation of the Dictionary Comparison script described in part 3. Three constituent pieces are required:

  • A folder path leading to set of Crosswalk Excel files to be assessed for field updating
  • A new iteration of the REDCap dictionary to be used for updating
  • A curated file listing a specific set of variables for comparison

Note a curated file is provided with this repository and can be used with this Comparison script. It matches the Data Element Dictionary order provided by NACC (May 2025). The file name is UDS4Data_ColumnOrder.txt found within this parser's main folder.

Scope

This Crosswalk Updater is designed to work with a set of human readable Excel Crosswalk matching the format developed for the UDS3-UDS4 Crosswalk family of applications along with a csv-based data dictionary, specifically from a REDCap export.

After the comparison is completed, a collection of Excel workbooks, one for each processed Crosswalk file / UDS4 form, are generated which can be used to update local Crosswalk dictionaries as the user sees fit.

This tool is primarily designed to guide updating of the Excel Crosswalks leading into JSON file regeneration for programmatic crosswalking between UDS3 and UDS4.

IMPORTANT - THIS DOES NOT PROVIDE SPECIFIC UPDATED RULES. THESE STILL NEED TO BE MANUALLY CURATED WITHIN THE CROSSWALK FILES PRIOR TO JSON FILE REGENERATION

UDS_Cross_Update.R

This is the primary script which takes in a folder path leading to a set of Excel based Crosswalk dictionaries, a REDCap dictionary for update referencing, and a curated list of variables to explicitly compare within the Crosswalk i.e. from the UDS4 Data Element Dictionary. The comparison tool identifies variables which only appear in Crosswalk as well as any dictionary field deviations for any matched variables. After the comparison, a series of form specific Excel workbooks summarizes all comparisons made for downstream application by the user.

Arguments, Folders, and Required Files

UDS_Cross_Update.R is designed to be run from the command line using either of the following:

1. Paths as arguments from the command line

Rscript UDS_Cross_Update.R <path to folder of Excel based Crosswalk files> <path to updated dictionary CSV> <path to curated variable list txt>

2. Interactive selection of files and folders

If the OS support GUI based folder selection, the scripts can also be run from the command line without arguments. The system will then prompt the user to select (1) the folder location of the Crosswalk files, (2) the REDCap dictionary to be used for updating, and (3) the curated list of variables for comparison.

3. Running scripts directly from IDE environment

The script can also be run directly from within an Rstudio environment. Be sure to initialize the string paths to the files / folders within the Main Process sections as needed.

Application

Requirements - R

  1. Install R (if not already installed)

  2. Necessary packages will be evaluated and downloaded automatically when UDS_Dictionary_Update.R is run using primary repository (default CRAN)

How to Use the Script

Step 1: Prepare your Crosswalks, updated dictionary, and curated variable list files. Make note of locations.

Step 2a: Run the Script from the command line Navigate to the folder where the desired script file is located and enter either the BASH commands as described above

Step 2b: Run the script interactively The .R file can be opened within the RStudio IDE and run interactively. If this step is done, folder paths will need to be provided by the user within the scripts. Sections for these variables can be found at the beginning of the "Main Process" section.

The script will compare the two dictionaries as described before generating the set of form-specific Excel workbooks.

Step 3: Check the Output: After running the script, a collections of Excel workbook output files, one for each form / Crosswalk file, will be generated titled "UDS_FORMNAME_Dictionary_Comparison_YYYYMMDD.xlsx" using today's date and the appropriate FORMNAME (e.g. A2, D1a_D1b). It will have the following tabs:

  • 'Unmatched - Crosswalk' - entries found exclusively in the Crosswalks UDS4 dictionary, typically these are only form-specific header variables
  • 'Crosswalk Entries' - entries from the Crosswalk where field discrepancies were found
  • 'New Entries' - entries for those same variables from the dictionary used for update referencing
  • 'Field Differences' - a breakdown for each variable listing the fields (limited to the Field Label, Response LEVELS, and Response LABELS) differed and what the field entries were for the Crosswalk in question and the updated dictioanry

Modifying definitions

A series of variables can be modified as desired within UDS_Cross_Update.R

  • var_col - The column containing the variable name in the dictionary, defaults to "Variable / Field Name"
  • cross_col - The column containing the variable name in the Crosswalk, default to 'UDS4 data element name'
  • ignored_fields - Any column names the user wishes to not be compared, defaults to NULL although other fields can be added as a vector if desired
  • to_ignore - Any specific variables the user wishes to ignore, set to NULL by default but can be used in tandem with the curated list
  • ignored_regex - A regular expression which can be applied during comparison to flag a field pairing as being the same even if a difference is observed; default is NULL
  • redcap_dict - The dictionary which defines which fields in the Crosswalk files and dictionary should be compared; if other fields are desired for comparison, add them as appropriate to the list's crosswalk and dictionary entries as well as in both vectors of the comparison entry in a positionally locked fashion e.g. two fields in the 5th element of the two vectors will be compared

Troubleshooting

If you encounter any issues, ensure that you are using the correct Python version and have installed all dependencies from the requirements.txt file.
Although columns are checked between the dictionaries, make sure they match appropriately for ease of application Make certain the ordered variable list for UDS4 exists and contains all listed variables. Again, use the UDS4Data_ColumnOrder.txt file provided with this repository as desired.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 56.6%
  • Python 43.4%