Skip to content

CSVConvert.py produces date interval and validation warnings due to missing reference_date resolution — solved with custom function #112

@dzhao2019

Description

@dzhao2019

About the issue

When running CSVConvert.py with raw_date provided (https://github.com/CanDIG/clinical_ETL_code/tree/develop/tests/raw_data) and the manifest_mohschemav3.yaml configuration, I encountered the following warnings during validation:

WARNING: Date intervals for 7 donor(s) could not be calculated. See raw_data_validation_results.json for full details.

WARNING: There are 83 validation warnings in your data. It can be ingested but will not be considered complete until the warnings in raw_data_validation_results.json are fixed.

The corresponding raw_data_validation_results.json file listed missing or unparseable date fields, such as:

  • date_of_birth required for donors
  • date_of_death required if is_deceased = Yes
  • date_of_diagnosis required for primary_diagnoses
  • treatment_start_date required for treatments
  • date_of_followup required for followups

Despite including reference_date: earliest_date(Donor.date_resolution, PrimaryDiagnosis.date_of_diagnosis) in the manifest file, date interval calculation still failed when date_of_diagnosis was present in the file.

Details

  1. Installed all dependencies from requirements.txt
  2. Place the following files in a working directory:
  • CSVConvert.py
  • manifest_mohschemav3.yaml
  • mappings.py
  • mohschemav3.py
  • schema.py
  • redcap.py
  • moh_v3_template.csv
  1. Command used:
    python CSVConvert.py --input raw_data --manifest manifest_mohschemav3.yaml --index --verbose

  2. date_format was set according to the data in the csv files; all the dates (i.e., '35') not following the minifest format were corrected.

manifest_mohschemav3.yaml:
date_format: DMY

submitter_donor_id, program_id, lost_to_followup_after_clinical_event_identifier, lost_to_followup_reason, date_alive_after_lost_to_followup, is_deceased, cause_of_death, date_of_birth, date_of_death, gender, sex_at_birth,date_resolution
DONOR_1,TEST_1,,,,Yes,Died of cancer,6/1/1954,20/5/2023,Man,Male,month
DONOR_2,TEST_1,,,,Yes,Died of other reasons,12/2/1982,10/6/2024,Woman,Female,day
DONOR_3,TEST_1,PD_3,Lost contact,4/6/2022,No,,7/12/1945,,Non-binary,Other,month
DONOR_4,TEST_1,,,,Yes,Not available,1/6/1984,31/7/2023,Man,Male,month
DONOR_5,TEST_2,,,1/6/2022,Yes,,15/2/1984,,Woman,Female,month
DONOR_6,TEST_2,PD_6,Withdrew from study,1/6/2022,No,,12/9/2024,,Non-binary,Other,month

Observation

Date interval warnings and validation warnings were raised, even though most required fields were present. The root cause was that earliest_date(...) could not resolve a valid date even if date_of_diagnosis was in the csv file.

Solution

I implemented a custom function donor_specific_reference_date() to select the earliest valid date_of_diagnosis based on the current submitter_donor_id. This helped avoid silent failures and improved validation reliability.

1. Custom function in redcap.py


from dateparser import parse as parse_date
import datetime

def donor_specific_reference_date(data_values):
    from mappings import INDEXED_DATA, IDENTIFIER, DATE_FORMAT, _validate_date_format

    donor_id = IDENTIFIER
    try:
        date_resolution = INDEXED_DATA["data"]["DONOR_SHEET"][donor_id]["date_resolution"][0]
    except:
        date_resolution = "day"  # fallback

    try:
        dx_entries = INDEXED_DATA["data"]["PRIMARY_DIAGNOSES_SHEET"]
        dates = []
        for entry in dx_entries.values():
            if entry.get("submitter_donor_id", [None])[0] == donor_id:
                dx_date = entry.get("date_of_diagnosis", [None])[0]
                if dx_date:
                    _validate_date_format(dx_date, DATE_FORMAT)
                    dt = parse_date(dx_date, settings={"DATE_ORDER": DATE_FORMAT})
                    if dt:
                        dates.append(dt)
    except:
        return None

    if not dates:
        return None

    earliest = min(dates)
    return {
        "offset": earliest.strftime("%Y-%m-%d"),
        "period": date_resolution
    }

2. Updated manifest_mohschemav3.yaml

reference_date: redcaprefdate.donor_specific_reference_date(DONOR_SHEET.submitter_donor_id)
functions:
  - redcap

3. Appended to moh_v3_template.csv
Add to the last line of the file:

REFERENCE_DATE,{redcaprefdate.donor_specific_reference_date(DONOR_SHEET.submitter_donor_id)}

Result

After these changes, validation passed without any date interval warnings. The data could be ingested with complete interval metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions