-
Notifications
You must be signed in to change notification settings - Fork 0
Description
About the issue
When running CSVConvert.py with raw_date provided (https://github.com/CanDIG/clinical_ETL_code/tree/develop/tests/raw_data) and the manifest_mohschemav3.yaml configuration, I encountered the following warnings during validation:
WARNING: Date intervals for 7 donor(s) could not be calculated. See raw_data_validation_results.json for full details.
WARNING: There are 83 validation warnings in your data. It can be ingested but will not be considered complete until the warnings in raw_data_validation_results.json are fixed.
The corresponding raw_data_validation_results.json file listed missing or unparseable date fields, such as:
- date_of_birth required for donors
- date_of_death required if is_deceased = Yes
- date_of_diagnosis required for primary_diagnoses
- treatment_start_date required for treatments
- date_of_followup required for followups
Despite including reference_date: earliest_date(Donor.date_resolution, PrimaryDiagnosis.date_of_diagnosis) in the manifest file, date interval calculation still failed when date_of_diagnosis was present in the file.
Details
- Installed all dependencies from requirements.txt
- Place the following files in a working directory:
- CSVConvert.py
- manifest_mohschemav3.yaml
- mappings.py
- mohschemav3.py
- schema.py
- redcap.py
- moh_v3_template.csv
-
Command used:
python CSVConvert.py --input raw_data --manifest manifest_mohschemav3.yaml --index --verbose -
date_format was set according to the data in the csv files; all the dates (i.e., '35') not following the minifest format were corrected.
manifest_mohschemav3.yaml:
date_format: DMY
submitter_donor_id, program_id, lost_to_followup_after_clinical_event_identifier, lost_to_followup_reason, date_alive_after_lost_to_followup, is_deceased, cause_of_death, date_of_birth, date_of_death, gender, sex_at_birth,date_resolution
DONOR_1,TEST_1,,,,Yes,Died of cancer,6/1/1954,20/5/2023,Man,Male,month
DONOR_2,TEST_1,,,,Yes,Died of other reasons,12/2/1982,10/6/2024,Woman,Female,day
DONOR_3,TEST_1,PD_3,Lost contact,4/6/2022,No,,7/12/1945,,Non-binary,Other,month
DONOR_4,TEST_1,,,,Yes,Not available,1/6/1984,31/7/2023,Man,Male,month
DONOR_5,TEST_2,,,1/6/2022,Yes,,15/2/1984,,Woman,Female,month
DONOR_6,TEST_2,PD_6,Withdrew from study,1/6/2022,No,,12/9/2024,,Non-binary,Other,month
Observation
Date interval warnings and validation warnings were raised, even though most required fields were present. The root cause was that earliest_date(...) could not resolve a valid date even if date_of_diagnosis was in the csv file.
Solution
I implemented a custom function donor_specific_reference_date() to select the earliest valid date_of_diagnosis based on the current submitter_donor_id. This helped avoid silent failures and improved validation reliability.
1. Custom function in redcap.py
from dateparser import parse as parse_date
import datetime
def donor_specific_reference_date(data_values):
from mappings import INDEXED_DATA, IDENTIFIER, DATE_FORMAT, _validate_date_format
donor_id = IDENTIFIER
try:
date_resolution = INDEXED_DATA["data"]["DONOR_SHEET"][donor_id]["date_resolution"][0]
except:
date_resolution = "day" # fallback
try:
dx_entries = INDEXED_DATA["data"]["PRIMARY_DIAGNOSES_SHEET"]
dates = []
for entry in dx_entries.values():
if entry.get("submitter_donor_id", [None])[0] == donor_id:
dx_date = entry.get("date_of_diagnosis", [None])[0]
if dx_date:
_validate_date_format(dx_date, DATE_FORMAT)
dt = parse_date(dx_date, settings={"DATE_ORDER": DATE_FORMAT})
if dt:
dates.append(dt)
except:
return None
if not dates:
return None
earliest = min(dates)
return {
"offset": earliest.strftime("%Y-%m-%d"),
"period": date_resolution
}
2. Updated manifest_mohschemav3.yaml
reference_date: redcaprefdate.donor_specific_reference_date(DONOR_SHEET.submitter_donor_id)
functions:
- redcap
3. Appended to moh_v3_template.csv
Add to the last line of the file:
REFERENCE_DATE,{redcaprefdate.donor_specific_reference_date(DONOR_SHEET.submitter_donor_id)}
Result
After these changes, validation passed without any date interval warnings. The data could be ingested with complete interval metadata.