Target: 95.85% → 99%+ recall
| File | Action | Description |
|---|---|---|
additional_patterns.py |
NEW | Drop into detectors/ folder |
dictionaries.py |
REPLACE | Replace existing detectors/dictionaries.py |
merger_patch.py |
PATCH | Apply changes to pipeline/merger.py |
orchestrator_registration.py |
REFERENCE | Instructions for registering new detector |
cp additional_patterns.py /path/to/scrubiq/detectors/cp dictionaries.py /path/to/scrubiq/detectors/Open pipeline/merger.py and make these changes:
3a. Add EMPLOYER to COMPATIBLE_TYPE_GROUPS (line ~20-28):
COMPATIBLE_TYPE_GROUPS: List[Set[str]] = [
{"NAME", "NAME_PATIENT", "NAME_PROVIDER", "NAME_RELATIVE", "NAME_FAMILY"},
{"ADDRESS", "STREET", "STREET_ADDRESS", "CITY", "STATE", "ZIP", "LOCATION"},
{"DATE", "DOB", "DATE_DOB", "DATE_ADMISSION", "DATE_DISCHARGE"},
{"PHONE", "FAX", "PHONE_MOBILE", "PHONE_HOME", "PHONE_WORK"},
{"SSN", "SSN_PARTIAL"},
{"MRN", "PATIENT_ID", "MEDICAL_RECORD"},
{"HEALTH_PLAN_ID", "MEMBER_ID", "INSURANCE_ID"},
{"EMPLOYER", "ORGANIZATION", "COMPANY", "COMPANYNAME"}, # <-- ADD THIS LINE
]3b. Update TYPE_NORMALIZE (line ~669-676):
Find this section:
# === CLINICAL (context-only, filtered before output) ===
"HOSPITAL": "FACILITY",
"ORG": "FACILITY",
"ORGANIZATION": "FACILITY",
"VENDOR": "FACILITY",
"COMPANYNAME": "FACILITY",
"COMPANY": "FACILITY",Replace with:
# === EMPLOYER (companies/organizations) ===
"COMPANYNAME": "EMPLOYER", # CHANGED from FACILITY
"COMPANY": "EMPLOYER", # CHANGED from FACILITY
"ORG": "EMPLOYER", # CHANGED from FACILITY
"ORGANIZATION": "EMPLOYER", # CHANGED from FACILITY
# === CLINICAL (context-only, filtered before output) ===
"HOSPITAL": "FACILITY",
"VENDOR": "FACILITY",
# === MEDICATION ===
"DRUG": "MEDICATION", # NEWIn detectors/orchestrator.py, add:
from .additional_patterns import AdditionalPatternDetectorThen add to your detector list:
AdditionalPatternDetector(),python3 -c "
from scrubiq.detectors.additional_patterns import AdditionalPatternDetector
d = AdditionalPatternDetector()
# Test EMPLOYER
spans = d.detect('I work at ABC Corporation.')
print('EMPLOYER:', [s.text for s in spans if s.entity_type == 'EMPLOYER'])
# Test AGE
spans = d.detect('Patient is 45 years old.')
print('AGE:', [s.text for s in spans if s.entity_type == 'AGE'])
# Test HEALTH_PLAN_ID
spans = d.detect('Member ID: XYZ123456')
print('HEALTH_PLAN_ID:', [s.text for s in spans if s.entity_type == 'HEALTH_PLAN_ID'])
"Expected output:
EMPLOYER: ['ABC Corporation']
AGE: ['45 years old']
HEALTH_PLAN_ID: ['XYZ123456']
| Entity Type | Missed Count | Fix |
|---|---|---|
| EMPLOYER | 773 | New patterns in additional_patterns.py |
| HEALTH_PLAN_ID | 873 | New patterns in additional_patterns.py |
| AGE | 579 | New patterns in additional_patterns.py |
| MEDICATION | 79 | Dictionary outputs MEDICATION (not DRUG) |
| CITY/STATE | ~80 | Geo folder now loaded in dictionaries.py |
| False positives | - | min_length filter in dictionaries.py |
| Type mismatches | - | TYPE_NORMALIZE updates in merger.py |
Run your recall test:
pytest tests/test_synthetic_accuracy.py -v -sExpected improvement: 95.85% → 98-99%+