Skip to content

chillbot-io/scrubiq-dev

Repository files navigation

Recall Improvement Patch

Target: 95.85% → 99%+ recall

Files Included

File Action Description
additional_patterns.py NEW Drop into detectors/ folder
dictionaries.py REPLACE Replace existing detectors/dictionaries.py
merger_patch.py PATCH Apply changes to pipeline/merger.py
orchestrator_registration.py REFERENCE Instructions for registering new detector

Installation

Step 1: Add new detector (EMPLOYER, AGE, HEALTH_PLAN_ID)

cp additional_patterns.py /path/to/scrubiq/detectors/

Step 2: Replace dictionaries.py (geo folder + min_length)

cp dictionaries.py /path/to/scrubiq/detectors/

Step 3: Patch merger.py

Open pipeline/merger.py and make these changes:

3a. Add EMPLOYER to COMPATIBLE_TYPE_GROUPS (line ~20-28):

COMPATIBLE_TYPE_GROUPS: List[Set[str]] = [
    {"NAME", "NAME_PATIENT", "NAME_PROVIDER", "NAME_RELATIVE", "NAME_FAMILY"},
    {"ADDRESS", "STREET", "STREET_ADDRESS", "CITY", "STATE", "ZIP", "LOCATION"},
    {"DATE", "DOB", "DATE_DOB", "DATE_ADMISSION", "DATE_DISCHARGE"},
    {"PHONE", "FAX", "PHONE_MOBILE", "PHONE_HOME", "PHONE_WORK"},
    {"SSN", "SSN_PARTIAL"},
    {"MRN", "PATIENT_ID", "MEDICAL_RECORD"},
    {"HEALTH_PLAN_ID", "MEMBER_ID", "INSURANCE_ID"},
    {"EMPLOYER", "ORGANIZATION", "COMPANY", "COMPANYNAME"},  # <-- ADD THIS LINE
]

3b. Update TYPE_NORMALIZE (line ~669-676):

Find this section:

    # === CLINICAL (context-only, filtered before output) ===
    "HOSPITAL": "FACILITY",
    "ORG": "FACILITY",
    "ORGANIZATION": "FACILITY",
    "VENDOR": "FACILITY",
    "COMPANYNAME": "FACILITY",
    "COMPANY": "FACILITY",

Replace with:

    # === EMPLOYER (companies/organizations) ===
    "COMPANYNAME": "EMPLOYER",  # CHANGED from FACILITY
    "COMPANY": "EMPLOYER",       # CHANGED from FACILITY
    "ORG": "EMPLOYER",           # CHANGED from FACILITY
    "ORGANIZATION": "EMPLOYER",  # CHANGED from FACILITY
    
    # === CLINICAL (context-only, filtered before output) ===
    "HOSPITAL": "FACILITY",
    "VENDOR": "FACILITY",
    
    # === MEDICATION ===
    "DRUG": "MEDICATION",  # NEW

Step 4: Register the new detector

In detectors/orchestrator.py, add:

from .additional_patterns import AdditionalPatternDetector

Then add to your detector list:

AdditionalPatternDetector(),

Step 5: Verify

python3 -c "
from scrubiq.detectors.additional_patterns import AdditionalPatternDetector

d = AdditionalPatternDetector()

# Test EMPLOYER
spans = d.detect('I work at ABC Corporation.')
print('EMPLOYER:', [s.text for s in spans if s.entity_type == 'EMPLOYER'])

# Test AGE  
spans = d.detect('Patient is 45 years old.')
print('AGE:', [s.text for s in spans if s.entity_type == 'AGE'])

# Test HEALTH_PLAN_ID
spans = d.detect('Member ID: XYZ123456')
print('HEALTH_PLAN_ID:', [s.text for s in spans if s.entity_type == 'HEALTH_PLAN_ID'])
"

Expected output:

EMPLOYER: ['ABC Corporation']
AGE: ['45 years old']
HEALTH_PLAN_ID: ['XYZ123456']

What These Changes Fix

Entity Type Missed Count Fix
EMPLOYER 773 New patterns in additional_patterns.py
HEALTH_PLAN_ID 873 New patterns in additional_patterns.py
AGE 579 New patterns in additional_patterns.py
MEDICATION 79 Dictionary outputs MEDICATION (not DRUG)
CITY/STATE ~80 Geo folder now loaded in dictionaries.py
False positives - min_length filter in dictionaries.py
Type mismatches - TYPE_NORMALIZE updates in merger.py

After Installation

Run your recall test:

pytest tests/test_synthetic_accuracy.py -v -s

Expected improvement: 95.85% → 98-99%+

About

dev repo for scrubiq

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors