Resubmission Linking Command Updates#5622
Conversation
* The command itself uses equivalence to determine the chains. We may come back and use distance to catch stragglers. The reviewable CSVs saw an update, and all the files go into their own subdirectory. * A new command to undo the linkage after the fact, just in case. * A new command to wipe all resubmission metadata for an audit year or a ser of specific records. * Some small changes in various places. I will not make such large commits a habit.
The distance and equivalence cluster generation functions are separate, and get their own test classes.
…metatdata, set to v0 rather than NULL. NULL will become v1 at dissemination time.
|
I've tested pretty extensively in the Preview environment, and things look good. I had to bump the memory to run the linking command - and it still couldn't get to some of the "larger" years. We may want to bump Production a bit before running the commands there. Something to keep in mind. |
| # For each record, compute its distance to the existing sets. | ||
| # If it is below the threshold, insert it into an existing set. | ||
| # Otherwise, insert into a new set. | ||
| def generate_clusters_from_records_by_equivalence(records, noisy=False): |
There was a problem hiding this comment.
PEDANTIC ALERT (also, I know you inherited this code)
Some of these names bug me. "Cluster" instead of "chain" (only makes sense if you agree with my other sorting comment, though) and "record" instead of "audit" or "submission". For example, could this be generate_audit_chains_by_equivalence? Probably annoying to swap the verbiage everywhere but I figure it's worth checking and I'm down to discuss.
There was a problem hiding this comment.
I fully agree, and I'll take every opportunity to fix up the language around things. I'm trying to use "chain" pretty much everywhere. That may or may not be helpful, since they're not really "chains" until the sort and annotation is made? They are aspiring chains, and that seems... Fine. I've switched to "submission" in most places, I feel like that makes sense. I've updates some variables to use sac or sacs, which matches up with a lot of the backend code. Should be better, but I'm sure I missed stuff. Nothing broke!
…records_postgres`
…usters" and "sets" in most/all places. Also, fixes some imports. I blame PyLance.
… sense to keep it in one spot.
Minimum allowed line rate is |
Resubmission Linking Command Updates
I will not make such large commits a habit.
Related tickets
Description of changes
How to test
Switch to this branch, bring everything up normally. Ensure you have some local data to play with. Ideally, both Census and GSA records.
The linking command
Try a few years, perhaps. The more recent years include more data. Those will take a little longer. Run the command with:
python manage.py link_resubmissions --email {ADMIN_EMAIL} --audit_year 2020Check the
.mdand.csvfiles that are produced. Verify they look right. You'll see a lot of duplicate data - that's fine, and ideal, since it means the records are truly in the same resubmission chain. For the 2019-2022 range, verify there are some Census-GSA crossovers.Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly.
Verify a few records in search and/or through the API. You may want to re-up the materialized view to use Advanced Search. Also spot check the internal
singleauditchecklisttable.The undo linking command
You've just run a few years, probably. Grab a related CSV, and run a command like:
python manage.py undo_link_resubmissions --email {ADMIN_EMAIL} --csv curation/data/{FILENAME}.csvYou'll see a "INFO CSV contains XYZ records." Verify it's right. Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly.
Verify a few records after the fact. Any record that previously had "NULL" resubmission data will instead have version 0 and status "unknown_resubmission_status". If we were to redisseminate it with NULL, it would be assumed to be version 1. So we explicitly say "unknown" after taking curative action on it.
The reset resubmission metadata command
This command sets the resub version to 0 and the status to "unknown_resubmission_status" for all records it hits. It will do either a full AY or accept a list of report_id's. It's very nice for testing the above commands locally. We may eventually be using this to reset records that are incorrectly brought together - either by our own actions or by user error.
Run the command like:
python manage.py reset_resubmission_metadata --email {ADMIN_EMAIL} --audit_year 2024python manage.py reset_resubmission_metadata --email {ADMIN_EMAIL} --report_ids 2024-12-GSAFAC-0000383165 2024-12-GSAFAC-0000387718 2024-12-GSAFAC-0000398121You'll see something like the following:
"INFO Found XYZ records for AY2024."
"INFO Found 3 records for report IDs: 2024-12-GSAFAC-0000383165, 2024-12-GSAFAC-0000387718, 2024-12-GSAFAC-0000398121."
Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly. Verify a few records after the fact.
Screenshot