Skip to content

Resubmission Linking Command Updates#5622

Open
jperson1 wants to merge 19 commits into
mainfrom
jp/resub-linking-command
Open

Resubmission Linking Command Updates#5622
jperson1 wants to merge 19 commits into
mainfrom
jp/resub-linking-command

Conversation

@jperson1
Copy link
Copy Markdown
Contributor

@jperson1 jperson1 commented May 6, 2026

Resubmission Linking Command Updates

I will not make such large commits a habit.

Related tickets

Description of changes

  • The linking command:
    • Now uses equivalence to determine the chains. We may come back and use distance to catch stragglers.
    • The reviewable CSVs saw an update to enable the undo-ing, and all the files go into their own subdirectory.
  • A new command to undo the linkage after the fact, just in case.
    • Uses a CSV generated from the linking command.
  • A new command to wipe all resubmission metadata for an audit year or a set of specific records.
    • Very convenient while testing. Will be useful if any of these linkages proves false way down the line.
  • A little Makefile change from last week.

How to test

Switch to this branch, bring everything up normally. Ensure you have some local data to play with. Ideally, both Census and GSA records.

The linking command

Try a few years, perhaps. The more recent years include more data. Those will take a little longer. Run the command with:
python manage.py link_resubmissions --email {ADMIN_EMAIL} --audit_year 2020

Check the .md and .csv files that are produced. Verify they look right. You'll see a lot of duplicate data - that's fine, and ideal, since it means the records are truly in the same resubmission chain. For the 2019-2022 range, verify there are some Census-GSA crossovers.

Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly.

Verify a few records in search and/or through the API. You may want to re-up the materialized view to use Advanced Search. Also spot check the internal singleauditchecklist table.

The undo linking command

You've just run a few years, probably. Grab a related CSV, and run a command like:
python manage.py undo_link_resubmissions --email {ADMIN_EMAIL} --csv curation/data/{FILENAME}.csv

You'll see a "INFO CSV contains XYZ records." Verify it's right. Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly.

Verify a few records after the fact. Any record that previously had "NULL" resubmission data will instead have version 0 and status "unknown_resubmission_status". If we were to redisseminate it with NULL, it would be assumed to be version 1. So we explicitly say "unknown" after taking curative action on it.

The reset resubmission metadata command

This command sets the resub version to 0 and the status to "unknown_resubmission_status" for all records it hits. It will do either a full AY or accept a list of report_id's. It's very nice for testing the above commands locally. We may eventually be using this to reset records that are incorrectly brought together - either by our own actions or by user error.

Run the command like:
python manage.py reset_resubmission_metadata --email {ADMIN_EMAIL} --audit_year 2024
python manage.py reset_resubmission_metadata --email {ADMIN_EMAIL} --report_ids 2024-12-GSAFAC-0000383165 2024-12-GSAFAC-0000387718 2024-12-GSAFAC-0000398121

You'll see something like the following:
"INFO Found XYZ records for AY2024."
"INFO Found 3 records for report IDs: 2024-12-GSAFAC-0000383165, 2024-12-GSAFAC-0000387718, 2024-12-GSAFAC-0000398121."

Press 'c' and 'enter' to continue. You'll see each record get redisseminated quickly. Verify a few records after the fact.

Screenshot

image

jperson1 added 2 commits May 6, 2026 15:45
* The command itself uses equivalence to determine the chains. We may come back and use distance to catch stragglers. The reviewable CSVs saw an update, and all the files go into their own subdirectory.
* A new command to undo the linkage after the fact, just in case.
* A new command to wipe all resubmission metadata for an audit year or a ser of specific records.
* Some small changes in various places.

I will not make such large commits a habit.
@jperson1 jperson1 self-assigned this May 6, 2026
The distance and equivalence cluster generation functions are separate, and get their own test classes.
…metatdata, set to v0 rather than NULL.

NULL will become v1 at dissemination time.
@jperson1
Copy link
Copy Markdown
Contributor Author

jperson1 commented May 7, 2026

I've tested pretty extensively in the Preview environment, and things look good. I had to bump the memory to run the linking command - and it still couldn't get to some of the "larger" years. We may want to bump Production a bit before running the commands there. Something to keep in mind.

@jperson1 jperson1 marked this pull request as ready for review May 7, 2026 16:41
Comment thread backend/curation/curationlib/sac_resubmission_records_postgres.py Outdated
# For each record, compute its distance to the existing sets.
# If it is below the threshold, insert it into an existing set.
# Otherwise, insert into a new set.
def generate_clusters_from_records_by_equivalence(records, noisy=False):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEDANTIC ALERT (also, I know you inherited this code)

Some of these names bug me. "Cluster" instead of "chain" (only makes sense if you agree with my other sorting comment, though) and "record" instead of "audit" or "submission". For example, could this be generate_audit_chains_by_equivalence? Probably annoying to swap the verbiage everywhere but I figure it's worth checking and I'm down to discuss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree, and I'll take every opportunity to fix up the language around things. I'm trying to use "chain" pretty much everywhere. That may or may not be helpful, since they're not really "chains" until the sort and annotation is made? They are aspiring chains, and that seems... Fine. I've switched to "submission" in most places, I feel like that makes sense. I've updates some variables to use sac or sacs, which matches up with a lot of the backend code. Should be better, but I'm sure I missed stuff. Nothing broke!

Comment thread backend/curation/management/commands/link_resubmissions.py Outdated
Comment thread backend/curation/curationlib/generate_resubmission_clusters.py Outdated
Comment thread backend/curation/management/commands/undo_link_resubmissions.py
@github-actions
Copy link
Copy Markdown
Contributor

Code Coverage

Package Line Rate Branch Rate Health
. 100% 100%
api 98% 86%
api.serializers 97% 88%
api.views 91% 96%
audit 95% 80%
audit.cross_validation 97% 86%
audit.fixtures 84% 50%
audit.formlib 92% 62%
audit.intakelib 89% 83%
audit.intakelib.checks 92% 86%
audit.intakelib.common 98% 82%
audit.intakelib.transforms 100% 95%
audit.management.commands 78% 17%
audit.migrations 100% 100%
audit.models 91% 69%
audit.templatetags 100% 100%
audit.test_viewlib 100% 100%
audit.views 75% 52%
census_historical_migration 96% 65%
census_historical_migration.migrations 100% 100%
census_historical_migration.sac_general_lib 92% 84%
census_historical_migration.transforms 95% 90%
census_historical_migration.workbooklib 68% 69%
config 78% 37%
curation 94% 90%
curation.curationlib 79% 51%
curation.management.commands 46% 34%
curation.migrations 100% 100%
dissemination 90% 70%
dissemination.analytics 27% 0%
dissemination.forms 80% 30%
dissemination.migrations 97% 25%
dissemination.models 100% 100%
dissemination.report_generation 21% 0%
dissemination.report_generation.excel 32% 0%
dissemination.searchlib 61% 44%
dissemination.templatetags 52% 6%
dissemination.views 67% 47%
djangooidc 53% 38%
djangooidc.tests 100% 94%
report_submission 100% 96%
report_submission.migrations 100% 100%
report_submission.templatetags 74% 100%
report_submission.views 78% 61%
support 94% 75%
support.migrations 100% 100%
support.models 90% 50%
tools 98% 50%
users 95% 86%
users.fixtures 100% 83%
users.management 100% 100%
users.management.commands 100% 100%
users.migrations 100% 100%
Summary 88% (22648 / 25643) 68% (2766 / 4050)

Minimum allowed line rate is 85%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants