Skip to content

COMP: Deprecate linearizing ingest helper; add merge-preserving rewriter#6162

Merged
hjmjohnson merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:deprecate-normalize-script
Apr 29, 2026
Merged

COMP: Deprecate linearizing ingest helper; add merge-preserving rewriter#6162
hjmjohnson merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:deprecate-normalize-script

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Locks the door on the bug that linearized the topology of the GenericLabelInterpolator (#6135), MGHIO (#6137), FastBilateral (#6159), and MeshNoise (#6161) ingest PRs. The old normalize-ingest-commits.py is now a hard refuse-to-run, and a new merge-preserving rewriter implements the Phase 1 / Phase 2 / Phase 3 / Phase 4 architecture #6160 references.

Why

normalize-ingest-commits.py replayed each upstream non-merge commit via git cherry-pick and explicitly skipped merges ("Skipping N merge commit(s); their content arrives via the non-merge commits"). That trades the upstream merge topology for a linear cherry-pick chain. The strategy doc has always said merge topology is mandatory; the script silently violated that. After repeated reviewer push-back the author asked for a hard guard plus a replacement.

What changed
File Change
Utilities/Maintenance/RemoteModuleIngest/normalize-ingest-commits.py Adds a refuse-to-run guard. Exits 3 with a deprecation notice unless --i-understand-this-linearizes is passed. Notice names the replacement and points at the strategy doc.
Utilities/Maintenance/RemoteModuleIngest/rewrite-history-merge-preserving.py New. Drives git filter-repo --blob-callback to apply uniform text-blob normalization (trailing-whitespace strip + EOF-newline fix) across every blob in <base>..<branch>. Merges are preserved by default. After the rewrite, asserts that the merge count is at least --expected-merges. Phase-1-only mode (--phase-1) clones upstream, applies destination pre-commit hooks once, and emits the reference tree SHA1 for the operator to compare against.
Utilities/Maintenance/RemoteModuleIngest/INGESTION_STRATEGY.md Adds "All upstream-side merges must also survive" requirement with operator-check, "Linearization is forbidden" subsection with allowed-tool table, and an explicit deprecation notice for normalize-ingest-commits.py.
What this intentionally doesn't do
  • Does not apply clang-format / gersemi per blob. Those formatters' output depends on the destination's current config, and applying them retroactively across upstream history would produce spurious historical diffs. The recommended pattern, documented in the strategy table, is a single STYLE: Apply current formatters commit at the rewritten tip after Phase 2 succeeds. PR ENH: Ingest ITKFastBilateral into Modules/Filtering (supersedes #5134) #6159 and ENH: Ingest ITKMeshNoise into Modules/Filtering (closes #5174) #6161 used this pattern successfully.
  • Does not perform the Mode A merge. That stays in ingest-remote-module.sh. This script is a purely-history-rewrite tool that the ingest script (or a manual operator) can invoke.
  • Does not delete normalize-ingest-commits.py. Keeping the file with the deprecation guard preserves backlinks and lets anyone with --i-understand-this-linearizes recover the old behavior in an emergency.

Closes the linearization-at-ingest class of bugs flagged on the prior four ingest PRs. Master tracker: #6160.

@github-actions github-actions Bot added type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots area:Python wrapping Python bindings for a class labels Apr 28, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review April 28, 2026 19:08
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR locks down the linearization bug that affected four prior ingest PRs by hard-deprecating normalize-ingest-commits.py (exit 3 unless --i-understand-this-linearizes is passed) and introducing rewrite-history-merge-preserving.py, a git filter-repo-based rewriter that preserves merge topology while normalizing text blobs. All seven issues flagged in the prior round of review have been addressed in this version: git add --all now precedes git write-tree, assert_filter_repo_available is guarded to Phase 2 only, --base is no longer required=True, --expected-merges defaults to 0 with accurate help text, the exec-bit --commit-callback is implemented, dead module-level helpers are absent, and .pre-commit-config.yaml is validated before invoking pre-commit.

Confidence Score: 5/5

Safe to merge; all previously-flagged defects are resolved and only one minor edge-case remains.

All seven P1/P0 issues from the prior review round have been correctly addressed. The single remaining finding (empty-file normalization producing a spurious newline) is P2 and would only surface for repositories with intentionally-empty files, an extremely rare case in ITK remote modules.

rewrite-history-merge-preserving.py — empty-file handling in normalize()

Important Files Changed

Filename Overview
Utilities/Maintenance/RemoteModuleIngest/rewrite-history-merge-preserving.py New merge-preserving rewriter; all 7 previously-flagged issues resolved. One remaining edge case: normalize() converts empty files to a single newline, which may diverge from pre-commit's end-of-file-fixup behavior for intentionally-empty files.
Utilities/Maintenance/RemoteModuleIngest/normalize-ingest-commits.py Deprecation guard added correctly; returns exit code 3 with clear notice unless --i-understand-this-linearizes is passed. Implementation is clean.
Utilities/Maintenance/RemoteModuleIngest/INGESTION_STRATEGY.md Adds mandatory merge-topology requirements, linearization-forbidden section with allowed-tool table, and explicit deprecation notice for normalize-ingest-commits.py.

Reviews (5): Last reviewed commit: "COMP: Deprecate linearizing ingest helpe..." | Re-trigger Greptile

Comment thread Utilities/Maintenance/RemoteModuleIngest/rewrite-history-merge-preserving.py Outdated
@hjmjohnson hjmjohnson force-pushed the deprecate-normalize-script branch from 262ff53 to 8b70516 Compare April 28, 2026 19:57
@hjmjohnson
Copy link
Copy Markdown
Member Author

Addressed greptile P1 in 8b70516 — added git add --all between pre-commit run and git write-tree in run_phase_1_reference so the reference SHA captures the post-hook tree. Phase 3 verification now compares against the correct reference.

@greptileai re-review the merge-preserving rewriter.

Comment thread Utilities/Maintenance/RemoteModuleIngest/rewrite-history-merge-preserving.py Outdated
@hjmjohnson hjmjohnson force-pushed the deprecate-normalize-script branch from 8b70516 to 66267cb Compare April 28, 2026 20:11
@hjmjohnson
Copy link
Copy Markdown
Member Author

Addressed greptile's three new P1s in 66267cb:

  1. Moved assert_filter_repo_available() to after the --phase-1 early return — Phase 1 only needs pre-commit, not git-filter-repo.
  2. Added --commit-callback to run_phase_2_rewrite that strips the executable bit from text-extension files (.cmake, .cxx, .h, .txt, .md, .yaml, etc.). .sh / .py / .pl are intentionally excluded since they may legitimately be runnable upstream.
  3. Changed --expected-merges default from 1 → 0 and rewrote the help text. The Mode A ingest merge is added later by ingest-remote-module.sh and is not in scope for this script's topology assertion.

@greptileai re-review.

Comment thread Utilities/Maintenance/RemoteModuleIngest/rewrite-history-merge-preserving.py Outdated
@hjmjohnson
Copy link
Copy Markdown
Member Author

Addressed greptile's round-3 P1 in cfd667c: dropped required=True on --base and added an explicit p.error() in the Phase 2 path. Phase 1 mode (--phase-1) now parses without needing a dummy --base. Verified both paths locally:

  • --phase-1 --upstream-url X --pre-commit-config-src Y → reaches clone (fails on bogus URL, as expected)
  • bare invocation without --baseerror: --base is required for Phase 2 (omit only with --phase-1)

@greptileai re-review.

@hjmjohnson hjmjohnson force-pushed the deprecate-normalize-script branch from cfd667c to 3345974 Compare April 28, 2026 22:06
@hjmjohnson
Copy link
Copy Markdown
Member Author

Addressed greptile's round-4 P1 + cleanup in 3345974:

  1. Phase 1 silent-failure (P1): now hard-fails with exit 2 if --pre-commit-config-src does not contain .pre-commit-config.yaml — the required hook config can no longer silently fall through to producing an unformatted reference tree. Auxiliary configs (.gersemi.config, .clang-format) emit a WARN: to stderr when missing rather than being silently absent.
  2. Dead module-level helpers: removed TEXT_EXTENSIONS, TEXT_BASENAMES, is_likely_text, normalize_text_blob from module scope. They were duplicated by the inline filter-repo callback strings (which run in a different Python process) and were not referenced anywhere else.

@greptileai re-review.

The normalize-ingest-commits.py script replays each non-merge upstream
commit via cherry-pick and skips merges entirely.  The result loses
the upstream merge topology, which the strategy doc has always
mandated as preserved.  Reviewers flagged this on PRs InsightSoftwareConsortium#6135, InsightSoftwareConsortium#6137,
InsightSoftwareConsortium#6159, and InsightSoftwareConsortium#6161 before it became a rule violation.

This commit:

1. Adds a hard refuse-to-run guard to normalize-ingest-commits.py
   (exit 3 with deprecation notice unless --i-understand-this-linearizes
   is passed).  The guard names the replacement and points at the
   strategy doc.

2. Adds rewrite-history-merge-preserving.py — a git-filter-repo
   driven rewriter that walks every commit (merges included) and
   applies uniform text-blob normalization (trailing-whitespace strip
   + EOF-newline fix).  Implements the user-mandated Phase 1
   reference / Phase 2 traversal / Phase 3 verification architecture.

3. Updates INGESTION_STRATEGY.md with the "Linearization is forbidden"
   subsection, an allowed-tool table for per-commit re-formatting,
   and the new operator check that the merge count satisfies
   1 + UPSTREAM_INTERNAL_MERGES.

Language-specific formatters (clang-format, gersemi) are intentionally
NOT applied per blob — they require the destination's current config,
and applying them retroactively across upstream history would produce
spurious diffs.  The recommended pattern is a single
"STYLE: Apply current formatters" commit at the rewritten tip after
Phase 2 succeeds.
@hjmjohnson hjmjohnson force-pushed the deprecate-normalize-script branch from 3345974 to 1f41862 Compare April 28, 2026 22:27
@hjmjohnson hjmjohnson requested a review from dzenanz April 29, 2026 16:32
@hjmjohnson
Copy link
Copy Markdown
Member Author

@dzenanz This is to be more rigorous about not linearizing the history.

Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the intent. I did not look at the new script. As this is maintenance-related code, it does not need to be highly scrutinized. If someone does take a look, they can suggest changes via a PR.

@hjmjohnson hjmjohnson merged commit 4b5eb76 into InsightSoftwareConsortium:main Apr 29, 2026
6 checks passed
@hjmjohnson hjmjohnson added this to the ITK 6.0 Release Candidate 1 milestone May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Python wrapping Python bindings for a class type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants