Skip to content

fix(dvc): self-heal interrupted MERGE_HEAD on stale staged content#2337

Merged
jirhiker merged 8 commits into
mainfrom
claude/sharp-franklin-e3eb52
May 15, 2026
Merged

fix(dvc): self-heal interrupted MERGE_HEAD on stale staged content#2337
jirhiker merged 8 commits into
mainfrom
claude/sharp-franklin-e3eb52

Conversation

@jirhiker
Copy link
Copy Markdown
Member

Summary

  • Self-heal in GitRepoManager._protected_merge now distinguishes a real conflict (unmerged paths) from an interrupted merge (staged auto-merge content, no conflicts), and finalizes the latter with git commit --no-edit instead of bailing.
  • Previous behavior treated any dirty working tree as unrecoverable, then fell through and re-invoked git merge, producing fatal: You have not concluded your merge (MERGE_HEAD exists) and aborting post_measurement_save -> meta_pull -> smart_pull repeatedly.

Why

Observed in production: dvc_persister.post_measurement_save raised GitCommandError from git merge --no-commit --no-ff -X theirs FETCH_HEAD because the prior run had crashed between git merge --no-commit (which succeeded mechanically and staged the merged content) and the follow-up git commit --no-edit. The self-heal added in 0336d81 / f0964f9 only handled the clean-tree case; staged content tripped is_dirty(untracked_files=False) and was misclassified as a conflict to resolve manually.

Fix

Three explicit branches in the self-heal:

  1. Unmerged paths (git diff --diff-filter=U non-empty): log critical, bail. User must resolve.
  2. Staged content, no conflicts (git diff --cached --name-only non-empty): run git commit --no-edit to finalize the interrupted merge.
  3. Clean: git merge --abort (or fall back to removing MERGE_HEAD/MERGE_MSG/MERGE_MODE).

is_dirty is no longer used — replaced with explicit checks against the index and unmerged paths so the two states cannot be conflated.

Test plan

  • Manual: simulate the crash state by running git merge --no-commit --no-ff <ref> in a MetaData clone, killing before commit, then triggering smart_pull. Self-heal logs finalizing the interrupted merge commit and proceeds.
  • Manual: introduce a real conflict (edit same line on both branches), repeat. Self-heal logs refusing to auto-resolve and bails.
  • Manual: leave only stale MERGE_HEAD marker files with clean tree. Self-heal logs aborting residual merge state and continues.
  • Resume an interrupted experiment in pychron after applying — post_measurement_save succeeds without operator intervention.

🤖 Generated with Claude Code

hornblende and others added 5 commits May 14, 2026 10:24
Apple Silicon (M3) was crashing pychron with SIGBUS/SIGSEGV after many
analyses while Intel survived. Root cause: worker threads (run threads,
_do_collection, _ping_loop, StatusMonitor) scheduled Qt timers via
pyface helpers (do_after_timer, CallbackTimer.single_shot,
QTimer.singleShot). The resulting QObjects bound their thread context
to the worker; when the worker exited and Qt later fired the timer, it
dereferenced a stale pointer. M3 PAC catches this; Intel silently
limps. Same bug also drove the 2+ second main-thread stalls observed
during paint-heavy startup bursts.

Phase 1 - instrumentation:
- pychron/core/helpers/m3_diagnostics.py (new): installs a
  qInstallMessageHandler that promotes only known cross-thread Qt
  warnings (Timers can only be used..., different thread, ...) to
  ERROR with a full Python stack while dropping benign libpng/iCCP
  spam; wraps QTimer.__init__/singleShot to log off-main use; runs a
  main-thread heartbeat QTimer + daemon poller that dumps every
  Python frame when the main thread stalls past stall_threshold.
- launchers/launcher.py: install_early() before any Qt import so the
  message handler is in place when pyface loads.
- pychron/envisage/pychron_run.py: install_late(stall_threshold=5.0)
  after app_factory() and before app.run() so the watchdog timer can
  attach to a real QApplication; 5s threshold suppresses known
  startup-burst false positives.
- M3_INSTRUMENTATION.md: how to read m3_diagnostics.log.

Phase 2 - thread-safe marshalling:
- m3_diagnostics.install_thread_safe_marshalling() patches the four
  Qt-timer entry points so off-main callers re-enter via
  invoke_in_main_thread (GUI.invoke_later) instead of creating
  QTimers in dying worker contexts:
    pyface.timer.do_later.do_after
    pyface.timer.do_later.do_later
    pyface.timer.timer.CallbackTimer.single_shot
    pyface.qt.QtCore.QTimer.singleShot
- pychron/monitors/automated_run_monitor.py: route the device get()
  query in get_isotope_keys through invoke_in_main_thread, since it
  is invoked from the monitor worker thread but touches Qt-owned
  hardware proxies.

Supporting fixes from the same investigation:
- pychron/core/ui/factory.py: revert lazy toolkit_factory to eager
  loading; lazy proxy was the original cause of the missing macOS
  menu bar.
- pychron/core/ui/gui.py: import convert_color and wake_screen
  eagerly from pychron.core.ui.qt.gui so they are available during
  menu initialization on macOS (matches the factory.py revert).
- pychron/core/ui/crash_diagnostics.py (new): standalone helper for
  parsing macOS .ips crash reports against pychron stacks.
- pychron/envisage/tasks/base_tasks_application.py: on macOS, do not
  reload saved task window layouts; auto-opened windows during
  startup interfere with menu bar rendering on Sonoma/Sequoia.
- pychron/core/helpers/archiver.py: add a warning() helper and skip
  the archive directory itself in _clean so the archiver no longer
  crashes on startup with shutil.Error: Cannot move a directory into
  itself plus AttributeError on the missing warning method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Untracked items that should not enter the repo:
- *.bak files (e.g. pychron/core/ui/gui.py.bak left over from the
  M3 cross-thread investigation).
- .claude/ (Claude Code agent session state).
- launchers-Felix/ (snapshot of the pre-transition Felix launcher
  retained locally for reference).
- launchers/helpers-Jan.py and pychron/extraction_line/switch_manager-Jan.py
  (variant copies kept locally during the Felix -> Jan mass spectrometer
  transition).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pychron crashes (in particular the M3 cross-thread segfaults addressed
in 83f4b6d) sometimes occur partway through DVC's MetaData git pull.
The next launch finds a stale MERGE_HEAD in the MetaData repo and
GitPython aborts with:

    fatal: You have not concluded your merge (MERGE_HEAD exists).
    Please, commit your changes before you merge.

which surfaces as "program stopped" and prevents Pychron from starting
until the user manually clears .git/MERGE_HEAD.

Now _protected_merge inspects the repo on entry:
- If MERGE_HEAD exists AND the working tree has unmerged paths or other
  uncommitted changes, log a critical message and let the original merge
  attempt fail loudly (real conflicts still require human intervention).
- If MERGE_HEAD exists AND the working tree is clean, run
  `git merge --abort` (with a manual MERGE_HEAD/MERGE_MSG/MERGE_MODE
  cleanup fallback) and continue with the new merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pychron.loggable.Loggable.warning/critical take a single message
argument, not stdlib `%s`-style positional formatting.  The previous
commit (f0964f9) passed extra positional args, which raised

    TypeError: Loggable.warning() takes 2 positional arguments but 3 were given

inside the heal path's exception guard, swallowing the abort and
allowing the original `git merge` to run against the stale MERGE_HEAD
- defeating the self-heal entirely.

Pre-format the messages with `str.format` so the single-arg signature
is honoured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntent

The previous self-heal in _protected_merge treated MERGE_HEAD + dirty
working tree (via is_dirty) as a hard bail. That conflated two distinct
states:

  1. Real unmerged paths (U entries) -> user must resolve.
  2. Staged auto-merged content with no conflicts -> the merge already
     succeeded mechanically; only the final `commit --no-edit` step was
     interrupted by the crash.

Case 2 was being misclassified as "dirty," logged as critical, then the
function fell through and re-invoked `git merge`, producing:

  fatal: You have not concluded your merge (MERGE_HEAD exists).
  Please, commit your changes before you merge.

This kept aborting experiment runs (post_measurement_save -> meta_pull
-> smart_pull -> _protected_merge) until the user manually committed
the residual merge.

Split the self-heal into three branches: unmerged paths bail, staged
content gets finalized via `git commit --no-edit`, otherwise abort the
residual merge state as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce85a7dcda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pychron/git_archive/repo_manager.py
Comment thread pychron/monitors/automated_run_monitor.py Outdated
When pychron segfaults inside QCoreApplication::notifyInternal2 (Qt
trying to deliver a queued Timer event to a freed QObject), the
existing diagnostics tell us *that* a receiver was dangling but not
*which class* of receiver. The macOS crash report shows a raw QObject*
pointer with no Python-side identity.

Add an opt-in QApplication event filter that logs every Timer-event
delivery to a dedicated rotating file (m3_eventtrace.log):

  HH:MM:SS.mmm Timer cls=<ClassName> pyid=0x... qtid=<int>

The filter captures `type(obj).__name__` BEFORE Qt dispatches, so a
successful trace identifies the next receiver and a crash on the next
line localizes the culprit class. Volume is bounded to Timer events
only (Qt event type 1); the full event stream is too noisy to dump.

Opt-in via PYCHRON_M3_EVENT_TRACE=1 so normal runs are not paying the
trace cost. install_late() consults the environment and skips the
filter otherwise.

The tracer logger does not propagate and writes to a separate file so
it can't drown the main diagnostics log or stall the main thread via
the root logger's UI handlers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

qInstallMessageHandler(handler)
_QT_MSG_HANDLER_INSTALLED = True
try:
QTimer.__init__ = init_wrapper
QTimer.singleShot = single_shot_wrapper
_QTIMER_GUARD_INSTALLED = True
except Exception as e: # pragma: no cover
_log.error("marshalling: QTimer.singleShot patch failed: %s", e)

_MARSHALLING_INSTALLED = True
_log.error("install_event_tracer: installEventFilter failed: %s", e)
return

_EVENT_TRACER_INSTALLED = True
_watchdog = _Watchdog(stall_threshold=stall_threshold)
_watchdog.start_main_timer()
_watchdog.start_poll_thread()
_WATCHDOG_INSTALLED = True
_attach_file_handler()
install_qt_message_handler()
install_qtimer_thread_guard()
_INSTALLED = True
Comment thread pychron/core/ui/crash_diagnostics.py Outdated
Comment thread pychron/core/ui/gui.py
Comment thread pychron/core/helpers/m3_diagnostics.py
hornblende and others added 2 commits May 15, 2026 14:10
Two correctness fixes from review feedback on #2337:

1. repo_manager._protected_merge: when the self-heal finalizes an
   interrupted merge (MERGE_HEAD + staged content, no conflicts), run
   _ensure_no_protected_deletions first, mirroring the normal merge
   path. Without this guard a crash mid-merge that staged deletions of
   UUID-backed analysis files would be silently committed by the
   self-heal. The outer except now re-raises GitCommandError so the
   protected-deletion bail reaches the caller instead of falling
   through to a retry.

2. automated_run_monitor._get_value: the previous fix routed dev.get()
   through invoke_in_main_thread, which is fire-and-forget and returns
   None. AutomatedRunMonitor._fcheck_conditions evaluates rules like
   "x>10" against this return value, so configured hardware limits
   were no longer enforced. Replace with a synchronous round-trip:
   schedule on the main thread, block on an Event, return the captured
   value. Direct call when already on the main thread to avoid
   self-deadlock. 10s timeout so a wedged main thread surfaces as a
   monitor failure instead of stalling the run thread.

Plus cleanup:
- Run Black across the 9 PR files to clear the formatting CI gate.
- Drop two stray @staticmethod decorators on module-level wrapper
  functions in m3_diagnostics (MyPy "staticmethod used with a
  non-method").
- Match the convert_color fallback signature in core/ui/gui.py to the
  real implementation (color, output="rgbF").
- Drop unused Dict/Optional imports in core/ui/crash_diagnostics.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explain why the eventFilter swallows exceptions silently: a dangling
receiver may itself fault on `type(obj).__name__`, and tracing must
not interfere with Qt event delivery. The truncated log line is the
intended signal in that case.
@jirhiker jirhiker merged commit 3605c09 into main May 15, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant