fix(dvc): self-heal interrupted MERGE_HEAD on stale staged content#2337
Merged
Conversation
Apple Silicon (M3) was crashing pychron with SIGBUS/SIGSEGV after many
analyses while Intel survived. Root cause: worker threads (run threads,
_do_collection, _ping_loop, StatusMonitor) scheduled Qt timers via
pyface helpers (do_after_timer, CallbackTimer.single_shot,
QTimer.singleShot). The resulting QObjects bound their thread context
to the worker; when the worker exited and Qt later fired the timer, it
dereferenced a stale pointer. M3 PAC catches this; Intel silently
limps. Same bug also drove the 2+ second main-thread stalls observed
during paint-heavy startup bursts.
Phase 1 - instrumentation:
- pychron/core/helpers/m3_diagnostics.py (new): installs a
qInstallMessageHandler that promotes only known cross-thread Qt
warnings (Timers can only be used..., different thread, ...) to
ERROR with a full Python stack while dropping benign libpng/iCCP
spam; wraps QTimer.__init__/singleShot to log off-main use; runs a
main-thread heartbeat QTimer + daemon poller that dumps every
Python frame when the main thread stalls past stall_threshold.
- launchers/launcher.py: install_early() before any Qt import so the
message handler is in place when pyface loads.
- pychron/envisage/pychron_run.py: install_late(stall_threshold=5.0)
after app_factory() and before app.run() so the watchdog timer can
attach to a real QApplication; 5s threshold suppresses known
startup-burst false positives.
- M3_INSTRUMENTATION.md: how to read m3_diagnostics.log.
Phase 2 - thread-safe marshalling:
- m3_diagnostics.install_thread_safe_marshalling() patches the four
Qt-timer entry points so off-main callers re-enter via
invoke_in_main_thread (GUI.invoke_later) instead of creating
QTimers in dying worker contexts:
pyface.timer.do_later.do_after
pyface.timer.do_later.do_later
pyface.timer.timer.CallbackTimer.single_shot
pyface.qt.QtCore.QTimer.singleShot
- pychron/monitors/automated_run_monitor.py: route the device get()
query in get_isotope_keys through invoke_in_main_thread, since it
is invoked from the monitor worker thread but touches Qt-owned
hardware proxies.
Supporting fixes from the same investigation:
- pychron/core/ui/factory.py: revert lazy toolkit_factory to eager
loading; lazy proxy was the original cause of the missing macOS
menu bar.
- pychron/core/ui/gui.py: import convert_color and wake_screen
eagerly from pychron.core.ui.qt.gui so they are available during
menu initialization on macOS (matches the factory.py revert).
- pychron/core/ui/crash_diagnostics.py (new): standalone helper for
parsing macOS .ips crash reports against pychron stacks.
- pychron/envisage/tasks/base_tasks_application.py: on macOS, do not
reload saved task window layouts; auto-opened windows during
startup interfere with menu bar rendering on Sonoma/Sequoia.
- pychron/core/helpers/archiver.py: add a warning() helper and skip
the archive directory itself in _clean so the archiver no longer
crashes on startup with shutil.Error: Cannot move a directory into
itself plus AttributeError on the missing warning method.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Untracked items that should not enter the repo: - *.bak files (e.g. pychron/core/ui/gui.py.bak left over from the M3 cross-thread investigation). - .claude/ (Claude Code agent session state). - launchers-Felix/ (snapshot of the pre-transition Felix launcher retained locally for reference). - launchers/helpers-Jan.py and pychron/extraction_line/switch_manager-Jan.py (variant copies kept locally during the Felix -> Jan mass spectrometer transition). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pychron crashes (in particular the M3 cross-thread segfaults addressed in 83f4b6d) sometimes occur partway through DVC's MetaData git pull. The next launch finds a stale MERGE_HEAD in the MetaData repo and GitPython aborts with: fatal: You have not concluded your merge (MERGE_HEAD exists). Please, commit your changes before you merge. which surfaces as "program stopped" and prevents Pychron from starting until the user manually clears .git/MERGE_HEAD. Now _protected_merge inspects the repo on entry: - If MERGE_HEAD exists AND the working tree has unmerged paths or other uncommitted changes, log a critical message and let the original merge attempt fail loudly (real conflicts still require human intervention). - If MERGE_HEAD exists AND the working tree is clean, run `git merge --abort` (with a manual MERGE_HEAD/MERGE_MSG/MERGE_MODE cleanup fallback) and continue with the new merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pychron.loggable.Loggable.warning/critical take a single message argument, not stdlib `%s`-style positional formatting. The previous commit (f0964f9) passed extra positional args, which raised TypeError: Loggable.warning() takes 2 positional arguments but 3 were given inside the heal path's exception guard, swallowing the abort and allowing the original `git merge` to run against the stale MERGE_HEAD - defeating the self-heal entirely. Pre-format the messages with `str.format` so the single-arg signature is honoured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntent
The previous self-heal in _protected_merge treated MERGE_HEAD + dirty
working tree (via is_dirty) as a hard bail. That conflated two distinct
states:
1. Real unmerged paths (U entries) -> user must resolve.
2. Staged auto-merged content with no conflicts -> the merge already
succeeded mechanically; only the final `commit --no-edit` step was
interrupted by the crash.
Case 2 was being misclassified as "dirty," logged as critical, then the
function fell through and re-invoked `git merge`, producing:
fatal: You have not concluded your merge (MERGE_HEAD exists).
Please, commit your changes before you merge.
This kept aborting experiment runs (post_measurement_save -> meta_pull
-> smart_pull -> _protected_merge) until the user manually committed
the residual merge.
Split the self-heal into three branches: unmerged paths bail, staged
content gets finalized via `git commit --no-edit`, otherwise abort the
residual merge state as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce85a7dcda
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
When pychron segfaults inside QCoreApplication::notifyInternal2 (Qt trying to deliver a queued Timer event to a freed QObject), the existing diagnostics tell us *that* a receiver was dangling but not *which class* of receiver. The macOS crash report shows a raw QObject* pointer with no Python-side identity. Add an opt-in QApplication event filter that logs every Timer-event delivery to a dedicated rotating file (m3_eventtrace.log): HH:MM:SS.mmm Timer cls=<ClassName> pyid=0x... qtid=<int> The filter captures `type(obj).__name__` BEFORE Qt dispatches, so a successful trace identifies the next receiver and a crash on the next line localizes the culprit class. Volume is bounded to Timer events only (Qt event type 1); the full event stream is too noisy to dump. Opt-in via PYCHRON_M3_EVENT_TRACE=1 so normal runs are not paying the trace cost. install_late() consults the environment and skips the filter otherwise. The tracer logger does not propagate and writes to a separate file so it can't drown the main diagnostics log or stall the main thread via the root logger's UI handlers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| ) | ||
|
|
||
| qInstallMessageHandler(handler) | ||
| _QT_MSG_HANDLER_INSTALLED = True |
| try: | ||
| QTimer.__init__ = init_wrapper | ||
| QTimer.singleShot = single_shot_wrapper | ||
| _QTIMER_GUARD_INSTALLED = True |
| except Exception as e: # pragma: no cover | ||
| _log.error("marshalling: QTimer.singleShot patch failed: %s", e) | ||
|
|
||
| _MARSHALLING_INSTALLED = True |
| _log.error("install_event_tracer: installEventFilter failed: %s", e) | ||
| return | ||
|
|
||
| _EVENT_TRACER_INSTALLED = True |
| _watchdog = _Watchdog(stall_threshold=stall_threshold) | ||
| _watchdog.start_main_timer() | ||
| _watchdog.start_poll_thread() | ||
| _WATCHDOG_INSTALLED = True |
| _attach_file_handler() | ||
| install_qt_message_handler() | ||
| install_qtimer_thread_guard() | ||
| _INSTALLED = True |
Two correctness fixes from review feedback on #2337: 1. repo_manager._protected_merge: when the self-heal finalizes an interrupted merge (MERGE_HEAD + staged content, no conflicts), run _ensure_no_protected_deletions first, mirroring the normal merge path. Without this guard a crash mid-merge that staged deletions of UUID-backed analysis files would be silently committed by the self-heal. The outer except now re-raises GitCommandError so the protected-deletion bail reaches the caller instead of falling through to a retry. 2. automated_run_monitor._get_value: the previous fix routed dev.get() through invoke_in_main_thread, which is fire-and-forget and returns None. AutomatedRunMonitor._fcheck_conditions evaluates rules like "x>10" against this return value, so configured hardware limits were no longer enforced. Replace with a synchronous round-trip: schedule on the main thread, block on an Event, return the captured value. Direct call when already on the main thread to avoid self-deadlock. 10s timeout so a wedged main thread surfaces as a monitor failure instead of stalling the run thread. Plus cleanup: - Run Black across the 9 PR files to clear the formatting CI gate. - Drop two stray @staticmethod decorators on module-level wrapper functions in m3_diagnostics (MyPy "staticmethod used with a non-method"). - Match the convert_color fallback signature in core/ui/gui.py to the real implementation (color, output="rgbF"). - Drop unused Dict/Optional imports in core/ui/crash_diagnostics.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explain why the eventFilter swallows exceptions silently: a dangling receiver may itself fault on `type(obj).__name__`, and tracing must not interfere with Qt event delivery. The truncated log line is the intended signal in that case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GitRepoManager._protected_mergenow distinguishes a real conflict (unmerged paths) from an interrupted merge (staged auto-merge content, no conflicts), and finalizes the latter withgit commit --no-editinstead of bailing.git merge, producingfatal: You have not concluded your merge (MERGE_HEAD exists)and abortingpost_measurement_save -> meta_pull -> smart_pullrepeatedly.Why
Observed in production:
dvc_persister.post_measurement_saveraisedGitCommandErrorfromgit merge --no-commit --no-ff -X theirs FETCH_HEADbecause the prior run had crashed betweengit merge --no-commit(which succeeded mechanically and staged the merged content) and the follow-upgit commit --no-edit. The self-heal added in 0336d81 / f0964f9 only handled the clean-tree case; staged content trippedis_dirty(untracked_files=False)and was misclassified as a conflict to resolve manually.Fix
Three explicit branches in the self-heal:
git diff --diff-filter=Unon-empty): log critical, bail. User must resolve.git diff --cached --name-onlynon-empty): rungit commit --no-editto finalize the interrupted merge.git merge --abort(or fall back to removing MERGE_HEAD/MERGE_MSG/MERGE_MODE).is_dirtyis no longer used — replaced with explicit checks against the index and unmerged paths so the two states cannot be conflated.Test plan
git merge --no-commit --no-ff <ref>in a MetaData clone, killing before commit, then triggeringsmart_pull. Self-heal logsfinalizing the interrupted merge commitand proceeds.refusing to auto-resolveand bails.MERGE_HEADmarker files with clean tree. Self-heal logsaborting residual merge stateand continues.post_measurement_savesucceeds without operator intervention.🤖 Generated with Claude Code