Skip to content

test(scheduler): repro signal watcher failure during SIGINT cleanup#14023

Draft
anmonteiro wants to merge 2 commits intoocaml:mainfrom
anmonteiro:anmonteiro/fix-signal-watcher-race
Draft

test(scheduler): repro signal watcher failure during SIGINT cleanup#14023
anmonteiro wants to merge 2 commits intoocaml:mainfrom
anmonteiro:anmonteiro/fix-signal-watcher-race

Conversation

@anmonteiro
Copy link
Copy Markdown
Collaborator

@anmonteiro anmonteiro commented Apr 3, 2026

  • Replace the Unix SIGCHLD wakeup path in the scheduler with a self-pipe notifier.
  • Keep child reaping on the scheduler thread, preserving the invariant from Scheduler: fix pid races #13245.
  • Install a minimal SIGCHLD handler in C that performs only an async-signal-safe write to a pipe.
  • On Unix systems, add notifier thread in process watcher that reads pipe events and emits Job_complete_ready.
  • On macOS, fixes a failure mode where interrupted high-concurrency builds could crash with signal watcher received an unexpected signal / Unknown. This was previously thought fixed in fix(signal-watcher): don't assert false on SIGKILL #13370

@anmonteiro anmonteiro force-pushed the anmonteiro/fix-signal-watcher-race branch from b7d25d2 to 2c8e50c Compare April 3, 2026 06:13
@anmonteiro anmonteiro changed the title test(scheduler): repro SIGCHLD wake-up corruption under high concurrency test(scheduler): repro signal watcher failure during SIGINT cleanup Apr 3, 2026
@anmonteiro anmonteiro force-pushed the anmonteiro/fix-signal-watcher-race branch 5 times, most recently from 64ffe05 to c05f2a1 Compare April 3, 2026 07:16
anmonteiro added a commit that referenced this pull request Apr 3, 2026
…" (#14029)

This reverts commit 27e5795.


Turns out that the failure rather happens during SIGINT cleanup, where
the behavior of `SIGCHLD` is not super consistent on macOS. Repro in
#14023

Signed-off-by: Antonio Nuno Monteiro <anmonteiro@gmail.com>
@anmonteiro anmonteiro force-pushed the anmonteiro/fix-signal-watcher-race branch 5 times, most recently from 980fb36 to 614796b Compare April 3, 2026 23:51
@anmonteiro anmonteiro marked this pull request as ready for review April 4, 2026 01:24
@anmonteiro anmonteiro requested review from Alizter and rgrinberg April 4, 2026 01:29
@rgrinberg
Copy link
Copy Markdown
Member

Could you explain the bug that is being fixed here?

@rgrinberg
Copy link
Copy Markdown
Member

Glancing over the fix, the changes don't really make sense as they essentially introduce a separate signal watching thread for SIGCHLD. Why would we want that unless using Thread.wait_signal is buggy? But if it's buggy, why would we use it for the other signals? In any case, I think you should explain the buggy behavior that you're seeing and what makes you suspect our signal watching thread is the cause.

@anmonteiro
Copy link
Copy Markdown
Collaborator Author

For sure, perhaps I haven't gotten to the absolute root cause of the issue, but the 1st commit reproduces the issue in our macOS tests.

I will do more digging to see if I can unearth something, but essentially under high process concurrency / cancelation, I observe the assert false case of the signal pattern match just has garbage (e.g. Unknown 123541251) as the signla.

@rgrinberg
Copy link
Copy Markdown
Member

That's concerning, but I'm not sure I'd suspect the issue is in dune. I would suspect there's something wrong with the OCaml runtime if it's returning us a garbage signal.

Why can't we just ignore such junk signals?

@anmonteiro
Copy link
Copy Markdown
Collaborator Author

we may just be able to ignore the junk signals.

I'll push an alternative fix doing that, instead.

In parallel, I'm trying to track down the specific bug, at this point I agree it's not in Dune.

@rgrinberg
Copy link
Copy Markdown
Member

rgrinberg commented Apr 4, 2026

FWIW, we don't necessarily need to watch signals in a separate thread. If watch signals in the main thread works better on macos, then so be it. But we need to do it for all signals. Moreover, we need to justify why we can't write the signal handlers in OCaml.

Signed-off-by: Antonio Nuno Monteiro <anmonteiro@gmail.com>
@anmonteiro anmonteiro force-pushed the anmonteiro/fix-signal-watcher-race branch 5 times, most recently from 95c7fe6 to 1076fb9 Compare April 5, 2026 06:25
Signed-off-by: Antonio Nuno Monteiro <anmonteiro@gmail.com>
@anmonteiro anmonteiro force-pushed the anmonteiro/fix-signal-watcher-race branch from 1076fb9 to c5c65a8 Compare April 5, 2026 06:35
@anmonteiro
Copy link
Copy Markdown
Collaborator Author

anmonteiro commented Apr 5, 2026

I haven't found a good way to write this:

so far, the only solution that i've seen work well is handling only SIGCHLD via the self-pipe with a sig handler in C stubs, and handling the remaining signals in OCaml like before.

so i'm proceeding with the mitigation in #14047, which at least gets Dune unstuck.

anmonteiro added a commit that referenced this pull request Apr 5, 2026
alternative to #14023

Signed-off-by: Antonio Nuno Monteiro <anmonteiro@gmail.com>
@anmonteiro anmonteiro marked this pull request as draft April 5, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants