Skip to content

feat(prefork): graceful shutdown, leak fixes, hook robustness (re-open of #2180 follow-up)#2199

Open
ReneWerner87 wants to merge 5 commits into
valyala:masterfrom
ReneWerner87:prefork_graceful_shutdown
Open

feat(prefork): graceful shutdown, leak fixes, hook robustness (re-open of #2180 follow-up)#2199
ReneWerner87 wants to merge 5 commits into
valyala:masterfrom
ReneWerner87:prefork_graceful_shutdown

Conversation

@ReneWerner87
Copy link
Copy Markdown
Contributor

Re-opens the follow-up commit that was applied to master right before #2180 was merged and then reverted in e9208ec because @erikdubbelboer hadn't reviewed it. Splitting it back out so it can be reviewed on its own merits.

Original PR (merged): #2180
Reverted commit: 262ea09e9208ec


Original summary

Addresses outstanding review concerns from #2180 plus several adjacent issues surfaced during a follow-up review pass.

Lifecycle / supervision

  • Track every per-child Wait goroutine via sync.WaitGroup and unblock pending sigCh sends through a context.Cancel so early-return paths (OnChildSpawn / OnMasterReady error, recovery doCommand error, ErrOverRecovery) can no longer leak goroutines or stall children.
  • Install signal.Notify(SIGTERM, SIGINT) in the master so deploy / rolling-restart signals enter the shutdown path instead of killing the master without graceful teardown.
  • Replace the unconditional SIGKILL defer with a SIGTERM-then-SIGKILL sequence gated by a configurable ShutdownGracePeriod (defaults to 5s; the Windows path stays SIGKILL since Process.Signal(SIGTERM) is unsupported there).

API

  • OnChildRecover now returns error so callers can implement recovery policies (circuit-breaker etc.); panics in any hook are recovered and surfaced as the returned error, with diagnostic logging.
  • Add RecoverInterval (optional crash-loop backoff) and ShutdownGracePeriod fields with safe zero-value defaults — both are opt-in, existing behaviour is preserved when zero.
  • Export ErrCommandProducerNilCmd and ErrCommandProducerNotStarted sentinel errors so callers can errors.Is them.
  • Rename oldPid/newPid to oldPID/newPID per Go initialism convention.
  • Add an explicit compile-time var _ Logger = fasthttp.Logger(nil) check so the local Logger interface stays in sync with fasthttp.Logger.

Resource hygiene

  • Master now closes both the original tcpListener and the duped fd in p.files when prefork() returns; previously the duped fd leaked once per call.
  • doCommand wraps every error path with %w + fmt.Errorf so caller-side diagnostics keep stage context.
  • Strip pre-existing FASTHTTP_PREFORK_CHILD entries before appending so child env never carries duplicate keys.
  • Extract magic numbers as package constants (inheritedListenerFD, masterPollInterval, defaultShutdownGracePeriod, preforkChildEnvValue).
  • Name the inherited listener fd via os.NewFile so net.FileListener errors are diagnosable.

Tests

  • Migrate to t.Setenv (drop the global setUp/tearDown helpers) — fixes the env-mutation-vs-parallel race.
  • Replace the rand.Intn port helper with :0 + Listener.Addr() to remove port-collision flakes under -count / parallel runs.
  • Collapse the three near-identical Test_ListenAndServe* tests into a single table-driven subtest that actually asserts the args forwarded to ServeFunc / ServeTLSFunc / ServeTLSEmbedFunc.
  • Add coverage for the previously untested branches: CommandProducer returning err / nil cmd / unstarted cmd, initial OnChildSpawn error, OnMasterReady error, hook panic surfacing, RecoverInterval enforcement.
  • noopChildProducer helper kills + waits any spawned child binaries during cleanup so failed tests no longer leave subprocesses around.

Local run on this branch: go test -race ./prefork/... and golangci-lint run ./prefork/... are both clean. Happy to split this into smaller PRs if you'd prefer to land the goroutine-leak fix and the API change separately from the new fields.

Addresses outstanding review concerns and several adjacent issues
surfaced during a follow-up review pass.

Lifecycle / supervision
- Track every per-child Wait goroutine via sync.WaitGroup and unblock
  pending sigCh sends through a context.Cancel so early-return paths
  (OnChildSpawn / OnMasterReady error, recovery doCommand error,
  ErrOverRecovery) can no longer leak goroutines or stall children.
- Install signal.Notify(SIGTERM, SIGINT) in the master so deploy/
  rolling-restart signals enter the shutdown path instead of killing
  the master without graceful teardown.
- Replace the unconditional SIGKILL defer with a SIGTERM-then-SIGKILL
  sequence gated by a configurable ShutdownGracePeriod (defaults to 5s,
  Windows path stays SIGKILL since Signal(SIGTERM) is unsupported).

API
- OnChildRecover now returns error so callers can implement recovery
  policies (circuit-breaker etc.); panic in any hook is recovered and
  surfaced as the returned error, with diagnostic logging.
- Add RecoverInterval (optional crash-loop backoff) and
  ShutdownGracePeriod fields with safe zero-value defaults.
- Export ErrCommandProducerNilCmd and ErrCommandProducerNotStarted
  sentinel errors so callers can errors.Is them.
- Rename oldPid/newPid to oldPID/newPID per Go initialism convention.
- Logger interface now declares an explicit compile-time compatibility
  check with fasthttp.Logger.

Resource hygiene
- Master closes both the original tcpListener and the duped fd in
  p.files when prefork() returns; previously the duped fd leaked once
  per call.
- doCommand wraps every error path with %w + fmt.Errorf so caller-side
  diagnostics keep stage context.
- Strip pre-existing FASTHTTP_PREFORK_CHILD entries before appending so
  child env never carries duplicate keys.
- Extract magic numbers as package constants
  (inheritedListenerFD, masterPollInterval, defaultShutdownGracePeriod,
  preforkChildEnvValue).
- Rename the inherited listener fd via os.NewFile so net.FileListener
  errors are diagnosable.

Tests
- Migrate to t.Setenv (drop the global setUp/tearDown helpers) — fixes
  the env-mutation-vs-parallel race.
- Replace rand.Intn port helper with `:0` + Listener.Addr() to remove
  port-collision flakes under -count and parallel runs.
- Collapse the three near-identical Test_ListenAndServe* tests into a
  single table-driven subtest that actually asserts the args forwarded
  to ServeFunc/ServeTLSFunc/ServeTLSEmbedFunc.
- Add coverage for the previously untested branches:
  CommandProducer returning err / nil cmd / unstarted cmd,
  initial OnChildSpawn error, OnMasterReady error,
  hook panic surfacing, RecoverInterval enforcement.
- noopChildProducer helper kills + waits any spawned child binaries
  during cleanup so failed tests no longer leave subprocesses around.
Copilot AI review requested due to automatic review settings May 2, 2026 14:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR revisits the prefork package lifecycle/supervision logic to make shutdown and recovery more robust (avoid goroutine/fd leaks, improve signal handling), while expanding the hook API and strengthening test coverage around failure modes.

Changes:

  • Add master signal handling plus a structured shutdown path (SIGTERM → grace → SIGKILL) and goroutine-leak prevention for per-child Wait routines.
  • Harden hook execution (panic recovery) and extend the public API (new fields + exported sentinel errors).
  • Refactor and expand tests to reduce flakes (no random ports / env races) and cover new/previously-untested branches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
prefork/prefork.go Adds shutdown orchestration, hook recovery, new exported errors/config knobs, and resource-hygiene improvements.
prefork/prefork_test.go Refactors tests for determinism and adds coverage for new error paths, hook behavior, and recovery timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork.go
Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork.go Outdated
- listen(): the *os.File wrapping the inherited fd was never closed.
  net.FileListener dups the fd, so the original was leaking on every
  child startup. Close it explicitly and return the dup'd listener.

- setTCPListenerFiles(): if tcpListener.File() failed, the bound
  net.Listener stayed open and p.ln pointed at it. Close the listener
  on the error path and only assign p.ln after the dup succeeds.

- prefork(): replace time.After in the RecoverInterval branch with a
  time.NewTimer that we Stop+drain when a shutdown signal wins the
  select, so the timer goroutine and channel allocation don't linger
  during crash-loop shutdown.

- invokeHook(): drop the panic log line. The hook caller logs the
  returned error already, so logging in the recover block produced
  duplicate output for the same panic.
Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork.go Outdated
Comment thread prefork/prefork_test.go Outdated
Comment thread prefork/prefork_test.go Outdated
Comment thread prefork/prefork_test.go Outdated
Comment thread prefork/prefork_test.go
Comment thread prefork/prefork_test.go Outdated
Comment thread prefork/prefork_test.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants