Skip to content

fix: prevent graceful shutdown errors from killing Envoy proxy immediately#1072

Merged
santoshpulluri merged 2 commits into
mainfrom
spulluri/fixing_bugs_dp
Apr 28, 2026
Merged

fix: prevent graceful shutdown errors from killing Envoy proxy immediately#1072
santoshpulluri merged 2 commits into
mainfrom
spulluri/fixing_bugs_dp

Conversation

@santoshpulluri
Copy link
Copy Markdown
Contributor

@santoshpulluri santoshpulluri commented Apr 28, 2026

Problem

During graceful shutdown, any error from DumpConfig(), Drain(), or Quit() calls close(errorExitCh). This channel is monitored by the Run() select loop in consul_dataplane.go:

case <-cdp.lifecycleConfig.lifecycleServerExited():
    // Calls Quit() + Kill() immediately

This means a transient error during shutdown (e.g., Envoy not yet ready, network timeout, IPv6 URL issue) bypasses the entire grace period and kills Envoy at 0ms — exactly the symptom seen in TestConnectInject_ProxyLifecycleShutdown acceptance test failures where curl reports Connection refused immediately.

Sequence of events (before fix)

SIGTERM → gracefulShutdown()
  → DumpConfig() fails (any reason)
  → close(errorExitCh)                    ← fires immediately
  → Run() select: lifecycleServerExited()
  → proxy.Quit() + proxy.Kill()           ← Envoy killed at 0ms
  → all connections refused instantly

Root cause

errorExitCh is intended to signal that the lifecycle server itself has crashed (e.g., ListenAndServe fails). It should NOT be closed for transient errors during an already-in-progress graceful shutdown. The shutdown path should log errors and continue — the grace period must be honored regardless.

Fix

Removed close(errorExitCh) from all three error paths in gracefulShutdown():

  • DumpConfig() error
  • Drain() error
  • Quit() error

Errors are still logged as warnings. The shutdown continues through the full grace period as intended.

Testing

Added 3 unit tests in lifecycle_test.go that reproduce the bug deterministically:

Test Verifies
TestGracefulShutdown_DumpConfigError_DoesNotKillProxy DumpConfig() failure does not close errorExitCh
TestGracefulShutdown_DrainError_DoesNotKillProxy Drain() failure does not close errorExitCh
TestGracefulShutdown_QuitError_DoesNotKillProxy Quit() failure does not close errorExitCh

Also fixed pre-existing issues in the test mock:

  • mockProxy methods now return error fieldsDrain(), Quit(), DumpConfig() previously always returned nil, meaning error paths were never exercised
  • Fixed data race — mock counters changed from int to sync/atomic.Int32 to eliminate races flagged by go test -race

Related PRs

All three are needed together to fully fix TestConnectInject_ProxyLifecycleShutdown.

@santoshpulluri santoshpulluri requested review from a team as code owners April 28, 2026 11:25
@santoshpulluri santoshpulluri force-pushed the spulluri/fixing_bugs_dp branch from 082f860 to 5d14554 Compare April 28, 2026 11:29
@santoshpulluri santoshpulluri added backport/1.8 Changes are backported to 1.8 backport/1.9 Changes are backported to 1.9 backport/2.0 Changes are backported to 2.0 labels Apr 28, 2026
@santoshpulluri santoshpulluri enabled auto-merge (squash) April 28, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.8 Changes are backported to 1.8 backport/1.9 Changes are backported to 1.9 backport/2.0 Changes are backported to 2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants