[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805
[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805MichaelUray wants to merge 4 commits intonetbirdio:mainfrom
Conversation
…public IP When two peers are behind the same NAT (same public IP), ICE P2P connections could never be established because of a reconnection loop: 1. Guard on peer A sees no P2P connection and sends a new offer 2. Peer B receives the offer with a new session ID 3. Peer B cancels its active ICE agent and creates a new one 4. Before the new ICE agent can complete connectivity checks, peer B's guard sends a new offer to peer A 5. Peer A cancels its ICE agent → cycle repeats indefinitely The fix skips incoming offers with different session IDs when the ICE agent is actively connecting (agentConnecting == true). This lets the current ICE negotiation run to completion (success or timeout) before accepting new offers. This change is backward compatible - older clients will simply have their offers deferred until the current ICE attempt finishes (max 6 seconds on timeout). P2P between peers at different locations (different public IPs) is unaffected as ICE succeeds before the guard timer fires. Fixes: netbirdio#3669 Related: netbirdio#2703 Related: netbirdio#3339 Related: netbirdio#4031 Related: netbirdio#2159
📝 WalkthroughWalkthroughBuffer remote ICE candidates in WorkerICE until an ICE agent exists; centralize offer-replacement logic to avoid replacing agents during connection attempts; on successful agent creation, flush buffered candidates into the new agent; clear buffer on agent close. Changes
Sequence DiagramsequenceDiagram
participant Client
participant WorkerICE as WorkerICE
participant PendingBuf as PendingBuffer
participant Agent as ICE Agent
Client->>WorkerICE: OnRemoteCandidate (agent = nil)
WorkerICE->>PendingBuf: Append candidate (bounded)
Note over WorkerICE,PendingBuf: Buffer candidates until an agent exists
Client->>WorkerICE: OnNewOffer
WorkerICE->>WorkerICE: shouldReplaceAgent? / maybe replaceExistingAgent
WorkerICE->>Agent: Create & initialize agent
Agent-->>WorkerICE: Agent ready
WorkerICE->>PendingBuf: Retrieve buffered candidates
loop For each buffered candidate
WorkerICE->>Agent: AddRemoteCandidate
end
PendingBuf-->>WorkerICE: Buffer cleared
Client->>WorkerICE: OnRemoteCandidate (agent != nil)
WorkerICE->>Agent: AddRemoteCandidate (direct)
Client->>WorkerICE: Close
WorkerICE->>WorkerICE: closeAgent (close agent, clear pending buffer)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)
184-189: Consider adding a buffer size limit as a defensive measure.The buffer could theoretically grow unbounded if candidates keep arriving without agent creation (e.g., due to signaling issues). While unlikely in practice, a simple cap would prevent memory accumulation in pathological cases.
♻️ Optional: Add buffer cap
+const maxPendingCandidates = 50 + func (w *WorkerICE) OnRemoteCandidate(candidate ice.Candidate, haRoutes route.HAMap) { w.muxAgent.Lock() defer w.muxAgent.Unlock() w.log.Debugf("OnRemoteCandidate from peer %s -> %s", w.config.Key, candidate.String()) if w.agent == nil { + if len(w.pendingCandidates) >= maxPendingCandidates { + w.log.Warnf("pending candidate buffer full, dropping candidate: %s", candidate.Type()) + return + } w.log.Infof("ICE Agent not ready, buffering remote candidate: %s", candidate.Type()) w.pendingCandidates = append(w.pendingCandidates, candidate) return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 184 - 189, The pendingCandidates slice can grow unbounded when w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and enforce it in the buffering branch inside the OnNewCandidate / handler where w.pendingCandidates is appended: if len(w.pendingCandidates) >= MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant and behavior are documented in comments and referenced where pendingCandidates is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 184-189: The pendingCandidates slice can grow unbounded when
w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and
enforce it in the buffering branch inside the OnNewCandidate / handler where
w.pendingCandidates is appended: if len(w.pendingCandidates) >=
MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate
and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant
and behavior are documented in comments and referenced where pendingCandidates
is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aabe6227-0bce-4e6d-abac-3b3c6358dde1
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
…ation When both peers exchange offers simultaneously, remote ICE candidates can arrive via signaling before the local ICE agent is created in OnNewOffer(). Previously these candidates were silently dropped with "ICE Agent is not initialized yet", causing ICE to fail because the agent never learned about the remote peer's network addresses. This adds a pendingCandidates buffer that collects early candidates and flushes them into the agent immediately after creation. The buffer is cleared when the agent is closed to prevent stale candidates from leaking into a new session. This race condition particularly affects peers behind the same NAT where the signaling round-trip is very fast.
28e3a06 to
64e91ae
Compare
Addresses coderabbitai review on PR netbirdio#5805: the pendingCandidates slice could theoretically grow unbounded if remote candidates keep arriving before the ICE agent is created (e.g. due to signaling issues). Add a fixed cap of 50 candidates and drop further candidates with a warning log to protect against memory accumulation in pathological scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 68-69: Buffered remote candidates stored in pendingCandidates
bypass the normal pipeline (candidateViaRoutes and shouldAddExtraCandidate),
causing different behavior depending on arrival timing; fix by applying the same
routing/filtering before buffering and by replaying buffered entries through the
same helper used by the live path: call candidateViaRoutes() and run
shouldAddExtraCandidate() (or the unified helper that encapsulates both) for
each candidate before pushing to pendingCandidates, and when draining
pendingCandidates later invoke the same AddRemoteCandidate() path used for live
arrivals so routed/filtered/synthetic-related-port candidates are handled
identically (update functions: pendingCandidates, candidateViaRoutes,
shouldAddExtraCandidate, AddRemoteCandidate).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 73288d66-2438-465a-b0b8-92d5cf6eed4e
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
|
@MichaelUray Thank you for the PR. The race you're protecting against — OnRemoteCandidate arriving before the agent is created — requires a very specific timing: the remote peer would need to create its agent, start gathering, discover a candidate, and signal it back, all before our handshaker goroutine reads the offer from its channel and calls OnNewOffer. That's microseconds (channel read) vs tens-to-hundreds of milliseconds (STUN/TURN network I/O for candidate gathering), so in practice this almost never happens. If we ever wanted to truly fix this race, the correct approach would be to remove the go from engine.go:1612: But of course it is a risky change. |
|
The session ID mechanism exists for two specific reasons:
By skipping all offers with a different session ID while |
…andidates Address coderabbitai review on PR netbirdio#5805: previously buffered candidates bypassed candidateViaRoutes() (route-loop filter) and shouldAddExtraCandidate() (extra server-reflexive helper), so a candidate arriving before the agent existed could be added later in a state inconsistent with a candidate arriving after the agent existed. Worse, the buffered path missed the extra-srflx-with-related-port candidate that the live path adds. Refactor OnRemoteCandidate to compute the full list of candidates that should reach the agent (the routed-out filter and the optional extra srflx) before deciding whether to buffer or add immediately. The same filter and extension now apply unconditionally; the only difference between the two paths is whether we hand the result to the agent right now or stash it for the next flush. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)
238-247: Minor: Log message may undercount dropped candidates.When the buffer is full, the
returnexits the function immediately. IftoAddcontains multiple candidates (original + extra srflx), all remaining candidates in the slice are dropped but only one is logged.This is a minor logging clarity issue and doesn't affect correctness.
💡 Optional improvement for logging accuracy
for _, c := range toAdd { if len(w.pendingCandidates) >= maxPendingCandidates { - w.log.Warnf("pending candidate buffer full (%d), dropping candidate: %s", - maxPendingCandidates, c.Type()) + w.log.Warnf("pending candidate buffer full (%d), dropping %d candidate(s)", + maxPendingCandidates, len(toAdd)-len(w.pendingCandidates)+maxPendingCandidates) return }Or simply break out of the loop and log once after counting drops.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 238 - 247, The loop over toAdd currently returns on first overflow, underreporting dropped candidates; change the logic in the worker (the loop that appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates) >= maxPendingCandidates (use break instead of return), compute how many candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or maintain a dropped counter), and emit a single w.log.Warnf that reports the total number of dropped candidates along with maxPendingCandidates and candidate types if desired; ensure you still append buffered candidates to w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 157-165: The current guard w.agentConnecting unconditionally skips
new offers while an ICE attempt is in progress, which avoids reconnection loops
but can delay recovery when a remote peer crashes and reconnects with a new
session; change this by making the blocking behavior conditional or timeboxed:
add a configurable short timeout or an early-abort check that compares the
incoming remoteOfferAnswer.SessionIDString() to the session involved in the
current ICE attempt and, if they differ, either abort the ongoing attempt
(invoke the existing ICE cancel/cleanup routine) or allow the new offer to
preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method
like abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
- Around line 180-184: The current call to NewICESessionID() logs an error but
leaves w.sessionID unchanged on failure; update the error handling so you do not
keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err
either set w.sessionID = "" (or a known-invalid value) or return an error from
the enclosing function; modify the block around NewICESessionID() so it checks
err, clears w.sessionID on failure (or returns the error) and logs the failure
with context before exiting.
---
Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 238-247: The loop over toAdd currently returns on first overflow,
underreporting dropped candidates; change the logic in the worker (the loop that
appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates)
>= maxPendingCandidates (use break instead of return), compute how many
candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or
maintain a dropped counter), and emit a single w.log.Warnf that reports the
total number of dropped candidates along with maxPendingCandidates and candidate
types if desired; ensure you still append buffered candidates to
w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8d2c91f1-409c-45ea-93ba-b1dd7570a51f
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
| // If ICE agent is actively connecting (gathering candidates, dialing), | ||
| // don't cancel it for a new offer - let the current attempt finish. | ||
| // This prevents a reconnection loop where guards on both sides keep | ||
| // sending new offers that cancel each other's ICE negotiations, | ||
| // particularly when both peers are behind the same NAT (same public IP). | ||
| if w.agentConnecting { | ||
| w.log.Infof("ICE agent is actively connecting, skipping new offer with different session: %s", remoteOfferAnswer.SessionIDString()) | ||
| return false | ||
| } |
There was a problem hiding this comment.
Acknowledge the crash-recovery trade-off documented in PR comments.
The agentConnecting guard (lines 162-165) prevents the reconnection loop but can delay recovery when the remote peer crashes and reconnects with a new session ID. The current ICE attempt must complete (success or timeout, potentially 30s+) before processing the new offer.
This is the trade-off discussed by reviewer pappz. Consider whether the ICE timeout duration is acceptable for crash-recovery scenarios in your deployment, or if a shorter timeout/early-abort mechanism could mitigate the delay.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/peer/worker_ice.go` around lines 157 - 165, The current guard
w.agentConnecting unconditionally skips new offers while an ICE attempt is in
progress, which avoids reconnection loops but can delay recovery when a remote
peer crashes and reconnects with a new session; change this by making the
blocking behavior conditional or timeboxed: add a configurable short timeout or
an early-abort check that compares the incoming
remoteOfferAnswer.SessionIDString() to the session involved in the current ICE
attempt and, if they differ, either abort the ongoing attempt (invoke the
existing ICE cancel/cleanup routine) or allow the new offer to preempt after a
configurable threshold (e.g., w.iceConnectingTimeout or a method like
abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
| sessionID, err := NewICESessionID() | ||
| if err != nil { | ||
| w.log.Errorf("failed to create new session ID: %s", err) | ||
| } | ||
| w.sessionID = sessionID |
There was a problem hiding this comment.
Error handling continues with potentially stale session ID.
If NewICESessionID() fails, the code logs an error but continues, leaving w.sessionID with its previous value rather than being explicitly cleared or set to a known-invalid state. This could lead to session ID collisions or unexpected behavior.
Consider whether to return an error or set w.sessionID to an empty/invalid value on failure.
💡 Suggested improvement
sessionID, err := NewICESessionID()
if err != nil {
w.log.Errorf("failed to create new session ID: %s", err)
+ w.sessionID = ""
+ } else {
+ w.sessionID = sessionID
}
- w.sessionID = sessionID📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sessionID, err := NewICESessionID() | |
| if err != nil { | |
| w.log.Errorf("failed to create new session ID: %s", err) | |
| } | |
| w.sessionID = sessionID | |
| sessionID, err := NewICESessionID() | |
| if err != nil { | |
| w.log.Errorf("failed to create new session ID: %s", err) | |
| w.sessionID = "" | |
| } else { | |
| w.sessionID = sessionID | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/peer/worker_ice.go` around lines 180 - 184, The current call
to NewICESessionID() logs an error but leaves w.sessionID unchanged on failure;
update the error handling so you do not keep a stale ID: only assign w.sessionID
= sessionID when err == nil, and on err either set w.sessionID = "" (or a
known-invalid value) or return an error from the enclosing function; modify the
block around NewICESessionID() so it checks err, clears w.sessionID on failure
(or returns the error) and logs the failure with context before exiting.



Summary
agentConnecting == trueto let the current ICE attempt complete.pendingCandidatesbuffer that collects early candidates and flushes them once the agent is ready.Both fixes are tightly coupled — without the buffer, the guard loop fix alone would still miss early candidates.
Checklist
By submitting this pull request, I confirm that I have read and agree to the terms of the Contributor License Agreement.
Summary by CodeRabbit