[client] Fix ICE reconnection loop and buffer candidates before agent initialization by MichaelUray · Pull Request #5805 · netbirdio/netbird

MichaelUray · 2026-04-06T15:35:08Z

Summary

Guard Loop Fix: When both peers are behind the same NAT (same public IP), their connectivity guards fire simultaneously every ~12s, each sending a new offer with a new session ID. This cancels the other side's ICE agent before it can finish negotiating, creating an infinite reconnection loop. Fix: skip new offers while agentConnecting == true to let the current ICE attempt complete.
Candidate Buffering: Remote ICE candidates that arrive before the local ICE agent is created were silently dropped. Add a pendingCandidates buffer that collects early candidates and flushes them once the agent is ready.

Both fixes are tightly coupled — without the buffer, the guard loop fix alone would still miss early candidates.

Checklist

Bug fix
Create tests that fail without the change: ICE connections between peers behind the same NAT consistently failed before this fix
Documentation not needed — internal ICE behavior change, no user-facing API change

By submitting this pull request, I confirm that I have read and agree to the terms of the Contributor License Agreement.

Summary by CodeRabbit

Bug Fixes
- Buffer early remote connection candidates with a bounded cap to avoid lost candidates and limit memory use.
- Expand and queue applicable server-reflexive candidates for later delivery instead of dropping them.
- Avoid unnecessary connection replacements during in-progress handshakes or when sessions match to improve stability.
- Flush buffered candidates into a newly created connection and clear the buffer on close to prevent stale data.

…public IP When two peers are behind the same NAT (same public IP), ICE P2P connections could never be established because of a reconnection loop: 1. Guard on peer A sees no P2P connection and sends a new offer 2. Peer B receives the offer with a new session ID 3. Peer B cancels its active ICE agent and creates a new one 4. Before the new ICE agent can complete connectivity checks, peer B's guard sends a new offer to peer A 5. Peer A cancels its ICE agent → cycle repeats indefinitely The fix skips incoming offers with different session IDs when the ICE agent is actively connecting (agentConnecting == true). This lets the current ICE negotiation run to completion (success or timeout) before accepting new offers. This change is backward compatible - older clients will simply have their offers deferred until the current ICE attempt finishes (max 6 seconds on timeout). P2P between peers at different locations (different public IPs) is unaffected as ICE succeeds before the guard timer fires. Fixes: netbirdio#3669 Related: netbirdio#2703 Related: netbirdio#3339 Related: netbirdio#4031 Related: netbirdio#2159

CLAassistant · 2026-04-06T15:35:16Z

All committers have signed the CLA.

coderabbitai · 2026-04-06T15:35:26Z

📝 Walkthrough

Walkthrough

Buffer remote ICE candidates in WorkerICE until an ICE agent exists; centralize offer-replacement logic to avoid replacing agents during connection attempts; on successful agent creation, flush buffered candidates into the new agent; clear buffer on agent close.

Changes

Cohort / File(s)	Summary
ICE agent & candidate buffering `client/internal/peer/worker_ice.go`	Add bounded `pendingCandidates` with `maxPendingCandidates`; buffer remote candidates when `agent` is nil (drop beyond cap with warning); introduce `shouldReplaceAgent` and extract `replaceExistingAgent`; on new agent creation call `flushPendingCandidates` and clear buffer in `closeAgent`; remove prior early-return that dropped candidates when agent was nil.

Sequence Diagram

sequenceDiagram
    participant Client
    participant WorkerICE as WorkerICE
    participant PendingBuf as PendingBuffer
    participant Agent as ICE Agent

    Client->>WorkerICE: OnRemoteCandidate (agent = nil)
    WorkerICE->>PendingBuf: Append candidate (bounded)
    Note over WorkerICE,PendingBuf: Buffer candidates until an agent exists

    Client->>WorkerICE: OnNewOffer
    WorkerICE->>WorkerICE: shouldReplaceAgent? / maybe replaceExistingAgent
    WorkerICE->>Agent: Create & initialize agent
    Agent-->>WorkerICE: Agent ready

    WorkerICE->>PendingBuf: Retrieve buffered candidates
    loop For each buffered candidate
        WorkerICE->>Agent: AddRemoteCandidate
    end
    PendingBuf-->>WorkerICE: Buffer cleared

    Client->>WorkerICE: OnRemoteCandidate (agent != nil)
    WorkerICE->>Agent: AddRemoteCandidate (direct)

    Client->>WorkerICE: Close
    WorkerICE->>WorkerICE: closeAgent (close agent, clear pending buffer)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[client] Reset WireGuard endpoint on ICE session change during relay fallback #5283: Modifies WorkerICE agent lifecycle and session-change handling in client/internal/peer/worker_ice.go (closely related changes).
[client] Fix nil pointer panic in ICE agent during sleep/wake cycles #5261: Alters ICE agent close/creation handling and agent replacement/cleanup logic in worker_ice.go (overlapping functionality).

Suggested reviewers

lixmal
mlsmaycon

Poem

🐰 I saved the drifted candidates in a cozy heap,
I guarded them gently while the agents sleep,
When a new offer woke the sleepy den,
I sprinkled them in and watched them blend,
Hop—connections stitched, and I thumped my feet.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly summarizes the two main changes: fixing an ICE reconnection loop and buffering candidates before agent initialization.
Description check	✅ Passed	The description covers all required template sections: detailed summary of changes, issue context, checklist completion, and CLA confirmation. All critical information is provided.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

client/internal/peer/worker_ice.go (1)

184-189: Consider adding a buffer size limit as a defensive measure.

The buffer could theoretically grow unbounded if candidates keep arriving without agent creation (e.g., due to signaling issues). While unlikely in practice, a simple cap would prevent memory accumulation in pathological cases.

♻️ Optional: Add buffer cap

+const maxPendingCandidates = 50
+
 func (w *WorkerICE) OnRemoteCandidate(candidate ice.Candidate, haRoutes route.HAMap) {
 	w.muxAgent.Lock()
 	defer w.muxAgent.Unlock()
 	w.log.Debugf("OnRemoteCandidate from peer %s -> %s", w.config.Key, candidate.String())
 	if w.agent == nil {
+		if len(w.pendingCandidates) >= maxPendingCandidates {
+			w.log.Warnf("pending candidate buffer full, dropping candidate: %s", candidate.Type())
+			return
+		}
 		w.log.Infof("ICE Agent not ready, buffering remote candidate: %s", candidate.Type())
 		w.pendingCandidates = append(w.pendingCandidates, candidate)
 		return
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 184 - 189, The
pendingCandidates slice can grow unbounded when w.agent == nil; add a fixed cap
(e.g., MAX_PENDING_CANDIDATES constant) and enforce it in the buffering branch
inside the OnNewCandidate / handler where w.pendingCandidates is appended: if
len(w.pendingCandidates) >= MAX_PENDING_CANDIDATES, either drop the oldest
(shift) or drop the new candidate and log a warning via w.log.Infof/warnf,
otherwise append; ensure the constant and behavior are documented in comments
and referenced where pendingCandidates is created/cleared (use symbols
w.pendingCandidates, w.agent, candidate).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 184-189: The pendingCandidates slice can grow unbounded when
w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and
enforce it in the buffering branch inside the OnNewCandidate / handler where
w.pendingCandidates is appended: if len(w.pendingCandidates) >=
MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate
and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant
and behavior are documented in comments and referenced where pendingCandidates
is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aabe6227-0bce-4e6d-abac-3b3c6358dde1

📥 Commits

Reviewing files that changed from the base of the PR and between decb5dd and 28e3a06.

📒 Files selected for processing (1)

client/internal/peer/worker_ice.go

…ation When both peers exchange offers simultaneously, remote ICE candidates can arrive via signaling before the local ICE agent is created in OnNewOffer(). Previously these candidates were silently dropped with "ICE Agent is not initialized yet", causing ICE to fail because the agent never learned about the remote peer's network addresses. This adds a pendingCandidates buffer that collects early candidates and flushes them into the agent immediately after creation. The buffer is cleared when the agent is closed to prevent stale candidates from leaking into a new session. This race condition particularly affects peers behind the same NAT where the signaling round-trip is very fast.

Addresses coderabbitai review on PR netbirdio#5805: the pendingCandidates slice could theoretically grow unbounded if remote candidates keep arriving before the ICE agent is created (e.g. due to signaling issues). Add a fixed cap of 50 candidates and drop further candidates with a warning log to protect against memory accumulation in pathological scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 68-69: Buffered remote candidates stored in pendingCandidates
bypass the normal pipeline (candidateViaRoutes and shouldAddExtraCandidate),
causing different behavior depending on arrival timing; fix by applying the same
routing/filtering before buffering and by replaying buffered entries through the
same helper used by the live path: call candidateViaRoutes() and run
shouldAddExtraCandidate() (or the unified helper that encapsulates both) for
each candidate before pushing to pendingCandidates, and when draining
pendingCandidates later invoke the same AddRemoteCandidate() path used for live
arrivals so routed/filtered/synthetic-related-port candidates are handled
identically (update functions: pendingCandidates, candidateViaRoutes,
shouldAddExtraCandidate, AddRemoteCandidate).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73288d66-2438-465a-b0b8-92d5cf6eed4e

📥 Commits

Reviewing files that changed from the base of the PR and between 28e3a06 and 387a2e3.

📒 Files selected for processing (1)

client/internal/peer/worker_ice.go

client/internal/peer/worker_ice.go

pappz · 2026-04-07T09:05:07Z

@MichaelUray Thank you for the PR.
Regarding the candidate buffering: I think this can be dropped.

The race you're protecting against — OnRemoteCandidate arriving before the agent is created — requires a very specific timing: the remote peer would need to create its agent, start gathering, discover a candidate, and signal it back, all before our handshaker goroutine reads the offer from its channel and calls OnNewOffer. That's microseconds (channel read) vs tens-to-hundreds of milliseconds (STUN/TURN network I/O for candidate gathering), so in practice this almost never happens.

If we ever wanted to truly fix this race, the correct approach would be to remove the go from engine.go:1612:

 // current:                                                                                                                
 go conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())

  // fix:                                                                                                                    
  conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())

But of course it is a risky change.

pappz · 2026-04-07T09:18:21Z

The session ID mechanism exists for two specific reasons:

Force renegotiation after silent disconnect — when a peer crashes and reconnects with a new session ID, the remote side must tear down its agent and start fresh.
Duplicate offer suppression — same session ID on resend (ICE + relay paths) gets deduplicated.

By skipping all offers with a different session ID while agentConnecting == true, case 1 regresses. If the remote peer
crashed and came back, the local peer ignores its new session ID and keeps trying the current (now dead) ICE attempt until it times out. Recovery gets delayed by the full ICE timeout (30s+).

…andidates Address coderabbitai review on PR netbirdio#5805: previously buffered candidates bypassed candidateViaRoutes() (route-loop filter) and shouldAddExtraCandidate() (extra server-reflexive helper), so a candidate arriving before the agent existed could be added later in a state inconsistent with a candidate arriving after the agent existed. Worse, the buffered path missed the extra-srflx-with-related-port candidate that the live path adds. Refactor OnRemoteCandidate to compute the full list of candidates that should reach the agent (the routed-out filter and the optional extra srflx) before deciding whether to buffer or add immediately. The same filter and extension now apply unconditionally; the only difference between the two paths is whether we hand the result to the agent right now or stash it for the next flush. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-08T05:42:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

client/internal/peer/worker_ice.go (1)

238-247: Minor: Log message may undercount dropped candidates.

When the buffer is full, the return exits the function immediately. If toAdd contains multiple candidates (original + extra srflx), all remaining candidates in the slice are dropped but only one is logged.

This is a minor logging clarity issue and doesn't affect correctness.

💡 Optional improvement for logging accuracy

 	for _, c := range toAdd {
 		if len(w.pendingCandidates) >= maxPendingCandidates {
-			w.log.Warnf("pending candidate buffer full (%d), dropping candidate: %s",
-				maxPendingCandidates, c.Type())
+			w.log.Warnf("pending candidate buffer full (%d), dropping %d candidate(s)",
+				maxPendingCandidates, len(toAdd)-len(w.pendingCandidates)+maxPendingCandidates)
 			return
 		}

Or simply break out of the loop and log once after counting drops.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 238 - 247, The loop over
toAdd currently returns on first overflow, underreporting dropped candidates;
change the logic in the worker (the loop that appends to w.pendingCandidates) to
stop iterating when len(w.pendingCandidates) >= maxPendingCandidates (use break
instead of return), compute how many candidates from toAdd were not buffered
(e.g., dropped := len(toAdd) - i or maintain a dropped counter), and emit a
single w.log.Warnf that reports the total number of dropped candidates along
with maxPendingCandidates and candidate types if desired; ensure you still
append buffered candidates to w.pendingCandidates and retain the info log
w.log.Infof for buffered entries.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 157-165: The current guard w.agentConnecting unconditionally skips
new offers while an ICE attempt is in progress, which avoids reconnection loops
but can delay recovery when a remote peer crashes and reconnects with a new
session; change this by making the blocking behavior conditional or timeboxed:
add a configurable short timeout or an early-abort check that compares the
incoming remoteOfferAnswer.SessionIDString() to the session involved in the
current ICE attempt and, if they differ, either abort the ongoing attempt
(invoke the existing ICE cancel/cleanup routine) or allow the new offer to
preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method
like abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
- Around line 180-184: The current call to NewICESessionID() logs an error but
leaves w.sessionID unchanged on failure; update the error handling so you do not
keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err
either set w.sessionID = "" (or a known-invalid value) or return an error from
the enclosing function; modify the block around NewICESessionID() so it checks
err, clears w.sessionID on failure (or returns the error) and logs the failure
with context before exiting.

---

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 238-247: The loop over toAdd currently returns on first overflow,
underreporting dropped candidates; change the logic in the worker (the loop that
appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates)
>= maxPendingCandidates (use break instead of return), compute how many
candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or
maintain a dropped counter), and emit a single w.log.Warnf that reports the
total number of dropped candidates along with maxPendingCandidates and candidate
types if desired; ensure you still append buffered candidates to
w.pendingCandidates and retain the info log w.log.Infof for buffered entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d2c91f1-409c-45ea-93ba-b1dd7570a51f

📥 Commits

Reviewing files that changed from the base of the PR and between 387a2e3 and c131459.

📒 Files selected for processing (1)

client/internal/peer/worker_ice.go

coderabbitai · 2026-04-08T05:47:38Z

client/internal/peer/worker_ice.go

+	// If ICE agent is actively connecting (gathering candidates, dialing),
+	// don't cancel it for a new offer - let the current attempt finish.
+	// This prevents a reconnection loop where guards on both sides keep
+	// sending new offers that cancel each other's ICE negotiations,
+	// particularly when both peers are behind the same NAT (same public IP).
+	if w.agentConnecting {
+		w.log.Infof("ICE agent is actively connecting, skipping new offer with different session: %s", remoteOfferAnswer.SessionIDString())
+		return false
+	}


⚠️ Potential issue | 🟡 Minor

Acknowledge the crash-recovery trade-off documented in PR comments.

The agentConnecting guard (lines 162-165) prevents the reconnection loop but can delay recovery when the remote peer crashes and reconnects with a new session ID. The current ICE attempt must complete (success or timeout, potentially 30s+) before processing the new offer.

This is the trade-off discussed by reviewer pappz. Consider whether the ICE timeout duration is acceptable for crash-recovery scenarios in your deployment, or if a shorter timeout/early-abort mechanism could mitigate the delay.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 157 - 165, The current guard w.agentConnecting unconditionally skips new offers while an ICE attempt is in progress, which avoids reconnection loops but can delay recovery when a remote peer crashes and reconnects with a new session; change this by making the blocking behavior conditional or timeboxed: add a configurable short timeout or an early-abort check that compares the incoming remoteOfferAnswer.SessionIDString() to the session involved in the current ICE attempt and, if they differ, either abort the ongoing attempt (invoke the existing ICE cancel/cleanup routine) or allow the new offer to preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method like abortConnectingIfNewSession(remoteSessionID)); update logs via w.log.Infof/w.log.Warnf to record preemption/timeout decisions.

coderabbitai · 2026-04-08T05:47:38Z

client/internal/peer/worker_ice.go

+	sessionID, err := NewICESessionID()
+	if err != nil {
+		w.log.Errorf("failed to create new session ID: %s", err)
+	}
+	w.sessionID = sessionID


⚠️ Potential issue | 🟡 Minor

Error handling continues with potentially stale session ID.

If NewICESessionID() fails, the code logs an error but continues, leaving w.sessionID with its previous value rather than being explicitly cleared or set to a known-invalid state. This could lead to session ID collisions or unexpected behavior.

Consider whether to return an error or set w.sessionID to an empty/invalid value on failure.

💡 Suggested improvement

sessionID, err := NewICESessionID() if err != nil { w.log.Errorf("failed to create new session ID: %s", err) + w.sessionID = "" + } else { + w.sessionID = sessionID } - w.sessionID = sessionID

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sessionID, err := NewICESessionID()

if err != nil {

w.log.Errorf("failed to create new session ID: %s", err)

}

w.sessionID = sessionID

sessionID, err := NewICESessionID()

if err != nil {

w.log.Errorf("failed to create new session ID: %s", err)

w.sessionID = ""

} else {

w.sessionID = sessionID

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 180 - 184, The current call to NewICESessionID() logs an error but leaves w.sessionID unchanged on failure; update the error handling so you do not keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err either set w.sessionID = "" (or a known-invalid value) or return an error from the enclosing function; modify the block around NewICESessionID() so it checks err, clears w.sessionID on failure (or returns the error) and logs the failure with context before exiting.

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch from 28e3a06 to 64e91ae Compare April 6, 2026 16:16

pappz self-requested a review April 7, 2026 08:16

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

client/internal/peer/worker_ice.go Show resolved Hide resolved

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

Uh oh!

Conversation

MichaelUray commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pappz commented Apr 7, 2026

Uh oh!

pappz commented Apr 7, 2026

Uh oh!

sonarqubecloud bot commented Apr 8, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MichaelUray commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 6, 2026 •

edited

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading