Skip to content

[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805

Open
MichaelUray wants to merge 4 commits intonetbirdio:mainfrom
MichaelUray:fix/ice-guard-loop-and-candidate-buffer
Open

[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805
MichaelUray wants to merge 4 commits intonetbirdio:mainfrom
MichaelUray:fix/ice-guard-loop-and-candidate-buffer

Conversation

@MichaelUray
Copy link
Copy Markdown

@MichaelUray MichaelUray commented Apr 6, 2026

Summary

  • Guard Loop Fix: When both peers are behind the same NAT (same public IP), their connectivity guards fire simultaneously every ~12s, each sending a new offer with a new session ID. This cancels the other side's ICE agent before it can finish negotiating, creating an infinite reconnection loop. Fix: skip new offers while agentConnecting == true to let the current ICE attempt complete.
  • Candidate Buffering: Remote ICE candidates that arrive before the local ICE agent is created were silently dropped. Add a pendingCandidates buffer that collects early candidates and flushes them once the agent is ready.

Both fixes are tightly coupled — without the buffer, the guard loop fix alone would still miss early candidates.

Checklist

  • Bug fix
  • Create tests that fail without the change: ICE connections between peers behind the same NAT consistently failed before this fix
  • Documentation not needed — internal ICE behavior change, no user-facing API change

By submitting this pull request, I confirm that I have read and agree to the terms of the Contributor License Agreement.

Summary by CodeRabbit

  • Bug Fixes
    • Buffer early remote connection candidates with a bounded cap to avoid lost candidates and limit memory use.
    • Expand and queue applicable server-reflexive candidates for later delivery instead of dropping them.
    • Avoid unnecessary connection replacements during in-progress handshakes or when sessions match to improve stability.
    • Flush buffered candidates into a newly created connection and clear the buffer on close to prevent stale data.

…public IP

When two peers are behind the same NAT (same public IP), ICE P2P
connections could never be established because of a reconnection loop:

1. Guard on peer A sees no P2P connection and sends a new offer
2. Peer B receives the offer with a new session ID
3. Peer B cancels its active ICE agent and creates a new one
4. Before the new ICE agent can complete connectivity checks,
   peer B's guard sends a new offer to peer A
5. Peer A cancels its ICE agent → cycle repeats indefinitely

The fix skips incoming offers with different session IDs when the
ICE agent is actively connecting (agentConnecting == true). This
lets the current ICE negotiation run to completion (success or
timeout) before accepting new offers.

This change is backward compatible - older clients will simply have
their offers deferred until the current ICE attempt finishes (max
6 seconds on timeout). P2P between peers at different locations
(different public IPs) is unaffected as ICE succeeds before the
guard timer fires.

Fixes: netbirdio#3669
Related: netbirdio#2703
Related: netbirdio#3339
Related: netbirdio#4031
Related: netbirdio#2159
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 6, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Buffer remote ICE candidates in WorkerICE until an ICE agent exists; centralize offer-replacement logic to avoid replacing agents during connection attempts; on successful agent creation, flush buffered candidates into the new agent; clear buffer on agent close.

Changes

Cohort / File(s) Summary
ICE agent & candidate buffering
client/internal/peer/worker_ice.go
Add bounded pendingCandidates with maxPendingCandidates; buffer remote candidates when agent is nil (drop beyond cap with warning); introduce shouldReplaceAgent and extract replaceExistingAgent; on new agent creation call flushPendingCandidates and clear buffer in closeAgent; remove prior early-return that dropped candidates when agent was nil.

Sequence Diagram

sequenceDiagram
    participant Client
    participant WorkerICE as WorkerICE
    participant PendingBuf as PendingBuffer
    participant Agent as ICE Agent

    Client->>WorkerICE: OnRemoteCandidate (agent = nil)
    WorkerICE->>PendingBuf: Append candidate (bounded)
    Note over WorkerICE,PendingBuf: Buffer candidates until an agent exists

    Client->>WorkerICE: OnNewOffer
    WorkerICE->>WorkerICE: shouldReplaceAgent? / maybe replaceExistingAgent
    WorkerICE->>Agent: Create & initialize agent
    Agent-->>WorkerICE: Agent ready

    WorkerICE->>PendingBuf: Retrieve buffered candidates
    loop For each buffered candidate
        WorkerICE->>Agent: AddRemoteCandidate
    end
    PendingBuf-->>WorkerICE: Buffer cleared

    Client->>WorkerICE: OnRemoteCandidate (agent != nil)
    WorkerICE->>Agent: AddRemoteCandidate (direct)

    Client->>WorkerICE: Close
    WorkerICE->>WorkerICE: closeAgent (close agent, clear pending buffer)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • lixmal
  • mlsmaycon

Poem

🐰 I saved the drifted candidates in a cozy heap,
I guarded them gently while the agents sleep,
When a new offer woke the sleepy den,
I sprinkled them in and watched them blend,
Hop—connections stitched, and I thumped my feet.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the two main changes: fixing an ICE reconnection loop and buffering candidates before agent initialization.
Description check ✅ Passed The description covers all required template sections: detailed summary of changes, issue context, checklist completion, and CLA confirmation. All critical information is provided.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)

184-189: Consider adding a buffer size limit as a defensive measure.

The buffer could theoretically grow unbounded if candidates keep arriving without agent creation (e.g., due to signaling issues). While unlikely in practice, a simple cap would prevent memory accumulation in pathological cases.

♻️ Optional: Add buffer cap
+const maxPendingCandidates = 50
+
 func (w *WorkerICE) OnRemoteCandidate(candidate ice.Candidate, haRoutes route.HAMap) {
 	w.muxAgent.Lock()
 	defer w.muxAgent.Unlock()
 	w.log.Debugf("OnRemoteCandidate from peer %s -> %s", w.config.Key, candidate.String())
 	if w.agent == nil {
+		if len(w.pendingCandidates) >= maxPendingCandidates {
+			w.log.Warnf("pending candidate buffer full, dropping candidate: %s", candidate.Type())
+			return
+		}
 		w.log.Infof("ICE Agent not ready, buffering remote candidate: %s", candidate.Type())
 		w.pendingCandidates = append(w.pendingCandidates, candidate)
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 184 - 189, The
pendingCandidates slice can grow unbounded when w.agent == nil; add a fixed cap
(e.g., MAX_PENDING_CANDIDATES constant) and enforce it in the buffering branch
inside the OnNewCandidate / handler where w.pendingCandidates is appended: if
len(w.pendingCandidates) >= MAX_PENDING_CANDIDATES, either drop the oldest
(shift) or drop the new candidate and log a warning via w.log.Infof/warnf,
otherwise append; ensure the constant and behavior are documented in comments
and referenced where pendingCandidates is created/cleared (use symbols
w.pendingCandidates, w.agent, candidate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 184-189: The pendingCandidates slice can grow unbounded when
w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and
enforce it in the buffering branch inside the OnNewCandidate / handler where
w.pendingCandidates is appended: if len(w.pendingCandidates) >=
MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate
and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant
and behavior are documented in comments and referenced where pendingCandidates
is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aabe6227-0bce-4e6d-abac-3b3c6358dde1

📥 Commits

Reviewing files that changed from the base of the PR and between decb5dd and 28e3a06.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

…ation

When both peers exchange offers simultaneously, remote ICE candidates
can arrive via signaling before the local ICE agent is created in
OnNewOffer(). Previously these candidates were silently dropped with
"ICE Agent is not initialized yet", causing ICE to fail because the
agent never learned about the remote peer's network addresses.

This adds a pendingCandidates buffer that collects early candidates
and flushes them into the agent immediately after creation. The buffer
is cleared when the agent is closed to prevent stale candidates from
leaking into a new session.

This race condition particularly affects peers behind the same NAT
where the signaling round-trip is very fast.
@MichaelUray MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch from 28e3a06 to 64e91ae Compare April 6, 2026 16:16
@pappz pappz self-requested a review April 7, 2026 08:16
Addresses coderabbitai review on PR netbirdio#5805: the pendingCandidates slice
could theoretically grow unbounded if remote candidates keep arriving
before the ICE agent is created (e.g. due to signaling issues). Add a
fixed cap of 50 candidates and drop further candidates with a warning
log to protect against memory accumulation in pathological scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 68-69: Buffered remote candidates stored in pendingCandidates
bypass the normal pipeline (candidateViaRoutes and shouldAddExtraCandidate),
causing different behavior depending on arrival timing; fix by applying the same
routing/filtering before buffering and by replaying buffered entries through the
same helper used by the live path: call candidateViaRoutes() and run
shouldAddExtraCandidate() (or the unified helper that encapsulates both) for
each candidate before pushing to pendingCandidates, and when draining
pendingCandidates later invoke the same AddRemoteCandidate() path used for live
arrivals so routed/filtered/synthetic-related-port candidates are handled
identically (update functions: pendingCandidates, candidateViaRoutes,
shouldAddExtraCandidate, AddRemoteCandidate).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73288d66-2438-465a-b0b8-92d5cf6eed4e

📥 Commits

Reviewing files that changed from the base of the PR and between 28e3a06 and 387a2e3.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

@pappz
Copy link
Copy Markdown
Contributor

pappz commented Apr 7, 2026

@MichaelUray Thank you for the PR.
Regarding the candidate buffering: I think this can be dropped.

The race you're protecting against — OnRemoteCandidate arriving before the agent is created — requires a very specific timing: the remote peer would need to create its agent, start gathering, discover a candidate, and signal it back, all before our handshaker goroutine reads the offer from its channel and calls OnNewOffer. That's microseconds (channel read) vs tens-to-hundreds of milliseconds (STUN/TURN network I/O for candidate gathering), so in practice this almost never happens.

If we ever wanted to truly fix this race, the correct approach would be to remove the go from engine.go:1612:

 // current:                                                                                                                
 go conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())
  // fix:                                                                                                                    
  conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())

But of course it is a risky change.

@pappz
Copy link
Copy Markdown
Contributor

pappz commented Apr 7, 2026

The session ID mechanism exists for two specific reasons:

  1. Force renegotiation after silent disconnect — when a peer crashes and reconnects with a new session ID, the remote side must tear down its agent and start fresh.
  2. Duplicate offer suppression — same session ID on resend (ICE + relay paths) gets deduplicated.

By skipping all offers with a different session ID while agentConnecting == true, case 1 regresses. If the remote peer
crashed and came back, the local peer ignores its new session ID and keeps trying the current (now dead) ICE attempt until it times out. Recovery gets delayed by the full ICE timeout (30s+).

…andidates

Address coderabbitai review on PR netbirdio#5805: previously buffered candidates
bypassed candidateViaRoutes() (route-loop filter) and shouldAddExtraCandidate()
(extra server-reflexive helper), so a candidate arriving before the agent
existed could be added later in a state inconsistent with a candidate
arriving after the agent existed. Worse, the buffered path missed the
extra-srflx-with-related-port candidate that the live path adds.

Refactor OnRemoteCandidate to compute the full list of candidates that
should reach the agent (the routed-out filter and the optional extra
srflx) before deciding whether to buffer or add immediately. The same
filter and extension now apply unconditionally; the only difference
between the two paths is whether we hand the result to the agent right
now or stash it for the next flush.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)

238-247: Minor: Log message may undercount dropped candidates.

When the buffer is full, the return exits the function immediately. If toAdd contains multiple candidates (original + extra srflx), all remaining candidates in the slice are dropped but only one is logged.

This is a minor logging clarity issue and doesn't affect correctness.

💡 Optional improvement for logging accuracy
 	for _, c := range toAdd {
 		if len(w.pendingCandidates) >= maxPendingCandidates {
-			w.log.Warnf("pending candidate buffer full (%d), dropping candidate: %s",
-				maxPendingCandidates, c.Type())
+			w.log.Warnf("pending candidate buffer full (%d), dropping %d candidate(s)",
+				maxPendingCandidates, len(toAdd)-len(w.pendingCandidates)+maxPendingCandidates)
 			return
 		}

Or simply break out of the loop and log once after counting drops.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 238 - 247, The loop over
toAdd currently returns on first overflow, underreporting dropped candidates;
change the logic in the worker (the loop that appends to w.pendingCandidates) to
stop iterating when len(w.pendingCandidates) >= maxPendingCandidates (use break
instead of return), compute how many candidates from toAdd were not buffered
(e.g., dropped := len(toAdd) - i or maintain a dropped counter), and emit a
single w.log.Warnf that reports the total number of dropped candidates along
with maxPendingCandidates and candidate types if desired; ensure you still
append buffered candidates to w.pendingCandidates and retain the info log
w.log.Infof for buffered entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 157-165: The current guard w.agentConnecting unconditionally skips
new offers while an ICE attempt is in progress, which avoids reconnection loops
but can delay recovery when a remote peer crashes and reconnects with a new
session; change this by making the blocking behavior conditional or timeboxed:
add a configurable short timeout or an early-abort check that compares the
incoming remoteOfferAnswer.SessionIDString() to the session involved in the
current ICE attempt and, if they differ, either abort the ongoing attempt
(invoke the existing ICE cancel/cleanup routine) or allow the new offer to
preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method
like abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
- Around line 180-184: The current call to NewICESessionID() logs an error but
leaves w.sessionID unchanged on failure; update the error handling so you do not
keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err
either set w.sessionID = "" (or a known-invalid value) or return an error from
the enclosing function; modify the block around NewICESessionID() so it checks
err, clears w.sessionID on failure (or returns the error) and logs the failure
with context before exiting.

---

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 238-247: The loop over toAdd currently returns on first overflow,
underreporting dropped candidates; change the logic in the worker (the loop that
appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates)
>= maxPendingCandidates (use break instead of return), compute how many
candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or
maintain a dropped counter), and emit a single w.log.Warnf that reports the
total number of dropped candidates along with maxPendingCandidates and candidate
types if desired; ensure you still append buffered candidates to
w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d2c91f1-409c-45ea-93ba-b1dd7570a51f

📥 Commits

Reviewing files that changed from the base of the PR and between 387a2e3 and c131459.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

Comment on lines +157 to +165
// If ICE agent is actively connecting (gathering candidates, dialing),
// don't cancel it for a new offer - let the current attempt finish.
// This prevents a reconnection loop where guards on both sides keep
// sending new offers that cancel each other's ICE negotiations,
// particularly when both peers are behind the same NAT (same public IP).
if w.agentConnecting {
w.log.Infof("ICE agent is actively connecting, skipping new offer with different session: %s", remoteOfferAnswer.SessionIDString())
return false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Acknowledge the crash-recovery trade-off documented in PR comments.

The agentConnecting guard (lines 162-165) prevents the reconnection loop but can delay recovery when the remote peer crashes and reconnects with a new session ID. The current ICE attempt must complete (success or timeout, potentially 30s+) before processing the new offer.

This is the trade-off discussed by reviewer pappz. Consider whether the ICE timeout duration is acceptable for crash-recovery scenarios in your deployment, or if a shorter timeout/early-abort mechanism could mitigate the delay.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 157 - 165, The current guard
w.agentConnecting unconditionally skips new offers while an ICE attempt is in
progress, which avoids reconnection loops but can delay recovery when a remote
peer crashes and reconnects with a new session; change this by making the
blocking behavior conditional or timeboxed: add a configurable short timeout or
an early-abort check that compares the incoming
remoteOfferAnswer.SessionIDString() to the session involved in the current ICE
attempt and, if they differ, either abort the ongoing attempt (invoke the
existing ICE cancel/cleanup routine) or allow the new offer to preempt after a
configurable threshold (e.g., w.iceConnectingTimeout or a method like
abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.

Comment on lines +180 to +184
sessionID, err := NewICESessionID()
if err != nil {
w.log.Errorf("failed to create new session ID: %s", err)
}
w.sessionID = sessionID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Error handling continues with potentially stale session ID.

If NewICESessionID() fails, the code logs an error but continues, leaving w.sessionID with its previous value rather than being explicitly cleared or set to a known-invalid state. This could lead to session ID collisions or unexpected behavior.

Consider whether to return an error or set w.sessionID to an empty/invalid value on failure.

💡 Suggested improvement
 	sessionID, err := NewICESessionID()
 	if err != nil {
 		w.log.Errorf("failed to create new session ID: %s", err)
+		w.sessionID = ""
+	} else {
+		w.sessionID = sessionID
 	}
-	w.sessionID = sessionID
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sessionID, err := NewICESessionID()
if err != nil {
w.log.Errorf("failed to create new session ID: %s", err)
}
w.sessionID = sessionID
sessionID, err := NewICESessionID()
if err != nil {
w.log.Errorf("failed to create new session ID: %s", err)
w.sessionID = ""
} else {
w.sessionID = sessionID
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 180 - 184, The current call
to NewICESessionID() logs an error but leaves w.sessionID unchanged on failure;
update the error handling so you do not keep a stale ID: only assign w.sessionID
= sessionID when err == nil, and on err either set w.sessionID = "" (or a
known-invalid value) or return an error from the enclosing function; modify the
block around NewICESessionID() so it checks err, clears w.sessionID on failure
(or returns the error) and logs the failure with context before exiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants