[client] Add TTL-based cache refresh for management DNS resolver by plusls · Pull Request #5794 · netbirdio/netbird

plusls · 2026-04-03T07:48:31Z

Describe your changes

Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP changes. When cache entries expire (300s TTL), refresh synchronously using net.DefaultResolver. On refresh failure, serve stale cache to maintain connectivity.

Fixes #5113

Issue ticket number and link

#5113

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Bug Fixes
- Improved DNS resolver caching with per-entry timestamps, TTL-based expiration, and automatic refresh of stale records. Failed refreshes now trigger a backoff while continuing to serve stale records for reliability. New logic also records insertion times and applies consistent TTLs to cached responses.

Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP changes. When cache entries expire (300s TTL), refresh synchronously using net.DefaultResolver. On refresh failure, serve stale cache to maintain connectivity. Fixes netbirdio#5113

coderabbitai · 2026-04-03T07:48:47Z

📝 Walkthrough

Walkthrough

Resolver cache now stores per-question cachedRecord entries with timestamps and failure markers. On each query ServeDNS checks TTL-based staleness, triggers a serialized refreshDomain when expired (with backoff), and otherwise serves cached records; AddDomain records TTLs and insertion time into cachedRecord.

Changes

Cohort / File(s)	Summary
DNS cache & refresh logic `client/internal/dns/mgmt/mgmt.go`	Replaced `map[dns.Question][]dns.RR` with `map[dns.Question]*cachedRecord` (records, `cachedAt`, `lastFailedRefresh`). `AddDomain` wraps records in `cachedRecord` and sets RR TTLs from `defaultTTL`. `ServeDNS` checks staleness vs `defaultTTL` and either serves cache or calls new `refreshDomain`. Added `refreshDomain` implementing mutex-serialized refresh, staleness re-check after lock, failure backoff via `lastFailedRefresh`, and stale-serving fallback on errors.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client (DNS request)
  participant Resolver as Resolver Cache
  participant Upstream as Upstream Resolver
  rect rgba(220,220,255,0.5)
    Client->>Resolver: DNS query (question)
    Resolver->>Resolver: Check cachedRecord + defaultTTL
    alt cache fresh
      Resolver-->>Client: return cached records
    else cache stale
      Resolver->>Resolver: acquire refreshMutex
      Resolver->>Upstream: re-resolve domain
      alt refresh success
        Upstream-->>Resolver: new RRs
        Resolver-->>Client: return new records
      else refresh failed (backoff)
        Resolver-->>Client: return stale records (update lastFailedRefresh)
      end
      Resolver->>Resolver: release refreshMutex
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

#4983: Modifies the same client/internal/dns/mgmt/mgmt.go and AddDomain to enhance DNS lookup timing/timeout handling — closely related to DNS resolution logic.

Suggested reviewers

pappz

Poem

🐰 I hopped through caches, timestamp in paw,
I sniffed stale answers and fixed the flaw,
I lock, I refresh, I back off when needed,
Now domains update and connections are seeded. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a TTL-based cache refresh mechanism to the management DNS resolver, which directly addresses the issue of DNS cache not being updated.
Description check	✅ Passed	The description covers all required sections including change description, issue ticket link (`#5113`), classification as bug fix, and documentation statement. All template requirements are adequately fulfilled.
Linked Issues check	✅ Passed	The code changes implement the required functionality: TTL expiration mechanism with 300s TTL, synchronous refresh using net.DefaultResolver, and fallback to stale cache on refresh failure, directly addressing issue `#5113`.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing TTL-based cache refresh for the management DNS resolver; no unrelated modifications or scope creep detected in the resolver cache structure updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

client/internal/dns/mgmt/mgmt.go (1)

158-183: ⚠️ Potential issue | 🟠 Major

Clear vanished record families during a successful refresh.

Line 161 and Line 173 only overwrite non-empty slices. If a domain stops returning A or AAAA records, the old entry stays in m.records, and Lines 215-217 then keep serving that stale RR set after a “successful” refresh. That leaves deleted addresses cached forever.

🩹 Suggested fix

  now := time.Now()
  m.mutex.Lock()
+ aQuestion := dns.Question{
+     Name:   dnsName,
+     Qtype:  dns.TypeA,
+     Qclass: dns.ClassINET,
+ }
+ aaaaQuestion := dns.Question{
+     Name:   dnsName,
+     Qtype:  dns.TypeAAAA,
+     Qclass: dns.ClassINET,
+ }
+ delete(m.records, aQuestion)
+ delete(m.records, aaaaQuestion)

- if len(aRecords) > 0 {
-     aQuestion := dns.Question{
-         Name:   dnsName,
-         Qtype:  dns.TypeA,
-         Qclass: dns.ClassINET,
-     }
+ if len(aRecords) > 0 {
      m.records[aQuestion] = &cachedRecord{
          records:  aRecords,
          cachedAt: now,
      }
  }

- if len(aaaaRecords) > 0 {
-     aaaaQuestion := dns.Question{
-         Name:   dnsName,
-         Qtype:  dns.TypeAAAA,
-         Qclass: dns.ClassINET,
-     }
+ if len(aaaaRecords) > 0 {
      m.records[aaaaQuestion] = &cachedRecord{
          records:  aaaaRecords,
          cachedAt: now,
      }
  }

- if !found {
-     return stale.records
- }
+ if !found {
+     return nil
+ }

Also applies to: 211-217

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 158 - 183, The refresh logic
currently only writes cache entries when aRecords or aaaaRecords are non-empty,
leaving stale entries in m.records when record families vanish; update the block
that handles aRecords and aaaaRecords (the aQuestion and aaaaQuestion keys in
m.records, and the cachedRecord values) to explicitly remove the corresponding
dns.Question key from m.records when the slice is empty (i.e., delete
m.records[aQuestion] / delete m.records[aaaaQuestion]) while still setting the
new cachedRecord when non-empty, keeping the existing mutex.Lock/unlock and
time.Now usage.

🧹 Nitpick comments (1)

client/internal/dns/mgmt/mgmt.go (1)
37-37: Re-check the current map entry after taking the refresh gate.

Line 200 tests the pre-lock stale pointer. If another request refreshed m.records[question] while this goroutine was waiting, stale.cachedAt is still old and we do a second lookup anyway. Re-reading m.records[question] under m.mutex (or gating per question with singleflight) would make the dedupe actually work.

Also applies to: 199-202
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` at line 37, The refresh gate logic checks
the pre-lock stale pointer (stale.cachedAt) and can trigger duplicate refreshes
if another goroutine updated m.records[question] while this goroutine was
waiting on refreshMutex; after acquiring refreshMutex (refreshMutex) re-read
m.records[question] under m.mutex (or use a per-question singleflight) and
re-evaluate stale.cachedAt/freshness before performing the lookup so you skip
the redundant refresh when another request already updated the entry.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 195-208: The refreshDomain function currently retries AddDomain on
every query when a cachedRecord is expired; modify Resolver.refreshDomain to
implement a backoff: add a retry/backoff timestamp field on cachedRecord (e.g.,
lastFailedRefresh or nextRefreshAttempt) and, before attempting AddDomain, check
that timestamp and if the backoff window hasn't elapsed simply return
stale.records immediately; on AddDomain failure set/update the backoff timestamp
(exponential or fixed interval) on the cachedRecord and return stale.records,
and on success clear/reset the failure timestamp and update cachedAt/records as
before. Ensure references to Resolver.refreshDomain, cachedRecord, and AddDomain
are used so the change is made in the right places.

---

Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 158-183: The refresh logic currently only writes cache entries
when aRecords or aaaaRecords are non-empty, leaving stale entries in m.records
when record families vanish; update the block that handles aRecords and
aaaaRecords (the aQuestion and aaaaQuestion keys in m.records, and the
cachedRecord values) to explicitly remove the corresponding dns.Question key
from m.records when the slice is empty (i.e., delete m.records[aQuestion] /
delete m.records[aaaaQuestion]) while still setting the new cachedRecord when
non-empty, keeping the existing mutex.Lock/unlock and time.Now usage.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Line 37: The refresh gate logic checks the pre-lock stale pointer
(stale.cachedAt) and can trigger duplicate refreshes if another goroutine
updated m.records[question] while this goroutine was waiting on refreshMutex;
after acquiring refreshMutex (refreshMutex) re-read m.records[question] under
m.mutex (or use a per-question singleflight) and re-evaluate
stale.cachedAt/freshness before performing the lookup so you skip the redundant
refresh when another request already updated the entry.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8b7e34e-dbd7-4287-82cb-42a603af2769

📥 Commits

Reviewing files that changed from the base of the PR and between 28fbf96 and 4edc62d.

📒 Files selected for processing (1)

client/internal/dns/mgmt/mgmt.go

client/internal/dns/mgmt/mgmt.go

sonarqubecloud · 2026-04-03T08:48:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

client/internal/dns/mgmt/mgmt.go (1)

163-185: ⚠️ Potential issue | 🟠 Major

A successful refresh can still return a removed RR family.

If the fresh lookup only returns one family, AddDomain() leaves the old sibling entry in m.records. The final lookup in refreshDomain() then finds that stale entry and returns it as if the refresh succeeded. A domain that moves from A+AAAA to a single family will keep serving the removed records indefinitely, and every lookup of that family keeps re-entering the blocking refresh path because its cachedAt never advances.

🔧 One way to make the "missing type" path real

-	if len(aRecords) > 0 {
-		aQuestion := dns.Question{
-			Name:   dnsName,
-			Qtype:  dns.TypeA,
-			Qclass: dns.ClassINET,
-		}
+	aQuestion := dns.Question{
+		Name:   dnsName,
+		Qtype:  dns.TypeA,
+		Qclass: dns.ClassINET,
+	}
+	if len(aRecords) > 0 {
 		m.records[aQuestion] = &cachedRecord{
 			records:  aRecords,
 			cachedAt: now,
 		}
+	} else {
+		delete(m.records, aQuestion)
 	}
 
-	if len(aaaaRecords) > 0 {
-		aaaaQuestion := dns.Question{
-			Name:   dnsName,
-			Qtype:  dns.TypeAAAA,
-			Qclass: dns.ClassINET,
-		}
+	aaaaQuestion := dns.Question{
+		Name:   dnsName,
+		Qtype:  dns.TypeAAAA,
+		Qclass: dns.ClassINET,
+	}
+	if len(aaaaRecords) > 0 {
 		m.records[aaaaQuestion] = &cachedRecord{
 			records:  aaaaRecords,
 			cachedAt: now,
 		}
+	} else {
+		delete(m.records, aaaaQuestion)
 	}
@@
 	if !found {
 		// DNS returned no records for this type, preserve stale with backoff
 		now := time.Now()
 		current.lastFailedRefresh = &now
+		m.mutex.Lock()
+		if _, exists := m.records[question]; !exists {
+			m.records[question] = current
+		}
+		m.mutex.Unlock()
 		return current.records
 	}

Also applies to: 230-239

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 163 - 185,
AddDomain/refreshDomain currently only writes entries for record families
present (aRecords or aaaaRecords) which leaves stale sibling entries in
m.records; update AddDomain (where aQuestion/aaaaQuestion are created and
m.records written) to also remove or refresh the opposite family when it is
absent: if aRecords is empty, delete any existing dns.Question with Qtype
dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear
records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA
entry. This ensures refreshDomain will not return removed stale RRs and that
cachedAt advances to avoid re-entering blocking refresh paths.

🧹 Nitpick comments (1)

client/internal/dns/mgmt/mgmt.go (1)
27-31: Refresh coordination is scoped inconsistently.

AddDomain() refreshes a whole domain in one lookup, but the failure marker lives on a single dns.Question while refreshMutex is shared by the entire resolver. After one failed A refresh, the AAAA query for the same name can still trigger its own blocking lookup, and that second lookup also stalls unrelated domains behind the same mutex. A domain-keyed refresh/backoff state would avoid both duplicate retries and cross-domain head-of-line blocking.

Also applies to: 35-39, 216-227
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 27 - 31, The refresh/backoff
state is scoped to a single dns.Question while refreshMutex is global, causing
cross-domain blocking and duplicated retries; update the logic to track refresh
state per domain (e.g., use the domain string or fqdn as the key) instead of
per-Question: introduce a domain-keyed refresh/backoff map (or per-domain mutex
map) and move the lastFailedRefresh marker out of the dns.Question-scoped
cachedRecord into that domain-keyed structure, then change AddDomain(), the
refresh path that currently consults cachedRecord.lastFailedRefresh, and any use
of refreshMutex to consult/lock the per-domain entry (or per-domain mutex) so A
and AAAA for the same name share backoff and unrelated domains don't get blocked
by a single global mutex.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 163-185: AddDomain/refreshDomain currently only writes entries for
record families present (aRecords or aaaaRecords) which leaves stale sibling
entries in m.records; update AddDomain (where aQuestion/aaaaQuestion are created
and m.records written) to also remove or refresh the opposite family when it is
absent: if aRecords is empty, delete any existing dns.Question with Qtype
dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear
records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA
entry. This ensures refreshDomain will not return removed stale RRs and that
cachedAt advances to avoid re-entering blocking refresh paths.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 27-31: The refresh/backoff state is scoped to a single
dns.Question while refreshMutex is global, causing cross-domain blocking and
duplicated retries; update the logic to track refresh state per domain (e.g.,
use the domain string or fqdn as the key) instead of per-Question: introduce a
domain-keyed refresh/backoff map (or per-domain mutex map) and move the
lastFailedRefresh marker out of the dns.Question-scoped cachedRecord into that
domain-keyed structure, then change AddDomain(), the refresh path that currently
consults cachedRecord.lastFailedRefresh, and any use of refreshMutex to
consult/lock the per-domain entry (or per-domain mutex) so A and AAAA for the
same name share backoff and unrelated domains don't get blocked by a single
global mutex.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a33d67e-d131-4bfc-be80-fcbb66582b75

📥 Commits

Reviewing files that changed from the base of the PR and between 4edc62d and da96f92.

📒 Files selected for processing (1)

client/internal/dns/mgmt/mgmt.go

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

client/internal/dns/mgmt/mgmt.go Outdated Show resolved Hide resolved

plusls added 3 commits April 3, 2026 16:28

[client] Add backoff after failed DNS cache refresh

a7b3f19

[client] Add backoff when DNS returns empty records for cached type

6a59a2f

[client] Re-read cache entry after acquiring refresh lock

da96f92

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[client] Add TTL-based cache refresh for management DNS resolver#5794

[client] Add TTL-based cache refresh for management DNS resolver#5794
plusls wants to merge 4 commits intonetbirdio:mainfrom
plusls:fix_mgnt_dns_cache

plusls commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

plusls commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Stack

Checklist

Documentation

Docs PR URL (required if "docs added" is checked)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 3, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

plusls commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading