Skip to content

[client] Add TTL-based cache refresh for management DNS resolver#5794

Open
plusls wants to merge 4 commits intonetbirdio:mainfrom
plusls:fix_mgnt_dns_cache
Open

[client] Add TTL-based cache refresh for management DNS resolver#5794
plusls wants to merge 4 commits intonetbirdio:mainfrom
plusls:fix_mgnt_dns_cache

Conversation

@plusls
Copy link
Copy Markdown
Contributor

@plusls plusls commented Apr 3, 2026

Describe your changes

Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP changes. When cache entries expire (300s TTL), refresh synchronously using net.DefaultResolver. On refresh failure, serve stale cache to maintain connectivity.

Fixes #5113

Issue ticket number and link

#5113

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes
    • Improved DNS resolver caching with per-entry timestamps, TTL-based expiration, and automatic refresh of stale records. Failed refreshes now trigger a backoff while continuing to serve stale records for reliability. New logic also records insertion times and applies consistent TTLs to cached responses.

Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP
changes. When cache entries expire (300s TTL), refresh synchronously
using net.DefaultResolver. On refresh failure, serve stale cache to
maintain connectivity.

Fixes netbirdio#5113
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Resolver cache now stores per-question cachedRecord entries with timestamps and failure markers. On each query ServeDNS checks TTL-based staleness, triggers a serialized refreshDomain when expired (with backoff), and otherwise serves cached records; AddDomain records TTLs and insertion time into cachedRecord.

Changes

Cohort / File(s) Summary
DNS cache & refresh logic
client/internal/dns/mgmt/mgmt.go
Replaced map[dns.Question][]dns.RR with map[dns.Question]*cachedRecord (records, cachedAt, lastFailedRefresh). AddDomain wraps records in cachedRecord and sets RR TTLs from defaultTTL. ServeDNS checks staleness vs defaultTTL and either serves cache or calls new refreshDomain. Added refreshDomain implementing mutex-serialized refresh, staleness re-check after lock, failure backoff via lastFailedRefresh, and stale-serving fallback on errors.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client (DNS request)
  participant Resolver as Resolver Cache
  participant Upstream as Upstream Resolver
  rect rgba(220,220,255,0.5)
    Client->>Resolver: DNS query (question)
    Resolver->>Resolver: Check cachedRecord + defaultTTL
    alt cache fresh
      Resolver-->>Client: return cached records
    else cache stale
      Resolver->>Resolver: acquire refreshMutex
      Resolver->>Upstream: re-resolve domain
      alt refresh success
        Upstream-->>Resolver: new RRs
        Resolver-->>Client: return new records
      else refresh failed (backoff)
        Resolver-->>Client: return stale records (update lastFailedRefresh)
      end
      Resolver->>Resolver: release refreshMutex
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • #4983: Modifies the same client/internal/dns/mgmt/mgmt.go and AddDomain to enhance DNS lookup timing/timeout handling — closely related to DNS resolution logic.

Suggested reviewers

  • pappz

Poem

🐰 I hopped through caches, timestamp in paw,
I sniffed stale answers and fixed the flaw,
I lock, I refresh, I back off when needed,
Now domains update and connections are seeded. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a TTL-based cache refresh mechanism to the management DNS resolver, which directly addresses the issue of DNS cache not being updated.
Description check ✅ Passed The description covers all required sections including change description, issue ticket link (#5113), classification as bug fix, and documentation statement. All template requirements are adequately fulfilled.
Linked Issues check ✅ Passed The code changes implement the required functionality: TTL expiration mechanism with 300s TTL, synchronous refresh using net.DefaultResolver, and fallback to stale cache on refresh failure, directly addressing issue #5113.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing TTL-based cache refresh for the management DNS resolver; no unrelated modifications or scope creep detected in the resolver cache structure updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/mgmt/mgmt.go (1)

158-183: ⚠️ Potential issue | 🟠 Major

Clear vanished record families during a successful refresh.

Line 161 and Line 173 only overwrite non-empty slices. If a domain stops returning A or AAAA records, the old entry stays in m.records, and Lines 215-217 then keep serving that stale RR set after a “successful” refresh. That leaves deleted addresses cached forever.

🩹 Suggested fix
  now := time.Now()
  m.mutex.Lock()
+ aQuestion := dns.Question{
+     Name:   dnsName,
+     Qtype:  dns.TypeA,
+     Qclass: dns.ClassINET,
+ }
+ aaaaQuestion := dns.Question{
+     Name:   dnsName,
+     Qtype:  dns.TypeAAAA,
+     Qclass: dns.ClassINET,
+ }
+ delete(m.records, aQuestion)
+ delete(m.records, aaaaQuestion)

- if len(aRecords) > 0 {
-     aQuestion := dns.Question{
-         Name:   dnsName,
-         Qtype:  dns.TypeA,
-         Qclass: dns.ClassINET,
-     }
+ if len(aRecords) > 0 {
      m.records[aQuestion] = &cachedRecord{
          records:  aRecords,
          cachedAt: now,
      }
  }

- if len(aaaaRecords) > 0 {
-     aaaaQuestion := dns.Question{
-         Name:   dnsName,
-         Qtype:  dns.TypeAAAA,
-         Qclass: dns.ClassINET,
-     }
+ if len(aaaaRecords) > 0 {
      m.records[aaaaQuestion] = &cachedRecord{
          records:  aaaaRecords,
          cachedAt: now,
      }
  }
- if !found {
-     return stale.records
- }
+ if !found {
+     return nil
+ }

Also applies to: 211-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 158 - 183, The refresh logic
currently only writes cache entries when aRecords or aaaaRecords are non-empty,
leaving stale entries in m.records when record families vanish; update the block
that handles aRecords and aaaaRecords (the aQuestion and aaaaQuestion keys in
m.records, and the cachedRecord values) to explicitly remove the corresponding
dns.Question key from m.records when the slice is empty (i.e., delete
m.records[aQuestion] / delete m.records[aaaaQuestion]) while still setting the
new cachedRecord when non-empty, keeping the existing mutex.Lock/unlock and
time.Now usage.
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)

37-37: Re-check the current map entry after taking the refresh gate.

Line 200 tests the pre-lock stale pointer. If another request refreshed m.records[question] while this goroutine was waiting, stale.cachedAt is still old and we do a second lookup anyway. Re-reading m.records[question] under m.mutex (or gating per question with singleflight) would make the dedupe actually work.

Also applies to: 199-202

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` at line 37, The refresh gate logic checks
the pre-lock stale pointer (stale.cachedAt) and can trigger duplicate refreshes
if another goroutine updated m.records[question] while this goroutine was
waiting on refreshMutex; after acquiring refreshMutex (refreshMutex) re-read
m.records[question] under m.mutex (or use a per-question singleflight) and
re-evaluate stale.cachedAt/freshness before performing the lookup so you skip
the redundant refresh when another request already updated the entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 195-208: The refreshDomain function currently retries AddDomain on
every query when a cachedRecord is expired; modify Resolver.refreshDomain to
implement a backoff: add a retry/backoff timestamp field on cachedRecord (e.g.,
lastFailedRefresh or nextRefreshAttempt) and, before attempting AddDomain, check
that timestamp and if the backoff window hasn't elapsed simply return
stale.records immediately; on AddDomain failure set/update the backoff timestamp
(exponential or fixed interval) on the cachedRecord and return stale.records,
and on success clear/reset the failure timestamp and update cachedAt/records as
before. Ensure references to Resolver.refreshDomain, cachedRecord, and AddDomain
are used so the change is made in the right places.

---

Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 158-183: The refresh logic currently only writes cache entries
when aRecords or aaaaRecords are non-empty, leaving stale entries in m.records
when record families vanish; update the block that handles aRecords and
aaaaRecords (the aQuestion and aaaaQuestion keys in m.records, and the
cachedRecord values) to explicitly remove the corresponding dns.Question key
from m.records when the slice is empty (i.e., delete m.records[aQuestion] /
delete m.records[aaaaQuestion]) while still setting the new cachedRecord when
non-empty, keeping the existing mutex.Lock/unlock and time.Now usage.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Line 37: The refresh gate logic checks the pre-lock stale pointer
(stale.cachedAt) and can trigger duplicate refreshes if another goroutine
updated m.records[question] while this goroutine was waiting on refreshMutex;
after acquiring refreshMutex (refreshMutex) re-read m.records[question] under
m.mutex (or use a per-question singleflight) and re-evaluate
stale.cachedAt/freshness before performing the lookup so you skip the redundant
refresh when another request already updated the entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8b7e34e-dbd7-4287-82cb-42a603af2769

📥 Commits

Reviewing files that changed from the base of the PR and between 28fbf96 and 4edc62d.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/mgmt/mgmt.go (1)

163-185: ⚠️ Potential issue | 🟠 Major

A successful refresh can still return a removed RR family.

If the fresh lookup only returns one family, AddDomain() leaves the old sibling entry in m.records. The final lookup in refreshDomain() then finds that stale entry and returns it as if the refresh succeeded. A domain that moves from A+AAAA to a single family will keep serving the removed records indefinitely, and every lookup of that family keeps re-entering the blocking refresh path because its cachedAt never advances.

🔧 One way to make the "missing type" path real
-	if len(aRecords) > 0 {
-		aQuestion := dns.Question{
-			Name:   dnsName,
-			Qtype:  dns.TypeA,
-			Qclass: dns.ClassINET,
-		}
+	aQuestion := dns.Question{
+		Name:   dnsName,
+		Qtype:  dns.TypeA,
+		Qclass: dns.ClassINET,
+	}
+	if len(aRecords) > 0 {
 		m.records[aQuestion] = &cachedRecord{
 			records:  aRecords,
 			cachedAt: now,
 		}
+	} else {
+		delete(m.records, aQuestion)
 	}
 
-	if len(aaaaRecords) > 0 {
-		aaaaQuestion := dns.Question{
-			Name:   dnsName,
-			Qtype:  dns.TypeAAAA,
-			Qclass: dns.ClassINET,
-		}
+	aaaaQuestion := dns.Question{
+		Name:   dnsName,
+		Qtype:  dns.TypeAAAA,
+		Qclass: dns.ClassINET,
+	}
+	if len(aaaaRecords) > 0 {
 		m.records[aaaaQuestion] = &cachedRecord{
 			records:  aaaaRecords,
 			cachedAt: now,
 		}
+	} else {
+		delete(m.records, aaaaQuestion)
 	}
@@
 	if !found {
 		// DNS returned no records for this type, preserve stale with backoff
 		now := time.Now()
 		current.lastFailedRefresh = &now
+		m.mutex.Lock()
+		if _, exists := m.records[question]; !exists {
+			m.records[question] = current
+		}
+		m.mutex.Unlock()
 		return current.records
 	}

Also applies to: 230-239

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 163 - 185,
AddDomain/refreshDomain currently only writes entries for record families
present (aRecords or aaaaRecords) which leaves stale sibling entries in
m.records; update AddDomain (where aQuestion/aaaaQuestion are created and
m.records written) to also remove or refresh the opposite family when it is
absent: if aRecords is empty, delete any existing dns.Question with Qtype
dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear
records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA
entry. This ensures refreshDomain will not return removed stale RRs and that
cachedAt advances to avoid re-entering blocking refresh paths.
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)

27-31: Refresh coordination is scoped inconsistently.

AddDomain() refreshes a whole domain in one lookup, but the failure marker lives on a single dns.Question while refreshMutex is shared by the entire resolver. After one failed A refresh, the AAAA query for the same name can still trigger its own blocking lookup, and that second lookup also stalls unrelated domains behind the same mutex. A domain-keyed refresh/backoff state would avoid both duplicate retries and cross-domain head-of-line blocking.

Also applies to: 35-39, 216-227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 27 - 31, The refresh/backoff
state is scoped to a single dns.Question while refreshMutex is global, causing
cross-domain blocking and duplicated retries; update the logic to track refresh
state per domain (e.g., use the domain string or fqdn as the key) instead of
per-Question: introduce a domain-keyed refresh/backoff map (or per-domain mutex
map) and move the lastFailedRefresh marker out of the dns.Question-scoped
cachedRecord into that domain-keyed structure, then change AddDomain(), the
refresh path that currently consults cachedRecord.lastFailedRefresh, and any use
of refreshMutex to consult/lock the per-domain entry (or per-domain mutex) so A
and AAAA for the same name share backoff and unrelated domains don't get blocked
by a single global mutex.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 163-185: AddDomain/refreshDomain currently only writes entries for
record families present (aRecords or aaaaRecords) which leaves stale sibling
entries in m.records; update AddDomain (where aQuestion/aaaaQuestion are created
and m.records written) to also remove or refresh the opposite family when it is
absent: if aRecords is empty, delete any existing dns.Question with Qtype
dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear
records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA
entry. This ensures refreshDomain will not return removed stale RRs and that
cachedAt advances to avoid re-entering blocking refresh paths.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 27-31: The refresh/backoff state is scoped to a single
dns.Question while refreshMutex is global, causing cross-domain blocking and
duplicated retries; update the logic to track refresh state per domain (e.g.,
use the domain string or fqdn as the key) instead of per-Question: introduce a
domain-keyed refresh/backoff map (or per-domain mutex map) and move the
lastFailedRefresh marker out of the dns.Question-scoped cachedRecord into that
domain-keyed structure, then change AddDomain(), the refresh path that currently
consults cachedRecord.lastFailedRefresh, and any use of refreshMutex to
consult/lock the per-domain entry (or per-domain mutex) so A and AAAA for the
same name share backoff and unrelated domains don't get blocked by a single
global mutex.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a33d67e-d131-4bfc-be80-fcbb66582b75

📥 Commits

Reviewing files that changed from the base of the PR and between 4edc62d and da96f92.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Management DNS cache not auto update

1 participant