[client] Add TTL-based cache refresh for management DNS resolver#5794
[client] Add TTL-based cache refresh for management DNS resolver#5794plusls wants to merge 4 commits intonetbirdio:mainfrom
Conversation
Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP changes. When cache entries expire (300s TTL), refresh synchronously using net.DefaultResolver. On refresh failure, serve stale cache to maintain connectivity. Fixes netbirdio#5113
📝 WalkthroughWalkthroughResolver cache now stores per-question Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client (DNS request)
participant Resolver as Resolver Cache
participant Upstream as Upstream Resolver
rect rgba(220,220,255,0.5)
Client->>Resolver: DNS query (question)
Resolver->>Resolver: Check cachedRecord + defaultTTL
alt cache fresh
Resolver-->>Client: return cached records
else cache stale
Resolver->>Resolver: acquire refreshMutex
Resolver->>Upstream: re-resolve domain
alt refresh success
Upstream-->>Resolver: new RRs
Resolver-->>Client: return new records
else refresh failed (backoff)
Resolver-->>Client: return stale records (update lastFailedRefresh)
end
Resolver->>Resolver: release refreshMutex
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/mgmt/mgmt.go (1)
158-183:⚠️ Potential issue | 🟠 MajorClear vanished record families during a successful refresh.
Line 161 and Line 173 only overwrite non-empty slices. If a domain stops returning A or AAAA records, the old entry stays in
m.records, and Lines 215-217 then keep serving that stale RR set after a “successful” refresh. That leaves deleted addresses cached forever.🩹 Suggested fix
now := time.Now() m.mutex.Lock() + aQuestion := dns.Question{ + Name: dnsName, + Qtype: dns.TypeA, + Qclass: dns.ClassINET, + } + aaaaQuestion := dns.Question{ + Name: dnsName, + Qtype: dns.TypeAAAA, + Qclass: dns.ClassINET, + } + delete(m.records, aQuestion) + delete(m.records, aaaaQuestion) - if len(aRecords) > 0 { - aQuestion := dns.Question{ - Name: dnsName, - Qtype: dns.TypeA, - Qclass: dns.ClassINET, - } + if len(aRecords) > 0 { m.records[aQuestion] = &cachedRecord{ records: aRecords, cachedAt: now, } } - if len(aaaaRecords) > 0 { - aaaaQuestion := dns.Question{ - Name: dnsName, - Qtype: dns.TypeAAAA, - Qclass: dns.ClassINET, - } + if len(aaaaRecords) > 0 { m.records[aaaaQuestion] = &cachedRecord{ records: aaaaRecords, cachedAt: now, } }- if !found { - return stale.records - } + if !found { + return nil + }Also applies to: 211-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 158 - 183, The refresh logic currently only writes cache entries when aRecords or aaaaRecords are non-empty, leaving stale entries in m.records when record families vanish; update the block that handles aRecords and aaaaRecords (the aQuestion and aaaaQuestion keys in m.records, and the cachedRecord values) to explicitly remove the corresponding dns.Question key from m.records when the slice is empty (i.e., delete m.records[aQuestion] / delete m.records[aaaaQuestion]) while still setting the new cachedRecord when non-empty, keeping the existing mutex.Lock/unlock and time.Now usage.
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)
37-37: Re-check the current map entry after taking the refresh gate.Line 200 tests the pre-lock
stalepointer. If another request refreshedm.records[question]while this goroutine was waiting,stale.cachedAtis still old and we do a second lookup anyway. Re-readingm.records[question]underm.mutex(or gating per question withsingleflight) would make the dedupe actually work.Also applies to: 199-202
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` at line 37, The refresh gate logic checks the pre-lock stale pointer (stale.cachedAt) and can trigger duplicate refreshes if another goroutine updated m.records[question] while this goroutine was waiting on refreshMutex; after acquiring refreshMutex (refreshMutex) re-read m.records[question] under m.mutex (or use a per-question singleflight) and re-evaluate stale.cachedAt/freshness before performing the lookup so you skip the redundant refresh when another request already updated the entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 195-208: The refreshDomain function currently retries AddDomain on
every query when a cachedRecord is expired; modify Resolver.refreshDomain to
implement a backoff: add a retry/backoff timestamp field on cachedRecord (e.g.,
lastFailedRefresh or nextRefreshAttempt) and, before attempting AddDomain, check
that timestamp and if the backoff window hasn't elapsed simply return
stale.records immediately; on AddDomain failure set/update the backoff timestamp
(exponential or fixed interval) on the cachedRecord and return stale.records,
and on success clear/reset the failure timestamp and update cachedAt/records as
before. Ensure references to Resolver.refreshDomain, cachedRecord, and AddDomain
are used so the change is made in the right places.
---
Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 158-183: The refresh logic currently only writes cache entries
when aRecords or aaaaRecords are non-empty, leaving stale entries in m.records
when record families vanish; update the block that handles aRecords and
aaaaRecords (the aQuestion and aaaaQuestion keys in m.records, and the
cachedRecord values) to explicitly remove the corresponding dns.Question key
from m.records when the slice is empty (i.e., delete m.records[aQuestion] /
delete m.records[aaaaQuestion]) while still setting the new cachedRecord when
non-empty, keeping the existing mutex.Lock/unlock and time.Now usage.
---
Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Line 37: The refresh gate logic checks the pre-lock stale pointer
(stale.cachedAt) and can trigger duplicate refreshes if another goroutine
updated m.records[question] while this goroutine was waiting on refreshMutex;
after acquiring refreshMutex (refreshMutex) re-read m.records[question] under
m.mutex (or use a per-question singleflight) and re-evaluate
stale.cachedAt/freshness before performing the lookup so you skip the redundant
refresh when another request already updated the entry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f8b7e34e-dbd7-4287-82cb-42a603af2769
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/mgmt/mgmt.go (1)
163-185:⚠️ Potential issue | 🟠 MajorA successful refresh can still return a removed RR family.
If the fresh lookup only returns one family,
AddDomain()leaves the old sibling entry inm.records. The final lookup inrefreshDomain()then finds that stale entry and returns it as if the refresh succeeded. A domain that moves from A+AAAA to a single family will keep serving the removed records indefinitely, and every lookup of that family keeps re-entering the blocking refresh path because itscachedAtnever advances.🔧 One way to make the "missing type" path real
- if len(aRecords) > 0 { - aQuestion := dns.Question{ - Name: dnsName, - Qtype: dns.TypeA, - Qclass: dns.ClassINET, - } + aQuestion := dns.Question{ + Name: dnsName, + Qtype: dns.TypeA, + Qclass: dns.ClassINET, + } + if len(aRecords) > 0 { m.records[aQuestion] = &cachedRecord{ records: aRecords, cachedAt: now, } + } else { + delete(m.records, aQuestion) } - if len(aaaaRecords) > 0 { - aaaaQuestion := dns.Question{ - Name: dnsName, - Qtype: dns.TypeAAAA, - Qclass: dns.ClassINET, - } + aaaaQuestion := dns.Question{ + Name: dnsName, + Qtype: dns.TypeAAAA, + Qclass: dns.ClassINET, + } + if len(aaaaRecords) > 0 { m.records[aaaaQuestion] = &cachedRecord{ records: aaaaRecords, cachedAt: now, } + } else { + delete(m.records, aaaaQuestion) } @@ if !found { // DNS returned no records for this type, preserve stale with backoff now := time.Now() current.lastFailedRefresh = &now + m.mutex.Lock() + if _, exists := m.records[question]; !exists { + m.records[question] = current + } + m.mutex.Unlock() return current.records }Also applies to: 230-239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 163 - 185, AddDomain/refreshDomain currently only writes entries for record families present (aRecords or aaaaRecords) which leaves stale sibling entries in m.records; update AddDomain (where aQuestion/aaaaQuestion are created and m.records written) to also remove or refresh the opposite family when it is absent: if aRecords is empty, delete any existing dns.Question with Qtype dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA entry. This ensures refreshDomain will not return removed stale RRs and that cachedAt advances to avoid re-entering blocking refresh paths.
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)
27-31: Refresh coordination is scoped inconsistently.
AddDomain()refreshes a whole domain in one lookup, but the failure marker lives on a singledns.QuestionwhilerefreshMutexis shared by the entire resolver. After one failed A refresh, the AAAA query for the same name can still trigger its own blocking lookup, and that second lookup also stalls unrelated domains behind the same mutex. A domain-keyed refresh/backoff state would avoid both duplicate retries and cross-domain head-of-line blocking.Also applies to: 35-39, 216-227
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 27 - 31, The refresh/backoff state is scoped to a single dns.Question while refreshMutex is global, causing cross-domain blocking and duplicated retries; update the logic to track refresh state per domain (e.g., use the domain string or fqdn as the key) instead of per-Question: introduce a domain-keyed refresh/backoff map (or per-domain mutex map) and move the lastFailedRefresh marker out of the dns.Question-scoped cachedRecord into that domain-keyed structure, then change AddDomain(), the refresh path that currently consults cachedRecord.lastFailedRefresh, and any use of refreshMutex to consult/lock the per-domain entry (or per-domain mutex) so A and AAAA for the same name share backoff and unrelated domains don't get blocked by a single global mutex.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 163-185: AddDomain/refreshDomain currently only writes entries for
record families present (aRecords or aaaaRecords) which leaves stale sibling
entries in m.records; update AddDomain (where aQuestion/aaaaQuestion are created
and m.records written) to also remove or refresh the opposite family when it is
absent: if aRecords is empty, delete any existing dns.Question with Qtype
dns.TypeA for that dnsName (or update its cachedRecord.cachedAt to now and clear
records), and similarly if aaaaRecords is empty delete/update the dns.TypeAAAA
entry. This ensures refreshDomain will not return removed stale RRs and that
cachedAt advances to avoid re-entering blocking refresh paths.
---
Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 27-31: The refresh/backoff state is scoped to a single
dns.Question while refreshMutex is global, causing cross-domain blocking and
duplicated retries; update the logic to track refresh state per domain (e.g.,
use the domain string or fqdn as the key) instead of per-Question: introduce a
domain-keyed refresh/backoff map (or per-domain mutex map) and move the
lastFailedRefresh marker out of the dns.Question-scoped cachedRecord into that
domain-keyed structure, then change AddDomain(), the refresh path that currently
consults cachedRecord.lastFailedRefresh, and any use of refreshMutex to
consult/lock the per-domain entry (or per-domain mutex) so A and AAAA for the
same name share backoff and unrelated domains don't get blocked by a single
global mutex.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8a33d67e-d131-4bfc-be80-fcbb66582b75
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go



Describe your changes
Add TTL expiration mechanism to MgmtCacheResolver to handle DDNS IP changes. When cache entries expire (300s TTL), refresh synchronously using net.DefaultResolver. On refresh failure, serve stale cache to maintain connectivity.
Fixes #5113
Issue ticket number and link
#5113
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit