region_cache: return nil reason from GetTiFlashRPCContext#1959
Conversation
Signed-off-by: gengliqi <gengliqiii@gmail.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughGetTiFlashRPCContext now returns structured unavailability details (TiFlashRPCContextUnavailableDetail) plus RPCContext and error. The change introduces typed unavailability reasons/constants, updates selection logic to report specific failure causes, updates call sites/tests to accept the extra return value, and re-exports the types in the public API. ChangesTiFlash RPC Context Unavailability Reason Reporting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
internal/locate/region_cache_test.go (1)
602-603: ⚡ Quick winAssert
TiFlashRPCContextUnavailableReasonin updated TiFlash tests.The new return value is currently ignored at Line 602, Line 620, Line 638, and Line 1354. Since this PR’s purpose is exposing why context is unavailable, these tests should assert the reason (e.g.,
TiFlashRPCContextAvailablewhenctx != nil, and specific unavailable reasons wherectx == nil) to lock in behavior.Proposed test assertion pattern
-ctx, _, err := s.cache.GetTiFlashRPCContext(s.bo, loc.Region, true, LabelFilterNoTiFlashWriteNode) +ctx, reason, err := s.cache.GetTiFlashRPCContext(s.bo, loc.Region, true, LabelFilterNoTiFlashWriteNode) s.Nil(err) s.NotNil(ctx) +s.Equal(TiFlashRPCContextAvailable, reason)Also applies to: 620-621, 638-639, 1354-1355
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/locate/region_cache_test.go` around lines 602 - 603, The test currently ignores the second return value of s.cache.GetTiFlashRPCContext; update each call (e.g., the calls with LabelFilterNoTiFlashWriteNode) to receive the TiFlashRPCContextUnavailableReason, then assert that when ctx != nil the reason equals TiFlashRPCContextAvailable, and when ctx == nil assert the reason equals the specific expected unavailable enum for that scenario (use the appropriate TiFlashRPCContextUnavailableReason value corresponding to the test case). Ensure you reference s.cache.GetTiFlashRPCContext, TiFlashRPCContextUnavailableReason, TiFlashRPCContextAvailable, and the LabelFilter* symbols when making the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/locate/region_cache.go`:
- Around line 92-95: The const block containing TiFlashRPCContextAvailable,
TiFlashRPCContextUnavailableError, and
TiFlashRPCContextUnavailableCachedRegionMissing is mis-aligned and failing
gofmt; fix it by running gofmt (or adjust spacing/tabs) so the identifiers,
types (TiFlashRPCContextUnavailableReason) and assignment values are properly
formatted/aligned (remove the stray tab/whitespace on the
TiFlashRPCContextUnavailableError line) and re-run gofmt to ensure CI passes.
- Around line 1145-1152: The code currently calls cachedRegion.invalidate(Other)
unconditionally which invalidates the cache even for label-filter-only failures;
change the logic so invalidate is only called for cases that imply stale region
metadata: move cachedRegion.invalidate(Other) into the branch that returns
TiFlashRPCContextUnavailableAllStoresEpochStale (and any other branch that truly
indicates stale metadata), but do not invalidate when returning
TiFlashRPCContextUnavailableAllStoresFiltered or
TiFlashRPCContextUnavailableNoAvailableStore; use the existing symbols
labelFilteredCount, epochMismatchCount,
TiFlashRPCContextUnavailableAllStoresFiltered,
TiFlashRPCContextUnavailableAllStoresEpochStale and cachedRegion.invalidate to
locate and update the code.
In `@internal/locate/region_request.go`:
- Around line 855-856: The call to regionCache.GetTiFlashRPCContext currently
drops the returned reason (rpcCtx, _, err :=
s.regionCache.GetTiFlashRPCContext(...)) which causes all nil TiFlash contexts
to be treated as generic retries; modify getRPCContext and sendReqState call
sites to accept and propagate the returned reason value from
regionCache.GetTiFlashRPCContext (preserve the second return value), thread that
reason through getRPCContext/sendReqState signatures, and update sendReqState's
logic to map non-retryable reasons (e.g., all_tiflash_stores_filtered_by_label,
no_tiflash_access_store) to dedicated handling paths instead of treating them as
EpochNotMatch/retry; ensure callers handle the propagated reason to avoid
fabricating stale-related retries.
In `@tikv/region.go`:
- Around line 66-77: The const block for TiFlash reasons is misaligned and
failing gofmt; open the const group containing TiFlashRPCContextAvailable,
TiFlashRPCContextUnavailableError,
TiFlashRPCContextUnavailableCachedRegionMissing, etc., and reformat it (run
gofmt or fix spacing/tabs so all equals signs and values align or simply use
standard single-tab alignment) so the block conforms to gofmt rules; save the
file and re-run `gofmt`/`golangci-lint` to ensure the formatting error around
TiFlashRPCContextUnavailableError is resolved.
---
Nitpick comments:
In `@internal/locate/region_cache_test.go`:
- Around line 602-603: The test currently ignores the second return value of
s.cache.GetTiFlashRPCContext; update each call (e.g., the calls with
LabelFilterNoTiFlashWriteNode) to receive the
TiFlashRPCContextUnavailableReason, then assert that when ctx != nil the reason
equals TiFlashRPCContextAvailable, and when ctx == nil assert the reason equals
the specific expected unavailable enum for that scenario (use the appropriate
TiFlashRPCContextUnavailableReason value corresponding to the test case). Ensure
you reference s.cache.GetTiFlashRPCContext, TiFlashRPCContextUnavailableReason,
TiFlashRPCContextAvailable, and the LabelFilter* symbols when making the
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d6868ca4-87b6-4d9f-b62a-eb940e9de80d
📒 Files selected for processing (4)
internal/locate/region_cache.gointernal/locate/region_cache_test.gointernal/locate/region_request.gotikv/region.go
| rpcCtx, _, err := s.regionCache.GetTiFlashRPCContext(bo, regionID, true, LabelFilterNoTiFlashWriteNode) | ||
| return rpcCtx, err |
There was a problem hiding this comment.
Propagate the TiFlash unavailability reason instead of dropping it here.
Discarding reason means every nil TiFlash context still falls into the generic rpcCtx == nil path later, which fabricates EpochNotMatch and retries. That loses the new signal for cases like all_tiflash_stores_filtered_by_label or no_tiflash_access_store, so the main request path still can't react differently to non-stale failures. Please thread the reason through getRPCContext/sendReqState and map the non-retryable cases to dedicated handling.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/locate/region_request.go` around lines 855 - 856, The call to
regionCache.GetTiFlashRPCContext currently drops the returned reason (rpcCtx, _,
err := s.regionCache.GetTiFlashRPCContext(...)) which causes all nil TiFlash
contexts to be treated as generic retries; modify getRPCContext and sendReqState
call sites to accept and propagate the returned reason value from
regionCache.GetTiFlashRPCContext (preserve the second return value), thread that
reason through getRPCContext/sendReqState signatures, and update sendReqState's
logic to map non-retryable reasons (e.g., all_tiflash_stores_filtered_by_label,
no_tiflash_access_store) to dedicated handling paths instead of treating them as
EpochNotMatch/retry; ensure callers handle the propagated reason to avoid
fabricating stale-related retries.
|
@windtalker: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/locate/region_cache.go (1)
1163-1169:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't invalidate the cache for the all-filtered case.
Line 1163 still evicts the region before the code distinguishes
TiFlashRPCContextUnavailableAllStoresFiltered. That turns a caller-side label mismatch into a stale-cache signal and forces an unnecessary reload on the next request.🛠️ Minimal fix
- cachedRegion.invalidate(Other) if labelFilteredCount == accessStoreNum { return nil, TiFlashRPCContextUnavailableDetail{ Reason: TiFlashRPCContextUnavailableAllStoresFiltered, StoreIDs: storeIDs,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/locate/region_cache.go` around lines 1163 - 1169, The code currently calls cachedRegion.invalidate(Other) before checking the all-filtered condition, which causes label-mismatch to be treated as stale cache; modify the logic in the region handling block (where cachedRegion.invalidate is called, and variables labelFilteredCount and accessStoreNum are used) so that you only call cachedRegion.invalidate(Other) when the failure is truly an unavailable/stale-cache condition and NOT when labelFilteredCount == accessStoreNum (i.e., detect the TiFlashRPCContextUnavailableAllStoresFiltered case first and return the TiFlashRPCContextUnavailableDetail without evicting the cache); adjust control flow so cachedRegion.invalidate is either moved after the label-filter check or wrapped in a conditional that excludes the all-filtered case.
🧹 Nitpick comments (1)
internal/locate/region_cache.go (1)
114-115: ⚡ Quick winInclude
StoreIDsin the detail's string form.Because
TiFlashRPCContextUnavailableDetailimplementsfmt.Stringer, logging it via%v/%s/zap.Stringerwill currently dropStoreIDs, which is the most actionable context beyond the reason.♻️ Proposed change
func (d TiFlashRPCContextUnavailableDetail) String() string { - return d.Reason.String() + if len(d.StoreIDs) == 0 { + return d.Reason.String() + } + return fmt.Sprintf("%s store_ids=%v", d.Reason, d.StoreIDs) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/locate/region_cache.go` around lines 114 - 115, The String() method for TiFlashRPCContextUnavailableDetail only returns d.Reason.String(), losing the actionable d.StoreIDs; update TiFlashRPCContextUnavailableDetail.String to include the StoreIDs (e.g., append a formatted representation of d.StoreIDs alongside d.Reason.String()) so that logging via fmt/Stringer or zap.Stringer surfaces both Reason and StoreIDs; reference the TiFlashRPCContextUnavailableDetail type and its fields Reason and StoreIDs when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@internal/locate/region_cache.go`:
- Around line 1163-1169: The code currently calls cachedRegion.invalidate(Other)
before checking the all-filtered condition, which causes label-mismatch to be
treated as stale cache; modify the logic in the region handling block (where
cachedRegion.invalidate is called, and variables labelFilteredCount and
accessStoreNum are used) so that you only call cachedRegion.invalidate(Other)
when the failure is truly an unavailable/stale-cache condition and NOT when
labelFilteredCount == accessStoreNum (i.e., detect the
TiFlashRPCContextUnavailableAllStoresFiltered case first and return the
TiFlashRPCContextUnavailableDetail without evicting the cache); adjust control
flow so cachedRegion.invalidate is either moved after the label-filter check or
wrapped in a conditional that excludes the all-filtered case.
---
Nitpick comments:
In `@internal/locate/region_cache.go`:
- Around line 114-115: The String() method for
TiFlashRPCContextUnavailableDetail only returns d.Reason.String(), losing the
actionable d.StoreIDs; update TiFlashRPCContextUnavailableDetail.String to
include the StoreIDs (e.g., append a formatted representation of d.StoreIDs
alongside d.Reason.String()) so that logging via fmt/Stringer or zap.Stringer
surfaces both Reason and StoreIDs; reference the
TiFlashRPCContextUnavailableDetail type and its fields Reason and StoreIDs when
making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 63d5b140-311f-4dbf-bf9a-7a878dc6efd1
📒 Files selected for processing (2)
internal/locate/region_cache.gotikv/region.go
✅ Files skipped from review due to trivial changes (1)
- tikv/region.go
Signed-off-by: gengliqi <gengliqiii@gmail.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfzjywxk, lcwangchao, windtalker, zyguan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick tidb-8.5 |
|
@gengliqi: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/cherry-pick tidb-8.5-20260113-v8.5.4 |
|
@gengliqi: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
client-go PR tikv/client-go#1959 extended GetTiFlashRPCContext to also return a TiFlashRPCContextUnavailableDetail. Upstream pingcap/tidb master still pins a pre-pingcap#1959 client-go and was not updated in lockstep, so merging the latest client-go into this bundle broke the build at the sole caller. Discard the new reason value here to match the only other caller in client-go itself (internal/locate/region_request.go); reason consumption can be added later if needed. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Return a reason when GetTiFlashRPCContext returns nil to help callers know why the TiFlash RPC context is unavailable.
Summary by CodeRabbit
New Features
Improvements