Skip to content

region_cache: return nil reason from GetTiFlashRPCContext#1959

Merged
ti-chi-bot[bot] merged 5 commits into
tikv:masterfrom
gengliqi:add-reason-for-tiflash-rpc-context-master
May 11, 2026
Merged

region_cache: return nil reason from GetTiFlashRPCContext#1959
ti-chi-bot[bot] merged 5 commits into
tikv:masterfrom
gengliqi:add-reason-for-tiflash-rpc-context-master

Conversation

@gengliqi
Copy link
Copy Markdown
Member

@gengliqi gengliqi commented May 10, 2026

Return a reason when GetTiFlashRPCContext returns nil to help callers know why the TiFlash RPC context is unavailable.

Summary by CodeRabbit

  • New Features

    • Exposed detailed TiFlash RPC-context unavailability reasons and structured details for clearer diagnostics.
  • Improvements

    • More explicit handling and reporting when TiFlash contexts are unavailable (including store lists and specific causes).
    • Improved TiFlash store selection with better load-balancing and label-based filtering for more reliable routing.

Review Change Stack

Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot ti-chi-bot Bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label May 10, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Warning

Rate limit exceeded

@gengliqi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 7 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1facca27-d6ea-4350-8cf1-5b36ac375bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 7398cd7 and 6911619.

📒 Files selected for processing (1)
  • internal/locate/region_cache.go
📝 Walkthrough

Walkthrough

GetTiFlashRPCContext now returns structured unavailability details (TiFlashRPCContextUnavailableDetail) plus RPCContext and error. The change introduces typed unavailability reasons/constants, updates selection logic to report specific failure causes, updates call sites/tests to accept the extra return value, and re-exports the types in the public API.

Changes

TiFlash RPC Context Unavailability Reason Reporting

Layer / File(s) Summary
Type Definition & Unavailability Reasons
internal/locate/region_cache.go
Adds TiFlashRPCContextUnavailableReason and TiFlashRPCContextUnavailableDetail types with String() methods and exported constants enumerating availability and unavailability categories.
GetTiFlashRPCContext Implementation
internal/locate/region_cache.go
Changes signature to return (*RPCContext, TiFlashRPCContextUnavailableDetail, error); validates cached-region state, handles TTL/reload, treats zero TiFlash access stores, applies labelFilter before address resolution, tracks filtered/epoch-stale candidates, updates workTiFlashIdx, invalidates on epoch mismatch, and returns detailed unavailability or available RPCContext.
Call Site Integration
internal/locate/region_request.go
TiFlash branch in getRPCContext now unpacks rpcCtx, _, err := GetTiFlashRPCContext(...) and returns rpcCtx, err.
Test Cases & Signature Validation
internal/locate/region_cache_test.go
Updates TestTiFlashRecoveredFromDown and TestRegionEpochOnTiFlash to capture and discard the new middle return value from GetTiFlashRPCContext.
Public API Re-exports
tikv/region.go
Adds exported type aliases and re-exports the TiFlash unavailability reason/detail types and constants for external callers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tikv/client-go#1826: Modifies region_cache.go TiFlash selection logic; closely related changes to TiFlash store access and selection.

Suggested reviewers

  • ekexium

Poem

🐰 I hopped through region caches late at night,
Found reasons for TiFlash's silent plight.
TTLs and filters, epochs in a row,
Now each failing step has a name to show.
Hooray — diagnostics gleam in moonlight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title states 'return nil reason from GetTiFlashRPCContext', but the actual changes show the method now returns a non-nil TiFlashRPCContextUnavailableDetail struct with reason and store IDs, not a nil reason. Update the title to accurately reflect that the method now returns structured unavailability details (TiFlashRPCContextUnavailableDetail) instead of nil, such as 'region_cache: return unavailability reason from GetTiFlashRPCContext'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
internal/locate/region_cache_test.go (1)

602-603: ⚡ Quick win

Assert TiFlashRPCContextUnavailableReason in updated TiFlash tests.

The new return value is currently ignored at Line 602, Line 620, Line 638, and Line 1354. Since this PR’s purpose is exposing why context is unavailable, these tests should assert the reason (e.g., TiFlashRPCContextAvailable when ctx != nil, and specific unavailable reasons where ctx == nil) to lock in behavior.

Proposed test assertion pattern
-ctx, _, err := s.cache.GetTiFlashRPCContext(s.bo, loc.Region, true, LabelFilterNoTiFlashWriteNode)
+ctx, reason, err := s.cache.GetTiFlashRPCContext(s.bo, loc.Region, true, LabelFilterNoTiFlashWriteNode)
 s.Nil(err)
 s.NotNil(ctx)
+s.Equal(TiFlashRPCContextAvailable, reason)

Also applies to: 620-621, 638-639, 1354-1355

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/locate/region_cache_test.go` around lines 602 - 603, The test
currently ignores the second return value of s.cache.GetTiFlashRPCContext;
update each call (e.g., the calls with LabelFilterNoTiFlashWriteNode) to receive
the TiFlashRPCContextUnavailableReason, then assert that when ctx != nil the
reason equals TiFlashRPCContextAvailable, and when ctx == nil assert the reason
equals the specific expected unavailable enum for that scenario (use the
appropriate TiFlashRPCContextUnavailableReason value corresponding to the test
case). Ensure you reference s.cache.GetTiFlashRPCContext,
TiFlashRPCContextUnavailableReason, TiFlashRPCContextAvailable, and the
LabelFilter* symbols when making the assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/locate/region_cache.go`:
- Around line 92-95: The const block containing TiFlashRPCContextAvailable,
TiFlashRPCContextUnavailableError, and
TiFlashRPCContextUnavailableCachedRegionMissing is mis-aligned and failing
gofmt; fix it by running gofmt (or adjust spacing/tabs) so the identifiers,
types (TiFlashRPCContextUnavailableReason) and assignment values are properly
formatted/aligned (remove the stray tab/whitespace on the
TiFlashRPCContextUnavailableError line) and re-run gofmt to ensure CI passes.
- Around line 1145-1152: The code currently calls cachedRegion.invalidate(Other)
unconditionally which invalidates the cache even for label-filter-only failures;
change the logic so invalidate is only called for cases that imply stale region
metadata: move cachedRegion.invalidate(Other) into the branch that returns
TiFlashRPCContextUnavailableAllStoresEpochStale (and any other branch that truly
indicates stale metadata), but do not invalidate when returning
TiFlashRPCContextUnavailableAllStoresFiltered or
TiFlashRPCContextUnavailableNoAvailableStore; use the existing symbols
labelFilteredCount, epochMismatchCount,
TiFlashRPCContextUnavailableAllStoresFiltered,
TiFlashRPCContextUnavailableAllStoresEpochStale and cachedRegion.invalidate to
locate and update the code.

In `@internal/locate/region_request.go`:
- Around line 855-856: The call to regionCache.GetTiFlashRPCContext currently
drops the returned reason (rpcCtx, _, err :=
s.regionCache.GetTiFlashRPCContext(...)) which causes all nil TiFlash contexts
to be treated as generic retries; modify getRPCContext and sendReqState call
sites to accept and propagate the returned reason value from
regionCache.GetTiFlashRPCContext (preserve the second return value), thread that
reason through getRPCContext/sendReqState signatures, and update sendReqState's
logic to map non-retryable reasons (e.g., all_tiflash_stores_filtered_by_label,
no_tiflash_access_store) to dedicated handling paths instead of treating them as
EpochNotMatch/retry; ensure callers handle the propagated reason to avoid
fabricating stale-related retries.

In `@tikv/region.go`:
- Around line 66-77: The const block for TiFlash reasons is misaligned and
failing gofmt; open the const group containing TiFlashRPCContextAvailable,
TiFlashRPCContextUnavailableError,
TiFlashRPCContextUnavailableCachedRegionMissing, etc., and reformat it (run
gofmt or fix spacing/tabs so all equals signs and values align or simply use
standard single-tab alignment) so the block conforms to gofmt rules; save the
file and re-run `gofmt`/`golangci-lint` to ensure the formatting error around
TiFlashRPCContextUnavailableError is resolved.

---

Nitpick comments:
In `@internal/locate/region_cache_test.go`:
- Around line 602-603: The test currently ignores the second return value of
s.cache.GetTiFlashRPCContext; update each call (e.g., the calls with
LabelFilterNoTiFlashWriteNode) to receive the
TiFlashRPCContextUnavailableReason, then assert that when ctx != nil the reason
equals TiFlashRPCContextAvailable, and when ctx == nil assert the reason equals
the specific expected unavailable enum for that scenario (use the appropriate
TiFlashRPCContextUnavailableReason value corresponding to the test case). Ensure
you reference s.cache.GetTiFlashRPCContext, TiFlashRPCContextUnavailableReason,
TiFlashRPCContextAvailable, and the LabelFilter* symbols when making the
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6868ca4-87b6-4d9f-b62a-eb940e9de80d

📥 Commits

Reviewing files that changed from the base of the PR and between 9513b5e and 7eb7352.

📒 Files selected for processing (4)
  • internal/locate/region_cache.go
  • internal/locate/region_cache_test.go
  • internal/locate/region_request.go
  • tikv/region.go

Comment thread internal/locate/region_cache.go
Comment thread internal/locate/region_cache.go Outdated
Comment on lines +855 to +856
rpcCtx, _, err := s.regionCache.GetTiFlashRPCContext(bo, regionID, true, LabelFilterNoTiFlashWriteNode)
return rpcCtx, err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Propagate the TiFlash unavailability reason instead of dropping it here.

Discarding reason means every nil TiFlash context still falls into the generic rpcCtx == nil path later, which fabricates EpochNotMatch and retries. That loses the new signal for cases like all_tiflash_stores_filtered_by_label or no_tiflash_access_store, so the main request path still can't react differently to non-stale failures. Please thread the reason through getRPCContext/sendReqState and map the non-retryable cases to dedicated handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/locate/region_request.go` around lines 855 - 856, The call to
regionCache.GetTiFlashRPCContext currently drops the returned reason (rpcCtx, _,
err := s.regionCache.GetTiFlashRPCContext(...)) which causes all nil TiFlash
contexts to be treated as generic retries; modify getRPCContext and sendReqState
call sites to accept and propagate the returned reason value from
regionCache.GetTiFlashRPCContext (preserve the second return value), thread that
reason through getRPCContext/sendReqState signatures, and update sendReqState's
logic to map non-retryable reasons (e.g., all_tiflash_stores_filtered_by_label,
no_tiflash_access_store) to dedicated handling paths instead of treating them as
EpochNotMatch/retry; ensure callers handle the propagated reason to avoid
fabricating stale-related retries.

Comment thread tikv/region.go
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 10, 2026
Copy link
Copy Markdown
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 11, 2026

@windtalker: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 11, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 11, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-10 14:21:25.692308283 +0000 UTC m=+16254.225087602: ☑️ agreed by cfzjywxk.
  • 2026-05-11 01:34:20.591285243 +0000 UTC m=+56629.124064563: ☑️ agreed by zyguan.

u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/locate/region_cache.go (1)

1163-1169: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't invalidate the cache for the all-filtered case.

Line 1163 still evicts the region before the code distinguishes TiFlashRPCContextUnavailableAllStoresFiltered. That turns a caller-side label mismatch into a stale-cache signal and forces an unnecessary reload on the next request.

🛠️ Minimal fix
-	cachedRegion.invalidate(Other)
 	if labelFilteredCount == accessStoreNum {
 		return nil, TiFlashRPCContextUnavailableDetail{
 			Reason:   TiFlashRPCContextUnavailableAllStoresFiltered,
 			StoreIDs: storeIDs,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/locate/region_cache.go` around lines 1163 - 1169, The code currently
calls cachedRegion.invalidate(Other) before checking the all-filtered condition,
which causes label-mismatch to be treated as stale cache; modify the logic in
the region handling block (where cachedRegion.invalidate is called, and
variables labelFilteredCount and accessStoreNum are used) so that you only call
cachedRegion.invalidate(Other) when the failure is truly an
unavailable/stale-cache condition and NOT when labelFilteredCount ==
accessStoreNum (i.e., detect the TiFlashRPCContextUnavailableAllStoresFiltered
case first and return the TiFlashRPCContextUnavailableDetail without evicting
the cache); adjust control flow so cachedRegion.invalidate is either moved after
the label-filter check or wrapped in a conditional that excludes the
all-filtered case.
🧹 Nitpick comments (1)
internal/locate/region_cache.go (1)

114-115: ⚡ Quick win

Include StoreIDs in the detail's string form.

Because TiFlashRPCContextUnavailableDetail implements fmt.Stringer, logging it via %v/%s/zap.Stringer will currently drop StoreIDs, which is the most actionable context beyond the reason.

♻️ Proposed change
 func (d TiFlashRPCContextUnavailableDetail) String() string {
-	return d.Reason.String()
+	if len(d.StoreIDs) == 0 {
+		return d.Reason.String()
+	}
+	return fmt.Sprintf("%s store_ids=%v", d.Reason, d.StoreIDs)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/locate/region_cache.go` around lines 114 - 115, The String() method
for TiFlashRPCContextUnavailableDetail only returns d.Reason.String(), losing
the actionable d.StoreIDs; update TiFlashRPCContextUnavailableDetail.String to
include the StoreIDs (e.g., append a formatted representation of d.StoreIDs
alongside d.Reason.String()) so that logging via fmt/Stringer or zap.Stringer
surfaces both Reason and StoreIDs; reference the
TiFlashRPCContextUnavailableDetail type and its fields Reason and StoreIDs when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/locate/region_cache.go`:
- Around line 1163-1169: The code currently calls cachedRegion.invalidate(Other)
before checking the all-filtered condition, which causes label-mismatch to be
treated as stale cache; modify the logic in the region handling block (where
cachedRegion.invalidate is called, and variables labelFilteredCount and
accessStoreNum are used) so that you only call cachedRegion.invalidate(Other)
when the failure is truly an unavailable/stale-cache condition and NOT when
labelFilteredCount == accessStoreNum (i.e., detect the
TiFlashRPCContextUnavailableAllStoresFiltered case first and return the
TiFlashRPCContextUnavailableDetail without evicting the cache); adjust control
flow so cachedRegion.invalidate is either moved after the label-filter check or
wrapped in a conditional that excludes the all-filtered case.

---

Nitpick comments:
In `@internal/locate/region_cache.go`:
- Around line 114-115: The String() method for
TiFlashRPCContextUnavailableDetail only returns d.Reason.String(), losing the
actionable d.StoreIDs; update TiFlashRPCContextUnavailableDetail.String to
include the StoreIDs (e.g., append a formatted representation of d.StoreIDs
alongside d.Reason.String()) so that logging via fmt/Stringer or zap.Stringer
surfaces both Reason and StoreIDs; reference the
TiFlashRPCContextUnavailableDetail type and its fields Reason and StoreIDs when
making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 63d5b140-311f-4dbf-bf9a-7a878dc6efd1

📥 Commits

Reviewing files that changed from the base of the PR and between 7eb7352 and 7398cd7.

📒 Files selected for processing (2)
  • internal/locate/region_cache.go
  • tikv/region.go
✅ Files skipped from review due to trivial changes (1)
  • tikv/region.go

gengliqi added 2 commits May 11, 2026 09:39
Signed-off-by: gengliqi <gengliqiii@gmail.com>
u
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, lcwangchao, windtalker, zyguan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot merged commit fb7ee09 into tikv:master May 11, 2026
13 checks passed
@gengliqi
Copy link
Copy Markdown
Member Author

/cherry-pick tidb-8.5

@ti-chi-bot
Copy link
Copy Markdown
Member

@gengliqi: new pull request created to branch tidb-8.5: #1960.

Details

In response to this:

/cherry-pick tidb-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@gengliqi
Copy link
Copy Markdown
Member Author

/cherry-pick tidb-8.5-20260113-v8.5.4

@ti-chi-bot
Copy link
Copy Markdown
Member

@gengliqi: new pull request created to branch tidb-8.5-20260113-v8.5.4: #1961.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick tidb-8.5-20260113-v8.5.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

YuhaoZhang00 added a commit to YuhaoZhang00/tidb that referenced this pull request May 15, 2026
client-go PR tikv/client-go#1959 extended GetTiFlashRPCContext to also
return a TiFlashRPCContextUnavailableDetail. Upstream pingcap/tidb
master still pins a pre-pingcap#1959 client-go and was not updated in lockstep,
so merging the latest client-go into this bundle broke the build at the
sole caller.

Discard the new reason value here to match the only other caller in
client-go itself (internal/locate/region_request.go); reason consumption
can be added later if needed.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants