client/resource_group: add RC paging pre-charge with PredictedReadBytes hint by YuhaoZhang00 · Pull Request #10611 · tikv/pd

YuhaoZhang00 · 2026-04-21T10:11:21Z

What problem does this PR solve?

Issue Number: close #10612

Under RC (resource control), paging coprocessor requests are today billed only at response settlement. A misbehaving resource group can therefore burst large reads before its token bucket reacts, because no RU is reserved up front. This PR introduces a pre-charge mechanism driven by a caller-supplied read-bytes prediction so the token bucket is pressured before the KV RPC is sent, while keeping the final bill accurate at settlement.

What is changed and how does it work?

Commits inside client/resource_group/controller:

Pre-charge on request — BeforeKVRequest now reads PredictedReadBytes directly from RequestInfo (a required method on the interface; writes always return 0 so pre-charge is gated to read RPCs only). When present and > 0, KVCalculator adds ReadBytesCost * predicted to delta.RRU just like any other read cost — the existing acquireTokens path then reserves that many tokens from the limiter. When the hint is zero (writes, or reads that opt out), behavior is identical to today.
Refund on settlement — AfterKVRequest now settles the pre-charge by subtracting the predicted basis and adding ReadBytesCost * actual. The resulting signed delta can be negative (we over-estimated), so a new Limiter.RefundTokens primitive is introduced as the inverse of RemoveTokens; onResponseImpl / onResponseWaitImpl branch to refund vs. consume based on the sign. Correctness: total RU billed for a pre-charged request is still exactly ReadBytesCost * actual.
Observability — nine per-resource-group counters/histograms, grouped by sampling unit (count → bytes → RU):

Count:
- paging_precharge_total — coprocessor RPCs that triggered pre-charge (PredictedReadBytes hint > 0; self-gated to coprocessor reads because non-cop callers never set the hint)
- paging_nonprecharge_total — coprocessor RPCs that reached the controller without a PredictedReadBytes hint (EMA cold-start); gated on RequestInfo.IsCop() to keep non-cop reads (CmdGet, CmdBatchGet, CmdScan, internal lookups) out of the metric
Bytes:
- paging_precharge_bytes_total — sum of predicted hints used as the pre-charge basis
- paging_actual_bytes_total — actual bytes read by pre-charged RPCs
- paging_nonprecharge_actual_bytes_total — actual bytes read by coprocessor RPCs that reached the controller without a hint (same IsCop gating as above)
- paging_prediction_residual_bytes — distribution of signed actual − predicted for pre-charged RPCs (predictor accuracy)
RU:
- paging_precharge_ru_total — RU pre-acquired at BeforeKVRequest for pre-charged RPCs (read base + ReadBytesCost * predicted)
- paging_settlement_ru_total — total RU finally consumed by pre-charged RPCs (read base + CPUMsCost * kv_cpu_ms + ReadBytesCost * actual); equals precharge_ru + settlement_ru_delta
- paging_settlement_ru_delta — distribution of per-RPC signed settlement_ru − precharge_ru (negative = refund, positive = extra debit)

Pre-charge is strictly opt-in: a client must pass a non-zero PredictedReadBytes value on a read request; passing 0 (or any write request, which always returns 0) preserves the existing settlement-only path. The wire protocol is unchanged.

Related PRs

client-go side: resourcecontrol, tikvrpc: add PredictedReadBytes hint for RC paging pre-charge client-go#1947
TiDB caller side: store/copr: add RC paging pre-charge EMA and PredictedReadBytes hint pingcap/tidb#67941

Check List

Tests

Unit test
- TestPredictedReadBytesPreCharge / TestNoPreChargeWithoutPredictedReadBytes — Before-side pre-charge
- TestPagingPreChargeTokenRefund / TestPagingPreChargeNoRefundWhenActualExceedsEstimate / TestPagingPreChargeZeroDelta / TestOnResponseImplPagingRefund — After-side refund
- TestRefundTokens — limiter primitive

Side effects

Increased code complexity (new required RequestInfo.PredictedReadBytes() method + refund path)

Release note

None.

Summary by CodeRabbit

New Features
- Added Prometheus metrics to surface paging pre-charges, actual paging bytes, non-precharge outcomes, prediction residuals, and RU settlement; added a limiter refund operation to return excess tokens.
Bug Fixes
- Ensure excess pre-charged tokens are refunded during response settlement and correct token accounting for over/under-estimates.
Tests
- Added tests covering predicted-read pre-charges, settlement behavior, paging-token refunds, and limiter refunds.

Introduce an optional predictedReadBytesProvider interface on RequestInfo. When a caller (e.g. TiDB maintaining a per-logical-scan EMA across paging RPCs) supplies a non-zero PredictedReadBytes, BeforeKVRequest pre-charges ReadBytesCost * predictedReadBytes RRU so that concurrent workers are throttled at BeforeKVRequest rather than all hitting AfterKVRequest settlement at once. AfterKVRequest then subtracts the same basis so the net (pre-charge + settlement) equals baseCost + actualCost. Without a hint the request is not pre-charged and is billed in AfterKVRequest by actual read bytes only - this keeps the RU-billing pre-charge decoupled from the protocol-level paging byte cap. The hint is added as an optional interface (not a method on RequestInfo) so existing RequestInfo implementations compile unchanged; they simply skip pre-charge. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

…ment When a request carries a PredictedReadBytes hint, BeforeKVRequest consumes tokens up-front as a pre-charge. If the actual read bytes come back smaller than the estimate, the delta represents tokens that were reserved but never consumed. Previously AfterKVRequest computed a negative delta but called RemoveTokens unconditionally, which further debited the limiter instead of giving tokens back. This commit adds Limiter.RefundTokens as the inverse of RemoveTokens and wires the response-side paths (onResponseImpl, onResponseWaitImpl) to call it whenever the settlement delta is negative, so over-estimated pre-charges are released back to the group's token bucket. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Add per-resource-group Prometheus metrics so operators can observe how the paging pre-charge path behaves in production and judge EMA prediction accuracy: - paging_precharge_total / paging_precharge_bytes_total: count and byte volume of RPCs that arrived with a PredictedReadBytes hint > 0 and were pre-charged at BeforeKVRequest. - paging_actual_bytes_total: actual read bytes reported by pre-charged RPCs, to compute an over/under-charge ratio against the pre-charge volume. - paging_prediction_residual_bytes: histogram of (actual - predicted) bytes for pre-charged RPCs; shows EMA prediction accuracy directly. - paging_nonprecharge_total / paging_nonprecharge_actual_bytes_total: count and byte volume of read RPCs that implemented the predicted hint interface but reported 0 (e.g. EMA cold-start) and therefore ran without pre-charge. Paired with paging_precharge_total this yields the cold/ready RPC split from the PD client side. Labels are cached per resource group in groupMetricsCollection to keep the hot path out of WithLabelValues. Only RequestInfo implementations that opt into the PredictedReadBytes interface contribute to these series; existing callers are unaffected. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

ti-chi-bot · 2026-04-21T10:11:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign husharp for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-21T10:11:32Z

Hi @YuhaoZhang00. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-21T10:11:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds optional predicted-read-bytes pre-charge for read requests, new Prometheus paging metrics, a Limiter.RefundTokens method, and updates request/response token accounting and tests to pre-charge, settle, and refund excess tokens.

Changes

Cohort / File(s)	Summary
Controller — core logic `client/resource_group/controller/group_controller.go`	Record paging pre-charge, actual, non-precharge metrics; compute per-request consumption earlier; branch token settlement to refund excess or remove deficit; adjust wait-path token acquisition and observation.
Metrics & Monitoring `client/resource_group/controller/metrics/metrics.go`	Add Prometheus collectors for paging pre-charge counts/bytes, actual bytes, prediction residual histogram, non-precharge counters/bytes, and RU precharge/settlement/delta; register new metrics.
Model & Estimation `client/resource_group/controller/model.go`	Introduce `predictedReadBytesProvider` + `estimatedReadBytes`; `KVCalculator.BeforeKVRequest` pre-charges RRU for predicted read bytes; `AfterKVRequest` subtracts the pre-charge during settlement.
Token Limiter `client/resource_group/controller/limiter.go`	Add `func (lim *Limiter) RefundTokens(now time.Time, amount float64)` to replenish tokens (respects burst/unlimited modes; does not call maybeNotify).
Tests & Test Utilities `client/resource_group/controller/group_controller_test.go`, `client/resource_group/controller/limiter_test.go`, `client/resource_group/controller/testutil.go`	Add tests validating pre-charge, settlement, and token refund semantics; add `predictedReadBytes` field and `PredictedReadBytes()` on `TestRequestInfo`; add `TestRefundTokens`.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant GC as GroupController
    participant TL as TokenLimiter
    participant M as Metrics

    Client->>GC: onRequestWaitImpl(info)
    Note over GC: if estimatedReadBytes(info) > 0
    GC->>TL: acquireTokens(precharge RU)
    TL-->>GC: success / error
    GC->>M: observe pre-charge count & bytes

    Client->>GC: onResponseImpl(req, resp)
    GC->>GC: compute consumption delta (actual - precharge)
    alt delta < 0 (excess precharge)
        GC->>TL: RefundTokens(excess)
        TL-->>GC: refunded
    else delta > 0 (deficit)
        GC->>TL: RemoveTokens(deficit)
        TL-->>GC: removed / error
    end
    GC->>M: observe actual bytes, prediction residual, settlement RU

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

client/resource_group: add RC paging pre-charge controller consuming client-supplied read-bytes hint #10612 — Implements predictedReadBytes hint, pre-charge/settlement flow, RefundTokens, and matching paging metrics as described.

Suggested labels

size/L, type/development, ok-to-test

Suggested reviewers

JmPotato
rleungx
disksing
nolouch

Poem

🐰 I guessed some bytes and tucked them tight,
I pre-charged tokens in the silver night,
If fewer hopped home, I nudged them back,
If more arrived, I let the meter track,
Metrics hum softly — balance kept just right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main change: adding RC paging pre-charge functionality with the PredictedReadBytes hint mechanism.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	PR description comprehensively addresses the problem, changes, implementation details, metrics, tests, and side effects with proper issue linkage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

client/resource_group/controller/group_controller.go (1)
602-626: ⚠️ Potential issue | 🟡 Minor

Pre-charge counter increments even when the request is then throttled.

observePagingPrecharge runs before acquireTokens. If acquireTokens fails with ErrClientResourceGroupThrottled, mu.consumption is rolled back via sub(...), but PagingPrechargeCounter / PagingPrechargeBytesCounter are already incremented, and onResponseImpl/onResponseWaitImpl will never run for this request to balance it with a PagingActualBytesCounter entry. Dashboards comparing actual-vs-precharge bytes (the documented use of these metrics) will show a chronic deficit proportional to throttling rate.

Consider moving the observation to after a successful acquire (or skipping it on the throttled-rollback path), or alternatively add a separate PagingPrechargeThrottledCounter so the denominator stays honest.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/group_controller.go` around lines 602 - 626,
The pre-charge metrics are recorded by observePagingPrecharge before calling
acquireTokens, causing unbalanced precharge counts when acquireTokens fails and
consumption is rolled back (sub on gc.mu.consumption); update the logic in
group_controller.go so that observePagingPrecharge is only called after a
successful acquireTokens (i.e., move the observePagingPrecharge call to after
acquireTokens returns without error), or alternatively decrement or record a
separate PagingPrechargeThrottledCounter in the error branch that handles
errs.ErrClientResourceGroupThrottled (the branch that calls sub and returns the
error) to keep PagingPrechargeCounter/PagingPrechargeBytesCounter consistent
with actual completed requests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller.go`:
- Around line 602-604: The metric observations (calls to
gc.metrics.observePagingPrecharge / observePagingPrechargeBytes /
observePagingActualBytes / observePagingPredictionResidual) should only run for
read requests to match the pre-charge behavior in
BeforeKVRequest/AfterKVRequest; update the branches that call
estimatedReadBytes(info) (notably in onResponseImpl and onResponseWaitImpl and
the other occurrences mentioned) to first check that the request is a read
(e.g., !req.IsWrite()) before observing any paging metrics, so pre-charge and
metric recording are gated symmetrically with the RU accounting in model.go.

---

Outside diff comments:
In `@client/resource_group/controller/group_controller.go`:
- Around line 602-626: The pre-charge metrics are recorded by
observePagingPrecharge before calling acquireTokens, causing unbalanced
precharge counts when acquireTokens fails and consumption is rolled back (sub on
gc.mu.consumption); update the logic in group_controller.go so that
observePagingPrecharge is only called after a successful acquireTokens (i.e.,
move the observePagingPrecharge call to after acquireTokens returns without
error), or alternatively decrement or record a separate
PagingPrechargeThrottledCounter in the error branch that handles
errs.ErrClientResourceGroupThrottled (the branch that calls sub and returns the
error) to keep PagingPrechargeCounter/PagingPrechargeBytesCounter consistent
with actual completed requests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f857a943-c822-4ef7-939c-79712b2a1f70

📥 Commits

Reviewing files that changed from the base of the PR and between e430bd0 and 6ebc843.

📒 Files selected for processing (7)

client/resource_group/controller/group_controller.go
client/resource_group/controller/group_controller_test.go
client/resource_group/controller/limiter.go
client/resource_group/controller/limiter_test.go
client/resource_group/controller/metrics/metrics.go
client/resource_group/controller/model.go
client/resource_group/controller/testutil.go

Temporarily replace github.com/tikv/client-go/v2 and github.com/tikv/pd/client with the corresponding forks carrying the PredictedReadBytes hint and controller-side pre-charge logic so CI can build this PR before the two upstream PRs land: - github.com/YuhaoZhang00/client-go/v2 (tikv/client-go#1947) - github.com/YuhaoZhang00/pd/client (tikv/pd#10611) Will be reverted and replaced with a normal require bump once both PRs merge and tag new releases. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

client/resource_group/controller/limiter.go (1)
328-340: LGTM on the refund primitive; consider symmetry note.

Implementation mirrors RemoveTokens with the opposite sign and skips maybeNotify (refund can't make us low), which is the right call. Since tokens can temporarily exceed burst, the cap is applied implicitly on the next getTokens() call — that's fine but worth a one-line comment so future readers don't mistake it for a leak of the burst cap.
📝 Optional doc tweak
 // RefundTokens adds tokens back to the limiter.
 //
-// No burst cap is applied here
+// No burst cap is applied here; any overshoot is clamped to burst on the
+// next getTokens() call. Does not call maybeNotify since adding tokens
+// cannot put the limiter into the low-token state.
 func (lim *Limiter) RefundTokens(now time.Time, amount float64) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/limiter.go` around lines 328 - 340, The
RefundTokens implementation mirrors RemoveTokens but omits maybeNotify and
allows lim.tokens to temporarily exceed lim.burst because the cap is enforced
later in getTokens; add a single-line comment inside RefundTokens (near the
lim.tokens = tokens + amount line) stating that exceeding burst is intentional
and will be capped by getTokens on the next access, and note that maybeNotify is
intentionally not called for refunds to clarify intent for future readers
(reference symbols: RefundTokens, RemoveTokens, maybeNotify, getTokens, burst).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller.go`:
- Around line 689-710: The change skips calling acquireTokens and recording
successfulRequestDuration when getRUValueFromConsumption(delta) returns exactly
0; to preserve prior metrics behavior, add a v == 0 branch after the existing
if/else-if so that when v == 0 you call
gc.metrics.successfulRequestDuration.Observe(0) (and leave waitDuration
unchanged), referencing the existing symbols getRUValueFromConsumption,
acquireTokens, and gc.metrics.successfulRequestDuration.Observe to locate where
to add the single-line observation.

---

Nitpick comments:
In `@client/resource_group/controller/limiter.go`:
- Around line 328-340: The RefundTokens implementation mirrors RemoveTokens but
omits maybeNotify and allows lim.tokens to temporarily exceed lim.burst because
the cap is enforced later in getTokens; add a single-line comment inside
RefundTokens (near the lim.tokens = tokens + amount line) stating that exceeding
burst is intentional and will be capped by getTokens on the next access, and
note that maybeNotify is intentionally not called for refunds to clarify intent
for future readers (reference symbols: RefundTokens, RemoveTokens, maybeNotify,
getTokens, burst).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f54c1ea-7b65-4175-ba34-2b2aaf771bc9

📥 Commits

Reviewing files that changed from the base of the PR and between 6ebc843 and 6e95720.

📒 Files selected for processing (4)

client/resource_group/controller/group_controller.go
client/resource_group/controller/limiter.go
client/resource_group/controller/metrics/metrics.go
client/resource_group/controller/model.go

🚧 Files skipped from review as they are similar to previous changes (1)

client/resource_group/controller/metrics/metrics.go

JmPotato · 2026-04-22T06:30:43Z

+		// Paging settlement
+		if bytesForEst := estimatedReadBytes(req); bytesForEst > 0 {
+			consumption.RRU -= float64(kc.ReadBytesCost) * float64(bytesForEst)
+		}


If I remember correctly, the logic here is only reached when the response is non-empty (meaning no error occurred).

If a request fails, shouldn't the previously deducted RU be refunded?

The refund doesn't happen here. The actual refund lives in onResponse{Wait}Impl, where v < 0 -> RefundTokens. So on failure, the pre-charge is correctly returned via that path.

Added TestPagingPreChargeRefundOnFailedRead unit test to verify.

JmPotato · 2026-04-22T06:33:36Z

+	if bytesForEst := estimatedReadBytes(req); bytesForEst > 0 {
+		gc.metrics.observePagingActual(bytesForEst, resp.ReadBytes())
+	} else if !req.IsWrite() {
+		if _, ok := req.(predictedReadBytesProvider); ok {


Looking at the code, all requests implement this interface, meaning they will all enter this branch under any path.

Given that, what is the purpose of this non-precharge metric?

The previous +/-4MB range clipped large first-page responses on cold copIterators (predicted=0 -> residual == actual, commonly several MB) and workload shifts where EMA sits above the current actual. Extending both ends by two factor-4 steps keeps the same near-zero resolution while making P95/P99 readable up to the TiKV paging cap. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

EMA is a TiDB-side implementation detail; PD's metric Help text should describe what the metric observes in terms of the hint contract. paging_nonprecharge_* also fires when hint is absent entirely (not just when the predictor produced 0), so reword to say so. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

codecov · 2026-04-22T07:38:41Z

Codecov Report

❌ Patch coverage is 91.66667% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.13%. Comparing base (b21a183) to head (22b311d).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10611      +/-   ##
==========================================
+ Coverage   78.96%   79.13%   +0.16%     
==========================================
  Files         532      535       +3     
  Lines       71883    73252    +1369     
==========================================
+ Hits        56766    57971    +1205     
- Misses      11093    11219     +126     
- Partials     4024     4062      +38

Flag	Coverage Δ
unittests	`79.13% <91.66%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mirror the existing bytes-dimension paging metrics in RU units: - paging_precharge_ru_total: RU pre-acquired at BeforeKVRequest for pre-charged paging read requests. - paging_settlement_ru_total: full RU finally consumed per pre-charged paging read request (base + CPU + ReadBytesCost * actual_bytes). - paging_settlement_ru_delta: histogram of signed per-RPC delta (settlement_ru - precharge_ru); negative bucket = RefundTokens flow, positive = RemoveTokens/acquireTokens flow. The histogram captures the per-RPC settlement magnitude and direction, which cannot be reconstructed from the two aggregate counters alone (sum and max(0,-v) don't commute). Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Order as count -> bytes -> RU, matching the conceptual grouping of sampling units. Add a one-line doc comment to each exported metric to satisfy revive lint. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Move the !IsWrite() guard into estimatedReadBytes so paging pre-charge, settlement, and metric observations stay symmetric even if a future write type implements the optional predictedReadBytesProvider. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Failed reads with a paging hint still go through AfterKVRequest (proven by the existing write !res.Succeed() payBackWriteCost branch), so the paging settlement subtracts ReadBytesCost*predicted and the resulting negative delta flows through RefundTokens. ReadBaseCost is intentionally not refunded, matching existing non-paging read failure behavior. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Drop the predictedReadBytesProvider optional interface and the type assertion guards at the paging metric observation sites. The hint is now part of the RequestInfo contract; client-go and tidb are upgraded simultaneously so no back-compat shim is needed. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

…ucceeds Observing the precharge before acquireTokens lets throttled and ctx-cancelled requests inflate PagingPrechargeCounter / PagingPrechargeBytesCounter / PagingPrechargeRU with no matching settlement sample on the response side (since OnResponse is never called for those requests). Delay the observation until after acquireTokens returns nil, matching the unconditional observation of paging_actual metrics in onResponse{,Wait}Impl. Burstable mode still observes, preserving symmetry with the burstable-agnostic settlement side. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

…echarge

rleungx · 2026-05-18T04:53:04Z

+	if bytesForEst := estimatedReadBytes(req); bytesForEst > 0 {
+		gc.metrics.observePagingActual(bytesForEst, resp.ReadBytes(),
+			getRUValueFromConsumption(count), getRUValueFromConsumption(delta))
+	} else if !req.IsWrite() {


After PredictedReadBytes became part of every RequestInfo, this branch counts every read with a zero prediction, not only paging reads. Can we add an explicit paging/hint-present signal before recording paging_nonprecharge_*?

rleungx · 2026-05-18T04:53:13Z

+	}
+	_, tokens := lim.getTokens(now)
+	lim.last = now
+	lim.tokens = tokens + amount


Refunding tokens updates the bucket, but existing waiters still sleep on their original timer in WaitReservations. Can RefundTokens notify or wake waiters so over-estimated precharges release blocked requests promptly?

rleungx · 2026-05-18T04:53:20Z

 		runningKVRequestCounter:           metrics.GroupRunningKVRequestCounter.WithLabelValues(name),
 		consumeTokenHistogram:             metrics.TokenConsumedHistogram.WithLabelValues(name),
+
+		prechargeCounter:        metrics.PagingPrechargeCounter.WithLabelValues(name),


These new per-resource-group metric series also need cleanup on group deletion/tombstone cleanup. Otherwise deleted groups can leave stale Paging* series behind.

After PredictedReadBytes was promoted to a required RequestInfo method, the paging_nonprecharge_* branch fired for every non-write RPC reaching the resource-control interceptor: point gets, batch gets, scans, and internal lookups all looked identical to a paging coprocessor request whose hint happened to be zero. The "paging cold-start" counter inflated by roughly the full read-RPC volume of the cluster, breaking the precharge-vs-nonprecharge comparison panel in the TiDB dashboard. Expose IsCop() on the RequestInfo interface so the nonprecharge branch can scope itself to coprocessor requests. The hint-present branch is unchanged because hint>0 already implies a coprocessor caller; only the zero-hint branch needs the explicit cmd-type signal, which client-go's implementation will derive from tikvrpc.Request.Type. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

Sync the metric variable docs, Prometheus Help strings, and the observePagingNonprecharge function doc with the new IsCop()-gated scope: every paging_* metric describes coprocessor RPCs, and the nonprecharge branch explicitly calls out the IsCop() precondition and the EMA cold-start meaning. No behavior change — comments only. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

RefundTokens previously updated the bucket and returned without notifying anyone. acquireTokens retry loops select on reconfiguredCh to wake before the next retry-interval tick when fresh tokens arrive (this is how Reconfigure unblocks them); refunded tokens however stayed invisible until the timer fired, costing one retry interval of unnecessary throttling per over-estimated paging settlement. Mirror the Reconfigure close+recreate pattern at the end of RefundTokens so the new tokens become immediately visible to retry waiters. Behavior for in-flight WaitReservations sleepers is deliberately unchanged — reservations carry their pre-computed timeToAct and re-evaluating them on refund is a separate design question. Also expand the RefundTokens doc to document the intentional above-burst overshoot (clamped by getTokens on the next access) and the deliberate omission of maybeNotify. Adds TestRefundTokensWakesAcquireRetry covering both the wake path and the burstable-limiter no-op case. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

cleanUpResourceGroup previously only removed ResourceGroupStatusGauge when a group was tombstoned or went inactive. The nine new paging_* per-resource-group label series introduced earlier in this PR were left behind and lingered in Prometheus until process restart. Add deletePagingLabels on groupMetricsCollection mirroring the initMetrics ordering and call it from cleanUpResourceGroup alongside the existing ResourceGroupStatusGauge deletion. The helper sits next to initMetrics so adding a new paging metric there forces a paired deletion here. Older per-group series (SuccessfulRequestDuration, FailedRequestCounter, TokenConsumedHistogram, ...) have the same leak but predate this PR; they will be addressed in a separate change. Adds TestDeletePagingLabelsResetsSeries covering every paging counter to catch a forgotten DeleteLabelValues in the helper. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

…C failure OnResponseWait is only called on the client-go side when the underlying RPC produced a response. A transport-level failure (connection drop, timeout, context cancellation) leaves the predicted RU pre-charged in BeforeKVRequest permanently debited from the token bucket and from the per-group consumption, because the existing settlement path never runs. Add OnRequestCancel(ctx, name, info) to ResourceGroupKVInterceptor and implement it on ResourceGroupsController. The implementation recomputes the BeforeKVRequest delta with the same RequestInfo, subtracts it from gc.mu.consumption, and refunds the corresponding tokens to the limiter. The per-store snapshot maintained by onRequestWaitImpl is intentionally left in place — it is bookkeeping for penalty distribution and the existing OnRequestWait path already accepts similar imprecision on the rare failure case. Adds TestOnRequestCancelRefundsPreCharge covering the round-trip invariant (tokens and consumption both restored to pre-OnRequestWait values after cancel). Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

interceptedClient.SendRequest and SendRequestAsync today call OnResponseWait only when the underlying RPC returns a non-nil response. That made sense in the legacy billing model (only base RRU / WRU at risk on transport failure) but became a real RU leak after the PD-side paging pre-charge: the predicted RU debited in OnRequestWait stayed permanently charged whenever the RPC failed before producing a response (connection drop, timeout, context cancellation). When resp == nil, invoke the new ResourceGroupKVInterceptor.OnRequestCancel so PD refunds the predicted RU to the limiter and rolls back the consumption row it added in OnRequestWait. Cancel and settlement are mutually exclusive on a given request — the if/else structure makes that invariant explicit at every call site. Pin the PD client replace to the matching tikv/pd#10611 commit; the replace will be removed once that PR is merged and tagged. Adds TestSendRequest{Cancels,DoesNotCancel}{,Async}* covering the three combinations (sync failure / sync success / async failure) through a recordingInterceptor mock so the wiring is verified end-to-end without touching a real PD. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

ti-chi-bot · 2026-05-18T13:18:42Z

@YuhaoZhang00: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-2	`22b311d`	link	true	`/test pull-unit-test-next-gen-2`
pull-unit-test-next-gen-3	`22b311d`	link	true	`/test pull-unit-test-next-gen-3`
pull-unit-test-next-gen-1	`22b311d`	link	true	`/test pull-unit-test-next-gen-1`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rleungx · 2026-05-19T03:28:28Z

+		calc.BeforeKVRequest(count, req)
+	}
+	if bytesForEst := estimatedReadBytes(req); bytesForEst > 0 {
+		gc.metrics.observePagingActual(bytesForEst, resp.ReadBytes(),


Can we move this paging settlement metric observation until after the positive-delta acquireTokens path succeeds? In onResponseWaitImpl, acquireTokens below can still return an error and the function returns before updating gc.mu.consumption, storeCounter, or globalCounter. In that case paging_actual_bytes_total / paging_settlement_ru_total would record a request that was not actually settled, which makes the “finally consumed” metrics drift from controller accounting. The OnResponse path is fine because it cannot fail after observation; this risk is specific to OnResponseWait.

YuhaoZhang00 added 3 commits April 21, 2026 16:23

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 21, 2026

ti-chi-bot Bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 21, 2026

ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread client/resource_group/controller/group_controller.go Outdated

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Apr 22, 2026

This was referenced Apr 22, 2026

store/copr: add RC paging pre-charge EMA and PredictedReadBytes hint pingcap/tidb#67941

Open

resourcecontrol, tikvrpc: add PredictedReadBytes hint for RC paging pre-charge tikv/client-go#1947

Open

client/resource_group: trim narrative comments

6e95720

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread client/resource_group/controller/group_controller.go

JmPotato reviewed Apr 22, 2026

View reviewed changes

YuhaoZhang00 added 2 commits April 22, 2026 14:46

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 22, 2026

YuhaoZhang00 force-pushed the feat/rc-paging-precharge branch from a8777a9 to 1764549 Compare April 22, 2026 07:26

ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 22, 2026

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 22, 2026

YuhaoZhang00 force-pushed the feat/rc-paging-precharge branch from 209e7ce to 6efae1c Compare April 22, 2026 07:45

metrics: group paging metrics by sampling unit and add doc comments

ded66bf

Order as count -> bytes -> RU, matching the conceptual grouping of sampling units. Add a one-line doc comment to each exported metric to satisfy revive lint. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>

YuhaoZhang00 force-pushed the feat/rc-paging-precharge branch from 6efae1c to ded66bf Compare April 22, 2026 07:46

YuhaoZhang00 added 3 commits April 22, 2026 17:58

YuhaoZhang00 requested a review from JmPotato April 22, 2026 10:24

JmPotato added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 28, 2026

Merge remote-tracking branch 'upstream/master' into feat/rc-paging-pr…

cf22352

…echarge

rleungx reviewed May 18, 2026

View reviewed changes

This was referenced May 18, 2026

client/resource_group: gate paging metrics by IsCop signal YuhaoZhang00/pd#1

Merged

resourcecontrol: expose IsCop on RequestInfo YuhaoZhang00/client-go#1

Merged

go.mod: bump pd and client-go for IsCop gating YuhaoZhang00/tidb#1

Merged

YuhaoZhang00 added 4 commits May 18, 2026 19:32

YuhaoZhang00 requested a review from rleungx May 18, 2026 13:19

rleungx reviewed May 19, 2026

View reviewed changes

Conversation

YuhaoZhang00 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Related PRs

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

ti-chi-bot Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JmPotato Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

YuhaoZhang00 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rleungx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

rleungx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

rleungx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented May 18, 2026

Uh oh!

rleungx May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuhaoZhang00 commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

codecov Bot commented Apr 22, 2026 •

edited

Loading