importsdk, importer, importinto: add import size estimate by GMHDBJD · Pull Request #67241 · pingcap/tidb

GMHDBJD · 2026-03-23T14:55:37Z

What problem does this PR solve?

Issue Number: close #67240

Problem Summary:

Premium needs the final import size estimate before an import starts, so it can expand disk in advance.

What changed and how does it work?

Add EstimateImportDataSize to import SDK.
Reuse nextgen KV encoding sampling to estimate final single-replica KV size.
Pass CSV and charset options through import-into so the estimate uses consistent parsing settings.
Decouple file-based size sampling from importer plan construction so the sampling logic can be reused directly by import SDK.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

New Features
- CSV configuration and data character-set options for import.
- New API to estimate import data size with per-table source and TiKV size estimates.
Improvements
- More accurate KV-size sampling and index/data-ratio estimation.
- Improved parser/resource lifecycle handling and clearer encoding error reporting.
Tests
- Added tests validating size estimates and parser error/close behavior.
Chores
- Updated mocks and build configuration to support new APIs.

…mport size estimate

pantheon-ai · 2026-03-23T14:55:44Z

Review failed due to infrastructure/execution failure after retries. Please re-trigger review.

_{ℹ️ Learn more details on Pantheon AI.}

tiprow · 2026-03-23T14:55:57Z

Hi @GMHDBJD. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-03-23T14:56:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds import-size estimation: extends the import SDK and FileScanner with an API to estimate per-table and total import TiKV sizes, threads CSV/charset config into the SDK, and refactors executor sampling to compute source, data-KV, and index-KV sizes per file.

Changes

Cohort / File(s)	Summary
SDK config & models `pkg/importsdk/config.go`, `pkg/importsdk/model.go`, `pkg/importsdk/BUILD.bazel`	Added CSV and data-character-set fields/options to `SDKConfig` (`WithCSVConfig`, `WithDataCharacterSet`); added `TableDataSizeEstimate` and `ImportDataSizeEstimate` models; expanded importsdk Bazel deps.
FileScanner API & tests `pkg/importsdk/file_scanner.go`, `pkg/importsdk/file_scanner_test.go`	Added `EstimateImportDataSize` to `FileScanner`; implemented per-table estimation with sampling, schema→TableInfo conversion, sampling-config builders, and tests for SQL/CSV/invalid files.
Executor KV-size sampler & tests `pkg/executor/importer/sampler.go`, `pkg/executor/importer/sampler_test.go`	Reworked sampling to file-based pipeline: introduced `SampledKVSizeResult`, `KVSizeSampleConfig`, `SampleFileImportKVSize`; accumulate `SourceSize`/`DataKVSize`/`IndexKVSize`; removed sentinel stop flow; updated tests to validate ratios and parser-close-on-error.
Importer wiring `lightning/pkg/importinto/importer.go`	Passes `cfg.Mydumper.CSV` and `cfg.Mydumper.DataCharacterSet` into SDK via `importsdk.WithCSVConfig(...)` and `importsdk.WithDataCharacterSet(...)`.
Import controller & parser/CSV refactors `pkg/executor/importer/import.go`, `pkg/executor/importer/table_import.go`, `pkg/executor/load_data.go`	Extracted parser/CSV config and column-mapping helpers to package-level functions; added `newLoadDataParser`; introduced `HandleSkipNRows`; improved parser close-on-error handling and adjusted call sites to pass `IgnoreLines`.
KV encoder interface tweak `pkg/executor/importer/kv_encode.go`	Introduced `simpleColAssignExprCreator` interface and updated encoder creation to accept an expr creator instead of controller receiver.
Mocks / gomock regen `pkg/importsdk/mock/sdk_mock.go`	Regenerated mocks: replaced `ISGOMOCK()` methods with `isgomock` fields; standardized ctx/param names; added `EstimateImportDataSize` mock methods and recorders.
Build/test deps `pkg/executor/importer/BUILD.bazel`	Added test deps (`//pkg/objstore/objectio`, `//pkg/objstore/storeapi`) required by new tests.

Sequence Diagram

sequenceDiagram
    participant Client as SDK Consumer
    participant FS as FileScanner
    participant Loader as Loader/Metadata
    participant Sampler as SampleFileImportKVSize
    participant Result as ImportDataSizeEstimate

    Client->>FS: EstimateImportDataSize(ctx)
    FS->>Loader: Discover databases & tables
    loop per table
        FS->>FS: estimateOneTableSize()
        FS->>FS: build KVSizeSampleConfig & TableInfo
        FS->>Sampler: SampleFileImportKVSize(ctx, cfg, tbl, dataStore, files)
        Sampler->>Sampler: parse files, encode rows, accumulate SourceSize/DataKV/IndexKV
        Sampler-->>FS: SampledKVSizeResult
        FS->>FS: compute table TiKV estimate and append
    end
    FS-->>Client: ImportDataSizeEstimate (tables + totals)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

joechenrh
Leavrth
mjonss

Poem

🐰 I nibble rows and count each bite,
Source, data, index — measured light.
I hop through files with tiny paws,
Tallying totals without a pause.
Disk can grow — the rabbit applauds.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding import size estimation across three packages (importsdk, importer, importinto).
Description check	✅ Passed	The description follows the template structure with issue reference, problem summary, detailed explanation of changes, checklist completion, and release note. All required sections are addressed.
Linked Issues check	✅ Passed	The PR fully addresses issue `#67240` by adding EstimateImportDataSize API to import SDK, reusing KV encoding sampling, passing CSV/charset options, and supporting aggregated size estimation across tables.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing import size estimation. Mock updates and refactoring are necessary supporting changes to enable the new API and maintain test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

hawkingrei · 2026-03-23T14:56:21Z

/ok-to-test

codecov · 2026-03-23T15:20:16Z

Codecov Report

❌ Patch coverage is 70.81851% with 164 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.6386%. Comparing base (e34d41c) to head (91838e9).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67241        +/-   ##
================================================
+ Coverage   77.7732%   78.6386%   +0.8653%     
================================================
  Files          2016       1951        -65     
  Lines        552852     551762      -1090     
================================================
+ Hits         429971     433898      +3927     
+ Misses       121139     117410      -3729     
+ Partials       1742        454      -1288

Flag	Coverage Δ
integration	`43.8351% <0.6802%> (-4.2950%)`	⬇️
unit	`76.9476% <70.4626%> (+0.6439%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.9151% <ø> (-11.9852%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

🧹 Nitpick comments (3)

pkg/importsdk/file_scanner.go (2)

341-344: Fallback to source size when sampling yields non-positive results.

The fallback at lines 341-344 returns tblMeta.TotalSize as the TiKV estimate when sampling produces non-positive values. This is a reasonable safeguard, but it may overestimate (source size ≠ KV size) or underestimate depending on the data format. Consider logging a warning when this fallback is triggered to aid debugging.

📝 Suggested logging improvement

 	if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
+		s.logger.Warn("sampling returned non-positive size, using source size as estimate",
+			zap.String("database", dbMeta.Name),
+			zap.String("table", tblMeta.Name),
+			zap.Int64("sourceSize", sampledSize.SourceSize),
+			zap.Int64("kvSize", sampledSize.TotalKVSize()))
 		return tblMeta.TotalSize, nil
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` around lines 341 - 344, The current fallback
returns tblMeta.TotalSize when sampledSize.SourceSize <= 0 or
sampledSize.TotalKVSize() <= 0 without any visibility; update the function
containing this return in file_scanner.go to log a warning when the fallback
path is taken (include sampledSize.SourceSize, sampledSize.TotalKVSize(), and
tblMeta.TotalSize in the message) so callers can debug sampling issues; use the
package logger already used nearby (or the function's contextual logger) and
keep the log level as warning; then return the same tblMeta.TotalSize as before.

47-47: Consider adding a doc comment to the interface method.

The new EstimateImportDataSize method on the FileScanner interface lacks documentation. A brief comment explaining its purpose would help consumers of this API.

📝 Suggested documentation

 	GetTotalSize(ctx context.Context) int64
+	// EstimateImportDataSize samples source data to estimate the final TiKV size
+	// after encoding. Returns per-table and aggregated size estimates.
 	EstimateImportDataSize(ctx context.Context) (*ImportDataSizeEstimate, error)
 	Close() error

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` at line 47, Add a short doc comment above the
FileScanner interface method EstimateImportDataSize describing its purpose and
behavior: explain that EstimateImportDataSize(ctx context.Context) returns an
ImportDataSizeEstimate and error, what the estimate represents (e.g., expected
total bytes/objects to be imported), any important semantics (context
cancellation support, when it may return an error or a partial estimate), and
whether the call is expected to be fast or may perform a scan; update the
comment near the FileScanner interface so consumers understand inputs, outputs,
and error conditions.

pkg/executor/importer/sampler.go (1)

117-128: Returning partial results alongside an error may mask sampling failures.

The function accumulates results across files and returns both the accumulated result and firstErr. If sampling fails for one file, callers receive partial data without clear indication of which tables succeeded. Consider whether this "best-effort" behavior is intentional or if you should fail fast on the first error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/sampler.go` around lines 117 - 128, The current loop in
sampler.go accumulates SampledKVSizeResult across files while returning the
first error (firstErr), which yields partial results when any call to
sampleKVSizeForOneFile fails; decide and implement one behavior: either fail
fast by checking err after each sampleKVSizeForOneFile call and immediately
return nil, err (remove accumulating when err!=nil), or explicitly switch to an
explicit best-effort mode by collecting per-file results and errors (e.g.,
map[file]error or a multi-error) and return both the full per-file result set
and an aggregated error; update the code around SampledKVSizeResult, firstErr,
and the loop over files to follow the chosen approach and ensure callers can
distinguish success vs partial success.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 117-128: The current loop in sampler.go accumulates
SampledKVSizeResult across files while returning the first error (firstErr),
which yields partial results when any call to sampleKVSizeForOneFile fails;
decide and implement one behavior: either fail fast by checking err after each
sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating
when err!=nil), or explicitly switch to an explicit best-effort mode by
collecting per-file results and errors (e.g., map[file]error or a multi-error)
and return both the full per-file result set and an aggregated error; update the
code around SampledKVSizeResult, firstErr, and the loop over files to follow the
chosen approach and ensure callers can distinguish success vs partial success.

In `@pkg/importsdk/file_scanner.go`:
- Around line 341-344: The current fallback returns tblMeta.TotalSize when
sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any
visibility; update the function containing this return in file_scanner.go to log
a warning when the fallback path is taken (include sampledSize.SourceSize,
sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can
debug sampling issues; use the package logger already used nearby (or the
function's contextual logger) and keep the log level as warning; then return the
same tblMeta.TotalSize as before.
- Line 47: Add a short doc comment above the FileScanner interface method
EstimateImportDataSize describing its purpose and behavior: explain that
EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate
and error, what the estimate represents (e.g., expected total bytes/objects to
be imported), any important semantics (context cancellation support, when it may
return an error or a partial estimate), and whether the call is expected to be
fast or may perform a scan; update the comment near the FileScanner interface so
consumers understand inputs, outputs, and error conditions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fa78505a-ef74-42a0-a3ff-91e57d5ccb07

📥 Commits

Reviewing files that changed from the base of the PR and between 27f439f and 5e961a0.

📒 Files selected for processing (8)

lightning/pkg/importinto/importer.go
pkg/executor/importer/sampler.go
pkg/importsdk/BUILD.bazel
pkg/importsdk/config.go
pkg/importsdk/file_scanner.go
pkg/importsdk/file_scanner_test.go
pkg/importsdk/mock/sdk_mock.go
pkg/importsdk/model.go

coderabbitai

🧹 Nitpick comments (2)

pkg/executor/importer/import.go (1)
1104-1122: Edge case: nil vs empty slice semantics.

The function treats columnNames == nil (line 1108) specially by returning columns as-is, but an empty slice []string{} would fail the length check on line 1105-1107 if cols is non-empty. This appears intentional based on caller usage patterns, but consider adding a brief comment explaining the nil vs empty slice distinction.
📝 Suggested documentation
 func reorderColumnsByNames(cols []*table.Column, columnNames []string) ([]*table.Column, error) {
 	if len(cols) != len(columnNames) {
 		return nil, exeerrors.ErrColumnsNotMatched
 	}
+	// nil columnNames means no reordering needed (preserve original order)
 	if columnNames == nil {
 		return cols, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1104 - 1122, The function
reorderColumnsByNames currently treats columnNames == nil as "no reordering" but
treats an empty slice as a length-mismatch error (ErrColumnsNotMatched); add a
concise comment above reorderColumnsByNames explaining this nil-vs-empty-slice
semantic (mention cols, columnNames and ErrColumnsNotMatched) so future readers
understand that nil means "leave cols as-is" while an empty slice is validated
against cols length and triggers the error.
pkg/importsdk/file_scanner.go (1)
296-339: Consider adding a log for fallback estimation.

When sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 (line 335), the method falls back to using tblMeta.TotalSize directly as the TiKV size estimate. This fallback may produce inaccurate estimates (source size ≠ TiKV size). Consider logging a warning to help users understand when estimates may be less accurate.
💡 Suggested improvement
 	if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
+		s.logger.Warn("sampling returned non-positive sizes, falling back to source size as TiKV estimate",
+			zap.String("table", tblMeta.Name),
+			zap.Int64("sourceSize", sampledSize.SourceSize),
+			zap.Int64("totalKVSize", sampledSize.TotalKVSize()))
 		return tblMeta.TotalSize, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` around lines 296 - 339, In
estimateOneTableSize, when the sampledSize fallback is used (condition
sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0), add a warning
log to indicate the fallback and include identifying details (tblMeta.DB,
tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize values so users
know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/executor/importer/import.go`:
- Around line 1104-1122: The function reorderColumnsByNames currently treats
columnNames == nil as "no reordering" but treats an empty slice as a
length-mismatch error (ErrColumnsNotMatched); add a concise comment above
reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols,
columnNames and ErrColumnsNotMatched) so future readers understand that nil
means "leave cols as-is" while an empty slice is validated against cols length
and triggers the error.

In `@pkg/importsdk/file_scanner.go`:
- Around line 296-339: In estimateOneTableSize, when the sampledSize fallback is
used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0),
add a warning log to indicate the fallback and include identifying details
(tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize
values so users know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fe5118e1-feeb-4805-b6cd-13cf6f800bda

📥 Commits

Reviewing files that changed from the base of the PR and between 5e961a0 and bb1b230.

📒 Files selected for processing (5)

pkg/executor/importer/import.go
pkg/executor/importer/kv_encode.go
pkg/executor/importer/sampler.go
pkg/executor/importer/sampler_test.go
pkg/importsdk/file_scanner.go

ingress-bot · 2026-03-23T16:57:24Z

🔍 Starting code review for this PR...

ingress-bot · 2026-03-23T17:27:24Z

🔍 New commits detected — starting re-review...

ingress-bot · 2026-03-23T17:32:24Z

🔍 New commits detected — starting re-review...

GMHDBJD · 2026-03-24T04:17:05Z

/retest

OliverS929 · 2026-03-24T10:57:37Z

+		return 0, errors.Trace(err)
+	}
+	if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
+		return tblMeta.TotalSize, nil


Should this return 0 when sampling succeeds but there are no data rows?

OliverS929 · 2026-03-24T10:59:19Z

 		err = scanner.CreateSchemaAndTableByName(ctx, "db1", "nonexistent")
 		require.Error(t, err)
 	})
+


Should we add some CSV-based tests here as well?

ingress-bot

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

Total findings: 6
Inline comments: 6
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

⚠️ [Major] (1)

Production estimation path depends on test-only utilities (pkg/importsdk/file_scanner.go:397, pkg/util/mock/fortest.go:22)

🟡 [Minor] (5)

Dead method reorderColumns left after refactoring (pkg/executor/importer/import.go:1172)
sampleIndexSizeRatio discards valid partial results on sampling error (pkg/executor/importer/sampler.go:130)
Bare boolean literals at generateCSVConfig call site obscure parameter meaning (pkg/executor/importer/sampler.go:217)
Parser (and underlying reader) leaked when handleSkipNRows or SetPos fails after parser creation (pkg/executor/importer/sampler.go:250)
FileScanner and SDK interface expansion breaks external implementors (pkg/importsdk/file_scanner.go:46, pkg/importsdk/sdk.go:23)

OliverS929 · 2026-03-24T11:58:34Z

+	}
+	for _, dbMeta := range dbMetas {
+		for _, tblMeta := range dbMeta.Tables {
+			singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta)


AI-assisted review: EstimateImportDataSize currently fails the whole request on the first bad table, while GetTableMetas already honors skipInvalidFiles and can return partial results. Is that difference intentional?

It seems a bit surprising from the SDK caller's perspective: with skipInvalidFiles=true, metadata discovery can still succeed for the valid tables, but size estimation becomes all-or-nothing. If that behavior is intended, it may be worth documenting explicitly; otherwise, this path may want to follow the same partial-result behavior.

…views

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pkg/executor/importer/sampler.go (1)

129-136: ⚠️ Potential issue | 🟠 Major

Keep usable sample data when sampling hits an error.

This path still throws valid sample data away in two places: sampleIndexSizeRatio returns 0 on any error, and sampleOneFile zeroes out bytes/KVs already collected before a late read/encode/add failure. CalResourceParams can then proceed with indexSizeRatio == 0 even though the sample already contains enough data to produce a usable estimate.

💡 Suggested fix

 func (e *LoadDataController) sampleIndexSizeRatio(
 	ctx context.Context,
 	ksCodec []byte,
 ) (float64, error) {
 	result, err := e.sampleKVSize(ctx, ksCodec)
-	if err != nil {
-		return 0, err
-	}
-	if result.DataKVSize == 0 {
-		return 0, nil
+	if result == nil || result.DataKVSize == 0 {
+		return 0, err
 	}
-	return float64(result.IndexKVSize) / float64(result.DataKVSize), nil
+	return float64(result.IndexKVSize) / float64(result.DataKVSize), err
 }

 	var (
 		count        int
 		readRowCache []types.Datum
 		readFn       = parserEncodeReader(parser, chunk.Chunk.EndOffset, chunk.GetKey())
 		kvBatch      = newEncodedKVGroupBatch(ksCodec, maxRowCount)
 	)
+	finalize := func(retErr error) (int64, uint64, uint64, error) {
+		dataKVSize, indexKVSize := kvBatch.groupChecksum.DataAndIndexSumSize()
+		return sourceSize, dataKVSize, indexKVSize, retErr
+	}
 	for count < maxRowCount {
 		row, closed, readErr := readFn(ctx, readRowCache)
 		if readErr != nil {
-			return 0, 0, 0, readErr
+			return finalize(readErr)
 		}
 		if closed {
 			break
 		}
@@
 		kvs, encodeErr := encoder.Encode(row.row, row.rowID)
 		row.resetFn()
 		if encodeErr != nil {
-			return 0, 0, 0, common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos)
+			return finalize(common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos))
 		}
 		if _, err = kvBatch.add(kvs); err != nil {
-			return 0, 0, 0, err
+			return finalize(err)
 		}
 		count++
 	}
-	dataKVSize, indexKVSize = kvBatch.groupChecksum.DataAndIndexSumSize()
-	return sourceSize, dataKVSize, indexKVSize, nil
+	return finalize(nil)
 }

Also applies to: 354-383

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/sampler.go` around lines 129 - 136, sample data is
being discarded on errors: in sampleIndexSizeRatio and sampleOneFile you should
preserve and use any already-collected sample totals instead of returning 0 or
zeroing totals when a later error occurs; update sampleIndexSizeRatio to check
the returned result (e.g., result.DataKVSize) and if it contains usable data
compute and return the ratio (IndexKVSize/DataKVSize) even if an error occurred
while finishing sampling, and modify sampleOneFile so that on a late
read/encode/add failure you do not reset the aggregated totals (bytes/KVs) —
only discard or reset the temporary batch variables for that file/operation so
the overall sample totals accumulated in e.sample... remain intact for
CalResourceParams to consume.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/importer/import.go`:
- Around line 1717-1722: The switch default currently sets
exeerrors.ErrLoadDataUnsupportedFormat into err but the common post-switch
wrapper always remaps any err to exeerrors.ErrLoadDataWrongFormatConfig; update
the post-switch handling in import.go so that if err is
exeerrors.ErrLoadDataUnsupportedFormat (use errors.Is or direct comparison) you
return that error directly, otherwise keep the existing wrapping into
exeerrors.ErrLoadDataWrongFormatConfig; reference the local variable err and the
two error symbols exeerrors.ErrLoadDataUnsupportedFormat and
exeerrors.ErrLoadDataWrongFormatConfig to locate and implement the conditional
return.

---

Duplicate comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 129-136: sample data is being discarded on errors: in
sampleIndexSizeRatio and sampleOneFile you should preserve and use any
already-collected sample totals instead of returning 0 or zeroing totals when a
later error occurs; update sampleIndexSizeRatio to check the returned result
(e.g., result.DataKVSize) and if it contains usable data compute and return the
ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing
sampling, and modify sampleOneFile so that on a late read/encode/add failure you
do not reset the aggregated totals (bytes/KVs) — only discard or reset the
temporary batch variables for that file/operation so the overall sample totals
accumulated in e.sample... remain intact for CalResourceParams to consume.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 18a0f6a1-3464-4e96-b23b-547b35233303

📥 Commits

Reviewing files that changed from the base of the PR and between bb1b230 and 23f73bb.

📒 Files selected for processing (8)

pkg/executor/importer/BUILD.bazel
pkg/executor/importer/import.go
pkg/executor/importer/sampler.go
pkg/executor/importer/sampler_test.go
pkg/executor/importer/table_import.go
pkg/importsdk/BUILD.bazel
pkg/importsdk/file_scanner.go
pkg/importsdk/file_scanner_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/importsdk/BUILD.bazel

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/executor/importer/sampler_test.go
pkg/importsdk/file_scanner_test.go
pkg/importsdk/file_scanner.go

ti-chi-bot · 2026-03-25T03:26:38Z

[LGTM Timeline notifier]

Timeline:

2026-03-25 01:52:19.340770321 +0000 UTC m=+319535.376840581: ☑️ agreed by joechenrh.
2026-03-25 03:26:37.010129954 +0000 UTC m=+325193.046200224: ☑️ agreed by OliverS929.

GMHDBJD · 2026-03-25T07:56:54Z

/retest

Benjamin2037

LGTM

ti-chi-bot · 2026-03-25T08:57:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, joechenrh, OliverS929

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Benjamin2037,OliverS929,joechenrh]
~~lightning/OWNERS~~ [Benjamin2037,OliverS929]
~~pkg/executor/importer/OWNERS~~ [Benjamin2037]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

GMHDBJD · 2026-03-25T11:46:28Z

/retest

pkg/importsdk, pkg/executor/importer, lightning/pkg/importinto: add i…

375736b

…mport size estimate

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/invalid-title labels Mar 23, 2026

GMHDBJD added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. component/import component/lightning This issue is related to Lightning of TiDB. labels Mar 23, 2026

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 23, 2026

ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 23, 2026

GMHDBJD removed the do-not-merge/invalid-title label Mar 23, 2026

GMHDBJD changed the title ~~pkg/importsdk, pkg/executor/importer, lightning/pkg/importinto: add import size estimate~~ importsdk, importer, importinto: add import size estimate Mar 23, 2026

chore: update bazel file

5e961a0

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

pkg/executor/importer, pkg/importsdk: decouple size sampling from plan

bb1b230

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

OliverS929 reviewed Mar 24, 2026

View reviewed changes

ingress-bot reviewed Mar 24, 2026

View reviewed changes

joechenrh reviewed Mar 24, 2026

View reviewed changes

Comment thread pkg/importsdk/model.go

Comment thread pkg/importsdk/model.go

Comment thread pkg/executor/importer/import.go Outdated

Comment thread pkg/executor/importer/import.go

Comment thread pkg/executor/importer/import.go

OliverS929 reviewed Mar 24, 2026

View reviewed changes

pkg/importsdk, pkg/executor/importer: address import size estimate re…

23f73bb

…views

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread pkg/executor/importer/import.go

ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 25, 2026

GMHDBJD mentioned this pull request Mar 25, 2026

lightning_compress local+gzip checkpoint restore can deterministically fail with ErrChecksumMismatch after restart #67293

Open

Benjamin2037 approved these changes Mar 25, 2026

View reviewed changes

ti-chi-bot Bot added the approved label Mar 25, 2026

ti-chi-bot Bot merged commit c5b7db3 into pingcap:master Mar 25, 2026
36 checks passed

This was referenced Mar 26, 2026

pkg/ddl, pkg/tici: add TiCI pre-split for DDL global sort ingest | tidb-test=13ccf8de48e8db2290ff884598444d0508606bbf tiflash=feature-fts #67313

Merged

importsdk, importer: fix sampled source size in import estimate #67492

Merged

This was referenced May 6, 2026

*: restore views without placeholder tables #68186

Open

importer, mydump: preload small parquet files in a single read #68250

Open

coderabbitai Bot mentioned this pull request May 18, 2026

mydump: move parquet parser and test utils to parquetfile pkg #68460

Merged

13 tasks

Conversation

GMHDBJD commented Mar 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiprow Bot commented Mar 23, 2026

Uh oh!

coderabbitai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

hawkingrei commented Mar 23, 2026

Uh oh!

codecov Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ingress-bot commented Mar 23, 2026

Uh oh!

ingress-bot commented Mar 23, 2026

Uh oh!

ingress-bot commented Mar 23, 2026

Uh oh!

GMHDBJD commented Mar 24, 2026

Uh oh!

OliverS929 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

OliverS929 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ingress-bot left a comment

Choose a reason for hiding this comment

Summary

⚠️ [Major] (1)

🟡 [Minor] (5)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OliverS929 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ti-chi-bot Bot commented Mar 25, 2026

[LGTM Timeline notifier]

Uh oh!

GMHDBJD commented Mar 25, 2026

Uh oh!

GMHDBJD commented Mar 23, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Mar 23, 2026 •

edited

Loading

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading

codecov Bot commented Mar 23, 2026 •

edited

Loading