Skip to content

importsdk, importer, importinto: add import size estimate#67241

Merged
ti-chi-bot[bot] merged 6 commits into
pingcap:masterfrom
GMHDBJD:import-size-estimate
Mar 25, 2026
Merged

importsdk, importer, importinto: add import size estimate#67241
ti-chi-bot[bot] merged 6 commits into
pingcap:masterfrom
GMHDBJD:import-size-estimate

Conversation

@GMHDBJD
Copy link
Copy Markdown
Collaborator

@GMHDBJD GMHDBJD commented Mar 23, 2026

What problem does this PR solve?

Issue Number: close #67240

Problem Summary:

Premium needs the final import size estimate before an import starts, so it can expand disk in advance.

What changed and how does it work?

  • Add EstimateImportDataSize to import SDK.
  • Reuse nextgen KV encoding sampling to estimate final single-replica KV size.
  • Pass CSV and charset options through import-into so the estimate uses consistent parsing settings.
  • Decouple file-based size sampling from importer plan construction so the sampling logic can be reused directly by import SDK.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • CSV configuration and data character-set options for import.
    • New API to estimate import data size with per-table source and TiKV size estimates.
  • Improvements

    • More accurate KV-size sampling and index/data-ratio estimation.
    • Improved parser/resource lifecycle handling and clearer encoding error reporting.
  • Tests

    • Added tests validating size estimates and parser error/close behavior.
  • Chores

    • Updated mocks and build configuration to support new APIs.

@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/invalid-title labels Mar 23, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Mar 23, 2026

Review failed due to infrastructure/execution failure after retries. Please re-trigger review.

ℹ️ Learn more details on Pantheon AI.

@GMHDBJD GMHDBJD added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. component/import component/lightning This issue is related to Lightning of TiDB. labels Mar 23, 2026
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 23, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Mar 23, 2026

Hi @GMHDBJD. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds import-size estimation: extends the import SDK and FileScanner with an API to estimate per-table and total import TiKV sizes, threads CSV/charset config into the SDK, and refactors executor sampling to compute source, data-KV, and index-KV sizes per file.

Changes

Cohort / File(s) Summary
SDK config & models
pkg/importsdk/config.go, pkg/importsdk/model.go, pkg/importsdk/BUILD.bazel
Added CSV and data-character-set fields/options to SDKConfig (WithCSVConfig, WithDataCharacterSet); added TableDataSizeEstimate and ImportDataSizeEstimate models; expanded importsdk Bazel deps.
FileScanner API & tests
pkg/importsdk/file_scanner.go, pkg/importsdk/file_scanner_test.go
Added EstimateImportDataSize to FileScanner; implemented per-table estimation with sampling, schema→TableInfo conversion, sampling-config builders, and tests for SQL/CSV/invalid files.
Executor KV-size sampler & tests
pkg/executor/importer/sampler.go, pkg/executor/importer/sampler_test.go
Reworked sampling to file-based pipeline: introduced SampledKVSizeResult, KVSizeSampleConfig, SampleFileImportKVSize; accumulate SourceSize/DataKVSize/IndexKVSize; removed sentinel stop flow; updated tests to validate ratios and parser-close-on-error.
Importer wiring
lightning/pkg/importinto/importer.go
Passes cfg.Mydumper.CSV and cfg.Mydumper.DataCharacterSet into SDK via importsdk.WithCSVConfig(...) and importsdk.WithDataCharacterSet(...).
Import controller & parser/CSV refactors
pkg/executor/importer/import.go, pkg/executor/importer/table_import.go, pkg/executor/load_data.go
Extracted parser/CSV config and column-mapping helpers to package-level functions; added newLoadDataParser; introduced HandleSkipNRows; improved parser close-on-error handling and adjusted call sites to pass IgnoreLines.
KV encoder interface tweak
pkg/executor/importer/kv_encode.go
Introduced simpleColAssignExprCreator interface and updated encoder creation to accept an expr creator instead of controller receiver.
Mocks / gomock regen
pkg/importsdk/mock/sdk_mock.go
Regenerated mocks: replaced ISGOMOCK() methods with isgomock fields; standardized ctx/param names; added EstimateImportDataSize mock methods and recorders.
Build/test deps
pkg/executor/importer/BUILD.bazel
Added test deps (//pkg/objstore/objectio, //pkg/objstore/storeapi) required by new tests.

Sequence Diagram

sequenceDiagram
    participant Client as SDK Consumer
    participant FS as FileScanner
    participant Loader as Loader/Metadata
    participant Sampler as SampleFileImportKVSize
    participant Result as ImportDataSizeEstimate

    Client->>FS: EstimateImportDataSize(ctx)
    FS->>Loader: Discover databases & tables
    loop per table
        FS->>FS: estimateOneTableSize()
        FS->>FS: build KVSizeSampleConfig & TableInfo
        FS->>Sampler: SampleFileImportKVSize(ctx, cfg, tbl, dataStore, files)
        Sampler->>Sampler: parse files, encode rows, accumulate SourceSize/DataKV/IndexKV
        Sampler-->>FS: SampledKVSizeResult
        FS->>FS: compute table TiKV estimate and append
    end
    FS-->>Client: ImportDataSizeEstimate (tables + totals)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • joechenrh
  • Leavrth
  • mjonss

Poem

🐰 I nibble rows and count each bite,
Source, data, index — measured light.
I hop through files with tiny paws,
Tallying totals without a pause.
Disk can grow — the rabbit applauds.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding import size estimation across three packages (importsdk, importer, importinto).
Description check ✅ Passed The description follows the template structure with issue reference, problem summary, detailed explanation of changes, checklist completion, and release note. All required sections are addressed.
Linked Issues check ✅ Passed The PR fully addresses issue #67240 by adding EstimateImportDataSize API to import SDK, reusing KV encoding sampling, passing CSV/charset options, and supporting aggregated size estimation across tables.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing import size estimation. Mock updates and refactoring are necessary supporting changes to enable the new API and maintain test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hawkingrei
Copy link
Copy Markdown
Member

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 23, 2026
@GMHDBJD GMHDBJD changed the title pkg/importsdk, pkg/executor/importer, lightning/pkg/importinto: add import size estimate importsdk, importer, importinto: add import size estimate Mar 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 70.81851% with 164 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.6386%. Comparing base (e34d41c) to head (91838e9).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67241        +/-   ##
================================================
+ Coverage   77.7732%   78.6386%   +0.8653%     
================================================
  Files          2016       1951        -65     
  Lines        552852     551762      -1090     
================================================
+ Hits         429971     433898      +3927     
+ Misses       121139     117410      -3729     
+ Partials       1742        454      -1288     
Flag Coverage Δ
integration 43.8351% <0.6802%> (-4.2950%) ⬇️
unit 76.9476% <70.4626%> (+0.6439%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 48.9151% <ø> (-11.9852%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/importsdk/file_scanner.go (2)

341-344: Fallback to source size when sampling yields non-positive results.

The fallback at lines 341-344 returns tblMeta.TotalSize as the TiKV estimate when sampling produces non-positive values. This is a reasonable safeguard, but it may overestimate (source size ≠ KV size) or underestimate depending on the data format. Consider logging a warning when this fallback is triggered to aid debugging.

📝 Suggested logging improvement
 	if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
+		s.logger.Warn("sampling returned non-positive size, using source size as estimate",
+			zap.String("database", dbMeta.Name),
+			zap.String("table", tblMeta.Name),
+			zap.Int64("sourceSize", sampledSize.SourceSize),
+			zap.Int64("kvSize", sampledSize.TotalKVSize()))
 		return tblMeta.TotalSize, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` around lines 341 - 344, The current fallback
returns tblMeta.TotalSize when sampledSize.SourceSize <= 0 or
sampledSize.TotalKVSize() <= 0 without any visibility; update the function
containing this return in file_scanner.go to log a warning when the fallback
path is taken (include sampledSize.SourceSize, sampledSize.TotalKVSize(), and
tblMeta.TotalSize in the message) so callers can debug sampling issues; use the
package logger already used nearby (or the function's contextual logger) and
keep the log level as warning; then return the same tblMeta.TotalSize as before.

47-47: Consider adding a doc comment to the interface method.

The new EstimateImportDataSize method on the FileScanner interface lacks documentation. A brief comment explaining its purpose would help consumers of this API.

📝 Suggested documentation
 	GetTotalSize(ctx context.Context) int64
+	// EstimateImportDataSize samples source data to estimate the final TiKV size
+	// after encoding. Returns per-table and aggregated size estimates.
 	EstimateImportDataSize(ctx context.Context) (*ImportDataSizeEstimate, error)
 	Close() error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` at line 47, Add a short doc comment above the
FileScanner interface method EstimateImportDataSize describing its purpose and
behavior: explain that EstimateImportDataSize(ctx context.Context) returns an
ImportDataSizeEstimate and error, what the estimate represents (e.g., expected
total bytes/objects to be imported), any important semantics (context
cancellation support, when it may return an error or a partial estimate), and
whether the call is expected to be fast or may perform a scan; update the
comment near the FileScanner interface so consumers understand inputs, outputs,
and error conditions.
pkg/executor/importer/sampler.go (1)

117-128: Returning partial results alongside an error may mask sampling failures.

The function accumulates results across files and returns both the accumulated result and firstErr. If sampling fails for one file, callers receive partial data without clear indication of which tables succeeded. Consider whether this "best-effort" behavior is intentional or if you should fail fast on the first error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/sampler.go` around lines 117 - 128, The current loop in
sampler.go accumulates SampledKVSizeResult across files while returning the
first error (firstErr), which yields partial results when any call to
sampleKVSizeForOneFile fails; decide and implement one behavior: either fail
fast by checking err after each sampleKVSizeForOneFile call and immediately
return nil, err (remove accumulating when err!=nil), or explicitly switch to an
explicit best-effort mode by collecting per-file results and errors (e.g.,
map[file]error or a multi-error) and return both the full per-file result set
and an aggregated error; update the code around SampledKVSizeResult, firstErr,
and the loop over files to follow the chosen approach and ensure callers can
distinguish success vs partial success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 117-128: The current loop in sampler.go accumulates
SampledKVSizeResult across files while returning the first error (firstErr),
which yields partial results when any call to sampleKVSizeForOneFile fails;
decide and implement one behavior: either fail fast by checking err after each
sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating
when err!=nil), or explicitly switch to an explicit best-effort mode by
collecting per-file results and errors (e.g., map[file]error or a multi-error)
and return both the full per-file result set and an aggregated error; update the
code around SampledKVSizeResult, firstErr, and the loop over files to follow the
chosen approach and ensure callers can distinguish success vs partial success.

In `@pkg/importsdk/file_scanner.go`:
- Around line 341-344: The current fallback returns tblMeta.TotalSize when
sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any
visibility; update the function containing this return in file_scanner.go to log
a warning when the fallback path is taken (include sampledSize.SourceSize,
sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can
debug sampling issues; use the package logger already used nearby (or the
function's contextual logger) and keep the log level as warning; then return the
same tblMeta.TotalSize as before.
- Line 47: Add a short doc comment above the FileScanner interface method
EstimateImportDataSize describing its purpose and behavior: explain that
EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate
and error, what the estimate represents (e.g., expected total bytes/objects to
be imported), any important semantics (context cancellation support, when it may
return an error or a partial estimate), and whether the call is expected to be
fast or may perform a scan; update the comment near the FileScanner interface so
consumers understand inputs, outputs, and error conditions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fa78505a-ef74-42a0-a3ff-91e57d5ccb07

📥 Commits

Reviewing files that changed from the base of the PR and between 27f439f and 5e961a0.

📒 Files selected for processing (8)
  • lightning/pkg/importinto/importer.go
  • pkg/executor/importer/sampler.go
  • pkg/importsdk/BUILD.bazel
  • pkg/importsdk/config.go
  • pkg/importsdk/file_scanner.go
  • pkg/importsdk/file_scanner_test.go
  • pkg/importsdk/mock/sdk_mock.go
  • pkg/importsdk/model.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/executor/importer/import.go (1)

1104-1122: Edge case: nil vs empty slice semantics.

The function treats columnNames == nil (line 1108) specially by returning columns as-is, but an empty slice []string{} would fail the length check on line 1105-1107 if cols is non-empty. This appears intentional based on caller usage patterns, but consider adding a brief comment explaining the nil vs empty slice distinction.

📝 Suggested documentation
 func reorderColumnsByNames(cols []*table.Column, columnNames []string) ([]*table.Column, error) {
 	if len(cols) != len(columnNames) {
 		return nil, exeerrors.ErrColumnsNotMatched
 	}
+	// nil columnNames means no reordering needed (preserve original order)
 	if columnNames == nil {
 		return cols, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1104 - 1122, The function
reorderColumnsByNames currently treats columnNames == nil as "no reordering" but
treats an empty slice as a length-mismatch error (ErrColumnsNotMatched); add a
concise comment above reorderColumnsByNames explaining this nil-vs-empty-slice
semantic (mention cols, columnNames and ErrColumnsNotMatched) so future readers
understand that nil means "leave cols as-is" while an empty slice is validated
against cols length and triggers the error.
pkg/importsdk/file_scanner.go (1)

296-339: Consider adding a log for fallback estimation.

When sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 (line 335), the method falls back to using tblMeta.TotalSize directly as the TiKV size estimate. This fallback may produce inaccurate estimates (source size ≠ TiKV size). Consider logging a warning to help users understand when estimates may be less accurate.

💡 Suggested improvement
 	if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
+		s.logger.Warn("sampling returned non-positive sizes, falling back to source size as TiKV estimate",
+			zap.String("table", tblMeta.Name),
+			zap.Int64("sourceSize", sampledSize.SourceSize),
+			zap.Int64("totalKVSize", sampledSize.TotalKVSize()))
 		return tblMeta.TotalSize, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/file_scanner.go` around lines 296 - 339, In
estimateOneTableSize, when the sampledSize fallback is used (condition
sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0), add a warning
log to indicate the fallback and include identifying details (tblMeta.DB,
tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize values so users
know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/executor/importer/import.go`:
- Around line 1104-1122: The function reorderColumnsByNames currently treats
columnNames == nil as "no reordering" but treats an empty slice as a
length-mismatch error (ErrColumnsNotMatched); add a concise comment above
reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols,
columnNames and ErrColumnsNotMatched) so future readers understand that nil
means "leave cols as-is" while an empty slice is validated against cols length
and triggers the error.

In `@pkg/importsdk/file_scanner.go`:
- Around line 296-339: In estimateOneTableSize, when the sampledSize fallback is
used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0),
add a warning log to indicate the fallback and include identifying details
(tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize
values so users know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fe5118e1-feeb-4805-b6cd-13cf6f800bda

📥 Commits

Reviewing files that changed from the base of the PR and between 5e961a0 and bb1b230.

📒 Files selected for processing (5)
  • pkg/executor/importer/import.go
  • pkg/executor/importer/kv_encode.go
  • pkg/executor/importer/sampler.go
  • pkg/executor/importer/sampler_test.go
  • pkg/importsdk/file_scanner.go

@ingress-bot
Copy link
Copy Markdown

🔍 Starting code review for this PR...

@ingress-bot
Copy link
Copy Markdown

🔍 New commits detected — starting re-review...

1 similar comment
@ingress-bot
Copy link
Copy Markdown

🔍 New commits detected — starting re-review...

@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Mar 24, 2026

/retest

return 0, errors.Trace(err)
}
if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 {
return tblMeta.TotalSize, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return 0 when sampling succeeds but there are no data rows?

err = scanner.CreateSchemaAndTableByName(ctx, "db1", "nonexistent")
require.Error(t, err)
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some CSV-based tests here as well?

Copy link
Copy Markdown

@ingress-bot ingress-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 6
  • Inline comments: 6
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

⚠️ [Major] (1)

  1. Production estimation path depends on test-only utilities (pkg/importsdk/file_scanner.go:397, pkg/util/mock/fortest.go:22)

🟡 [Minor] (5)

  1. Dead method reorderColumns left after refactoring (pkg/executor/importer/import.go:1172)
  2. sampleIndexSizeRatio discards valid partial results on sampling error (pkg/executor/importer/sampler.go:130)
  3. Bare boolean literals at generateCSVConfig call site obscure parameter meaning (pkg/executor/importer/sampler.go:217)
  4. Parser (and underlying reader) leaked when handleSkipNRows or SetPos fails after parser creation (pkg/executor/importer/sampler.go:250)
  5. FileScanner and SDK interface expansion breaks external implementors (pkg/importsdk/file_scanner.go:46, pkg/importsdk/sdk.go:23)

Comment thread pkg/importsdk/file_scanner.go Outdated
Comment thread pkg/executor/importer/import.go Outdated
Comment thread pkg/executor/importer/sampler.go
Comment thread pkg/executor/importer/sampler.go
Comment thread pkg/executor/importer/sampler.go Outdated
Comment thread pkg/importsdk/file_scanner.go
Comment thread pkg/importsdk/model.go
Comment thread pkg/importsdk/model.go
Comment thread pkg/executor/importer/import.go Outdated
Comment thread pkg/executor/importer/import.go
Comment thread pkg/executor/importer/import.go
}
for _, dbMeta := range dbMetas {
for _, tblMeta := range dbMeta.Tables {
singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-assisted review: EstimateImportDataSize currently fails the whole request on the first bad table, while GetTableMetas already honors skipInvalidFiles and can return partial results. Is that difference intentional?

It seems a bit surprising from the SDK caller's perspective: with skipInvalidFiles=true, metadata discovery can still succeed for the valid tables, but size estimation becomes all-or-nothing. If that behavior is intended, it may be worth documenting explicitly; otherwise, this path may want to follow the same partial-result behavior.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/executor/importer/sampler.go (1)

129-136: ⚠️ Potential issue | 🟠 Major

Keep usable sample data when sampling hits an error.

This path still throws valid sample data away in two places: sampleIndexSizeRatio returns 0 on any error, and sampleOneFile zeroes out bytes/KVs already collected before a late read/encode/add failure. CalResourceParams can then proceed with indexSizeRatio == 0 even though the sample already contains enough data to produce a usable estimate.

💡 Suggested fix
 func (e *LoadDataController) sampleIndexSizeRatio(
 	ctx context.Context,
 	ksCodec []byte,
 ) (float64, error) {
 	result, err := e.sampleKVSize(ctx, ksCodec)
-	if err != nil {
-		return 0, err
-	}
-	if result.DataKVSize == 0 {
-		return 0, nil
+	if result == nil || result.DataKVSize == 0 {
+		return 0, err
 	}
-	return float64(result.IndexKVSize) / float64(result.DataKVSize), nil
+	return float64(result.IndexKVSize) / float64(result.DataKVSize), err
 }
 	var (
 		count        int
 		readRowCache []types.Datum
 		readFn       = parserEncodeReader(parser, chunk.Chunk.EndOffset, chunk.GetKey())
 		kvBatch      = newEncodedKVGroupBatch(ksCodec, maxRowCount)
 	)
+	finalize := func(retErr error) (int64, uint64, uint64, error) {
+		dataKVSize, indexKVSize := kvBatch.groupChecksum.DataAndIndexSumSize()
+		return sourceSize, dataKVSize, indexKVSize, retErr
+	}
 	for count < maxRowCount {
 		row, closed, readErr := readFn(ctx, readRowCache)
 		if readErr != nil {
-			return 0, 0, 0, readErr
+			return finalize(readErr)
 		}
 		if closed {
 			break
 		}
@@
 		kvs, encodeErr := encoder.Encode(row.row, row.rowID)
 		row.resetFn()
 		if encodeErr != nil {
-			return 0, 0, 0, common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos)
+			return finalize(common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos))
 		}
 		if _, err = kvBatch.add(kvs); err != nil {
-			return 0, 0, 0, err
+			return finalize(err)
 		}
 		count++
 	}
-	dataKVSize, indexKVSize = kvBatch.groupChecksum.DataAndIndexSumSize()
-	return sourceSize, dataKVSize, indexKVSize, nil
+	return finalize(nil)
 }

Also applies to: 354-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/sampler.go` around lines 129 - 136, sample data is
being discarded on errors: in sampleIndexSizeRatio and sampleOneFile you should
preserve and use any already-collected sample totals instead of returning 0 or
zeroing totals when a later error occurs; update sampleIndexSizeRatio to check
the returned result (e.g., result.DataKVSize) and if it contains usable data
compute and return the ratio (IndexKVSize/DataKVSize) even if an error occurred
while finishing sampling, and modify sampleOneFile so that on a late
read/encode/add failure you do not reset the aggregated totals (bytes/KVs) —
only discard or reset the temporary batch variables for that file/operation so
the overall sample totals accumulated in e.sample... remain intact for
CalResourceParams to consume.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/importer/import.go`:
- Around line 1717-1722: The switch default currently sets
exeerrors.ErrLoadDataUnsupportedFormat into err but the common post-switch
wrapper always remaps any err to exeerrors.ErrLoadDataWrongFormatConfig; update
the post-switch handling in import.go so that if err is
exeerrors.ErrLoadDataUnsupportedFormat (use errors.Is or direct comparison) you
return that error directly, otherwise keep the existing wrapping into
exeerrors.ErrLoadDataWrongFormatConfig; reference the local variable err and the
two error symbols exeerrors.ErrLoadDataUnsupportedFormat and
exeerrors.ErrLoadDataWrongFormatConfig to locate and implement the conditional
return.

---

Duplicate comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 129-136: sample data is being discarded on errors: in
sampleIndexSizeRatio and sampleOneFile you should preserve and use any
already-collected sample totals instead of returning 0 or zeroing totals when a
later error occurs; update sampleIndexSizeRatio to check the returned result
(e.g., result.DataKVSize) and if it contains usable data compute and return the
ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing
sampling, and modify sampleOneFile so that on a late read/encode/add failure you
do not reset the aggregated totals (bytes/KVs) — only discard or reset the
temporary batch variables for that file/operation so the overall sample totals
accumulated in e.sample... remain intact for CalResourceParams to consume.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 18a0f6a1-3464-4e96-b23b-547b35233303

📥 Commits

Reviewing files that changed from the base of the PR and between bb1b230 and 23f73bb.

📒 Files selected for processing (8)
  • pkg/executor/importer/BUILD.bazel
  • pkg/executor/importer/import.go
  • pkg/executor/importer/sampler.go
  • pkg/executor/importer/sampler_test.go
  • pkg/executor/importer/table_import.go
  • pkg/importsdk/BUILD.bazel
  • pkg/importsdk/file_scanner.go
  • pkg/importsdk/file_scanner_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/importsdk/BUILD.bazel
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/executor/importer/sampler_test.go
  • pkg/importsdk/file_scanner_test.go
  • pkg/importsdk/file_scanner.go

Comment thread pkg/executor/importer/import.go
@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 25, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Mar 25, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-25 01:52:19.340770321 +0000 UTC m=+319535.376840581: ☑️ agreed by joechenrh.
  • 2026-03-25 03:26:37.010129954 +0000 UTC m=+325193.046200224: ☑️ agreed by OliverS929.

@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Mar 25, 2026

/retest

Copy link
Copy Markdown
Collaborator

@Benjamin2037 Benjamin2037 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, joechenrh, OliverS929

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the approved label Mar 25, 2026
@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Mar 25, 2026

/retest

@ti-chi-bot ti-chi-bot Bot merged commit c5b7db3 into pingcap:master Mar 25, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved component/import component/lightning This issue is related to Lightning of TiDB. lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

importsdk, importinto: expose estimated import size for premium disk scaling

6 participants