importsdk, importer, importinto: add import size estimate#67241
Conversation
…mport size estimate
|
Review failed due to infrastructure/execution failure after retries. Please re-trigger review. ℹ️ Learn more details on Pantheon AI. |
|
Hi @GMHDBJD. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds import-size estimation: extends the import SDK and FileScanner with an API to estimate per-table and total import TiKV sizes, threads CSV/charset config into the SDK, and refactors executor sampling to compute source, data-KV, and index-KV sizes per file. Changes
Sequence DiagramsequenceDiagram
participant Client as SDK Consumer
participant FS as FileScanner
participant Loader as Loader/Metadata
participant Sampler as SampleFileImportKVSize
participant Result as ImportDataSizeEstimate
Client->>FS: EstimateImportDataSize(ctx)
FS->>Loader: Discover databases & tables
loop per table
FS->>FS: estimateOneTableSize()
FS->>FS: build KVSizeSampleConfig & TableInfo
FS->>Sampler: SampleFileImportKVSize(ctx, cfg, tbl, dataStore, files)
Sampler->>Sampler: parse files, encode rows, accumulate SourceSize/DataKV/IndexKV
Sampler-->>FS: SampledKVSizeResult
FS->>FS: compute table TiKV estimate and append
end
FS-->>Client: ImportDataSizeEstimate (tables + totals)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok-to-test |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67241 +/- ##
================================================
+ Coverage 77.7732% 78.6386% +0.8653%
================================================
Files 2016 1951 -65
Lines 552852 551762 -1090
================================================
+ Hits 429971 433898 +3927
+ Misses 121139 117410 -3729
+ Partials 1742 454 -1288
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/importsdk/file_scanner.go (2)
341-344: Fallback to source size when sampling yields non-positive results.The fallback at lines 341-344 returns
tblMeta.TotalSizeas the TiKV estimate when sampling produces non-positive values. This is a reasonable safeguard, but it may overestimate (source size ≠ KV size) or underestimate depending on the data format. Consider logging a warning when this fallback is triggered to aid debugging.📝 Suggested logging improvement
if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { + s.logger.Warn("sampling returned non-positive size, using source size as estimate", + zap.String("database", dbMeta.Name), + zap.String("table", tblMeta.Name), + zap.Int64("sourceSize", sampledSize.SourceSize), + zap.Int64("kvSize", sampledSize.TotalKVSize())) return tblMeta.TotalSize, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` around lines 341 - 344, The current fallback returns tblMeta.TotalSize when sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any visibility; update the function containing this return in file_scanner.go to log a warning when the fallback path is taken (include sampledSize.SourceSize, sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can debug sampling issues; use the package logger already used nearby (or the function's contextual logger) and keep the log level as warning; then return the same tblMeta.TotalSize as before.
47-47: Consider adding a doc comment to the interface method.The new
EstimateImportDataSizemethod on theFileScannerinterface lacks documentation. A brief comment explaining its purpose would help consumers of this API.📝 Suggested documentation
GetTotalSize(ctx context.Context) int64 + // EstimateImportDataSize samples source data to estimate the final TiKV size + // after encoding. Returns per-table and aggregated size estimates. EstimateImportDataSize(ctx context.Context) (*ImportDataSizeEstimate, error) Close() error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` at line 47, Add a short doc comment above the FileScanner interface method EstimateImportDataSize describing its purpose and behavior: explain that EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate and error, what the estimate represents (e.g., expected total bytes/objects to be imported), any important semantics (context cancellation support, when it may return an error or a partial estimate), and whether the call is expected to be fast or may perform a scan; update the comment near the FileScanner interface so consumers understand inputs, outputs, and error conditions.pkg/executor/importer/sampler.go (1)
117-128: Returning partial results alongside an error may mask sampling failures.The function accumulates results across files and returns both the accumulated
resultandfirstErr. If sampling fails for one file, callers receive partial data without clear indication of which tables succeeded. Consider whether this "best-effort" behavior is intentional or if you should fail fast on the first error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/sampler.go` around lines 117 - 128, The current loop in sampler.go accumulates SampledKVSizeResult across files while returning the first error (firstErr), which yields partial results when any call to sampleKVSizeForOneFile fails; decide and implement one behavior: either fail fast by checking err after each sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating when err!=nil), or explicitly switch to an explicit best-effort mode by collecting per-file results and errors (e.g., map[file]error or a multi-error) and return both the full per-file result set and an aggregated error; update the code around SampledKVSizeResult, firstErr, and the loop over files to follow the chosen approach and ensure callers can distinguish success vs partial success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 117-128: The current loop in sampler.go accumulates
SampledKVSizeResult across files while returning the first error (firstErr),
which yields partial results when any call to sampleKVSizeForOneFile fails;
decide and implement one behavior: either fail fast by checking err after each
sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating
when err!=nil), or explicitly switch to an explicit best-effort mode by
collecting per-file results and errors (e.g., map[file]error or a multi-error)
and return both the full per-file result set and an aggregated error; update the
code around SampledKVSizeResult, firstErr, and the loop over files to follow the
chosen approach and ensure callers can distinguish success vs partial success.
In `@pkg/importsdk/file_scanner.go`:
- Around line 341-344: The current fallback returns tblMeta.TotalSize when
sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any
visibility; update the function containing this return in file_scanner.go to log
a warning when the fallback path is taken (include sampledSize.SourceSize,
sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can
debug sampling issues; use the package logger already used nearby (or the
function's contextual logger) and keep the log level as warning; then return the
same tblMeta.TotalSize as before.
- Line 47: Add a short doc comment above the FileScanner interface method
EstimateImportDataSize describing its purpose and behavior: explain that
EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate
and error, what the estimate represents (e.g., expected total bytes/objects to
be imported), any important semantics (context cancellation support, when it may
return an error or a partial estimate), and whether the call is expected to be
fast or may perform a scan; update the comment near the FileScanner interface so
consumers understand inputs, outputs, and error conditions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fa78505a-ef74-42a0-a3ff-91e57d5ccb07
📒 Files selected for processing (8)
lightning/pkg/importinto/importer.gopkg/executor/importer/sampler.gopkg/importsdk/BUILD.bazelpkg/importsdk/config.gopkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.gopkg/importsdk/mock/sdk_mock.gopkg/importsdk/model.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/executor/importer/import.go (1)
1104-1122: Edge case: nil vs empty slice semantics.The function treats
columnNames == nil(line 1108) specially by returning columns as-is, but an empty slice[]string{}would fail the length check on line 1105-1107 ifcolsis non-empty. This appears intentional based on caller usage patterns, but consider adding a brief comment explaining the nil vs empty slice distinction.📝 Suggested documentation
func reorderColumnsByNames(cols []*table.Column, columnNames []string) ([]*table.Column, error) { if len(cols) != len(columnNames) { return nil, exeerrors.ErrColumnsNotMatched } + // nil columnNames means no reordering needed (preserve original order) if columnNames == nil { return cols, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/import.go` around lines 1104 - 1122, The function reorderColumnsByNames currently treats columnNames == nil as "no reordering" but treats an empty slice as a length-mismatch error (ErrColumnsNotMatched); add a concise comment above reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols, columnNames and ErrColumnsNotMatched) so future readers understand that nil means "leave cols as-is" while an empty slice is validated against cols length and triggers the error.pkg/importsdk/file_scanner.go (1)
296-339: Consider adding a log for fallback estimation.When
sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0(line 335), the method falls back to usingtblMeta.TotalSizedirectly as the TiKV size estimate. This fallback may produce inaccurate estimates (source size ≠ TiKV size). Consider logging a warning to help users understand when estimates may be less accurate.💡 Suggested improvement
if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { + s.logger.Warn("sampling returned non-positive sizes, falling back to source size as TiKV estimate", + zap.String("table", tblMeta.Name), + zap.Int64("sourceSize", sampledSize.SourceSize), + zap.Int64("totalKVSize", sampledSize.TotalKVSize())) return tblMeta.TotalSize, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` around lines 296 - 339, In estimateOneTableSize, when the sampledSize fallback is used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0), add a warning log to indicate the fallback and include identifying details (tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize values so users know the estimate may be inaccurate; locate this in function estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/importer/import.go`:
- Around line 1104-1122: The function reorderColumnsByNames currently treats
columnNames == nil as "no reordering" but treats an empty slice as a
length-mismatch error (ErrColumnsNotMatched); add a concise comment above
reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols,
columnNames and ErrColumnsNotMatched) so future readers understand that nil
means "leave cols as-is" while an empty slice is validated against cols length
and triggers the error.
In `@pkg/importsdk/file_scanner.go`:
- Around line 296-339: In estimateOneTableSize, when the sampledSize fallback is
used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0),
add a warning log to indicate the fallback and include identifying details
(tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize
values so users know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fe5118e1-feeb-4805-b6cd-13cf6f800bda
📒 Files selected for processing (5)
pkg/executor/importer/import.gopkg/executor/importer/kv_encode.gopkg/executor/importer/sampler.gopkg/executor/importer/sampler_test.gopkg/importsdk/file_scanner.go
|
🔍 Starting code review for this PR... |
|
🔍 New commits detected — starting re-review... |
1 similar comment
|
🔍 New commits detected — starting re-review... |
|
/retest |
| return 0, errors.Trace(err) | ||
| } | ||
| if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { | ||
| return tblMeta.TotalSize, nil |
There was a problem hiding this comment.
Should this return 0 when sampling succeeds but there are no data rows?
| err = scanner.CreateSchemaAndTableByName(ctx, "db1", "nonexistent") | ||
| require.Error(t, err) | ||
| }) | ||
|
|
There was a problem hiding this comment.
Should we add some CSV-based tests here as well?
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 6
- Inline comments: 6
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
⚠️ [Major] (1)
- Production estimation path depends on test-only utilities (pkg/importsdk/file_scanner.go:397, pkg/util/mock/fortest.go:22)
🟡 [Minor] (5)
- Dead method
reorderColumnsleft after refactoring (pkg/executor/importer/import.go:1172) sampleIndexSizeRatiodiscards valid partial results on sampling error (pkg/executor/importer/sampler.go:130)- Bare boolean literals at
generateCSVConfigcall site obscure parameter meaning (pkg/executor/importer/sampler.go:217) - Parser (and underlying reader) leaked when handleSkipNRows or SetPos fails after parser creation (pkg/executor/importer/sampler.go:250)
- FileScanner and SDK interface expansion breaks external implementors (pkg/importsdk/file_scanner.go:46, pkg/importsdk/sdk.go:23)
| } | ||
| for _, dbMeta := range dbMetas { | ||
| for _, tblMeta := range dbMeta.Tables { | ||
| singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta) |
There was a problem hiding this comment.
AI-assisted review: EstimateImportDataSize currently fails the whole request on the first bad table, while GetTableMetas already honors skipInvalidFiles and can return partial results. Is that difference intentional?
It seems a bit surprising from the SDK caller's perspective: with skipInvalidFiles=true, metadata discovery can still succeed for the valid tables, but size estimation becomes all-or-nothing. If that behavior is intended, it may be worth documenting explicitly; otherwise, this path may want to follow the same partial-result behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/executor/importer/sampler.go (1)
129-136:⚠️ Potential issue | 🟠 MajorKeep usable sample data when sampling hits an error.
This path still throws valid sample data away in two places:
sampleIndexSizeRatioreturns0on any error, andsampleOneFilezeroes out bytes/KVs already collected before a late read/encode/add failure.CalResourceParamscan then proceed withindexSizeRatio == 0even though the sample already contains enough data to produce a usable estimate.💡 Suggested fix
func (e *LoadDataController) sampleIndexSizeRatio( ctx context.Context, ksCodec []byte, ) (float64, error) { result, err := e.sampleKVSize(ctx, ksCodec) - if err != nil { - return 0, err - } - if result.DataKVSize == 0 { - return 0, nil + if result == nil || result.DataKVSize == 0 { + return 0, err } - return float64(result.IndexKVSize) / float64(result.DataKVSize), nil + return float64(result.IndexKVSize) / float64(result.DataKVSize), err }var ( count int readRowCache []types.Datum readFn = parserEncodeReader(parser, chunk.Chunk.EndOffset, chunk.GetKey()) kvBatch = newEncodedKVGroupBatch(ksCodec, maxRowCount) ) + finalize := func(retErr error) (int64, uint64, uint64, error) { + dataKVSize, indexKVSize := kvBatch.groupChecksum.DataAndIndexSumSize() + return sourceSize, dataKVSize, indexKVSize, retErr + } for count < maxRowCount { row, closed, readErr := readFn(ctx, readRowCache) if readErr != nil { - return 0, 0, 0, readErr + return finalize(readErr) } if closed { break } @@ kvs, encodeErr := encoder.Encode(row.row, row.rowID) row.resetFn() if encodeErr != nil { - return 0, 0, 0, common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos) + return finalize(common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos)) } if _, err = kvBatch.add(kvs); err != nil { - return 0, 0, 0, err + return finalize(err) } count++ } - dataKVSize, indexKVSize = kvBatch.groupChecksum.DataAndIndexSumSize() - return sourceSize, dataKVSize, indexKVSize, nil + return finalize(nil) }Also applies to: 354-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/sampler.go` around lines 129 - 136, sample data is being discarded on errors: in sampleIndexSizeRatio and sampleOneFile you should preserve and use any already-collected sample totals instead of returning 0 or zeroing totals when a later error occurs; update sampleIndexSizeRatio to check the returned result (e.g., result.DataKVSize) and if it contains usable data compute and return the ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing sampling, and modify sampleOneFile so that on a late read/encode/add failure you do not reset the aggregated totals (bytes/KVs) — only discard or reset the temporary batch variables for that file/operation so the overall sample totals accumulated in e.sample... remain intact for CalResourceParams to consume.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/executor/importer/import.go`:
- Around line 1717-1722: The switch default currently sets
exeerrors.ErrLoadDataUnsupportedFormat into err but the common post-switch
wrapper always remaps any err to exeerrors.ErrLoadDataWrongFormatConfig; update
the post-switch handling in import.go so that if err is
exeerrors.ErrLoadDataUnsupportedFormat (use errors.Is or direct comparison) you
return that error directly, otherwise keep the existing wrapping into
exeerrors.ErrLoadDataWrongFormatConfig; reference the local variable err and the
two error symbols exeerrors.ErrLoadDataUnsupportedFormat and
exeerrors.ErrLoadDataWrongFormatConfig to locate and implement the conditional
return.
---
Duplicate comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 129-136: sample data is being discarded on errors: in
sampleIndexSizeRatio and sampleOneFile you should preserve and use any
already-collected sample totals instead of returning 0 or zeroing totals when a
later error occurs; update sampleIndexSizeRatio to check the returned result
(e.g., result.DataKVSize) and if it contains usable data compute and return the
ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing
sampling, and modify sampleOneFile so that on a late read/encode/add failure you
do not reset the aggregated totals (bytes/KVs) — only discard or reset the
temporary batch variables for that file/operation so the overall sample totals
accumulated in e.sample... remain intact for CalResourceParams to consume.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 18a0f6a1-3464-4e96-b23b-547b35233303
📒 Files selected for processing (8)
pkg/executor/importer/BUILD.bazelpkg/executor/importer/import.gopkg/executor/importer/sampler.gopkg/executor/importer/sampler_test.gopkg/executor/importer/table_import.gopkg/importsdk/BUILD.bazelpkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/importsdk/BUILD.bazel
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/executor/importer/sampler_test.go
- pkg/importsdk/file_scanner_test.go
- pkg/importsdk/file_scanner.go
[LGTM Timeline notifier]Timeline:
|
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Benjamin2037, joechenrh, OliverS929 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
What problem does this PR solve?
Issue Number: close #67240
Problem Summary:
Premium needs the final import size estimate before an import starts, so it can expand disk in advance.
What changed and how does it work?
EstimateImportDataSizeto import SDK.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Improvements
Tests
Chores