br: fix lightning split large csv file error and adjust s3 seek result#27769
Conversation
|
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. DetailsReviewer can indicate their review by submitting an approval review. |
| } | ||
|
|
||
| r.reader = eofReader{} | ||
| return r.rangeInfo.Size, nil |
There was a problem hiding this comment.
r.pos need to be updated in case subsequent seeks. Also, should we return realOffset rather than r.rangeInfo.Size? I didn't find any specifications for this behavior, but returning realOffset is more consistent with linux filesystem.
There was a problem hiding this comment.
Fixed. I think we should return real position which is r.rangeInfo.Size, since the behavior isn't the same as linux fs. fs implement ReaderWriter and allow to write to a position that is larger that the current end and leave the file with a hole. But here we don't. Though there is no much difference.
| } | ||
|
|
||
| // eofReader is a io.ReaderClose Reader that always return io.EOF | ||
| type eofReader struct{} |
There was a problem hiding this comment.
(optional io.NopCloser(bytes.NewReader(nil))
| // If a csv file is overlarge, we need to split it into multiple regions. | ||
| // Note: We can only split a csv file whose format is strict. | ||
| if isCsvFile && dataFileSize > int64(cfg.Mydumper.MaxRegionSize) && cfg.Mydumper.StrictFormat { | ||
| if isCsvFile && cfg.Mydumper.StrictFormat && dataFileSize > int64(cfg.Mydumper.MaxRegionSize)*11/10 { |
There was a problem hiding this comment.
What happens if the file size if slightly bigger the int64(cfg.Mydumper.MaxRegionSize)*11/10? 🤣
There was a problem hiding this comment.
Then the file will be split by cfg.Mydumper.MaxRegionSize, so the second chunk size is about 1/10 * cfg.Mydumper.MaxRegionSize.
There was a problem hiding this comment.
If the file size is slightly bigger the int64(cfg.Mydumper.MaxRegionSize)* 2, the third chunk size is very small, will this be a problem?
There was a problem hiding this comment.
Not a big problem. The common case is that the data export tool (like dumpling or mydumper) set the exported file size with cfg.Mydumper.MaxRegionSize, but the output file size might be slightly bigger or smaller, so we can avoid split a lot of small chunks.
There was a problem hiding this comment.
make that 11/10 a named constant...
There was a problem hiding this comment.
It will make the code a bit ugly because 11/10 is a float. I add a code comment to explain why the threshold need to be increased😅
There was a problem hiding this comment.
you can make "10" a constant and set the upper limit to MaxRegionSize + MaxRegionSize/10.
|
/run-check_dev_2 |
1 similar comment
|
/run-check_dev_2 |
|
/merge |
|
This pull request has been accepted and is ready to merge. DetailsCommit hash: 69a37a3 |
|
/run-check_dev_2 |
|
/remove-label needs-cherry-pick-5.2 |
|
/label needs-cherry-pick-5.2 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
|
cherry pick to release-5.2 in PR #33883 |
|
/remove-label needs-cherry-pick-5.2 |
What problem does this PR solve?
Fix the bug that lightning split large csv file may failed if the file size if slightly bigger the
region-split-sizeand the csv has header.close #27763
What is changed and how it works?
io.EOFmakeTableRegions, after preprocess the header line, if the remain file is smaller thanregion-split-size, directly return the result with 1 region.Check List
Tests
Side effects
Documentation
Release note