-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) #12647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a clean backport of my fix. |
|
Yes thanks @wypoon for this fix, and for reviewing it. |
| if (!files.isEmpty()) { | ||
| List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); | ||
| Optional<StatisticsFile> file = | ||
| files.stream().filter(f -> f.snapshotId() == snapshot.snapshotId()).findFirst(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have multiple files with the same snapshotId? Or findAny is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per java implementation, it is always a single stats file per snapshot id.
| statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile)); |
I followed similar design while working on partition stats too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could have a theoretical performance boost from findAny
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec doesn't actually say that there should only be one statistics file per snapshot. This happens to be how it is implemented in Java. The spec simply allows for multiple statistics files.
I was thinking about the problem of tracking orphaned statistics files when they are recomputed. One idea I had was to keep replaced statistics files (for a snapshot) still in the list (as long as the files are tracked in metadata we can clean up unused ones), but to keep the newest one before others. Hence findFirst. It was just an idea (and honestly not one I'm seriously considering).
In any case, I do not think that findAny is faster than findFirst here.
|
Why is this PR against Spark 3.4? |
|
Thanks @ajantha-bhat for catching that this is a backport! @jeesou: Please when backporting tell it in the PR description. Also highlight any changes needed compared to the original PR, so the reviewers could have easier time. |
|
@pvary, @ajantha-bhat I already reviewed this a day before you and if you read my comment, I mentioned that this is a clean backport of my fix. @jeesou as Peter stated, you should state in the description that this is a backport of #12482 (and that I am the author of the fix). I implied it but did not state it explicitly. |
No description provided.