Skip to content

Conversation

@jeesou
Copy link
Contributor

@jeesou jeesou commented Mar 26, 2025

No description provided.

@github-actions github-actions bot added the spark label Mar 26, 2025
@jeesou
Copy link
Contributor Author

jeesou commented Mar 26, 2025

This is the backport changes from #12482

@wypoon @findepi

Kindly help review and merge

@jeesou jeesou changed the title Backporting Correct Stats file fetch fix to Spark 3.4 Spark : Backporting Correct Stats file fetch fix to Spark 3.4 Mar 26, 2025
@wypoon
Copy link
Contributor

wypoon commented Mar 26, 2025

This is a clean backport of my fix.
@jeesou can you please use the same title as the original fix, prefixed by "Spark 3.4" --
"Spark 3.4: Use correct statistics file in SparkScan::estimateStatistics(Snapshot)".
Thanks.
(Btw, it is customary to acknowledge the author of the original fix if you're doing a backport.)

@jeesou jeesou changed the title Spark : Backporting Correct Stats file fetch fix to Spark 3.4 Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) Mar 27, 2025
@jeesou
Copy link
Contributor Author

jeesou commented Mar 27, 2025

Yes thanks @wypoon for this fix, and for reviewing it.

if (!files.isEmpty()) {
List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
Optional<StatisticsFile> file =
files.stream().filter(f -> f.snapshotId() == snapshot.snapshotId()).findFirst();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have multiple files with the same snapshotId? Or findAny is enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per java implementation, it is always a single stats file per snapshot id.

statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));

I followed similar design while working on partition stats too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could have a theoretical performance boost from findAny

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't actually say that there should only be one statistics file per snapshot. This happens to be how it is implemented in Java. The spec simply allows for multiple statistics files.
I was thinking about the problem of tracking orphaned statistics files when they are recomputed. One idea I had was to keep replaced statistics files (for a snapshot) still in the list (as long as the files are tracked in metadata we can clean up unused ones), but to keep the newest one before others. Hence findFirst. It was just an idea (and honestly not one I'm seriously considering).
In any case, I do not think that findAny is faster than findFirst here.

@pvary
Copy link
Contributor

pvary commented Mar 27, 2025

Why is this PR against Spark 3.4?
In Flink we usually create PRs against the current version

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Mar 27, 2025

@pvary: This PR is looks like just a backport of #12482.

He has not updated the PR descriptions properly (looks like new contributor).

@pvary pvary merged commit d54d81e into apache:main Mar 27, 2025
27 checks passed
@pvary
Copy link
Contributor

pvary commented Mar 27, 2025

Thanks @ajantha-bhat for catching that this is a backport!

@jeesou: Please when backporting tell it in the PR description. Also highlight any changes needed compared to the original PR, so the reviewers could have easier time.

@wypoon
Copy link
Contributor

wypoon commented Mar 27, 2025

@pvary, @ajantha-bhat I already reviewed this a day before you and if you read my comment, I mentioned that this is a clean backport of my fix.

@jeesou as Peter stated, you should state in the description that this is a backport of #12482 (and that I am the author of the fix). I implied it but did not state it explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants