Use Snapshot's statistics file in SparkScan #11040

karuppayya · 2024-08-29T05:15:05Z

Use the statistics of the snapshot being scanned, instead of the first statistics file.

@huaxingao @RussellSpitzer @aokolnychyi Please help review

amogh-jahagirdar · 2024-08-30T03:52:56Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+    long snapshotId = snapshot.snapshotId();
+    return table.statisticsFiles().stream()
+        .filter(statisticsFile -> statisticsFile.snapshotId() == snapshotId)
+        .findFirst();


Nit: I feel like this could just be inlined above instead of having a separate helper method, but not super opinonated.

amogh-jahagirdar · 2024-08-30T04:03:15Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

  }

+  @TestTemplate
+  public void testMultipleSnapshotsWithColStats() throws NoSuchTableException {


I think we're missing a test for the case where a statistics file for the snapshot couldn't be found? Let me know if I just missed it.

+1 agree with it.

The testcase add has it here

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java

huaxingao

LGTM

karuppayya · 2024-09-06T17:21:06Z

@amogh-jahagirdar Can you please take a look?

jeesou · 2024-09-12T05:33:39Z

LGTM
I have picked up the PR changes and tested it out, Its working fine.

amogh-jahagirdar

Sorry for the late review, please see my comment @karuppayya on why I think a table API for resolving a statistics file for a snapshot makes sense. Let me know what you think!

amogh-jahagirdar · 2024-09-20T19:34:11Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+      Optional<StatisticsFile> statisticsFile = statisticsFile(snapshot);
+      if (statisticsFile.isPresent()) {
+        List<BlobMetadata> metadataList = statisticsFile.get().blobMetadata();


@karuppayya Sorry for missing this earlier, I think we may want to consider a table API for resolving a statistics file based on a snapshot, statisticsFileFor. The implementation of that API could just do a best effort search of the statistics file for a given snapshot, and if one cannot be found just return the most recent one.

If an engine integration needs the exact statistics and the API response isn't it, that's OK since the engine can then just ignore the statistics file. But i think in the most common cases, having an out of date statistics file is probably acceptable and so the API should probably default to the best effort lookup.

This is analagous to what happens in view.dialectFor API where a best effort for a given dialect is searched but if one cannot be found the first representation is returned. Engines like Trino which require the strict dialect can use the API response and compare against the desired and fail accordingly. Other engines like Spark don't do the strict lookup and just take the response as is.

+1 to introduce the table API for retrieving the stats.
But should we do a best effort here or jsut return empty when there arent stats for the snasphot?
We dont have a means to compare against a baseline to figure if its an approximation, unlike dialects where it could be validated.

Hi @karuppayya , i saw the latest changes, but still as per the latest changes it will take the latest Snapshot Id, and it filters over it. Which means that if the Analyze procedure is not executed for the latest snapshot, it won't find the stat file. Hence it is not doing the best effort search of the statistics file for a given snapshot right?

Instead, it should pick the last existing statistics file, so that we may get some benefit out of it at least in the query planning. Could you please help me understand the current behavior.

@jeesou Yes, the current code doesnt return any available stats.
I think returning a best effort stats can result in bad decisions by optimizer based on when the stats were computed.
We can introduce a config to let users decide if they are fine with best effort search. This way the user is also aware of it, instead of doing it transparently. WDYT? @jeesou @amogh-jahagirdar

Yes @karuppayya , making it config based seems like a better idea, giving the user more control over it.

karuppayya · 2024-09-30T22:59:03Z

@amogh-jahagirdar I have incorporated the feedback. Can you please take a look. Thank you

github-actions · 2024-11-12T00:15:02Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

jeesou · 2024-11-14T19:29:13Z

Hi @karuppayya , @amogh-jahagirdar as per our discussion to introduce a config to let users decide if they are fine with best effort search, I was thinking of adding a kind of threshold that the user can decide, as per the amount of data change.

I have written some code as example, the diff can be seen here - https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1

i created the config as OLD_STATISTICS_USAGE_THRESHOLD_PERCENTAGE

Basically it tries and finds the last Snapshot for which statistics is present, and the amount of data changed in between.

Currently if any deletion is happening I am not using the old statistics, as deletion can be unpredictable, and this needs fine-tuning, for other operations, I make a record of amount of data change and check whether the change is within the specified threshold. Default value is 100 which means by default it will never use the old existing stats.

kindly check once and please do suggest improvements.

jeesou · 2024-11-27T05:32:53Z

Hi @karuppayya , @amogh-jahagirdar kindly check the comment above.

amogh-jahagirdar · 2024-11-27T14:17:57Z

Sorry for the delay @jeesou @karuppayya , this is on my list today for review

amogh-jahagirdar · 2024-11-28T00:07:36Z

api/src/main/java/org/apache/iceberg/Table.java

+   *
+   * @return the {@link StatisticsFile} for the given snapshot id, if available.
+   */
+  default Optional<StatisticsFile> statistics(long snapshotId) {


Thanks for adding this, I think this API is in the right direction but I wonder if it's taking enough information to be able to resolve the right file. There's an implicit assumption in the current implementation of this API that a Puffin file will have all the blobs needed. I think it's possible that there could multiple puffins, and one puffin has blob type NDV and the other puffin has blob type SomeFutureIndex. both of these puffin files are for snapshot 1.

In the current implementation if someone wants the Puffin where there's SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me a puffin with the given snapshot logic" I feel like the API isn't really doing what it needs.

Here's the signature I'm thinking at the moment:

default StatisticsFile statisticsFor(long snapshotId, String blobType)

This will attempt to attempt to find the statistics file produced at the given snapshot which contains the blob type. I think this future proofs the API a bit more and makes it more useful in case a user is really only caring about a particular blob type which may or may not exist in some other Puffin file.

I also think that in case a statistics file for exactly that snapshot couldn't be found, we should return the latest statistics rather than just return nothing. While it's possible that the statistics are not completely accurate in this approach, I think in the average case data distributions wouldn't change so drastically between snapshots that the statistics would work horribly against the query. It's probably better in the average case to have some statistics. If there's no there's no statistics file containing the desired blob then I think we'd just return null.

WDYT?

yes @amogh-jahagirdar your suggestion is perfect, considering a generic solution where we support multiple bolb types. The current implementation is considering that we will only support the "apache-datasketches-theta-v1".
We recently faced this when we were dealing with presto, considering both engines were using a common catalog, and hence the puffin file created by presto was not use-able as it was of a different blob type "presto-sum-data-size-bytes-v1". This change would be a more of a futuristic change which we may take up.

Regarding the best effort search of stats @amogh-jahagirdar, I thing we need to reconsider if we want to have some statistics always, because that would depend on the amount of data added or deleted after the last time we ran and Analyze. Because stale statistics could lead to wrong query plans. And what if we let the user configure how much deviation or change is the user fine with to continue using the older statistics. For the same I had made some changes so that the user may decide the amount of change https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1.

Kindly have a look at it @amogh-jahagirdar and @karuppayya and share your suggestions please.
I have not considered the delete scenario, if i find any deletion happening I am not using old stats, but that can be up to discussion as delete is a tricky subject in this case.

HI @karuppayya , @amogh-jahagirdar kindly review the change once and suggest any edits if required.

Hi @karuppayya @amogh-jahagirdar @huaxingao kindly, give this a look, and share suggestions on this approach, mentioned above.

amogh-jahagirdar · 2024-11-28T00:09:18Z

core/src/main/java/org/apache/iceberg/BaseTable.java


+  @Override
+  public Optional<StatisticsFile> statistics(long snapshotId) {
+    return statisticsFiles().stream().filter(file -> file.snapshotId() == snapshotId).findFirst();


See comment above on a different signature to incorporate the desired blob type a user is looking for and how I think this APi should be more best effort (return the latest statistics file that we can find) rather than returning Optional.empty

jeesou · 2024-12-10T05:31:56Z

Hi @karuppayya @amogh-jahagirdar could you please have a look at the PR.

jeesou · 2025-01-07T04:53:29Z

Hi @karuppayya @amogh-jahagirdar friendly reminder, please check the comments once.

github-actions · 2025-03-24T00:16:49Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

jeesou · 2025-03-26T09:01:02Z

Hi @karuppayya will this PR be needed anymore, now that this PR is already in
#12482

This seems to solve the problem right?

github-actions · 2025-04-26T00:16:16Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-05-03T00:16:27Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the spark label Aug 29, 2024

amogh-jahagirdar reviewed Aug 30, 2024

View reviewed changes

huaxingao reviewed Aug 30, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java Outdated Show resolved Hide resolved

karuppayya requested review from amogh-jahagirdar and huaxingao August 30, 2024 17:13

huaxingao reviewed Aug 31, 2024

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java Outdated Show resolved Hide resolved

karuppayya force-pushed the fix_snapshot branch from 4567124 to cda422c Compare September 5, 2024 05:10

karuppayya requested a review from huaxingao September 5, 2024 05:13

huaxingao approved these changes Sep 5, 2024

View reviewed changes

jeesou approved these changes Sep 12, 2024

View reviewed changes

amogh-jahagirdar reviewed Sep 20, 2024

View reviewed changes

krajendran4 added 4 commits September 27, 2024 14:55

Use stats from the right snapshot

02ddd74

Address review comments

b51697f

Address review comment

2f57477

Add table API fo stats

c8f664f

karuppayya force-pushed the fix_snapshot branch from cda422c to c8f664f Compare September 27, 2024 23:16

github-actions bot added API core labels Sep 27, 2024

karuppayya requested a review from amogh-jahagirdar September 27, 2024 23:16

github-actions bot added the stale label Nov 12, 2024

github-actions bot removed the stale label Nov 15, 2024

amogh-jahagirdar reviewed Nov 28, 2024

View reviewed changes

github-actions bot added the stale label Mar 24, 2025

github-actions bot removed the stale label Mar 27, 2025

github-actions bot added the stale label Apr 26, 2025

github-actions bot closed this May 3, 2025

Use Snapshot's statistics file in SparkScan #11040

Use Snapshot's statistics file in SparkScan #11040

Uh oh!

Conversation

karuppayya commented Aug 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Sep 6, 2024

Uh oh!

jeesou commented Sep 12, 2024

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeesou Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Sep 30, 2024

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

jeesou commented Nov 14, 2024

Uh oh!

jeesou commented Nov 27, 2024

Uh oh!

amogh-jahagirdar commented Nov 27, 2024

Uh oh!

amogh-jahagirdar Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeesou Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeesou commented Dec 10, 2024

Uh oh!

jeesou commented Jan 7, 2025

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

jeesou commented Mar 26, 2025

Uh oh!

github-actions bot commented Apr 26, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

amogh-jahagirdar Sep 20, 2024 •

edited

Loading

jeesou Oct 1, 2024 •

edited

Loading

amogh-jahagirdar Nov 28, 2024 •

edited

Loading

jeesou Nov 28, 2024 •

edited

Loading