-
Notifications
You must be signed in to change notification settings - Fork 3k
Use Snapshot's statistics file in SparkScan #11040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| long snapshotId = snapshot.snapshotId(); | ||
| return table.statisticsFiles().stream() | ||
| .filter(statisticsFile -> statisticsFile.snapshotId() == snapshotId) | ||
| .findFirst(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I feel like this could just be inlined above instead of having a separate helper method, but not super opinonated.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testMultipleSnapshotsWithColStats() throws NoSuchTableException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing a test for the case where a statistics file for the snapshot couldn't be found? Let me know if I just missed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 agree with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testcase add has it here
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java
Outdated
Show resolved
Hide resolved
4567124 to
cda422c
Compare
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@amogh-jahagirdar Can you please take a look? |
|
LGTM |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, please see my comment @karuppayya on why I think a table API for resolving a statistics file for a snapshot makes sense. Let me know what you think!
| Optional<StatisticsFile> statisticsFile = statisticsFile(snapshot); | ||
| if (statisticsFile.isPresent()) { | ||
| List<BlobMetadata> metadataList = statisticsFile.get().blobMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karuppayya Sorry for missing this earlier, I think we may want to consider a table API for resolving a statistics file based on a snapshot, statisticsFileFor. The implementation of that API could just do a best effort search of the statistics file for a given snapshot, and if one cannot be found just return the most recent one.
If an engine integration needs the exact statistics and the API response isn't it, that's OK since the engine can then just ignore the statistics file. But i think in the most common cases, having an out of date statistics file is probably acceptable and so the API should probably default to the best effort lookup.
This is analagous to what happens in view.dialectFor API where a best effort for a given dialect is searched but if one cannot be found the first representation is returned. Engines like Trino which require the strict dialect can use the API response and compare against the desired and fail accordingly. Other engines like Spark don't do the strict lookup and just take the response as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to introduce the table API for retrieving the stats.
But should we do a best effort here or jsut return empty when there arent stats for the snasphot?
We dont have a means to compare against a baseline to figure if its an approximation, unlike dialects where it could be validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @karuppayya , i saw the latest changes, but still as per the latest changes it will take the latest Snapshot Id, and it filters over it. Which means that if the Analyze procedure is not executed for the latest snapshot, it won't find the stat file. Hence it is not doing the best effort search of the statistics file for a given snapshot right?
Instead, it should pick the last existing statistics file, so that we may get some benefit out of it at least in the query planning. Could you please help me understand the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeesou Yes, the current code doesnt return any available stats.
I think returning a best effort stats can result in bad decisions by optimizer based on when the stats were computed.
We can introduce a config to let users decide if they are fine with best effort search. This way the user is also aware of it, instead of doing it transparently. WDYT? @jeesou @amogh-jahagirdar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @karuppayya , making it config based seems like a better idea, giving the user more control over it.
cda422c to
c8f664f
Compare
|
@amogh-jahagirdar I have incorporated the feedback. Can you please take a look. Thank you |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Hi @karuppayya , @amogh-jahagirdar as per our discussion to introduce a config to let users decide if they are fine with best effort search, I was thinking of adding a kind of threshold that the user can decide, as per the amount of data change. I have written some code as example, the diff can be seen here - https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1 i created the config as OLD_STATISTICS_USAGE_THRESHOLD_PERCENTAGE Basically it tries and finds the last Snapshot for which statistics is present, and the amount of data changed in between. Currently if any deletion is happening I am not using the old statistics, as deletion can be unpredictable, and this needs fine-tuning, for other operations, I make a record of amount of data change and check whether the change is within the specified threshold. Default value is 100 which means by default it will never use the old existing stats. kindly check once and please do suggest improvements. |
|
Hi @karuppayya , @amogh-jahagirdar kindly check the comment above. |
|
Sorry for the delay @jeesou @karuppayya , this is on my list today for review |
| * | ||
| * @return the {@link StatisticsFile} for the given snapshot id, if available. | ||
| */ | ||
| default Optional<StatisticsFile> statistics(long snapshotId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, I think this API is in the right direction but I wonder if it's taking enough information to be able to resolve the right file. There's an implicit assumption in the current implementation of this API that a Puffin file will have all the blobs needed. I think it's possible that there could multiple puffins, and one puffin has blob type NDV and the other puffin has blob type SomeFutureIndex. both of these puffin files are for snapshot 1.
In the current implementation if someone wants the Puffin where there's SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me a puffin with the given snapshot logic" I feel like the API isn't really doing what it needs.
Here's the signature I'm thinking at the moment:
default StatisticsFile statisticsFor(long snapshotId, String blobType)
This will attempt to attempt to find the statistics file produced at the given snapshot which contains the blob type. I think this future proofs the API a bit more and makes it more useful in case a user is really only caring about a particular blob type which may or may not exist in some other Puffin file.
I also think that in case a statistics file for exactly that snapshot couldn't be found, we should return the latest statistics rather than just return nothing. While it's possible that the statistics are not completely accurate in this approach, I think in the average case data distributions wouldn't change so drastically between snapshots that the statistics would work horribly against the query. It's probably better in the average case to have some statistics. If there's no there's no statistics file containing the desired blob then I think we'd just return null.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @amogh-jahagirdar your suggestion is perfect, considering a generic solution where we support multiple bolb types. The current implementation is considering that we will only support the "apache-datasketches-theta-v1".
We recently faced this when we were dealing with presto, considering both engines were using a common catalog, and hence the puffin file created by presto was not use-able as it was of a different blob type "presto-sum-data-size-bytes-v1". This change would be a more of a futuristic change which we may take up.
Regarding the best effort search of stats @amogh-jahagirdar, I thing we need to reconsider if we want to have some statistics always, because that would depend on the amount of data added or deleted after the last time we ran and Analyze. Because stale statistics could lead to wrong query plans. And what if we let the user configure how much deviation or change is the user fine with to continue using the older statistics. For the same I had made some changes so that the user may decide the amount of change https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1.
Kindly have a look at it @amogh-jahagirdar and @karuppayya and share your suggestions please.
I have not considered the delete scenario, if i find any deletion happening I am not using old stats, but that can be up to discussion as delete is a tricky subject in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @karuppayya , @amogh-jahagirdar kindly review the change once and suggest any edits if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @karuppayya @amogh-jahagirdar @huaxingao kindly, give this a look, and share suggestions on this approach, mentioned above.
|
|
||
| @Override | ||
| public Optional<StatisticsFile> statistics(long snapshotId) { | ||
| return statisticsFiles().stream().filter(file -> file.snapshotId() == snapshotId).findFirst(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above on a different signature to incorporate the desired blob type a user is looking for and how I think this APi should be more best effort (return the latest statistics file that we can find) rather than returning Optional.empty
|
Hi @karuppayya @amogh-jahagirdar could you please have a look at the PR. |
|
Hi @karuppayya @amogh-jahagirdar friendly reminder, please check the comments once. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Hi @karuppayya will this PR be needed anymore, now that this PR is already in This seems to solve the problem right? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Use the statistics of the snapshot being scanned, instead of the first statistics file.
@huaxingao @RussellSpitzer @aokolnychyi Please help review