Skip to content

Conversation

@karuppayya
Copy link
Contributor

Use the statistics of the snapshot being scanned, instead of the first statistics file.

@huaxingao @RussellSpitzer @aokolnychyi Please help review

@github-actions github-actions bot added the spark label Aug 29, 2024
Comment on lines 250 to 253
long snapshotId = snapshot.snapshotId();
return table.statisticsFiles().stream()
.filter(statisticsFile -> statisticsFile.snapshotId() == snapshotId)
.findFirst();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I feel like this could just be inlined above instead of having a separate helper method, but not super opinonated.

}

@TestTemplate
public void testMultipleSnapshotsWithColStats() throws NoSuchTableException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing a test for the case where a statistics file for the snapshot couldn't be found? Let me know if I just missed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 agree with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testcase add has it here

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karuppayya
Copy link
Contributor Author

@amogh-jahagirdar Can you please take a look?

@jeesou
Copy link
Contributor

jeesou commented Sep 12, 2024

LGTM
I have picked up the PR changes and tested it out, Its working fine.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, please see my comment @karuppayya on why I think a table API for resolving a statistics file for a snapshot makes sense. Let me know what you think!

Comment on lines 198 to 200
Optional<StatisticsFile> statisticsFile = statisticsFile(snapshot);
if (statisticsFile.isPresent()) {
List<BlobMetadata> metadataList = statisticsFile.get().blobMetadata();
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karuppayya Sorry for missing this earlier, I think we may want to consider a table API for resolving a statistics file based on a snapshot, statisticsFileFor. The implementation of that API could just do a best effort search of the statistics file for a given snapshot, and if one cannot be found just return the most recent one.

If an engine integration needs the exact statistics and the API response isn't it, that's OK since the engine can then just ignore the statistics file. But i think in the most common cases, having an out of date statistics file is probably acceptable and so the API should probably default to the best effort lookup.

This is analagous to what happens in view.dialectFor API where a best effort for a given dialect is searched but if one cannot be found the first representation is returned. Engines like Trino which require the strict dialect can use the API response and compare against the desired and fail accordingly. Other engines like Spark don't do the strict lookup and just take the response as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to introduce the table API for retrieving the stats.
But should we do a best effort here or jsut return empty when there arent stats for the snasphot?
We dont have a means to compare against a baseline to figure if its an approximation, unlike dialects where it could be validated.

Copy link
Contributor

@jeesou jeesou Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @karuppayya , i saw the latest changes, but still as per the latest changes it will take the latest Snapshot Id, and it filters over it. Which means that if the Analyze procedure is not executed for the latest snapshot, it won't find the stat file. Hence it is not doing the best effort search of the statistics file for a given snapshot right?

Instead, it should pick the last existing statistics file, so that we may get some benefit out of it at least in the query planning. Could you please help me understand the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeesou Yes, the current code doesnt return any available stats.
I think returning a best effort stats can result in bad decisions by optimizer based on when the stats were computed.
We can introduce a config to let users decide if they are fine with best effort search. This way the user is also aware of it, instead of doing it transparently. WDYT? @jeesou @amogh-jahagirdar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @karuppayya , making it config based seems like a better idea, giving the user more control over it.

@karuppayya
Copy link
Contributor Author

@amogh-jahagirdar I have incorporated the feedback. Can you please take a look. Thank you

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 12, 2024
@jeesou
Copy link
Contributor

jeesou commented Nov 14, 2024

Hi @karuppayya , @amogh-jahagirdar as per our discussion to introduce a config to let users decide if they are fine with best effort search, I was thinking of adding a kind of threshold that the user can decide, as per the amount of data change.

I have written some code as example, the diff can be seen here - https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1

i created the config as OLD_STATISTICS_USAGE_THRESHOLD_PERCENTAGE

Basically it tries and finds the last Snapshot for which statistics is present, and the amount of data changed in between.

Currently if any deletion is happening I am not using the old statistics, as deletion can be unpredictable, and this needs fine-tuning, for other operations, I make a record of amount of data change and check whether the change is within the specified threshold. Default value is 100 which means by default it will never use the old existing stats.

kindly check once and please do suggest improvements.

@github-actions github-actions bot removed the stale label Nov 15, 2024
@jeesou
Copy link
Contributor

jeesou commented Nov 27, 2024

Hi @karuppayya , @amogh-jahagirdar kindly check the comment above.

@amogh-jahagirdar
Copy link
Contributor

Sorry for the delay @jeesou @karuppayya , this is on my list today for review

*
* @return the {@link StatisticsFile} for the given snapshot id, if available.
*/
default Optional<StatisticsFile> statistics(long snapshotId) {
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, I think this API is in the right direction but I wonder if it's taking enough information to be able to resolve the right file. There's an implicit assumption in the current implementation of this API that a Puffin file will have all the blobs needed. I think it's possible that there could multiple puffins, and one puffin has blob type NDV and the other puffin has blob type SomeFutureIndex. both of these puffin files are for snapshot 1.

In the current implementation if someone wants the Puffin where there's SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me a puffin with the given snapshot logic" I feel like the API isn't really doing what it needs.

Here's the signature I'm thinking at the moment:

default StatisticsFile statisticsFor(long snapshotId, String blobType)

This will attempt to attempt to find the statistics file produced at the given snapshot which contains the blob type. I think this future proofs the API a bit more and makes it more useful in case a user is really only caring about a particular blob type which may or may not exist in some other Puffin file.

I also think that in case a statistics file for exactly that snapshot couldn't be found, we should return the latest statistics rather than just return nothing. While it's possible that the statistics are not completely accurate in this approach, I think in the average case data distributions wouldn't change so drastically between snapshots that the statistics would work horribly against the query. It's probably better in the average case to have some statistics. If there's no there's no statistics file containing the desired blob then I think we'd just return null.

WDYT?

Copy link
Contributor

@jeesou jeesou Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @amogh-jahagirdar your suggestion is perfect, considering a generic solution where we support multiple bolb types. The current implementation is considering that we will only support the "apache-datasketches-theta-v1".
We recently faced this when we were dealing with presto, considering both engines were using a common catalog, and hence the puffin file created by presto was not use-able as it was of a different blob type "presto-sum-data-size-bytes-v1". This change would be a more of a futuristic change which we may take up.

Regarding the best effort search of stats @amogh-jahagirdar, I thing we need to reconsider if we want to have some statistics always, because that would depend on the amount of data added or deleted after the last time we ran and Analyze. Because stale statistics could lead to wrong query plans. And what if we let the user configure how much deviation or change is the user fine with to continue using the older statistics. For the same I had made some changes so that the user may decide the amount of change https://github.com/karuppayya/iceberg/compare/fix_snapshot...jeesou:fix_snapshot_modifications?expand=1.

Kindly have a look at it @amogh-jahagirdar and @karuppayya and share your suggestions please.
I have not considered the delete scenario, if i find any deletion happening I am not using old stats, but that can be up to discussion as delete is a tricky subject in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @karuppayya , @amogh-jahagirdar kindly review the change once and suggest any edits if required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @karuppayya @amogh-jahagirdar @huaxingao kindly, give this a look, and share suggestions on this approach, mentioned above.


@Override
public Optional<StatisticsFile> statistics(long snapshotId) {
return statisticsFiles().stream().filter(file -> file.snapshotId() == snapshotId).findFirst();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above on a different signature to incorporate the desired blob type a user is looking for and how I think this APi should be more best effort (return the latest statistics file that we can find) rather than returning Optional.empty

@jeesou
Copy link
Contributor

jeesou commented Dec 10, 2024

Hi @karuppayya @amogh-jahagirdar could you please have a look at the PR.

@jeesou
Copy link
Contributor

jeesou commented Jan 7, 2025

Hi @karuppayya @amogh-jahagirdar friendly reminder, please check the comments once.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 24, 2025
@jeesou
Copy link
Contributor

jeesou commented Mar 26, 2025

Hi @karuppayya will this PR be needed anymore, now that this PR is already in
#12482

This seems to solve the problem right?

@github-actions github-actions bot removed the stale label Mar 27, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 26, 2025
@github-actions
Copy link

github-actions bot commented May 3, 2025

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants