Skip to content

Conversation

@cccs-jc
Copy link
Contributor

@cccs-jc cccs-jc commented Nov 3, 2023

Closes #8902

@singhpk234 I have fixed the issue #8902. Could you have a look at it.

for (ManifestFile manifest : manifestFiles) {
manifestIndexes.add(Pair.of(manifest, currentFileIndex));
currentFileIndex += manifest.addedFilesCount() + manifest.existingFilesCount();
currentFileIndex += manifest.addedFilesCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please add a ut for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would but I don't know how?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya we need to think of a case where this would actually result in incorrect results i am a bit surprised that there is no existing case for testing this code path. when i was adding rate limit code i just refactored this part to a common function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya none of the snapshots have an existing file count value.

when I look at production tables it's pretty rare actually to see that. I don't know what makes the existing file count be set.

Copy link
Contributor

@singhpk234 singhpk234 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cccs-jc imho then we should not remove this line until we have coverage for this then, can you please revert this change then ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is relatively easy to reason about this part because it is used only in one path.

The existing entries will be entries of data files that are already part of a manifest. For example, when you rewrite the metadata; for example because you use a lot of fast-appends. With a fast append, a new manifest will be written and added to the manifest-list. At some point, you want to combine these manifests into a bigger one to reduce the number of calls to the object store and make the planning part faster.

We want to skip when only the metadata has been rewritten, so I think this change is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, see here I show how the existingFilesCount is only set when doing rewrites.

#8980 (comment)

However the micro-batch streaming reader skips over rewrites. So it will never happen that the existingFilesCount is set in the above code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can happen, for example, it is easy to reproduce in TestTransaction::testTransactionRecommit where a new datafile is committed, and merged into an existing one:

image

Now, I'm questioning skipManifests, if it is valid to rely on the counts of manifests. But I have to dig deeper into this code since I'm not too familiar with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234 I'll put the + manifest.existingFilesCount(); back.

Now that I know how to create manifests with existing file counts I will add this to the test suite

commit.manifest.min-count-to-merge=3

This way the test will include some manifest entries with existing file counts. I ran the tests again with the + manifest.existingFilesCount(); and without. All the test still pass. But I think that makes sense because the skipManifests is an optimization.

Have a look and let me know what you think.

@cccs-jc cccs-jc requested review from nastra and singhpk234 November 14, 2023 19:10
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cccs-jc i would recommend to make the changes 1 spark version at a time and then create back-port pr, i am not sure what is the preferred though but checking it 1 version at a time helps in review and focus.

for (ManifestFile manifest : manifestFiles) {
manifestIndexes.add(Pair.of(manifest, currentFileIndex));
currentFileIndex += manifest.addedFilesCount() + manifest.existingFilesCount();
currentFileIndex += manifest.addedFilesCount();
Copy link
Contributor

@singhpk234 singhpk234 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cccs-jc imho then we should not remove this line until we have coverage for this then, can you please revert this change then ?

shouldContinueReading = false;
break;
}
// we found the next available snapshot, continue from there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes it explicit we skipped some snapshots and we are continuing from nextValid


Snapshot nextSnapshot = SnapshotUtil.snapshotAfter(table, curSnapshot.snapshotId());
// skip over rewrite and delete snapshots
while (!shouldProcess(nextSnapshot)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does shouldProcess handle null ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think SnapshotUtil.snapshotAfter never returns null

@cccs-jc
Copy link
Contributor Author

cccs-jc commented Nov 22, 2023

@cccs-jc i would recommend to make the changes 1 spark version at a time and then create back-port pr, i am not sure what is the preferred though but checking it 1 version at a time helps in review and focus.

I removed the 3.4 version. The only version in the commit is 3.5

@cccs-jc cccs-jc requested a review from singhpk234 November 22, 2023 20:59
@cccs-jc
Copy link
Contributor Author

cccs-jc commented Nov 23, 2023

@singhpk234 As you recommended I removed the 3.4 implementation and only kept one version 3.5.

However, now the test cases for 3.4 are failing. Any idea how to fix this. Should I just put back the 3.4 implementation?

@cccs-jc
Copy link
Contributor Author

cccs-jc commented Nov 28, 2023

@singhpk234 As you recommended I removed the 3.4 implementation and only kept one version 3.5.

However, now the test cases for 3.4 are failing. Any idea how to fix this. Should I just put back the 3.4 implementation?

What do you think @singhpk234 ?

@singhpk234
Copy link
Contributor

@cccs-jc i mean let's have changes for 3.5 with it's test only in 3.5 and we can backport the change with it's test in lower spark version like 3.4 and 3.3, 3.4 test failures are expected right as we don't have changes for SparkMicrobatch stream for 3.4 in it.

Also i would request to revert the change in core for Microbatch.java if we don't have coverage for it as i am unsure when would that fail (may be some legacy handling)

Apologies for getting being late in getting back at this.

@cccs-jc
Copy link
Contributor Author

cccs-jc commented Dec 7, 2023

@cccs-jc i mean let's have changes for 3.5 with it's test only in 3.5 and we can backport the change with it's test in lower spark version like 3.4 and 3.3, 3.4 test failures are expected right as we don't have changes for SparkMicrobatch stream for 3.4 in it.

Also i would request to revert the change in core for Microbatch.java if we don't have coverage for it as i am unsure when would that fail (may be some legacy handling)

Apologies for getting being late in getting back at this.

Keeping the + existingFilesCount(); in the SparkMicrobatch.java makes no sense to me.

What is the purpose of adding that to the currentFileIndex ?

The way I understand it currentFileIndex is a position of the added files. So we want to only count the added files (addedFilesCount()). These are the files that you want a streaming job to consume.

Can you explain what is the purpose of using existingFilesCount here ?

@singhpk234
Copy link
Contributor

Can you explain what is the purpose of using existingFilesCount here ?

I am not fully aware of this logically i totally agree with you it makes no sense to keep it but what i am skeptical is if there is some handling happening due to backward compatibility (not sure about it either) and you also suggested here that you are not able to populate this field (refering to your comment here : #8980 (comment)) so my rationale here was to add this change when we can add some coverage or better would to take this change out of the current pr and involve more folks for the discussion and let this pr go as a functionality to skip overwrite commit only, please let me know your thoughts ?

@cccs-jc
Copy link
Contributor Author

cccs-jc commented Dec 13, 2023

so I did more digging. On our production tables I search for all manifests which have a existing_data_files_count > 0 and added_data_files_count > 0 and I find none. This leads me to believe that a commit will either be an append with added_data_files_count or a rewrite with existing_data_files_count .

This query returns no results:

select
      distinct added_snapshot_id
  from
      catalog1.schema1.table1.manifests
  where
      existing_data_files_count > 0
      and added_data_files_count > 0

I can search for manifests which have existing_data_files_count > 0 and join those results to the snapshots.

select
    *
from
    catalog1.schema1.table1.snapshots
where
    snapshot_id in (
        select
            distinct added_snapshot_id
        from
            catalog1.schema1.table1.manifests
        where
            existing_data_files_count > 0
    )

Manifests with the snapshot_id they belong to
image

Their corresponding snapshots are all rewrite snapshots:
image

When streaming we skip over rewrites snapshots. Thus we will never encounter a manifest with an existing_data_files_count > 0.

So this calling this in the code does nothing + existingFilesCount();

@Fokko Fokko added this to the Iceberg 1.5.0 milestone Jan 4, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thanks @cccs-jc for working on this.

@singhpk234 Do you have any further concerns? I'm also very skeptical of counting the added files, and I think we might want to remove that piece of logic (in a separate PR).

@singhpk234
Copy link
Contributor

I'm also very skeptical of counting the added files, and I think we might want to remove that piece of logic (in a separate PR).

+1 on this @Fokko other than this no further concerns from my end !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger

4 participants