Skip to content

feat(core): Add support for replace in incremental scan#8

Merged
gbrgr merged 4 commits intomainfrom
feature/gb/support-replace-incremental
Nov 10, 2025
Merged

feat(core): Add support for replace in incremental scan#8
gbrgr merged 4 commits intomainfrom
feature/gb/support-replace-incremental

Conversation

@gbrgr
Copy link
Copy Markdown
Collaborator

@gbrgr gbrgr commented Nov 6, 2025

Closes RAI-44110

Adds support for replace operations in snapshot histories for incremental scans.

Even though replace operations logically keep data the same, we still report file additions and deletions, as their physical layout changes and files to which the rows belong change. This is necessary for incremental scan users who want to base change tracking off of file identifiers.

@gbrgr gbrgr changed the title Add support for replace Add support for replace in incremental scan Nov 6, 2025
@gbrgr gbrgr marked this pull request as ready for review November 6, 2025 12:35
@gbrgr gbrgr requested a review from vustef November 6, 2025 12:35
Copy link
Copy Markdown
Collaborator

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple clarifications

/// 1. Files to compact: Vec<String> of existing file names that are being compacted
/// 2. Target file: String name of the new compacted file
///
/// Example: `Replace(vec!["file-a.parquet", "file-b.parquet"], "file-a-b-compacted.parquet")`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how iceberg engines do it too? How do they retarget positional delete files to the file-a-b-compacted.parquet?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what spark does is that essentially file-a-b-compacted.parquet will contain the records of file-a + file-b minus the positional deletes (and equality deletes). However, existing delete files of file-a and file-b remain in-place.

Comment thread crates/iceberg/src/scan/incremental/tests.rs
Comment thread crates/iceberg/src/scan/incremental/tests.rs
@gbrgr gbrgr changed the title Add support for replace in incremental scan feat(core): Add support for replace in incremental scan Nov 7, 2025
// Snapshot 6: Delete position 2 from file-ab-compact (record "4" deleted)
// Net result:
// - Additions: compacted records with position 1 filtered from file-a (1,3,5,10,11,12)
// - Deletions: All positions from file-a (0-4) and file-b (0-2) because these files
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would report duplicate delete for record "2", right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. The delete for the compacted file is filtered out and only appends for that file are reported. Even if not, depends what we mean by duplicated delete: Since the file path is also reported, the file path for one of the records "2" would be file-a, while the other file-ab-compacted

@gbrgr gbrgr merged commit 37e79b8 into main Nov 10, 2025
24 of 26 checks passed
@gbrgr gbrgr deleted the feature/gb/support-replace-incremental branch November 10, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants