feat(core): Add support for replace in incremental scan#8
Conversation
vustef
left a comment
There was a problem hiding this comment.
Just a couple clarifications
| /// 1. Files to compact: Vec<String> of existing file names that are being compacted | ||
| /// 2. Target file: String name of the new compacted file | ||
| /// | ||
| /// Example: `Replace(vec!["file-a.parquet", "file-b.parquet"], "file-a-b-compacted.parquet")` |
There was a problem hiding this comment.
Is this how iceberg engines do it too? How do they retarget positional delete files to the file-a-b-compacted.parquet?
There was a problem hiding this comment.
So what spark does is that essentially file-a-b-compacted.parquet will contain the records of file-a + file-b minus the positional deletes (and equality deletes). However, existing delete files of file-a and file-b remain in-place.
| // Snapshot 6: Delete position 2 from file-ab-compact (record "4" deleted) | ||
| // Net result: | ||
| // - Additions: compacted records with position 1 filtered from file-a (1,3,5,10,11,12) | ||
| // - Deletions: All positions from file-a (0-4) and file-b (0-2) because these files |
There was a problem hiding this comment.
this would report duplicate delete for record "2", right?
There was a problem hiding this comment.
No, not really. The delete for the compacted file is filtered out and only appends for that file are reported. Even if not, depends what we mean by duplicated delete: Since the file path is also reported, the file path for one of the records "2" would be file-a, while the other file-ab-compacted
Closes RAI-44110
Adds support for
replaceoperations in snapshot histories for incremental scans.Even though replace operations logically keep data the same, we still report file additions and deletions, as their physical layout changes and files to which the rows belong change. This is necessary for incremental scan users who want to base change tracking off of file identifiers.