Skip to content

feat(core): Support file deletion in incremental scan#4

Merged
gbrgr merged 10 commits intomainfrom
feature/gb/support-file-deletion
Nov 4, 2025
Merged

feat(core): Support file deletion in incremental scan#4
gbrgr merged 10 commits intomainfrom
feature/gb/support-file-deletion

Conversation

@gbrgr
Copy link
Copy Markdown
Collaborator

@gbrgr gbrgr commented Nov 3, 2025

Closes RAI-43291

Support Deleted Files in Incremental Scans

Summary

This PR implements support for deleted data files in incremental scans by introducing streaming logic for IncrementalFileScanTask::Delete. Previously, encountering a deleted file during an incremental scan would result in a "Feature Unsupported" error. Now, deleted files are properly handled by emitting delete records for all positions in the file (0 to N-1, where N is the total record count).

Changes

Implementation

iceberg-rust/crates/iceberg/src/arrow/incremental.rs

  1. New Helper Function: process_incremental_deleted_file_task (lines 314-353)

    • Generates position values from 0 to total_records - 1 (0-indexed)
    • Creates RecordBatches with pos column (UInt64) and _file column (file path)
    • Does NOT read the actual deleted file - only uses the record count from metadata
    • Returns a stream of batches, similar to positional deletes
  2. Updated IncrementalFileScanTask::Delete Match Arm (lines 126-161)

    • Extracts file path and total record count from deleted_file_task.base.record_count
    • Calls process_incremental_deleted_file_task to generate delete records
    • Sends batches to deletes_tx channel (correct channel for delete records)
    • Handles errors appropriately with descriptive messages

Tests

iceberg-rust/crates/iceberg/src/scan/incremental/tests.rs

Three comprehensive tests were added/updated to verify the implementation:

  1. test_incremental_scan_append_then_delete_file (lines 1591-1667)

    • Tests the basic scenario: append a file, then delete it
    • Verifies that scanning from empty state to final state yields no net change
    • Verifies that scanning from before deletion to after shows proper delete records
  2. test_incremental_scan_positional_deletes_then_file_delete (lines 1669-1756)

    • Tests the double-delete scenario: positional deletes followed by file deletion
    • Verifies that the system correctly handles redundant deletes
    • Includes a scan from snapshot 3 to 4 that demonstrates delete records are emitted even when records were already deleted by positional deletes
  3. test_incremental_scan_with_deleted_files_cancellation (updated, lines 1758+)

    • Updated from expecting an error to verifying correct delete behavior
    • Tests cancellation logic: files deleted outside the scan range produce Delete tasks
    • Verifies that files added and deleted within the same range are handled correctly

Key Design Decisions

Position Indexing

Delete records use 0-indexed positions (0 to N-1) to match Iceberg's positional delete semantics and ensure consistency with existing positional delete handling.

No File I/O Required

The implementation generates delete records without reading the actual deleted file. It only uses the record_count metadata from the manifest entry, making it efficient.

Interaction with Positional Deletes

The existing cancellation logic in the incremental scan planner (at mod.rs:481) prevents double-deletes by filtering positional deletes when appropriate:

  • When a file is both appended and deleted in the scan range AND has positional deletes, the positional deletes are emitted (not the deleted file task)
  • When a file is only deleted (not appended in the range), the deleted file task is emitted and positional deletes are filtered out

Double-Delete Behavior

The system may emit "redundant" deletes in certain scenarios. For example:

  • Snapshot 3: Positional deletes remove records at positions 0, 1, 2
  • Snapshot 4: File is deleted entirely
  • Scan from 3→4 will emit positions 0, 1, 2 from the file deletion, even though they were already deleted

This is expected behavior because incremental scans are stateless with respect to individual record states - they only track file-level changes between snapshots.

@gbrgr gbrgr marked this pull request as ready for review November 3, 2025 09:21
Copy link
Copy Markdown
Collaborator

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gerald. A few clarifications. Very nice tests btw!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are row numbers supposed to be 0-based?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. In the positional delete files they are, so I did this for consistency.

Comment thread crates/iceberg/src/arrow/incremental.rs Outdated
Comment thread crates/iceberg/src/arrow/incremental.rs Outdated
Comment thread crates/iceberg/src/scan/incremental/mod.rs Outdated
vec![], // No appends (file was added and fully deleted)
vec![
// Positions from positional deletes (snapshot 3): 0, 1, 2
// The deleted file task (which would emit 0, 1, 2) is NOT emitted
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to modify the test so that it would emit different thing for positional deletes vs for deleted fiel task?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat guess I'd have to expose this in the public api somehow, which I do not really want to.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just change the test so that it fails if it returns positional deletes only (e.g. delete only row 0 with positional deletes)

Comment thread crates/iceberg/src/scan/incremental/tests.rs Outdated
Comment thread crates/iceberg/src/scan/incremental/tests.rs Outdated
Comment thread crates/iceberg/src/scan/incremental/tests.rs Outdated
Comment thread crates/iceberg/src/scan/incremental/tests.rs Outdated
Copy link
Copy Markdown
Collaborator

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gerald.

@gbrgr gbrgr enabled auto-merge (squash) November 4, 2025 09:39
@gbrgr gbrgr disabled auto-merge November 4, 2025 09:42
@gbrgr gbrgr merged commit 589b57c into main Nov 4, 2025
16 checks passed
@gbrgr gbrgr deleted the feature/gb/support-file-deletion branch November 4, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants