Skip to content

Upsert Speadup#1

Draft
EnyMan wants to merge 8 commits intomainfrom
upsert-speedup
Draft

Upsert Speadup#1
EnyMan wants to merge 8 commits intomainfrom
upsert-speedup

Conversation

@EnyMan
Copy link
Copy Markdown
Owner

@EnyMan EnyMan commented Jan 21, 2026

This pull request introduces extensive performance instrumentation across critical data read and write paths in PyIceberg, focusing on timing and logging for scan, append, overwrite, and upsert operations. The changes add detailed log messages to help diagnose bottlenecks and optimize performance, and refactor some logic in the upsert operation to improve efficiency for insert filtering. Below are the most important changes grouped by theme.

Performance Instrumentation and Logging

  • Added timing and logging throughout pyiceberg/io/pyarrow.py for file scan tasks, delete file reads, and batch processing. This includes breakdowns for file opening, schema projection, filter preparation, batch reading, and delete file collection. [1]], [2]], [3]], [4]], [5]], [6]], [7]])
  • Added timing and logging to all major write operations in pyiceberg/table/__init__.py, including append, overwrite, and upsert, to track the duration and row counts for each operation. [1]], [2]], [3]], [4]], [5]], [6]])

Upsert Logic Improvements

  • Refactored upsert logic to use a coarse match filter for initial scan and replaced per-batch expression filtering for inserts with a more efficient anti-join approach using PyArrow, improving correctness and performance for large datasets. [1]], [2]])

File Planning Instrumentation

  • Added timing and logging to the local file planning method to track manifest scan, delete file matching, and residual evaluation times, with summary statistics for the number of data and delete files processed. [1]], [2]])

Infrastructure

  • Introduced a module-level logger for consistent logging throughout pyiceberg/table/__init__.py. [1]], [2]])

These changes provide much deeper visibility into the performance characteristics of PyIceberg's critical data paths, which will help with debugging and future optimization efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant