feat(reader): Add read_with_metrics() for scan I/O metrics by mbutrovich · Pull Request #2349 · apache/iceberg-rust

mbutrovich · 2026-04-20T17:53:39Z

Which issue does this PR close?

Closes #.

What changes are included in this PR?

Add always-on per-scan I/O metrics to ArrowReader.

Motivation: Downstream engines need per-scan byte counts for their UIs. For example, DataFusion Comet uses this to populate bytes_scanned on its Iceberg scan operator, which flows through to Spark UI via TaskMetrics.inputMetrics.setBytesRead(). This must be per-scan, not global. Concurrent scans against the same FileIO need independent counters. The approach matches DataFusion's pattern of wrapping AsyncFileReader with a counting layer and is storage-backend agnostic.

ArrowReader::read() now returns ScanResult

ScanResult wraps the record batch stream and ScanMetrics. Accessors: stream(), metrics(), into_parts().
Metrics are always collected. One fetch_add(Relaxed) per I/O request, negligible overhead.
Counter is created fresh per read() call, so cloned readers get independent metrics.

New file: crates/iceberg/src/arrow/scan_metrics.rs

CountingFileRead<F: FileRead>: generic wrapper that increments a shared AtomicU64 on each read().
ScanMetrics: public handle exposing bytes_read().
ScanResult: public struct returned by ArrowReader::read().

FileRead blanket impl for Box<dyn FileRead>

Enables generic CountingFileRead<F> to wrap the boxed reader returned by FileIO::reader().

Single open_parquet_file with counting

All Parquet opens (data files and delete files) go through the same open_parquet_file wrapped with CountingFileRead, so bytes_read reflects total scan I/O.
build_parquet_reader(): shared internals for reader construction and metadata loading.

FileScanTaskReader struct (refactor)

Extracted process_file_scan_task's parameters into a Clone struct with a process(self, task) method, resolving a clippy::too_many_arguments violation. Struct and impl are co-located.

Re-exports

ScanMetrics and ScanResult re-exported from iceberg::arrow and iceberg::scan.

Are these changes tested?

test_scan_metrics_bytes_read in reader.rs: asserts bytes_read() == 0 before stream consumption (the stream is lazy) and bytes_read() > 0 after. test_scan_metrics_includes_delete_file_bytes: reads the same data file with and without a positional delete file and asserts bytes_read is strictly greater when deletes are present. All existing reader and scan tests pass (updated to use ScanResult::stream()).

mbutrovich · 2026-04-20T22:15:42Z

@blackmwk I'm trying really hard not to add more to reader.rs. Let me know if you have any suggestions.

CTTY

Just took a quick pass, love the direction!

blackmwk

Thanks @mbutrovich for this pr!

mbutrovich · 2026-04-21T20:45:17Z

Thanks for the feedback @CTTY and @blackmwk! I will address the comments tomorrow.

mbutrovich · 2026-04-22T17:20:13Z

I believe I addressed all of the feedback. Thanks for the first pass @CTTY @blackmwk, please take another look whenever you get a spare moment. Thank you!

sdd · 2026-04-23T06:55:06Z

This looks like a great move in the right direction - as we discussed before, I've been keen to see metrics captured for a long time. I like that this approach is agnostic with regard to the consumption of the resulting metrics themselves; interested consumers can write their own adapters to ScanMetrics in order to wire them directly into DataFusion, extract them into an OTEL / Prometheus exporter, etc.

blackmwk · 2026-04-23T08:19:46Z

Let's hold a while to wait for #2358.

mbutrovich · 2026-04-24T17:46:32Z

Cleaning up the merge conflict after #2358.

Resolve conflict from PR apache#2358 splitting reader.rs into modules. Port bytes_read/ScanMetrics changes into reader/pipeline.rs: - FileScanTaskReader struct with ScanMetrics - CountingFileRead wrapping in open_parquet_file - ScanResult return type from read() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

blackmwk

Thanks @mbutrovich for this pr! Generally LGTM.

Co-authored-by: blackmwk <liurenjie1024@outlook.com>

blackmwk

Thanks @mbutrovich for this pr!

- Closes #. Add always-on per-scan I/O metrics to `ArrowReader`. **Motivation:** Downstream engines need per-scan byte counts for their UIs. For example, DataFusion Comet uses this to populate `bytes_scanned` on its Iceberg scan operator, which flows through to Spark UI via `TaskMetrics.inputMetrics.setBytesRead()`. This must be per-scan, not global. Concurrent scans against the same `FileIO` need independent counters. The approach matches DataFusion's pattern of wrapping `AsyncFileReader` with a counting layer and is storage-backend agnostic. **`ArrowReader::read()` now returns `ScanResult`** - `ScanResult` wraps the record batch stream and `ScanMetrics`. Accessors: `stream()`, `metrics()`, `into_parts()`. - Metrics are always collected. One `fetch_add(Relaxed)` per I/O request, negligible overhead. - Counter is created fresh per `read()` call, so cloned readers get independent metrics. **New file: `crates/iceberg/src/arrow/scan_metrics.rs`** - `CountingFileRead<F: FileRead>`: generic wrapper that increments a shared `AtomicU64` on each `read()`. - `ScanMetrics`: public handle exposing `bytes_read()`. - `ScanResult`: public struct returned by `ArrowReader::read()`. **`FileRead` blanket impl for `Box<dyn FileRead>`** - Enables generic `CountingFileRead<F>` to wrap the boxed reader returned by `FileIO::reader()`. **Single `open_parquet_file` with counting** - All Parquet opens (data files and delete files) go through the same `open_parquet_file` wrapped with `CountingFileRead`, so `bytes_read` reflects total scan I/O. - `build_parquet_reader()`: shared internals for reader construction and metadata loading. **`FileScanTaskReader` struct (refactor)** - Extracted `process_file_scan_task`'s parameters into a `Clone` struct with a `process(self, task)` method, resolving a `clippy::too_many_arguments` violation. Struct and impl are co-located. **Re-exports** - `ScanMetrics` and `ScanResult` re-exported from `iceberg::arrow` and `iceberg::scan`. `test_scan_metrics_bytes_read` in `reader.rs`: asserts `bytes_read() == 0` before stream consumption (the stream is lazy) and `bytes_read() > 0` after. `test_scan_metrics_includes_delete_file_bytes`: reads the same data file with and without a positional delete file and asserts `bytes_read` is strictly greater when deletes are present. All existing reader and scan tests pass (updated to use `ScanResult::stream()`). --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: blackmwk <liurenjie1024@outlook.com> (cherry picked from commit 1ad4bfd)

mbutrovich added 5 commits April 20, 2026 13:20

Add metrics for bytes read, need to clean up clippy.

8921a95

Clean up.

a4457df

Make API opt-in.

f7a0507

Refactor open logic as well.

f707ba8

Put missing comment back.

ab4ebee

mbutrovich requested review from CTTY and blackmwk April 20, 2026 22:15

CTTY reviewed Apr 20, 2026

View reviewed changes

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

blackmwk reviewed Apr 21, 2026

View reviewed changes

Comment thread crates/iceberg/src/arrow/scan_metrics.rs Outdated

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

Comment thread crates/iceberg/src/arrow/reader.rs Outdated

mbutrovich mentioned this pull request Apr 22, 2026

Task metrics (bytes read) for CometIcebergNativeScan apache/datafusion-comet#4002

Closed

mbutrovich and others added 3 commits April 22, 2026 08:52

Address PR feedback.

fa055fa

Handle delete files and add a test.

cf59635

Merge branch 'main' into bytes_read

9045930

mbutrovich requested review from CTTY and blackmwk April 22, 2026 17:19

toutane mentioned this pull request Apr 24, 2026

Report DataFusion operator metrics in IcebergTableScan #2364

Open

blackmwk reviewed Apr 27, 2026

View reviewed changes

Comment thread crates/iceberg/src/io/file_io.rs Outdated

Comment thread crates/iceberg/src/arrow/scan_metrics.rs Outdated

Comment thread crates/iceberg/src/arrow/delete_file_loader.rs Outdated

mbutrovich and others added 5 commits April 27, 2026 09:35

Update crates/iceberg/src/io/file_io.rs

5c87110

Co-authored-by: blackmwk <liurenjie1024@outlook.com>

Update crates/iceberg/src/arrow/scan_metrics.rs

a3ea53d

Co-authored-by: blackmwk <liurenjie1024@outlook.com>

Merge branch 'main' into bytes_read

e7fdcc4

Address PR feedback.

69ccaee

Merge branch 'main' into bytes_read

0b34c6f

mbutrovich requested a review from blackmwk April 28, 2026 00:13

blackmwk approved these changes Apr 28, 2026

View reviewed changes

blackmwk merged commit 1ad4bfd into apache:main Apr 28, 2026
20 checks passed

mbutrovich mentioned this pull request Apr 28, 2026

feat: task-level input metrics (bytesRead) for Iceberg native scan apache/datafusion-comet#4128

Merged

Conversation

mbutrovich commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

mbutrovich commented Apr 20, 2026

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Apr 21, 2026

Uh oh!

mbutrovich commented Apr 22, 2026

Uh oh!

sdd commented Apr 23, 2026

Uh oh!

blackmwk commented Apr 23, 2026

Uh oh!

mbutrovich commented Apr 24, 2026

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackmwk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbutrovich commented Apr 20, 2026 •

edited

Loading