Skip to content

feat: task-level input metrics (bytesRead) for Iceberg native scan #4128

Merged
mbutrovich merged 6 commits intoapache:mainfrom
mbutrovich:iceberg_metrics
Apr 28, 2026
Merged

feat: task-level input metrics (bytesRead) for Iceberg native scan #4128
mbutrovich merged 6 commits intoapache:mainfrom
mbutrovich:iceberg_metrics

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4002.

Rationale for this change

Iceberg native scans report zero for task-level input metrics (bytesRead, recordsRead) in Spark UI because iceberg-rust reads files entirely in Rust, bypassing Hadoop's Java I/O counters. Upstream iceberg-rust PR apache/iceberg-rust#2349 added ScanMetrics with a live bytes_read counter. This PR plumbs that through to Spark.

What changes are included in this PR?

  • Bump iceberg-rust dep from a2f067d to 1ad4bfd (adds ScanResult/ScanMetrics)
  • ArrowReader::read() now returns ScanResult: extract stream and clone metrics handle
  • Add bytes_scanned Count metric to IcebergScanMetrics, bridge iceberg-rust's live AtomicU64 into the DF metric tree on each poll_next via delta tracking
  • Add bytes_scanned SQLMetric to CometIcebergNativeScanExec
  • Override CometExecRDD.compute() to call reportScanInputMetrics (same pattern as the Parquet path)
  • Remove stale configured_scheme field from OpenDalStorageFactory::S3 (upstream API change)

How are these changes tested?

  • Added bytes_scanned > 0 assertion to existing "verify all Iceberg planning metrics" test
  • New test: "task-level inputMetrics.bytesRead is populated for Iceberg native scan" — uses SparkListener to verify bytesRead > 0, recordsRead == 10000, and cross-checks SQL-level bytes_scanned matches task-level bytesRead

@mbutrovich mbutrovich self-assigned this Apr 28, 2026
Comment thread spark/src/test/scala/org/apache/comet/CometIcebergNativeSuite.scala Outdated
Comment thread native/core/src/execution/operators/iceberg_scan.rs
Comment thread spark/src/test/scala/org/apache/comet/CometIcebergNativeSuite.scala Outdated
Comment thread spark/src/test/scala/org/apache/comet/CometIcebergNativeSuite.scala Outdated
Comment thread spark/src/test/scala/org/apache/comet/CometIcebergNativeSuite.scala
@mbutrovich
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @andygrove, @comphead, and @hsiang-c! Hopefully I addressed all comments and clarified in the user guide for Iceberg.

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich

Comment thread spark/src/test/scala/org/apache/comet/CometIcebergNativeSuite.scala
Copy link
Copy Markdown
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mbutrovich

@mbutrovich mbutrovich merged commit f7b632c into apache:main Apr 28, 2026
136 of 137 checks passed
@mbutrovich mbutrovich deleted the iceberg_metrics branch April 28, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task metrics (bytes read) for CometIcebergNativeScan

4 participants