Add metrics for parquet sink by xudong963 · Pull Request #20307 · apache/datafusion

xudong963 · 2026-02-12T10:25:49Z

Which issue does this PR close?

Closes #.

Rationale for this change

Before the PR, explain analyze for copyinto doesn't have metric

What changes are included in this PR?

Support metrics for ParquetSink

Are these changes tested?

Yes, ut + sqllogictest

Are there any user-facing changes?

Cherry-pick of apache#20307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kosiew

@xudong963 Thanks for working on this.

kosiew · 2026-02-26T10:19:50Z

datafusion/datasource-parquet/src/file_format.rs

+                        .map(|rg| rg.compressed_size() as usize)
+                        .sum();
+                    rows_written_counter.add(file_rows);
+                    bytes_written_counter.add(file_bytes);


Given that bytes_written is derived from row_groups().compressed_size() and may not reflect the exact on-disk file size, should we clarify this in the documentation or rename it to something like compressed_row_group_bytes?

Good point, I think bytes_written is the intuitive name that users expect for a sink metric. I'd like to keep the name bytes_written but adding a brief doc comment explaining what it measures.

(Renaming to compressed_row_group_bytes is overly verbose and would surprise users familiar with standard sink metrics. )

kosiew · 2026-02-26T10:29:55Z

datafusion/core/src/dataframe/parquet.rs

+        let rows_written = aggregated
+            .iter()
+            .find(|m| m.value().name() == "rows_written")
+            .expect("should have rows_written metric");


Since both test_parquet_sink_metrics_parallel and test_parquet_sink_metrics repeat the aggregate_by_name().iter().find(...) pattern, should we add a small helper (e.g., metric_usize(&aggregated, "rows_written")) to simplify the tests and improve readability?

Good point!

kosiew · 2026-02-26T10:38:17Z

datafusion/datasource-parquet/src/file_format.rs

+        let rows_written_counter =
+            MetricBuilder::new(&self.metrics).global_counter("rows_written");


Since Parquet now implements sink metrics but CSV/JSON still use the default DataSink::metrics() -> None, is this divergence intentional? If so, would it make sense to follow up by centralizing common sink metrics wiring (e.g., in FileSink or write orchestration) to reduce duplication and enable consistent opt-in across file sinks?

Makes sense, I'll open an issue and make a follow-up PR for this.

opened an issue #20644

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xudong963 · 2026-03-01T21:09:37Z

@kosiew thanks for the review, applied your suggestions in the commit

kosiew

lgtm

## Which issue does this PR close?  - Closes #. ## Rationale for this change  Before the PR, explain analyze for copyinto doesn't have metric ## What changes are included in this PR?  Support metrics for ParquetSink ## Are these changes tested?  Yes, ut + sqllogictest ## Are there any user-facing changes?   --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…Sink and JsonSink Follows the pattern established in apache#20307 (ParquetSink metrics). Threads bytes_written tracking through the shared `spawn_writer_tasks_and_join` orchestration function so both CSV and JSON sinks get consistent metrics without duplicating logic. Changes: - `orchestration.rs`: track bytes per serialized chunk in `serialize_rb_stream_to_object_store`; extend the internal oneshot channel to carry (rows, bytes); add optional `Count` params to `spawn_writer_tasks_and_join` - `CsvSink` / `JsonSink`: add `ExecutionPlanMetricsSet`, wire `rows_written`, `bytes_written`, `elapsed_compute` counters, implement `DataSink::metrics()` - Tests: EXPLAIN ANALYZE sqllogictests in copy.slt; unit tests in datasource/file_format/csv.rs and json.rs Closes apache#20644 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xudong963 marked this pull request as draft February 12, 2026 10:26

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Feb 12, 2026

xudong963 marked this pull request as ready for review February 12, 2026 15:15

xudong963 force-pushed the metric_parquet_sink branch from d855a2f to e92e48c Compare February 12, 2026 15:15

xudong963 added a commit to massive-com/arrow-datafusion that referenced this pull request Feb 13, 2026

Cherry pick add metrics for parquet sink from upstream (#32)

84b491c

Cherry-pick of apache#20307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xudong963 added a commit to massive-com/arrow-datafusion that referenced this pull request Feb 13, 2026

Cherry pick add metrics for parquet sink from upstream (#32)

da5d59f

Cherry-pick of apache#20307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kosiew reviewed Feb 26, 2026

View reviewed changes

xudong963 and others added 5 commits March 1, 2026 21:10

Add metrics for parquet sink

5f8ef31

rich test

975715e

Fix clippy: use datafusion Instant for WASM compatibility

ac882e6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

remove duplicate

ac9e483

resolve comments

50d3d0f

xudong963 force-pushed the metric_parquet_sink branch from b235df2 to 50d3d0f Compare March 1, 2026 21:08

kosiew approved these changes Mar 2, 2026

View reviewed changes

xudong963 added this pull request to the merge queue Mar 2, 2026

Merged via the queue into apache:main with commit 95de1bf Mar 2, 2026
28 checks passed

xudong963 mentioned this pull request Mar 2, 2026

Implement sink metrics but CSV/JSON #20644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for parquet sink#20307

Add metrics for parquet sink#20307
xudong963 merged 5 commits intoapache:mainfrom
xudong963:metric_parquet_sink

xudong963 commented Feb 12, 2026 •

edited

Loading

Uh oh!

kosiew left a comment

Uh oh!

kosiew Feb 26, 2026

Uh oh!

xudong963 Mar 1, 2026

Uh oh!

kosiew Feb 26, 2026

Uh oh!

xudong963 Mar 1, 2026

Uh oh!

kosiew Feb 26, 2026

Uh oh!

xudong963 Mar 1, 2026

Uh oh!

xudong963 Mar 2, 2026

Uh oh!

xudong963 commented Mar 1, 2026

Uh oh!

kosiew left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let rows_written_counter =
		MetricBuilder::new(&self.metrics).global_counter("rows_written");

Conversation

xudong963 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Mar 1, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xudong963 commented Feb 12, 2026 •

edited

Loading