Skip to content

[Proposal] Expose Spilling Progress Interface in DataFusion #19697

@xudong963

Description

@xudong963

Background

Currently:

  1. SpillMetrics (per operator) are updated only at the end of a spill.
  2. DiskManager tracks used_disk_space (current total) but doesn't expose a structured "progress" view.

Proposed Changes

  1. Real-time Metric Updates in SpillMetrics: modify InProgressSpillFile to ensure spilled_bytes
    and spill_file_count metrics are updated as soon as the data is written to disk.
  • Initial update: In append_batch, when the IPCStreamWriter is first created, immediately call update_disk_usage() on the file and add the size (schema/header) to spilled_bytes
  • Incremental update: After each writer.write(batch) call, call update_disk_usage() and add the delta size to
    spilled_bytes
  • Final update: In finish() call update_disk_usage() after finishing the writer and add the remaining delta size (footer/metadata) to spilled_bytes
    .
  1. Spilling Progress Interface in DiskManager: expose the current global state of the disk manager.
  • New SpillingProgress struct
    pub struct SpillingProgress {
        /// Total bytes currently used on disk for spilling
        pub current_bytes: u64,
        /// Total number of active spill files
        pub active_files_count: usize,
    }
  • Implement spilling_progress(&self) -> SpillingProgress
  1. Delegate Interface in RuntimeEnv: provide a convenient entry point for users.
    let progress = ctx.runtime_env().spilling_progress();
    

Then users could call the API to get the real-time spilling progress, for our use case, we want to call this from the SQL UI to give users the real-time feedback about their SQLs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions