Add nested file-scan pipeline (stream of per-file streams)#91
Add nested file-scan pipeline (stream of per-file streams)#91robertbuessow wants to merge 14 commits intomainfrom
Conversation
Replace the to_arrow() code path with a custom file-parallel pipeline that processes N parquet files concurrently while preserving strict file-then-row ordering for Julia consumption. Architecture: - plan_files() provides an ordered list of FileScanTasks - Each file task runs as a background tokio task with its own mpsc channel and per-file Semaphore(100MB) for backpressure - FuturesOrdered yields per-file receivers in file order; the consumer drains file 0 batch-by-batch, then file 1, etc. - Each file uses the same iceberg-rs ArrowReader code path as before Changes: - New: ordered_file_pipeline.rs (~420 lines incl. profiling stats) - full.rs: IcebergScan gains file_io, batch_size, file_concurrency fields; iceberg_arrow_stream uses plan_files() + pipeline - scan_common.rs / incremental.rs: pass through new fields in macros - Temporary PipelineStats with FFI print_summary for benchmarking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exposes a new iceberg_file_scan_stream / scan_nested! API that yields one IcebergFileScan per file, each carrying the filename, record count, and a prefetched inner IcebergArrowStream of Arrow IPC batches. The existing flat scan! path is kept by re-implementing it as a thin flattening wrapper over the same nested pipeline, so all ordering, backpressure, and stats logic lives in one place. Key changes: - ordered_file_pipeline: add FileScan (internal type), make_file_stream, spawn_file_task_with_meta, run_nested / create_nested_pipeline; rewrite create_pipeline to flatten the nested pipeline. Switch run_nested from FuturesOrdered to FuturesUnordered (spawn futures resolve immediately so ordering is irrelevant here). - table.rs: add IcebergFileScan (#[repr(C)]), IcebergFileScanStream, IcebergFileScanResponse; add iceberg_file_scan_free, iceberg_file_scan_stream_free, iceberg_file_scan_record_count, iceberg_file_scan_filename (sync getters), iceberg_next_file_scan (async). - full.rs: add iceberg_file_scan_stream export op; make IcebergScan.file_io non-optional (FileIO instead of Option<FileIO>) — always set at construction, use .clone() in stream ops instead of .take(). - Julia: scan_nested!, nested_arrow_stream, next_file_scan, file_scan_record_count, file_scan_filename, file_scan_arrow_stream, free_file_scan!, free_file_scan_stream! exported from RustyIceberg. Labels: dismiss-release-notes build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace all inline iceberg::Error::new(iceberg::ErrorKind::Unexpected, ...) callsites with a crate-wide pub(crate) helper in lib.rs, removing the boilerplate from catalog.rs, lib.rs, and ordered_file_pipeline.rs. Labels: dismiss-release-notes build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Add add_elapsed, store_elapsed, and async timed methods to PipelineStats to eliminate the repeated start/record/elapsed boilerplate. The three async phases (fetch+decode, serialize, semaphore) now read as single timed() expressions; the sync reader-setup phase uses add_elapsed directly. Labels: dismiss-release-notes build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Tests (test/scan_tests.jl):
- Basic iteration: verify each FileScan has a non-empty .parquet filename,
record_count > 0, and the nations table totals exactly 25 records.
- Row count consistency: nested and flat pipelines over the customer table
must produce identical total row counts.
- Correct data: nested scan over nations produces the same rows as the
existing flat-scan reference test.
- Builder methods: select_columns! and with_batch_size! are respected
per-batch in the nested path.
- Safe early drop: abandon a FileScan mid-stream and drop the outer stream
early; background file tasks see the dropped receiver and exit cleanly.
Also fixes the timed! macro call for the serialize phase: the block form
{ expr.await.map_err(...)? } was incorrectly evaluating to () before the
macro's own .await ran. The .map_err chains now appear outside the macro
invocation where they belong.
Labels: dismiss-release-notes build:benchmark
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ll time PipelineStats unit tests (12 tests): - reset_clears_every_field, task_start/end concurrency tracking, peak high-water mark, buffer add/release/peak, add_elapsed accumulation, store_elapsed overwrite, field isolation. Dispatch backpressure metric: - Rename dead consumer_wait_ns → file_dispatch_wait_ns. Track time run_nested spends blocked on the outer channel (slow consumer) using timed!(file_dispatch_wait_ns, tx.send(...)). Shown as "dispatch stall" in print_summary. Nested pipeline wall time: - run_nested now records pipeline_wall_ns when it finishes dispatching all FileScan handles. For the flat path run_flat overwrites it with the full end-to-end time; for the nested path this gives a non-zero throughput in print_summary instead of always showing 0. Labels: dismiss-release-notes build:benchmark Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adds a new scan configuration parameter that controls how many completed FileScan items are queued in the outer channel of the nested pipeline, decoupling file-I/O concurrency from how far ahead Rust prefetches for the Julia consumer. Previously the outer channel capacity was hardcoded to `file_concurrency`, coupling the two. Now they are independent: `file_concurrency` sets how many files are processed in parallel, while `file_prefetch_depth` controls how many ready FileScan items can buffer up before run_nested blocks. When `file_prefetch_depth` is 0 (default), it falls back to `file_concurrency` preserving existing behaviour. Changes: - IcebergScan/IcebergIncrementalScan: add file_prefetch_depth field - scan_common.rs: propagate field through all struct-reconstructing macros; add impl_with_file_prefetch_depth! macro - full.rs: wire iceberg_scan_with_file_prefetch_depth FFI function; resolve prefetch_depth in both iceberg_arrow_stream and iceberg_file_scan_stream - ordered_file_pipeline: add prefetch_depth parameter to create_nested_pipeline and create_pipeline; use it for the outer FileScan channel capacity - Julia: with_file_prefetch_depth! was already defined; move export to correct line Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| let scan_ptr = unsafe { &mut *scan }; | ||
| let scan_ref = &scan_ptr.scan; | ||
| if scan_ref.is_none() { | ||
| return Err(anyhow::anyhow!("Scan not initialized")); | ||
| } | ||
|
|
||
| let concurrency = scan_ptr.file_concurrency; | ||
| let concurrency = if concurrency == 0 { | ||
| std::thread::available_parallelism() | ||
| .map(|n| n.get()) | ||
| .unwrap_or(1) | ||
| } else { | ||
| concurrency | ||
| }; | ||
|
|
||
| let file_io = scan_ptr.file_io.clone(); | ||
| let batch_size = scan_ptr.batch_size; | ||
| let prefetch_depth = if scan_ptr.file_prefetch_depth == 0 { | ||
| concurrency | ||
| } else { | ||
| scan_ptr.file_prefetch_depth | ||
| }; |
There was a problem hiding this comment.
Optional: would it make sense to extract a helper for this to use here and above (lines 151-174)?
There was a problem hiding this comment.
Done — extracted into resolve_pipeline_params in the eaa30d4 commit.
| format!("Serialization task panicked: {}", e), | ||
| )), | ||
| Ok(Err(e)) => Err(unexpected(e)), | ||
| Err(e) => Err(unexpected(format!("Serialization task panicked: {e}"))), |
There was a problem hiding this comment.
So format!("Serialization task panicked: {e}") is the same as format!("Serialization task panicked: {}", e)?
There was a problem hiding this comment.
Yes, they're identical at runtime. {e} is Rust 1.58 "captured identifier" syntax — it implicitly captures the local variable e as the format argument, so you don't need to pass it explicitly after the format string. Pure style preference; I tend to use it for single-variable cases to keep the string self-contained.
| @@ -0,0 +1,425 @@ | |||
| // =========================================================================== | |||
| // Temporary profiling — will be removed before merging to production. | |||
There was a problem hiding this comment.
Perhaps change this comment. Would be nice to keep this for now at least.
There was a problem hiding this comment.
Updated in eaa30d4 — the comment now describes the module as permanent pipeline statistics rather than temporary profiling.
- Extract resolve_pipeline_params() in full.rs to deduplicate the concurrency/prefetch_depth/file_io/batch_size resolution logic shared by iceberg_arrow_stream and iceberg_file_scan_stream sync blocks - Update pipeline_stats.rs header comment to remove "temporary / will be removed" framing — the module is intended to stay Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| // starving the tokio executor. | ||
| let serialized = timed!( | ||
| serialize_ns, | ||
| tokio::task::spawn_blocking(move || crate::serialize_record_batch(batch)) |
There was a problem hiding this comment.
Wondering here whether, as done in the export work, it would make sense to have a thread pool of N workers that do the serialization work idle in the background?
There was a problem hiding this comment.
Done. Added a global rayon::ThreadPool
| // =========================================================================== | ||
| // Pipeline statistics — timing and throughput counters for the file-parallel | ||
| // scan pipeline. Global atomics accumulate data across all file tasks and are | ||
| // readable from Julia via `@ccall iceberg_print_pipeline_stats()`. |
There was a problem hiding this comment.
Shall we expose that as a Julia function as well, or do you prefer users to use the ccall macro?
There was a problem hiding this comment.
Added print_pipeline_stats() and reset_pipeline_stats() Julia wrappers in full.jl, both exported from the module.
Replace `tokio::task::spawn_blocking` with a process-global `rayon::ThreadPool` (sized to available CPU cores) for Arrow IPC serialization. All concurrent file tasks share the same pool via `SERIALIZE_POOL`, keeping N worker threads warm for the lifetime of the process rather than borrowing from tokio's unbounded blocking pool on each batch. Also add `print_pipeline_stats()` and `reset_pipeline_stats()` Julia wrappers in `full.jl`, both exported from the module. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #91 +/- ##
==========================================
+ Coverage 83.86% 83.89% +0.03%
==========================================
Files 9 9
Lines 849 913 +64
==========================================
+ Hits 712 766 +54
- Misses 137 147 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
scan_nested!/iceberg_file_scan_streamreturns an outerFileScanStreamthat yields oneIcebergFileScanper file. Each scan carries the filename, record count, and a prefetched innerIcebergArrowStreamof Arrow IPC batches. The existing flatscan!is re-implemented as a thin flattening wrapper over the same nested pipeline so all ordering, backpressure, and stats logic lives in one place.scan!call sites and tests are unchanged.IcebergScan.file_iomade non-optional: was always set at construction; removingOption<>enforces this statically and eliminates dead.take().ok_or_else(...)guards in both stream ops.unexpected()crate-wide helper: replaces all inlineiceberg::Error::new(iceberg::ErrorKind::Unexpected, ...)incatalog.rs,lib.rs, andordered_file_pipeline.rs.timed!macro: replaces thelet start / .await / add_elapsedpattern for the three async phases (fetch+decode, serialize, semaphore). Block formtimed!(field, { expr })keeps the important computation visually prominent.PipelineStatsimprovements:add_elapsed/store_elapsedhelpers;file_dispatch_wait_nsmetric tracks outer-channel backpressure (timerun_nestedis blocked waiting for the consumer to callnext_file_scan);pipeline_wall_nsis now also set for the nested path (previously always 0, breaking the throughput display).PipelineStatsmethods.@testsetblocks covering basic iteration,nested_arrow_streamcalled directly, row-count consistency vs flat scan, correct data, builder-method compatibility, and safe early drop.New Julia API
Ownership rules
file_scan_arrow_streamreturns a borrowed pointer into theIcebergFileScanstruct. Callers must not callfree_streamon it —free_file_scan!handles cleanup.FileScanwithout reading all inner batches is safe: the dropped receiver causes the background file task to exit cleanly.FileScanStreamwithout consuming all files is equally safe.Test plan
make testpasses (all existing tests unchanged)cargo testpasses (12PipelineStatsunit tests)Based on @hall-alex sorted iterator change.
Overarching task tracking Iceberg-load optimizations: RAI-49519.
🤖 Generated with Claude Code