Conversation
Update iceberg and iceberg-catalog-rest to rev 418213731e91544f5eb31a3efa459e88f599030e, which includes the fix for incremental scans with from=None silently dropping EXISTING manifest entries from expired snapshots. Also pulls in arrow/parquet v58.1.0 via the updated lock file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iceberg-rust rev 418213731e91 depends on arrow/parquet v58.1.0; keeping the FFI crate on 57.x caused two incompatible versions of RecordBatch to be linked, failing to compile. Align all arrow-* and parquet pins to "58.1". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Accepts Vector{String} with optional validity BitVector, handling pointer
extraction and GC preservation internally. The low-level ptr/len overload
is retained for performance-sensitive callers with pre-allocated buffers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I am seeing good throughput improvements locally: vs. (some results are noisy due to local execution, but all workloads consistently show improvements. More so for numeric types due to batched gathering on the Rust side). |
…diate Vec Build StringArray directly with OffsetBuffer + values Buffer in a single pass, replacing Vec<Option<&str>> + StringArray::from. Use new_unchecked to skip Arrow's UTF-8 re-validation — Julia strings are guaranteed valid UTF-8. For 20M x 32-byte strings this eliminates ~320 MB of intermediate Vec<Option<&str>> storage and ~640 MB of redundant UTF-8 validation reads per column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
robertbuessow
left a comment
There was a problem hiding this comment.
Not sure I understood everything but I think it's good enough. Some nits + it would be good to improve error handling between rust and julia.
- Extract encode worker loop body into encode_worker_loop() - Retain panic message: downcast Box<dyn Any> to &str / String before formatting - Add iceberg_take_gather_error() FFI + thread-local to surface gather errors immediately in Julia exceptions rather than deferring to writer close - Clarify lengths_ptr doc: array of byte lengths per string - Merge identical sequential/scattered validity branches in build_null_buffer_scattered - Add explanatory comments: bitvector merging, re-alignment, all-valid bit-set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #90 +/- ##
==========================================
- Coverage 84.42% 82.81% -1.61%
==========================================
Files 9 9
Lines 873 966 +93
==========================================
+ Hits 737 800 +63
- Misses 136 166 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Export improvements
This PR reworks the Parquet/Iceberg write path for throughput, cleanliness, and configurability.
New write APIs
WriterConfigstruct with configurable Parquet properties: compression codec, dictionary encoding, plain encoding, row group size, page size, write batch size, and statisticsColumnBatch/write_columns— zero-copy write path for flat column buffers (bypasses Arrow IPC serialization)GatheredBatch/write_columns— gathered-column write path that assembles columns from scattered slices (selection vectors + validity bitmaps) directly in the calling threadset_encode_workers!to configure the encode thread pool size before first useGlobal encode worker pool
A single pool of N OS threads (default:
Sys.CPU_THREADS) is shared across all writers. Eachwrite_columns/writecall gathers or serializes data in the calling thread, submits aRecordBatchto the pool, and returns immediately — encode and Parquet I/O run on pool threads. Per-writer ordering is preserved via aMutex<ConcreteDataFileWriter>in the sharedWriterState: only one pool thread encodes a given writer at a time.close_writerwaits for all in-flight tasks to drain before finalising the file.This design lets Julia pipeline the gather/serialize step on the main thread with encode work happening concurrently on pool threads, rather than blocking end-to-end on each write.
Tests
Added tests for
ColumnBatch,GatheredBatch(including scattered slices, nullable columns, string columns), decimal types (Int32/Int64/Int128 backing), and allWriterConfigParquet properties.