Export improvements by gbrgr · Pull Request #90 · RelationalAI/RustyIceberg.jl

gbrgr · 2026-04-28T14:58:03Z

Export improvements

This PR reworks the Parquet/Iceberg write path for throughput, cleanliness, and configurability.

New write APIs

WriterConfig struct with configurable Parquet properties: compression codec, dictionary encoding, plain encoding, row group size, page size, write batch size, and statistics
ColumnBatch / write_columns — zero-copy write path for flat column buffers (bypasses Arrow IPC serialization)
GatheredBatch / write_columns — gathered-column write path that assembles columns from scattered slices (selection vectors + validity bitmaps) directly in the calling thread
set_encode_workers! to configure the encode thread pool size before first use

Global encode worker pool
A single pool of N OS threads (default: Sys.CPU_THREADS) is shared across all writers. Each write_columns / write call gathers or serializes data in the calling thread, submits a RecordBatch to the pool, and returns immediately — encode and Parquet I/O run on pool threads. Per-writer ordering is preserved via a Mutex<ConcreteDataFileWriter> in the shared WriterState: only one pool thread encodes a given writer at a time. close_writer waits for all in-flight tasks to drain before finalising the file.

This design lets Julia pipeline the gather/serialize step on the main thread with encode work happening concurrently on pool threads, rather than blocking end-to-end on each write.

Tests
Added tests for ColumnBatch, GatheredBatch (including scattered slices, nullable columns, string columns), decimal types (Int32/Int64/Int128 backing), and all WriterConfig Parquet properties.

Update iceberg and iceberg-catalog-rest to rev 418213731e91544f5eb31a3efa459e88f599030e, which includes the fix for incremental scans with from=None silently dropping EXISTING manifest entries from expired snapshots. Also pulls in arrow/parquet v58.1.0 via the updated lock file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

iceberg-rust rev 418213731e91 depends on arrow/parquet v58.1.0; keeping the FFI crate on 57.x caused two incompatible versions of RecordBatch to be linked, failing to compile. Align all arrow-* and parquet pins to "58.1". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Accepts Vector{String} with optional validity BitVector, handling pointer extraction and GC preservation internally. The low-level ptr/len overload is retained for performance-sensitive callers with pre-allocated buffers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gbrgr · 2026-04-29T15:16:12Z

I am seeing good throughput improvements locally:

# Run on 2026-04-29 17:13:08.052
# Apple M3 Pro, 8 thread(s), -O2
# ========================================================================================================
# Name                                                    Pager        Mem   Allocs Throughput        Time
# iceberg_export:Int,Int,Int,Int GNF                  172.80MiB   37.22MiB  371.48K 3412.2MB/s     178.9ms
# iceberg_export:Int,Int,Int,Int GNF_Null(30.0%)      144.00MiB   71.56MiB    1.65M 2038.4MB/s     299.4ms
# iceberg_export:Int,Int,Int,Int WIDE                   0.00MiB    6.53MiB   89.52K 5615.9MB/s     108.7ms
# iceberg_export:Float64,Float64,Float64,Float64 GNF  172.80MiB   37.38MiB  371.87K 3240.3MB/s     188.4ms
# iceberg_export:Float64,...,Float64 GNF_Null(30.0%)  144.00MiB   71.37MiB    1.65M 1925.0MB/s     317.1ms
# iceberg_export:Float64,...t64,Float64,Float64 WIDE    0.00MiB    7.25MiB   89.83K 5244.7MB/s     116.4ms
# iceberg_export:Int,Float64,FD{Int64,4},Int128 GNF   172.80MiB   37.91MiB  371.74K 2313.7MB/s     329.7ms
# iceberg_export:Int,Floa...},Int128 GNF_Null(30.0%)  144.00MiB   71.78MiB    1.64M 2058.4MB/s     370.6ms
# iceberg_export:Int,Float64,FD{Int64,4},Int128 WIDE    0.00MiB    6.55MiB   90.52K 3319.6MB/s     229.8ms
# iceberg_export:Int,Float64,vstr,Int128 GNF          172.80MiB    1.17GiB   20.43M 1018.1MB/s    1199.0ms
# iceberg_export:Int,Floa...r,Int128 GNF_Null(30.0%)  144.00MiB  901.95MiB   15.80M 1012.7MB/s    1205.4ms
# iceberg_export:Int,Float64,vstr,Int128 WIDE           0.00MiB    1.27GiB   20.15M 1354.6MB/s     901.2ms
# iceberg_export:vstr,vstr,vstr,vstr GNF              211.20MiB    4.49GiB   80.53M  764.2MB/s    3194.6ms
# iceberg_export:vstr,vstr,vstr,vstr GNF_Null(30.0%)  172.80MiB    3.33GiB   58.08M 1133.3MB/s    2154.2ms
# iceberg_export:vstr,vstr,vstr,vstr WIDE               0.00MiB    3.05GiB   20.23M 1043.5MB/s    2339.7ms

vs.

# Run on 2026-04-29 10:36:06.658
# Apple M3 Pro, 8 thread(s), -O2
# ========================================================================================================
# Name                                                    Pager        Mem   Allocs Throughput        Time
# iceberg_export:Int,Int,Int,Int GNF                  172.80MiB  660.68MiB  539.06K  841.5MB/s     725.3ms
# iceberg_export:Int,Int,Int,Int GNF_Null(30.0%)      144.00MiB   76.06MiB    1.89M  680.1MB/s     897.5ms
# iceberg_export:Int,Int,Int,Int WIDE                   0.00MiB    5.90MiB  118.44K 1128.7MB/s     540.7ms
# iceberg_export:Float64,Float64,Float64,Float64 GNF  172.80MiB  660.68MiB  539.14K  840.6MB/s     726.1ms
# iceberg_export:Float64,...,Float64 GNF_Null(30.0%)  144.00MiB   76.07MiB    1.89M  628.9MB/s     970.5ms
# iceberg_export:Float64,...t64,Float64,Float64 WIDE    0.00MiB    5.90MiB  118.52K 1054.6MB/s     578.8ms
# iceberg_export:Int,Float64,FD{Int64,4},Int128 GNF   172.80MiB  813.85MiB  539.35K  838.1MB/s     910.3ms
# iceberg_export:Int,Floa...},Int128 GNF_Null(30.0%)  144.00MiB   77.22MiB    1.93M  712.7MB/s    1070.6ms
# iceberg_export:Int,Float64,FD{Int64,4},Int128 WIDE    0.00MiB    5.97MiB  119.19K 1125.0MB/s     678.2ms
# iceberg_export:Int,Float64,vstr,Int128 GNF          172.80MiB    2.26GiB   20.57M  942.2MB/s    1295.6ms
# iceberg_export:Int,Floa...r,Int128 GNF_Null(30.0%)  144.00MiB    1.37GiB   15.96M  785.7MB/s    1553.6ms
# iceberg_export:Int,Float64,vstr,Int128 WIDE           0.00MiB    1.70GiB   20.17M  991.6MB/s    1231.0ms
# iceberg_export:vstr,vstr,vstr,vstr GNF              211.20MiB    6.62GiB   80.67M 1025.7MB/s    2380.2ms
# iceberg_export:vstr,vstr,vstr,vstr GNF_Null(30.0%)  172.80MiB    5.31GiB   57.99M  843.2MB/s    2895.2ms
# iceberg_export:vstr,vstr,vstr,vstr WIDE               0.00MiB    9.00GiB  100.31M  310.4MB/s        7.9s

(some results are noisy due to local execution, but all workloads consistently show improvements. More so for numeric types due to batched gathering on the Rust side).

…diate Vec Build StringArray directly with OffsetBuffer + values Buffer in a single pass, replacing Vec<Option<&str>> + StringArray::from. Use new_unchecked to skip Arrow's UTF-8 re-validation — Julia strings are guaranteed valid UTF-8. For 20M x 32-byte strings this eliminates ~320 MB of intermediate Vec<Option<&str>> storage and ~640 MB of redundant UTF-8 validation reads per column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

robertbuessow

Not sure I understood everything but I think it's good enough. Some nits + it would be good to improve error handling between rust and julia.

- Extract encode worker loop body into encode_worker_loop() - Retain panic message: downcast Box<dyn Any> to &str / String before formatting - Add iceberg_take_gather_error() FFI + thread-local to surface gather errors immediately in Julia exceptions rather than deferring to writer close - Clarify lengths_ptr doc: array of byte lengths per string - Merge identical sequential/scattered validity branches in build_null_buffer_scattered - Add explanatory comments: bitvector merging, re-alignment, all-valid bit-set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-30T11:59:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 71.42857% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.81%. Comparing base (cdce472) to head (7a8947a).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/writer.jl	71.42%	30 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
- Coverage   84.42%   82.81%   -1.61%     
==========================================
  Files           9        9              
  Lines         873      966      +93     
==========================================
+ Hits          737      800      +63     
- Misses        136      166      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gbrgr and others added 16 commits April 21, 2026 19:45

Bump version

8c730b2

Add optimizations

029fe57

Add tests

2617614

Bump version

a2c0169

Merge branch 'main' into gb/export-improvements

57ab408

Fix test

218f426

format

a1a78ba

Rename fn

3e46312

.

b2309f2

.

8dbd1e6

Fix naming

3d43275

Add test

b2ae197

Rename fn

66880e8

.

d3fe00e

gbrgr marked this pull request as ready for review April 28, 2026 17:07

gbrgr and others added 3 commits April 28, 2026 19:12

Disallow changing pool size

8ce862c

Update comment

bec8fa5

gbrgr requested review from hall-alex and robertbuessow April 29, 2026 11:29

robertbuessow approved these changes Apr 30, 2026

View reviewed changes

gbrgr and others added 2 commits April 30, 2026 13:51

Rename build_null_buffer_scattered → build_null_buffer_gathered

7a8947a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gbrgr added 2 commits April 30, 2026 14:48

Remove code dup

7aa1836

Rename fn

b820563

gbrgr added 2 commits April 30, 2026 14:57

.

c1dc692

.

5d525a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export improvements#90

Export improvements#90
gbrgr wants to merge 26 commits intomainfrom
gb/export-improvements

gbrgr commented Apr 28, 2026 •

edited

Loading

Uh oh!

gbrgr commented Apr 29, 2026

Uh oh!

robertbuessow left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gbrgr commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!