GH-948: Use buffer indexing for UUID vector #949

jhrotko · 2026-01-07T18:05:15Z

What's Changed

The current UUID vector implementation creates new buffer slices when reading values through holders, which has several drawbacks:

Memory overhead: Each slice creates a new ArrowBuf object
Performance impact: Buffer slicing is slower than direct buffer indexing
Inconsistency: Other fixed-width types (like Decimal) use buffer indexing with a start offset field

Proposed Changes

Add start field to UUID holders to track buffer offsets:
- UuidHolder: Add public int start = 0;
- NullableUuidHolder: Add public int start = 0;
Update UuidVector to use buffer indexing
Update readers and writers

Related Work

Original UUID extension type implementation: Introduce type support for UUID as a canonical extension type #825 (GH-825: Add UUID canonical extension type #903)

Closes #948

jbonofre · 2026-01-07T18:13:26Z

Should we use a start offset ? Why not having diifferent vectors ? Just wondering ...

lidavidm

This also means that a holder can inadvertently extend the lifetime of a vector's backing buffer, right? (I suppose slicing would as well.) If we care about efficiency maybe storing two longs is better?

jbonofre · 2026-01-08T04:54:28Z

@lidavidm yes, that's my point as well. I wonder if we should not "square" the vector (with two "edges").

gszadovszky · 2026-01-08T10:46:23Z

The current approach is just following what we have for variable length binaries (e.g. VarChar). I'm not against using the two-longs approach, but whatever stands against this one, stands for the var-len holders as well.
Also, DecimalHolder is very similar to this one, while the two-longs approach would work in that case as well.

jhrotko · 2026-01-08T11:38:24Z

Thanks for the feedback. After looking at this more carefully, let me be clear of my current understanding of these two options:

The main concern is that when a holder stores an ArrowBuf reference, it keeps the entire buffer alive even after the vector is closed. This is especially problematic if you're saving holders for later use, you end up keeping large buffers in memory when you only need a few values.

Did you mean something like this?
take the example:

// Scenario: Process UUIDs from a large batch, keep some for later
public class UuidProcessor {
    private List<UuidHolder> importantUuids = new ArrayList<>();
    
    public void processBatch(UuidVector vector) {
        // Vector has 100,000 UUIDs = 1.6 MB of data
        UuidHolder holder = new UuidHolder();
        
        for (int i = 0; i < vector.getValueCount(); i++) {
            vector.get(i, holder);
            
            // Check if this UUID is "important"
            if (isImportant(holder)) {
                
                UuidHolder saved = new UuidHolder();
                saved.buffer = holder.buffer;  // ← References entire 1.6 MB buffer!
                saved.start = holder.start;
                saved.isSet = holder.isSet;
                importantUuids.add(saved);
            }
        }
        
        // Done processing, close the vector
        vector.close();
        
        // Vector is closed, but buffer is not freed
        // Even if we only saved 10 UUIDs (160 bytes), we're keeping 1.6 MB alive
    }
}

sequenceDiagram
    participant App as Application
    participant Vector as UuidVector
    participant Buffer as ArrowBuf (1000 UUIDs)
    participant Holder as UuidHolder
    participant Allocator as BufferAllocator
    
    Note over App,Allocator: Step 1: Create vector with 1000 UUIDs
    App->>Vector: allocateNew(1000)
    Vector->>Allocator: allocate 16,000 bytes
    Allocator->>Buffer: Create buffer (refCount=1)
    Buffer-->>Vector: buffer reference
    
    Note over App,Allocator: Step 2: Read ONE UUID into holder
    App->>Vector: get(0, holder)
    Vector->>Holder: holder.buffer = vector.buffer
    Vector->>Holder: holder.start = 0
    Note over Buffer: refCount=2 (vector + holder)
    
    Note over App,Allocator: Step 3: Close vector (done with it)
    App->>Vector: close()
    Vector->>Buffer: release() - refCount=2→1
    Note over Buffer: ❌ Buffer NOT freed!<br/>Holder still references it
    
    Note over App,Allocator: Step 4: Application keeps holder
    Note over Holder: Holder only needs 16 bytes<br/>but keeps 16,000 bytes alive!
    Note over Buffer: ❌ 15,984 bytes wasted!
    
    Note over App,Allocator: Step 5: Eventually holder goes out of scope
    App->>Holder: (garbage collected)
    Holder->>Buffer: release() - refCount=1→0
    Buffer->>Allocator: free memory
    Note over Buffer: ✅ Finally freed

I initially modeled UUID after VarChar and Decimal, but if this is the lifetime risk for VarcharHolder we accept this risk since copying can be very expensive while for a UuidHolder copying is trivial, so there's no reason to take on the buffer lifetime risk.

lidavidm · 2026-01-08T13:41:50Z

I don't think Varchar is an appropriate comparison here since it is variable-length. UUID is a fixed length (known at compile-time, unlike fixed-length binary) so it can be simplified. I assume both memory-wise and performance-wise, storing two longs is preferable to slicing/copying/storing a reference + length, and less prone to issues with forgetting to retain/release the buffer.

lidavidm · 2026-01-09T01:28:25Z

vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java

        UuidHolder uuidHolder = new UuidHolder();
        uuidReader.read(uuidHolder);
-        UUID actualUuid = UuidUtility.uuidFromArrowBuf(uuidHolder.buffer, 0);
+        UUID actualUuid = new UUID(uuidHolder.mostSigBits, uuidHolder.leastSigBits);


Is it worthwhile to just have a UUID getUUID() method on the holder?

lidavidm · 2026-01-09T01:30:05Z

vector/src/main/java/org/apache/arrow/vector/UuidVector.java

    } else {
-      return getBufferSlicePostNullCheck(index);
+      holder.isSet = 1;
+      final ByteBuffer bb = ByteBuffer.wrap(getUnderlyingVector().getObject(index));


Is it not possible to directly use ArrowBuf#getLong? (Or even MemoryUtil#getLong + a manual bounds check to avoid two bounds checks, though I'd expect the JIT would optimize that out)

Can you take a look again?

jhrotko · 2026-01-09T11:11:31Z

Let me do a quick benchmark just to compare both versions

jhrotko · 2026-01-09T22:57:22Z

The results of the benchmarking comparing the two versions (v1) buffer indexing from e35ed7c and (v2) two MSB and LSB longs UUID holder implementations across 5 scales (1k, 10k, 100k, 1M, 10M elements) for 7 different operations.

There were no major differences with exception of the test getWithUuidHolder where there was a 65% slow down with v2. I can provide full data for all methods if requested.

Scale	v1 (µs/op)	v2 (µs/op)	Diff (µs)	% Change	v1/v2 ratio
1k	1.000	1.666	+0.666	+66.6%	1.67x slower
10k	9.954	16.462	+6.508	+65.4%	1.65x slower
100k	102.048	162.946	+60.898	+59.7%	1.60x slower
1M	1043.961	1669.155	+625.194	+59.9%	1.60x slower
10M	10121.154	16718.845	+6597.691	+65.2%	1.65x slower

The getWithUuidHolder regression is due to different implementation approaches:

V1 (ArrowBuf reference):

holder.buffer = getDataBuffer();      // Copy pointer (8 bytes)
holder.start = getStartOffset(index); // Copy offset (4 bytes)
// Total: 12 bytes copied, NO data reading (deferred work)

V2 (MSB/LSB longs):

holder.mostSigBits = Long.reverseBytes(dataBuffer.getLong(start));      // Read + reverse
holder.leastSigBits = Long.reverseBytes(dataBuffer.getLong(start + 8)); // Read + reverse
// Total: 16 bytes READ + 2 byte reversals (immediate work)

The old version defers work (just stores references), while the new version does work upfront (reads and byte-reverses data -> O(N)).

My point of view

Given the 65% performance regression and the importance of maintaining Arrow's zero-copy design pattern, I would prefer the buffer reference approach v1. This aligns with how other holder types handle data larger than 8 bytes (Decimal, VarChar, and VarBinary) all use buffer references to enable zero-copy access. At 16 bytes, UUID fits this same category and should follow the established pattern.

apacheGH-948: Use buffer indexing for UUID vector

e35ed7c

github-actions bot added the breaking-change label Jan 7, 2026

This comment has been minimized.

Sign in to view

jhrotko marked this pull request as ready for review January 7, 2026 18:08

jhrotko requested review from jbonofre, laurentgo, lidavidm and wgtmac as code owners January 7, 2026 18:08

lidavidm reviewed Jan 8, 2026

View reviewed changes

Use MSB and LSB longs to represent UUID

421b5d4

lidavidm reviewed Jan 9, 2026

View reviewed changes

jhrotko added 2 commits January 9, 2026 09:15

Add getUuid method in [Nullable]UuidHolder

c9b09b0

use ArrowBuf.getLong

51cf225

jbonofre added the enhancement PRs that add or improve features. label Jan 9, 2026

github-actions bot added this to the 19.0.0 milestone Jan 9, 2026

Add Uuid vector benchmarks

18e3354

jhrotko force-pushed the uuid-improvement branch from 80c8f7e to 18e3354 Compare January 9, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-948: Use buffer indexing for UUID vector #949

GH-948: Use buffer indexing for UUID vector #949

Uh oh!

jhrotko commented Jan 7, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

jbonofre commented Jan 7, 2026

Uh oh!

lidavidm left a comment

Uh oh!

jbonofre commented Jan 8, 2026

Uh oh!

gszadovszky commented Jan 8, 2026

Uh oh!

jhrotko commented Jan 8, 2026 •

edited

Loading

Uh oh!

lidavidm commented Jan 8, 2026

Uh oh!

lidavidm Jan 9, 2026

Uh oh!

lidavidm Jan 9, 2026

Uh oh!

jhrotko Jan 9, 2026

Uh oh!

jhrotko commented Jan 9, 2026

Uh oh!

jhrotko commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GH-948: Use buffer indexing for UUID vector #949

Are you sure you want to change the base?

GH-948: Use buffer indexing for UUID vector #949

Uh oh!

Conversation

jhrotko commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Changed

Proposed Changes

Related Work

Uh oh!

This comment has been minimized.

jbonofre commented Jan 7, 2026

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

jbonofre commented Jan 8, 2026

Uh oh!

gszadovszky commented Jan 8, 2026

Uh oh!

jhrotko commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Jan 8, 2026

Uh oh!

lidavidm Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

lidavidm Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jhrotko Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jhrotko commented Jan 9, 2026

Uh oh!

jhrotko commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My point of view

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhrotko commented Jan 7, 2026 •

edited

Loading

jhrotko commented Jan 8, 2026 •

edited

Loading

jhrotko commented Jan 9, 2026 •

edited

Loading