Skip to content

feat: add an arrow-scalar crate with a scalar definition#5845

Closed
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:feat/arrow-scalar
Closed

feat: add an arrow-scalar crate with a scalar definition#5845
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:feat/arrow-scalar

Conversation

@westonpace
Copy link
Copy Markdown
Member

There are places in lance-file and lance-index where we pull in datafusion just so that we can have access to ScalarValue (we may use Expr in lance-index too so maybe not the best example). I'd like to not need to pull in Datafusion to get access to scalars.

In addition, two different definitions for binary serialization of arrow scalars has popped up recently. First, in the constant encoding and second in the column statistics. Although in column statistics we might be able to avoid the need for scalar serialization and in the constant encoding we shouldn't need the full set of arrow types (but it wouldn't hurt).

It's enough I toyed around with making a small standalone lightweight crate to provide scalars.

If we use this then I'd also like to make a small statistics crate with a an accumulator that keeps track of min/max/nan count/null count/etc. We could use this in some of the file encoding spots as well as some of the column statistics work (such an accumulator needs a definition for scalar because that is the min/max output)

@github-actions github-actions Bot added the enhancement New feature or request label Jan 29, 2026
@westonpace
Copy link
Copy Markdown
Member Author

westonpace commented Jan 29, 2026

Here I've used an enum approach (similar to ScalarValue) which uses primitive rust types for the simple scalars and Buffer for the string/binary scalars. Using Buffer instead of String or Vec<u8> means it should be possible to zero-copy a scalar from an arrow array (which could be important if the array is full of images or videos)

There is another approach which could result in even less code. Instead of an enum we could say that all scalars are simply Arc<dyn Array> where the array has length of 1. The downside is that I think some of the display / ordering code would be more complex and creating primitive scalars would be slightly more heavyweight. However, I think it is a viable approach as well if others think that might be better.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

This PR introduces a new arrow-scalar crate with a standalone scalar representation for Arrow types. The motivation to reduce DataFusion dependencies is sound.

P0 Issues

  1. Dictionary key type handling bug (convert.rs:1368-1371): The dictionary extraction always casts keys to UInt32Type, but dictionaries can have various key types (Int8, Int16, Int32, Int64, UInt8, etc.). This will panic or return incorrect values for dictionaries with non-UInt32 keys.

    let key_idx = dict.keys().as_primitive::<UInt32Type>().value(index) as usize;

    This should inspect key_type from the dictionary data type and handle all valid key types.

P1 Issues

  1. Map data type extraction (scalar.rs:2254-2259): The Map data_type implementation takes the first field from entries, but Arrow Map type requires the entries struct field, not just the first field. This may produce incorrect DataType.

  2. Potential semantic issue in Struct::is_null (scalar.rs:2304): Using values.is_empty() to determine if a struct is null is semantically incorrect. An empty struct is valid and not null. Consider tracking nullability explicitly or checking if all field values are null.

  3. Test coverage: The crate lacks tests for:

    • Dictionary round-trips with various key types
    • Map type operations
    • Error paths in byte deserialization (malformed input)

Minor Notes

  • The interval bit-packing representation differs from Arrows native layout. Ensure this is intentional and documented for future maintainers.
  • Consider adding #[must_use] to to_bytes(), to_array(), and similar methods.

@westonpace
Copy link
Copy Markdown
Member Author

Leaving in draft as I still want to do more review (this is mostly vibe coded at the moment)

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 29, 2026

@westonpace
Copy link
Copy Markdown
Member Author

Superseded by #5955

@westonpace westonpace closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant