Skip to content

[Variant] Align cast logic for variant_get to cast kernel for numeric/bool types#9563

Merged
alamb merged 8 commits intoapache:mainfrom
klion26:variant_cast_number_and_bool
Apr 6, 2026
Merged

[Variant] Align cast logic for variant_get to cast kernel for numeric/bool types#9563
alamb merged 8 commits intoapache:mainfrom
klion26:variant_cast_number_and_bool

Conversation

@klion26
Copy link
Copy Markdown
Member

@klion26 klion26 commented Mar 16, 2026

Which issue does this PR close?

What changes are included in this PR?

Align tests with cast kernel

Are these changes tested?

Covered by the existing tests

Are there any user-facing changes?

Yes, changed the logic for `Variant::asboolean/as_int8/as_int16/as_int32/as_int64/as_u8/as_u16/as_u32/as_u64/as_f16/as_f32/as_f64

@klion26 klion26 changed the title WIP [Variant] Align cast logic for variant_get to cast kernel for numeric/bool types Mar 16, 2026
@github-actions github-actions Bot added arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Mar 16, 2026
Copy link
Copy Markdown
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich This is what is in my mind to align cast logic for variant_get with cast kernel, current, only changed int/float/double/bool (no decimal for now).

Currently, this will change some behaviors (details in inline comments). Please help review this when you're in free, thanks.

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Int64(i) if fits_precision::<11>(i) => Some(f16::from_f32(i as _)),
_ => None,
}
self.as_num::<f16>()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the behavior from before. Now 65536 will return Some(inf) instead of None.

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Int64(i) => num_cast::<i64, T>(i),
Variant::Float(f) => num_cast::<f32, T>(f),
Variant::Double(d) => num_cast::<f64, T>(d),
Variant::Decimal4(d) if d.scale() == 0 => num_cast::<_, T>(d.integer()),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from/to_decimal needs to be added in a following commit if the direction is right

@sdf-jkl
Copy link
Copy Markdown
Contributor

sdf-jkl commented Mar 18, 2026

I did a quick run, and it looks cool! I'll take a closer look tomorrow

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: Should we consider keeping the lossless as_xxx methods, with new cast_as_xxx methods that are fully aggressive?

Asking because even lossless casting is complex enough that users could easily get it wrong, which could be a reason for the library to provide it.

Alternatively, we might consider expanding CastOptions to differentiate between lossless casts (e.g. 42i8 -> 42u64) vs. lossy casts (also allows e.g. f64::PI -> 3i32) vs. converting casts (also allows e.g. "true" -> true, 0 -> false, "123" -> 123i8)?

{
if value {
// a workaround to cast a primitive to type O, infallible
num_traits::cast::cast(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like T::Native already provides ONE and ZERO as constants?
https://docs.rs/arrow/latest/arrow/array/trait.ArrowNativeTypeOp.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have an impedance mismatch between ArrowNativeType (arrow) and NumCast (num_cast) tho?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ArrowNativeTypeOp contains ONE and ZEOR, but Rust type (i32, etc) does not have ZEOR/ONE.

where
I: Default + PartialEq,
{
value != I::default()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that rust or SQL allows coercing numbers to bool?
But arrow-cast already allows this, I guess?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, this should probably compare directly against I::ZERO?
(to match the other direction, below)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, arrorw-cast already allows this, This is the current equivalent conversion for arrow-cast.

fn numeric_to_bool_cast<T>(from: &PrimitiveArray<T>) -> Result<BooleanArray, ArrowError>
where
T: ArrowPrimitiveType + ArrowPrimitiveType,
{
let mut b = BooleanBuilder::with_capacity(from.len());
for i in 0..from.len() {
if from.is_null(i) {
b.append_null();
} else if from.value(i) != T::default_value() {
b.append_value(true);
} else {
b.append_value(false);
}
}

Comment thread parquet-variant/src/variant.rs Outdated
match self {
Variant::BooleanTrue => Some(true),
Variant::BooleanFalse => Some(false),
Variant::Int8(i) => Some(cast_num_to_bool::<i8>(*i)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these type annotations actually necessary? Seems like the compiler would infer them easily?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type annotations can be removed.

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Int64(i) => Some(cast_num_to_bool::<i64>(*i)),
Variant::Float(f) => Some(cast_num_to_bool::<f32>(*f)),
Variant::Double(d) => Some(cast_num_to_bool::<f64>(*d)),
Variant::ShortString(s) => cast_single_string_to_boolean_default(s.0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShortString provides Deref?

Suggested change
Variant::ShortString(s) => cast_single_string_to_boolean_default(s.0),
Variant::ShortString(s) => cast_single_string_to_boolean_default(*s),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s here is &ShortString, *s will give ShortString, we need to use &**s to convert to &str, adjusted to s.as_ref()

@klion26 klion26 force-pushed the variant_cast_number_and_bool branch from 4f7214e to e0ee913 Compare March 19, 2026 08:48
Copy link
Copy Markdown
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich Thanks for the review. I've addressed some of the comments and replied to the others.

For the function names Variant::as_xx and Variant::cast_as_xxx, I think as_xx is fine, as conversion name guide gives that conversion prefix like as_/to_, and we can do 3.2f64 as i64 in Rust code, but maybe we can add lossy suffix in the name?

Alternatively, we might consider expanding CastOptions to differentiate between lossless casts (e.g. 42i8 -> 42u64) vs. lossy casts (also allows e.g. f64::PI -> 3i32) vs. converting casts (also allows e.g. "true" -> true, 0 -> false, "123" -> 123i8)?

I've tried adding the lossy flag to CastOptions in #9169; the conclusion is to keep it as it is for now.

where
I: Default + PartialEq,
{
value != I::default()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, arrorw-cast already allows this, This is the current equivalent conversion for arrow-cast.

fn numeric_to_bool_cast<T>(from: &PrimitiveArray<T>) -> Result<BooleanArray, ArrowError>
where
T: ArrowPrimitiveType + ArrowPrimitiveType,
{
let mut b = BooleanBuilder::with_capacity(from.len());
for i in 0..from.len() {
if from.is_null(i) {
b.append_null();
} else if from.value(i) != T::default_value() {
b.append_value(true);
} else {
b.append_value(false);
}
}

{
if value {
// a workaround to cast a primitive to type O, infallible
num_traits::cast::cast(1)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ArrowNativeTypeOp contains ONE and ZEOR, but Rust type (i32, etc) does not have ZEOR/ONE.

Comment thread parquet-variant/src/variant.rs Outdated
match self {
Variant::BooleanTrue => Some(true),
Variant::BooleanFalse => Some(false),
Variant::Int8(i) => Some(cast_num_to_bool::<i8>(*i)),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type annotations can be removed.

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Int64(i) => Some(cast_num_to_bool::<i64>(*i)),
Variant::Float(f) => Some(cast_num_to_bool::<f32>(*f)),
Variant::Double(d) => Some(cast_num_to_bool::<f64>(*d)),
Variant::ShortString(s) => cast_single_string_to_boolean_default(s.0),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s here is &ShortString, *s will give ShortString, we need to use &**s to convert to &str, adjusted to s.as_ref()

@klion26 klion26 force-pushed the variant_cast_number_and_bool branch from e0ee913 to a0f711c Compare March 19, 2026 09:08
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGMT, left a few nits.

}
}

fn as_num<T>(&self) -> Option<T>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a doc to this function?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Decimal16(d) if d.scale() == 0 => d.integer().try_into().ok(),
_ => None,
}
self.as_num::<_>()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why type inference here, but not for other as_int funcs?

+self.as_num::<_>()
-self.as_num::<32>()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. The inference has been removed now

Copy link
Copy Markdown
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdf-jkl Thanks for the review, comments addressed. Please take another look when you're free.

Comment thread parquet-variant/src/variant.rs Outdated
Variant::Decimal16(d) if d.scale() == 0 => d.integer().try_into().ok(),
_ => None,
}
self.as_num::<_>()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. The inference has been removed now

}
}

fn as_num<T>(&self) -> Option<T>
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26! I only have nits and typos

Comment thread parquet-variant/src/variant.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
Comment thread arrow-cast/src/cast/mod.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
fn test_error_message_numeric_type_display() {
let mut builder = VariantArrayBuilder::new(1);
builder.append_variant(Variant::BooleanTrue);
builder.append_variant(Variant::Null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Variant::Null to cause error in strict casting will cause conflict with the PR I'm working on #9599.
Let's see who merges first 😄

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to other variants

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Copy link
Copy Markdown
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdf-jkl I've addressed the comments, please another look

Comment thread arrow-cast/src/cast/mod.rs Outdated
fn test_error_message_numeric_type_display() {
let mut builder = VariantArrayBuilder::new(1);
builder.append_variant(Variant::BooleanTrue);
builder.append_variant(Variant::Null);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to other variants

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
Comment thread parquet-variant/src/variant.rs Outdated
@klion26 klion26 force-pushed the variant_cast_number_and_bool branch from 522ef0c to 2b33884 Compare March 27, 2026 07:47
@klion26 klion26 force-pushed the variant_cast_number_and_bool branch from 2b33884 to a9214bc Compare March 27, 2026 08:04
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, thank you!

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems good.

I don't love silently lossy casts with no option to be strict(er), but that's unfortunately how arrow casts seem to roll, not a decision this PR takes.

Ok(Arc::new(output_array))
}

fn cast_single_string_to_boolean(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need #[inline] directive to avoid a performance hazard?

Comment thread parquet-variant/src/variant.rs Outdated
match *self {
Variant::BooleanFalse => single_bool_to_numeric::<T>(false),
Variant::BooleanTrue => single_bool_to_numeric::<T>(true),
Variant::Int8(i) => num_cast::<_, T>(i),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised type inference doesn't pick up that T? Does it fail to compile if you do this?

Suggested change
Variant::Int8(i) => num_cast::<_, T>(i),
Variant::Int8(i) => num_cast(i),

Comment thread parquet-variant/src/variant.rs Outdated
/// let v5 = Variant::from("hello!");
/// assert_eq!(v5.as_f32(), None);
/// ```
#[allow(clippy::cast_possible_truncation)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need this any more since the truncation is hidden behind too many layers for clippy to "see" now?

Comment thread parquet-variant/src/variant.rs Outdated
Comment on lines +1546 to +1547
// It will always fit in i16 because u8 max is 255 and i16 max is 32767
Variant::Int16(num_cast(value).unwrap())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems simpler to just say

Suggested change
// It will always fit in i16 because u8 max is 255 and i16 max is 32767
Variant::Int16(num_cast(value).unwrap())
Variant::Int16(num_cast(value).unwrap()) // u8 -> i16 is infallible

(more below)

@klion26
Copy link
Copy Markdown
Member Author

klion26 commented Mar 31, 2026

@scovich I've addressed the comments, please take another look when you're fee, thanks.

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Mar 31, 2026

Hmm, what about:

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

Do we have a docs gap to fill?

/// [Parquet Variant]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
/// [specification]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
/// [Variant Shredding specification]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md
///
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what about:

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

Do we have a docs gap to fill?

It seems like there is no single place that has the casting logic defined. We could add it to the Variant doc

Suggested change
///
/// # Casting Semantics
///
/// Scalar conversion semantics intentionally follow Arrow cast behavior where applicable.
/// Conversions in this module delegate to Arrow compute cast helpers such as
/// [`num_cast`], [`cast_num_to_bool`], [`single_bool_to_numeric`], and
/// [`cast_single_string_to_boolean_default`].
///
/// - [`Self::as_boolean`] accepts boolean, numeric, and string variants.
/// Numeric zero maps to `false`; non-zero maps to `true`. String parsing follows
/// Arrow UTF8-to-boolean cast rules.
/// - Numeric accessors such as [`Self::as_int8`], [`Self::as_int64`], [`Self::as_u8`],
/// [`Self::as_u64`], [`Self::as_f16`], [`Self::as_f32`], and [`Self::as_f64`] accept
/// boolean and numeric variants (integers, floating-point, and decimals with scale `0`).
/// They return `None` when conversion is not possible.
/// - Decimal accessors such as [`Self::as_decimal4`], [`Self::as_decimal8`], and
/// [`Self::as_decimal16`] accept compatible decimal variants and integer variants.
/// They return `None` when conversion is not possible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like that would do the job

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the PR description as well.

Are there any user-facing changes?

No

@klion26
Copy link
Copy Markdown
Member Author

klion26 commented Apr 1, 2026

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

It's a very good idea to do this, it can keep the doc as fresh as possible.

The doc has been updated.

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Apr 1, 2026

Looks like some kind of CI hiccup on the integration test. I triggered a retry, let's see if it succeeds.

@alamb alamb merged commit 989ac03 into apache:main Apr 6, 2026
34 of 35 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 6, 2026

Thanks @klion26 @sdf-jkl and @scovich

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 6, 2026

(BTw @scovich you should feel free to merge these PRs without waiting for me)

@klion26 klion26 deleted the variant_cast_number_and_bool branch April 7, 2026 11:48
@klion26
Copy link
Copy Markdown
Member Author

klion26 commented Apr 7, 2026

@scovich @sdf-jkl @alamb Thanks for the review and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Align cast logic for variant_get to cast kernel for numeric/bool types

4 participants