Skip to content

fix: Expose hash to FFI udf/udaf/udwf to fix their Eq#17350

Closed
crystalxyz wants to merge 6 commits intoapache:mainfrom
crystalxyz:fix-17087
Closed

fix: Expose hash to FFI udf/udaf/udwf to fix their Eq#17350
crystalxyz wants to merge 6 commits intoapache:mainfrom
crystalxyz:fix-17087

Conversation

@crystalxyz
Copy link
Copy Markdown
Contributor

@crystalxyz crystalxyz commented Aug 28, 2025

Which issue does this PR close?

Rationale for this change

As described in the issue, the original version of Eq in Foreign UDF/UDAF/UDWF does comparison based on pointer hash, which fails to recognize the same UDFs if their pointers differ. This feature fixes this by exposing hash method in the FFI interface so that Eq will compare the actual hash values of udfs (as well as their signatures and aliases).

What changes are included in this PR?

For FFI UDF, UDAF and UDWF, I have made the following changes:

  • The hash function is exposed in FFI module
  • In Foreign_ module, the PartialEq trait will do comparison on the results of the hash functions
  • In Foreign_ module, the Hash trait will now use the result of hash function

Are these changes tested?

Yes. The added unit tests for UDF, UDAF and UDWF failed before this feature, but passed now.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the ffi Changes to the ffi crate label Aug 28, 2025
@crystalxyz crystalxyz changed the title fix: Fix Eq by adding hash to FFI udf/udaf/udwf fix: Expose hash to FFI udf/udaf/udwf to fix their Eq Aug 29, 2025
@crystalxyz crystalxyz marked this pull request as ready for review August 29, 2025 02:59
@timsaucer
Copy link
Copy Markdown
Member

I'm wondering if we can save ourselves some effort here and do this:

pub struct FFI_ScalarUDF {
    ...
    /// Internal hash function result
    pub hash_value: u64,
    ...
}

impl From<Arc<ScalarUDF>> for FFI_ScalarUDF {
    fn from(udf: Arc<ScalarUDF>) -> Self {
        let name = udf.name().into();
        let aliases = udf.aliases().iter().map(|a| a.to_owned().into()).collect();
        let volatility = udf.signature().volatility.into();
        let short_circuits = udf.short_circuits();

        let mut state = DefaultHasher::new();
        udf.hash(&mut state);
        let hash_value = state.finish();

        let private_data = Box::new(ScalarUDFPrivateData { udf });

        Self {
            name,
            aliases,
            volatility,
            short_circuits,
            invoke_with_args: invoke_with_args_fn_wrapper,
            return_type: return_type_fn_wrapper,
            return_field_from_args: return_field_from_args_fn_wrapper,
            coerce_types: coerce_types_fn_wrapper,
            hash_value,
            clone: clone_fn_wrapper,
            release: release_fn_wrapper,
            private_data: Box::into_raw(private_data) as *mut c_void,
        }
    }
}

impl PartialEq for ForeignScalarUDF {
    fn eq(&self, other: &Self) -> bool {
        let Self {
            name,
            aliases,
            udf,
            signature,
        } = self;
        name == &other.name
            && aliases == &other.aliases
            && signature == &other.signature
            && udf.hash_value == other.udf.hash_value
    }
}

impl Hash for ForeignScalarUDF {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let Self {
            name,
            aliases,
            udf,
            signature,
        } = self;
        name.hash(state);
        aliases.hash(state);
        // This appears to be a hash of the hash value, but if you review how
        // u64 is hashed, it is just pushing the byte values into state.
        udf.hash_value.hash(state);
        signature.hash(state);
    }
}

And further, I wonder if we even need fn hash() to include the name, aliases, and signature at all. Those all come from the FFI calls, so wouldn't they be included in the internal call to hash?

It seems like we have an opportunity here to have a simpler path, but I'm not 100% confident I haven't overlooked some need to call hash() function every time. I don't think we expect these functions to have a state that is changing and thus hash is updating.

@crystalxyz
Copy link
Copy Markdown
Contributor Author

Thanks for your feedback! Will take a look and think more carefully about the cases

Copy link
Copy Markdown
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you for diving in!

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 4, 2025

This feature fixes this by exposing hash method in the FFI interface so that Eq will depend on the actual hash values of udfs.

It may sounds like being dependent on hash not colliding.
It's not the case, right?

@crystalxyz
Copy link
Copy Markdown
Contributor Author

crystalxyz commented Sep 4, 2025

It may sounds like being dependent on hash not colliding. It's not the case, right?

@findepi Sorry for the confusion in wording and I just fixed the feature description. For Eq, we do compare signatures and aliases in addition to hash values.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 4, 2025

If two different functions return same hash (e.g. 42), will Eq return incorrect result in such case?

@timsaucer
Copy link
Copy Markdown
Member

If two different functions return same hash (e.g. 42), will Eq return incorrect result in such case?

I must be missing something - how is that any different than any of the UDFs that aren't going through ffi? This is calling the same function during initialization.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 5, 2025

For normal functions, the Eq is not based on the Hash and is not susceptible to hash collisions.
The hash can be pre-calculated and if it's different, then Eq can short-circuit before checking other fields.
However, checking pre-calculated hash alone -- or some fields and the pre-calculated hash -- is not a correct Eq implementation.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 5, 2025

You can think of this the following way. If I replace hash function with a function that always returns the same thing (e.g. 42), is my code maybe slower, but still correct? If yes, we're good.

However, looking at impl PartialEq for ForeignAggregateUDF, I don't think this is the case. So I don't think we're good.

@timsaucer timsaucer self-requested a review September 5, 2025 10:48
@timsaucer
Copy link
Copy Markdown
Member

Ok, I think @findepi makes a fair point. We can probably just revert the last commits and I'll re-review. Sorry for the extra work.

@alamb alamb requested a review from findepi September 5, 2025 13:26
@crystalxyz
Copy link
Copy Markdown
Contributor Author

crystalxyz commented Sep 5, 2025

Sorry I might be missing some context here. I understand that hash values can only be used to avoid more expensive comparisons if they are unequal. But if hash values are equal, it doesn't guarantee anything about equality, so how should we check equality of ForeignUDFs? I'm not sure if having hash() as a function or hash_value as a constant makes a difference here either.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 8, 2025

@timsaucer what if we simply don't do #17087 ?

@findepi findepi marked this pull request as draft September 8, 2025 13:23
@crystalxyz
Copy link
Copy Markdown
Contributor Author

crystalxyz commented Sep 8, 2025

Yeah, I guess it's better to have Eq to return false negative (return false for equal udfs) than false positive (return true for unequal udfs), although it would be good if we can return the result more accurately.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 9, 2025

Yeah, I guess it's better to have Eq to return false negative (return false for equal udfs) than false positive (return true for unequal udfs),

Agreed!

although it would be good if we can return the result more accurately.

Agreed too.
Longer answer under the issue: #17087 (comment)

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale PR has not had any activity for some time label Nov 26, 2025
@crystalxyz
Copy link
Copy Markdown
Contributor Author

Not a good long-term solution for safety concerns, close and fallback to issue discussion.

@crystalxyz crystalxyz closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ffi Changes to the ffi crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable udf hashing across FFI boundary

3 participants