Skip to content

Conversation

@yuancu
Copy link
Collaborator

@yuancu yuancu commented Apr 23, 2025

Description

This PR implements SHA2 with UDFs, and reuse calcite MD5, SHA1 sql operators as available functions.

  • SHA2 family is not fully implemented in Calcite SQL functions - only SHA256 and SHA512 are available, SHA224 and SHA384 are missing. Therefore, we implement our own.

Issues Resolved

Resolves #3573

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

yuancu added 4 commits April 23, 2025 15:18
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@LantaoJin
Copy link
Member

LantaoJin commented Apr 24, 2025

A high-level question: could you explain in PR description why choose to self-implement instead of leveraging Calcite builtin functions (This question should be answered for each function in similar PRs)


Example::

os> source=people | eval `SHA1('hello')` = SHA1('hello') | fields `SHA1('hello')`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please apply function on field of people index

Copy link
Collaborator Author

@yuancu yuancu Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are corrected. Document kept unchanged.


Example::

os> source=people | eval `SHA2('hello',256)` = SHA2('hello',256) | fields `SHA2('hello',256)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 |
+------------------------------------------------------------------+

os> source=people | eval `SHA2('hello',512)` = SHA2('hello',512) | fields `SHA2('hello',512)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

yuancu added 2 commits April 24, 2025 11:08
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@yuancu
Copy link
Collaborator Author

yuancu commented Apr 24, 2025

A high-level question: could you explain in PR description why choose to self-implement instead of leveraging Calcite builtin functions (This question should be answered for each function in similar PRs)

I found that MD5 and SHA1 are actually implemented in Calcite. Fixed by reusing their implementations.
I'll check other PRs.

LantaoJin
LantaoJin previously approved these changes Apr 24, 2025
Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
LantaoJin
LantaoJin previously approved these changes May 9, 2025

// Register PPL UDF operator
registerOperator(SPAN, PPLBuiltinOperators.SPAN);
registerOperator(SHA2, PPLBuiltinOperators.SHA2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the SHA2 is the only one we built by ourselves. So I checked why PPL Spark supported SHA2, I found there was no requirement for this. This issue mentioned a link from Splunk, but there was no SHA2 in Splunk either. @penghuo how about remove the SHA2 both in Spark and OpenSearch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with using SHA2, as it provides more functionality.

@LantaoJin LantaoJin dismissed their stale review May 9, 2025 15:10

Is SHA2 required?

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
@penghuo penghuo merged commit b784b87 into opensearch-project:main May 12, 2025
22 checks passed
penghuo pushed a commit that referenced this pull request Jun 16, 2025
* Implement cryptographic functions

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Implement cryptographic UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Wrap message digests with ThreadLocal to enforce thread safety

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Create documentation for cryptographic hash PPL functions

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Use calcite implementations of md5 and sha1

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Test field values in cryptographic UDFs

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

* Refactor: update sha2 implementations with DigestUtils

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>

---------

Signed-off-by: Yuanchun Shen <yuanchu@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support cryptographic hashing functions

3 participants