Skip to content

feat(java): support json extraction by scanning#5770

Merged
majin1102 merged 2 commits intolance-format:mainfrom
majin1102:jsonb
Jan 27, 2026
Merged

feat(java): support json extraction by scanning#5770
majin1102 merged 2 commits intolance-format:mainfrom
majin1102:jsonb

Conversation

@majin1102
Copy link
Copy Markdown
Contributor

@majin1102 majin1102 commented Jan 21, 2026

Close #5744

For now arrow-java does't provide a convenient way like pa.json(). So we extend the field type in lance java binding to support related json udf

@github-actions github-actions Bot added enhancement New feature or request java labels Jan 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review for PR #5770

Summary

This PR adds Java utilities for creating Arrow JSON extension fields and a test to verify JSON extraction via scanning works correctly.

P0 Issues

None identified - The implementation is correct and addresses the issue #5744.

P1 Issues / Suggestions

  1. Missing Javadoc on constants (Minor)
    The class uses EXTENSION_NAME_KEY and ARROW_JSON_EXTENSION_NAME constants but doesn't document their relationship to the Arrow specification. Consider adding a brief class-level Javadoc explaining that these constants align with the Arrow JSON extension type specification (arrow.json).

  2. Consider adding jsonLargeBinary helper
    Looking at rust/lance-arrow/src/json.rs, Lance internally uses LargeBinary with lance.json extension for JSONB storage. The current PR only exposes arrow.json extension (which uses Utf8/LargeUtf8 and gets converted to JSONB on write). This is the correct approach for users writing JSON data, but a comment noting this would help future maintainers understand the data flow:

    • User writes arrow.json (Utf8) → Lance converts to lance.json (LargeBinary JSONB) on write
    • On read, Lance converts back to arrow.json (Utf8)
  3. Test coverage is good - The test properly validates:

    • Creating a dataset with JSON extension field
    • Writing JSON data
    • Querying with json_extract filter
    • Verifying correct row filtering

Positive Aspects

  • Correct use of Collections.unmodifiableMap for immutable metadata
  • Both Utf8 and LargeUtf8 variants are provided
  • Test follows existing patterns in the codebase (uses @TempDir, proper resource cleanup with try-with-resources)
  • The solution aligns with how PyArrow's pa.json_() works, providing parity between Python and Java SDKs

Verdict

LGTM ✓ - The PR correctly solves the issue and the implementation is clean. The test adequately covers the use case.

@majin1102
Copy link
Copy Markdown
Contributor Author

@Xuanwo Could you take a look at this json related feat for Java binding?

Thanks

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, left a test suggestion.

public class JsonExtractionTest {

@Test
void testJsonExtraction(@TempDir Path tempDir) throws Exception {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more tests for some corner cases? For example, null field, invalid JSON string, jsonLargeUtf8 .

@majin1102
Copy link
Copy Markdown
Contributor Author

Ready for another look @yanghua

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@majin1102 majin1102 merged commit 7f1d16d into lance-format:main Jan 27, 2026
8 checks passed
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
Close lance-format#5744

For now arrow-java does't provide a convenient way like `pa.json()`. So
we extend the field type in lance java binding to support related json
udf

---------

Co-authored-by: majin.nathan <majin.nathan@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug lanceJava Sdk] json_extract fails with InvalidJsonb when querying Utf8 or LargeBinary columns containing raw JSON strings

2 participants