Skip to content

add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json>#15521

Merged
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:json-query-array
Dec 8, 2023
Merged

add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json>#15521
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:json-query-array

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

This PR adds JSON_QUERY_ARRAY which is sort of like JSON_QUERY but instead of returning COMPLEX<json> for any value extracted from some json path, instead returns ARRAY<COMPLEX<json>>. This is currently done purely with ExpressionVirtualColumn via a DirectOperatorConversion rather than using the specialized NestedFieldVirtualColumn used by JSON_VALUE and JSON_QUERY, mostly because there isn't a lot of room for optimization yet, and I would rather wait until the future if we introduce specialized array column selectors than trying to extend the existing selectors of this virtual column to also handle arrays of objects.

Similar to other array handling, values which are not arrays will be coerced into single element arrays, though I am open to discussion on this, since it would seem equally valid to handle them as null values...

This allows for a lot of useful stuff like using UNNEST on arrays of objects, to transform an array of json objects into rows of json objects.

For example, using some data sourced from a discussion in a community slack thread, which has top level arrays of objects (would also work with nested arrays of objects at some path)

Screenshot 2023-12-08 at 12 01 39 AM

We can use JSON_QUERY_ARRAY to do stuff like translate it to a separate row per object:

Screenshot 2023-12-08 at 12 02 33 AM

and further use JSON_VALUE to extract values from these objects and do stuff like group or aggregate on them:

Screenshot 2023-12-08 at 12 04 08 AM

Will add docs in a follow-up PR.

Release note

Added JSON_QUERY_ARRAY which is similar to JSON_QUERY except the return type is always ARRAY<COMPLEX<json>> instead of COMPLEX<json>. Essentially, this function allows extracting arrays of objects from nested data and performing operations such as UNNEST, ARRAY_LENGTH, ARRAY_SLICE, or any other available ARRAY operations.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis
Copy link
Copy Markdown
Member Author

I imagine this function to be most useful actually with MSQ, to explode arrays of objects out into separate rows at ingest time, for example:

Screenshot 2023-12-08 at 12 53 10 AM

or even to extract values from the exploded objects into individual columns:

Screenshot 2023-12-08 at 12 55 28 AM

Copy link
Copy Markdown
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. LGTM otherwise. Very useful capability.

@Override
public Expr apply(List<Expr> args)
{
if (args.get(1).isLiteral()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add some validation on args count.

}

@Override
public SqlTypeMappingRule getTypeMappingRule()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here as to why this is overridden?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't mean to commit, was experimenting with CAST and forgot to remove this

@clintropolis clintropolis merged commit e64b92e into apache:master Dec 8, 2023
@clintropolis clintropolis deleted the json-query-array branch December 8, 2023 13:28
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants