Skip to content

feat: Implement to_json for subset of types#805

Merged
andygrove merged 29 commits intoapache:mainfrom
andygrove:to-json
Aug 28, 2024
Merged

feat: Implement to_json for subset of types#805
andygrove merged 29 commits intoapache:mainfrom
andygrove:to-json

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Aug 10, 2024

Which issue does this PR close?

Closes #631

Rationale for this change

This is part of our effort to support more operations on complex types.

Performance seems ok.

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
to_json                                             147            152           2          1.4         721.3       1.0X
to_json: Comet (Scan)                               141            145           3          1.4         692.0       1.0X
to_json: Comet (Scan, Exec)                          80             88           5          2.6         391.9       1.8X

What changes are included in this PR?

This PR adds an implementation of to_json that works for a subset of types (structs, primitives, strings). There is no support for date or timestamp yet.

How are these changes tested?

New Rust tests and Spark tests.

@andygrove andygrove marked this pull request as draft August 10, 2024 22:42
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 11, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 33.82%. Comparing base (4fe43ad) to head (5c2f551).

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 50.00% 9 Missing and 7 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #805      +/-   ##
============================================
- Coverage     33.94%   33.82%   -0.12%     
+ Complexity      874      871       -3     
============================================
  Files           112      112              
  Lines         42916    42887      -29     
  Branches       9464     9456       -8     
============================================
- Hits          14567    14508      -59     
- Misses        25379    25395      +16     
- Partials       2970     2984      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove andygrove changed the title feat: Implement to_json feat: Implement to_json for subset of types Aug 11, 2024
@andygrove andygrove marked this pull request as ready for review August 11, 2024 14:28
Comment thread docs/source/user-guide/expressions.md
@andygrove
Copy link
Copy Markdown
Member Author

@parthchandra could you review?

@dharanad
Copy link
Copy Markdown
Contributor

i would also love to review this. will plan it for tomorrow

@andygrove
Copy link
Copy Markdown
Member Author

@eejbyfeldt @Kimahriman you may also be interested in reviewing this one

@Kimahriman
Copy link
Copy Markdown
Contributor

LGTM

val isSupported = child.dataType match {
case s: StructType =>
s.fields.forall(f => isSupportedType(f.dataType))
case _ =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the spark code for this expression here: https://github.com/apache/spark/blob/bfddd53d98da866b474464321e5b323a3df32e81/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L832-L833

Nit: Is that despite the name being Structs it also handles Map, Arrays should we mention that as a TODO here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @eejbyfeldt point is correct. Map and Arrays can effectively be jsonified, but they are not StructType and we need to extend the pattern matching to support those

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments for map/array support

Comment thread docs/source/user-guide/expressions.md Outdated
Comment on lines +188 to +192
| ----------------- | --------------------------------- |
| CreateNamedStruct | Create a struct |
| GetElementAt | Access a field in a struct |
| StructsToJson | Convert a struct to a JSON string |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In other parts of this document the Notes section is only used to document compatibility issues and/or limitations. Should we follow that here as well? I think the expressions names are mostly self describing and the extra comment does not really add that much. (Maybe GetElementAt it a bit unclear what it maps to in Spark/SQL)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have removed these comments.

Comment thread native/spark-expr/src/to_json.rs Outdated

fn struct_to_json(array: &StructArray, timezone: &str) -> Result<ArrayRef> {
// get field names
let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it creates some uneccessary copies of the field names. Any reason to not change this to

Suggested change
let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect();
let fields = array.fields();

and the usage site then becomes

json.push_str(field_names[col_index].name());

Or is there some reason to make copies that I am missing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is now escaping the field names (if needed) so requires a copy in that case

Comment thread native/spark-expr/src/to_json.rs Outdated
Comment on lines +173 to +179
if quotes_needed[col_index] {
json.push('"');
}
json.push_str(string_arrays[col_index].value(row_index));
if quotes_needed[col_index] {
json.push('"');
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an issue here if the value in string_arrays[col_index].value(row_index) contains a literal " character. I think spark (or some other json library) would escape such characters.

There are probably also other things like newlines and tabs that also needs to be handled.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now added escaping for common cases such as \t \r \n \b \f and double quotes

Comment thread native/spark-expr/src/to_json.rs Outdated
}
// quoted field name
json.push('"');
json.push_str(&field_names[col_index]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field_name also needs to be escaped if it contains problematic chars.

@andygrove
Copy link
Copy Markdown
Member Author

Thanks for the review @eejbyfeldt! It is really appreciated. Some very good feedback there. I will address the feedback over the next day or two.

@parthchandra
Copy link
Copy Markdown
Contributor

Sorry @andygrove for this late review. I don't know if one can improve on @eejbyfeldt's review.
To address some of the handling of escape characters, should we look at using something like serde_json ?

@andygrove andygrove mentioned this pull request Aug 19, 2024
5 tasks
andygrove and others added 3 commits August 25, 2024 11:05
Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>
@andygrove andygrove marked this pull request as draft August 25, 2024 17:19
@andygrove andygrove marked this pull request as ready for review August 26, 2024 15:23
@andygrove
Copy link
Copy Markdown
Member Author

@parthchandra @eejbyfeldt This is ready for another review

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andygrove andygrove requested review from huaxingao and viirya August 28, 2024 20:58
Copy link
Copy Markdown
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @andygrove

@andygrove andygrove merged commit cd530f8 into apache:main Aug 28, 2024
@andygrove andygrove deleted the to-json branch August 28, 2024 23:39
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* add skeleton for StructsToJson

* first test passes

* add support for nested structs

* add support for strings and improve test

* clippy

* format

* prepare for review

* remove perf results

* update user guide

* add microbenchmark

* remove comment

* update docs

* reduce size of diff

* add failing test for quotes in field names and values

* test passes

* clippy

* revert a docs change

* Update native/spark-expr/src/to_json.rs

Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>

* address feedback

* support tabs

* newlines

* backspace

* clippy

* fix test regression

* cargo fmt

---------

Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>
(cherry picked from commit cd530f8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement initial version of to_json

8 participants