feat: Implement to_json for subset of types by andygrove · Pull Request #805 · apache/datafusion-comet

andygrove · 2024-08-10T22:42:23Z

Which issue does this PR close?

Closes #631

Rationale for this change

This is part of our effort to support more operations on complex types.

Performance seems ok.

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
to_json                                             147            152           2          1.4         721.3       1.0X
to_json: Comet (Scan)                               141            145           3          1.4         692.0       1.0X
to_json: Comet (Scan, Exec)                          80             88           5          2.6         391.9       1.8X

What changes are included in this PR?

This PR adds an implementation of to_json that works for a subset of types (structs, primitives, strings). There is no support for date or timestamp yet.

How are these changes tested?

New Rust tests and Spark tests.

codecov-commenter · 2024-08-11T05:57:02Z

Codecov Report

Attention: Patch coverage is 50.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 33.82%. Comparing base (4fe43ad) to head (5c2f551).

Files	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	50.00%	9 Missing and 7 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #805      +/-   ##
============================================
- Coverage     33.94%   33.82%   -0.12%     
+ Complexity      874      871       -3     
============================================
  Files           112      112              
  Lines         42916    42887      -29     
  Branches       9464     9456       -8     
============================================
- Hits          14567    14508      -59     
- Misses        25379    25395      +16     
- Partials       2970     2984      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove · 2024-08-12T13:23:24Z

@parthchandra could you review?

dharanad · 2024-08-13T18:05:20Z

i would also love to review this. will plan it for tomorrow

andygrove · 2024-08-13T18:20:31Z

@eejbyfeldt @Kimahriman you may also be interested in reviewing this one

Kimahriman · 2024-08-13T18:31:02Z

LGTM

eejbyfeldt · 2024-08-13T19:39:54Z

+            val isSupported = child.dataType match {
+              case s: StructType =>
+                s.fields.forall(f => isSupportedType(f.dataType))
+              case _ =>


My reading of the spark code for this expression here: https://github.com/apache/spark/blob/bfddd53d98da866b474464321e5b323a3df32e81/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L832-L833

Nit: Is that despite the name being Structs it also handles Map, Arrays should we mention that as a TODO here?

I think @eejbyfeldt point is correct. Map and Arrays can effectively be jsonified, but they are not StructType and we need to extend the pattern matching to support those

I added comments for map/array support

eejbyfeldt · 2024-08-13T19:52:50Z

+| ----------------- | --------------------------------- |
+| CreateNamedStruct | Create a struct                   |
+| GetElementAt      | Access a field in a struct        |
+| StructsToJson     | Convert a struct to a JSON string |
+


Nit: In other parts of this document the Notes section is only used to document compatibility issues and/or limitations. Should we follow that here as well? I think the expressions names are mostly self describing and the extra comment does not really add that much. (Maybe GetElementAt it a bit unclear what it maps to in Spark/SQL)

Good point. I have removed these comments.

eejbyfeldt · 2024-08-13T20:05:21Z

+
+fn struct_to_json(array: &StructArray, timezone: &str) -> Result<ArrayRef> {
+    // get field names
+    let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect();


This looks like it creates some uneccessary copies of the field names. Any reason to not change this to

Suggested change

let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect();

let fields = array.fields();

and the usage site then becomes

json.push_str(field_names[col_index].name());

Or is there some reason to make copies that I am missing?

This code is now escaping the field names (if needed) so requires a copy in that case

eejbyfeldt · 2024-08-13T20:17:35Z

+                    if quotes_needed[col_index] {
+                        json.push('"');
+                    }
+                    json.push_str(string_arrays[col_index].value(row_index));
+                    if quotes_needed[col_index] {
+                        json.push('"');
+                    }


I think there is an issue here if the value in string_arrays[col_index].value(row_index) contains a literal " character. I think spark (or some other json library) would escape such characters.

There are probably also other things like newlines and tabs that also needs to be handled.

I have now added escaping for common cases such as \t \r \n \b \f and double quotes

eejbyfeldt · 2024-08-13T20:34:32Z

+                    }
+                    // quoted field name
+                    json.push('"');
+                    json.push_str(&field_names[col_index]);


The field_name also needs to be escaped if it contains problematic chars.

andygrove · 2024-08-13T23:45:12Z

Thanks for the review @eejbyfeldt! It is really appreciated. Some very good feedback there. I will address the feedback over the next day or two.

parthchandra · 2024-08-16T18:39:19Z

Sorry @andygrove for this late review. I don't know if one can improve on @eejbyfeldt's review.
To address some of the handling of escape characters, should we look at using something like serde_json ?

Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>

…to-json

andygrove · 2024-08-26T15:24:23Z

@parthchandra @eejbyfeldt This is ready for another review

parthchandra

lgtm

huaxingao

LGTM. Thanks @andygrove

* add skeleton for StructsToJson * first test passes * add support for nested structs * add support for strings and improve test * clippy * format * prepare for review * remove perf results * update user guide * add microbenchmark * remove comment * update docs * reduce size of diff * add failing test for quotes in field names and values * test passes * clippy * revert a docs change * Update native/spark-expr/src/to_json.rs Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> * address feedback * support tabs * newlines * backspace * clippy * fix test regression * cargo fmt --------- Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> (cherry picked from commit cd530f8)

andygrove added 4 commits August 10, 2024 14:15

add skeleton for StructsToJson

db361ae

first test passes

9f8cab1

add support for nested structs

65be1d4

add support for strings and improve test

779344e

andygrove marked this pull request as draft August 10, 2024 22:42

andygrove added 2 commits August 10, 2024 23:04

clippy

e769d2f

format

5eccf37

prepare for review

636b95a

andygrove changed the title ~~feat: Implement to_json~~ feat: Implement to_json for subset of types Aug 11, 2024

remove perf results

08ea988

andygrove marked this pull request as ready for review August 11, 2024 14:28

andygrove added 3 commits August 11, 2024 08:46

update user guide

05bbc5d

add microbenchmark

525d260

remove comment

5c2f551

andygrove commented Aug 11, 2024

View reviewed changes

Comment thread docs/source/user-guide/expressions.md

andygrove added 3 commits August 12, 2024 07:33

update docs

6976d1f

Merge remote-tracking branch 'apache/main' into to-json

c8e0f34

reduce size of diff

0225642

eejbyfeldt reviewed Aug 13, 2024

View reviewed changes

andygrove mentioned this pull request Aug 19, 2024

Comet 0.2.0 Release #843

Closed

5 tasks

andygrove added 3 commits August 25, 2024 09:47

Merge remote-tracking branch 'apache/main' into to-json

4488da6

add failing test for quotes in field names and values

d2f55e2

test passes

d327a68

andygrove and others added 3 commits August 25, 2024 11:05

clippy

d84f294

revert a docs change

d1b6b24

Update native/spark-expr/src/to_json.rs

c34b69a

Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>

andygrove marked this pull request as draft August 25, 2024 17:19

andygrove added 5 commits August 25, 2024 11:45

address feedback

8e6ca9f

Merge branch 'to-json' of github.com:andygrove/datafusion-comet into …

6910b65

…to-json

support tabs

e76501d

newlines

7402e56

backspace

e17506d

andygrove marked this pull request as ready for review August 26, 2024 15:23

andygrove added 4 commits August 26, 2024 09:27

clippy

a1a7f21

fix test regression

5b23f1f

upmerge

d3d2201

cargo fmt

525c9eb

parthchandra approved these changes Aug 28, 2024

View reviewed changes

andygrove requested review from huaxingao and viirya August 28, 2024 20:58

huaxingao approved these changes Aug 28, 2024

View reviewed changes

andygrove merged commit cd530f8 into apache:main Aug 28, 2024

andygrove deleted the to-json branch August 28, 2024 23:39

	let field_names: Vec<String> = array.fields().iter().map(\|f\| f.name().clone()).collect();
	let fields = array.fields();

Conversation

andygrove commented Aug 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

andygrove commented Aug 12, 2024

Uh oh!

dharanad commented Aug 13, 2024

Uh oh!

andygrove commented Aug 13, 2024

Uh oh!

Kimahriman commented Aug 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Aug 13, 2024

Uh oh!

parthchandra commented Aug 16, 2024

Uh oh!

andygrove commented Aug 26, 2024

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

andygrove commented Aug 10, 2024 •

edited

Loading

codecov-commenter commented Aug 11, 2024 •

edited

Loading