feat: Implement to_json for subset of types#805
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #805 +/- ##
============================================
- Coverage 33.94% 33.82% -0.12%
+ Complexity 874 871 -3
============================================
Files 112 112
Lines 42916 42887 -29
Branches 9464 9456 -8
============================================
- Hits 14567 14508 -59
- Misses 25379 25395 +16
- Partials 2970 2984 +14 ☔ View full report in Codecov by Sentry. |
|
@parthchandra could you review? |
|
i would also love to review this. will plan it for tomorrow |
|
@eejbyfeldt @Kimahriman you may also be interested in reviewing this one |
|
LGTM |
| val isSupported = child.dataType match { | ||
| case s: StructType => | ||
| s.fields.forall(f => isSupportedType(f.dataType)) | ||
| case _ => |
There was a problem hiding this comment.
My reading of the spark code for this expression here: https://github.com/apache/spark/blob/bfddd53d98da866b474464321e5b323a3df32e81/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L832-L833
Nit: Is that despite the name being Structs it also handles Map, Arrays should we mention that as a TODO here?
There was a problem hiding this comment.
I think @eejbyfeldt point is correct. Map and Arrays can effectively be jsonified, but they are not StructType and we need to extend the pattern matching to support those
There was a problem hiding this comment.
I added comments for map/array support
| | ----------------- | --------------------------------- | | ||
| | CreateNamedStruct | Create a struct | | ||
| | GetElementAt | Access a field in a struct | | ||
| | StructsToJson | Convert a struct to a JSON string | | ||
|
|
There was a problem hiding this comment.
Nit: In other parts of this document the Notes section is only used to document compatibility issues and/or limitations. Should we follow that here as well? I think the expressions names are mostly self describing and the extra comment does not really add that much. (Maybe GetElementAt it a bit unclear what it maps to in Spark/SQL)
There was a problem hiding this comment.
Good point. I have removed these comments.
|
|
||
| fn struct_to_json(array: &StructArray, timezone: &str) -> Result<ArrayRef> { | ||
| // get field names | ||
| let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect(); |
There was a problem hiding this comment.
This looks like it creates some uneccessary copies of the field names. Any reason to not change this to
| let field_names: Vec<String> = array.fields().iter().map(|f| f.name().clone()).collect(); | |
| let fields = array.fields(); |
and the usage site then becomes
json.push_str(field_names[col_index].name());
Or is there some reason to make copies that I am missing?
There was a problem hiding this comment.
This code is now escaping the field names (if needed) so requires a copy in that case
| if quotes_needed[col_index] { | ||
| json.push('"'); | ||
| } | ||
| json.push_str(string_arrays[col_index].value(row_index)); | ||
| if quotes_needed[col_index] { | ||
| json.push('"'); | ||
| } |
There was a problem hiding this comment.
I think there is an issue here if the value in string_arrays[col_index].value(row_index) contains a literal " character. I think spark (or some other json library) would escape such characters.
There are probably also other things like newlines and tabs that also needs to be handled.
There was a problem hiding this comment.
I have now added escaping for common cases such as \t \r \n \b \f and double quotes
| } | ||
| // quoted field name | ||
| json.push('"'); | ||
| json.push_str(&field_names[col_index]); |
There was a problem hiding this comment.
The field_name also needs to be escaped if it contains problematic chars.
|
Thanks for the review @eejbyfeldt! It is really appreciated. Some very good feedback there. I will address the feedback over the next day or two. |
|
Sorry @andygrove for this late review. I don't know if one can improve on @eejbyfeldt's review. |
Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>
|
@parthchandra @eejbyfeldt This is ready for another review |
huaxingao
left a comment
There was a problem hiding this comment.
LGTM. Thanks @andygrove
* add skeleton for StructsToJson * first test passes * add support for nested structs * add support for strings and improve test * clippy * format * prepare for review * remove perf results * update user guide * add microbenchmark * remove comment * update docs * reduce size of diff * add failing test for quotes in field names and values * test passes * clippy * revert a docs change * Update native/spark-expr/src/to_json.rs Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> * address feedback * support tabs * newlines * backspace * clippy * fix test regression * cargo fmt --------- Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> (cherry picked from commit cd530f8)
Which issue does this PR close?
Closes #631
Rationale for this change
This is part of our effort to support more operations on complex types.
Performance seems ok.
What changes are included in this PR?
This PR adds an implementation of to_json that works for a subset of types (structs, primitives, strings). There is no support for date or timestamp yet.
How are these changes tested?
New Rust tests and Spark tests.