Conversation
Signed-off-by: Xuanwo <github@xuanwo.io>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4577 +/- ##
==========================================
- Coverage 80.99% 80.85% -0.15%
==========================================
Files 314 314
Lines 115727 116609 +882
Branches 115727 116609 +882
==========================================
+ Hits 93735 94283 +548
- Misses 18682 18952 +270
- Partials 3310 3374 +64
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Quick initial question. How do these compare to https://github.com/datafusion-contrib/datafusion-functions-json ?
json_get::json_get_udf(),
json_get_bool::json_get_bool_udf(),
json_get_float::json_get_float_udf(),
json_get_int::json_get_int_udf(),
json_get_json::json_get_json_udf(),
json_get_array::json_get_array_udf(),
json_as_text::json_as_text_udf(),
json_get_str::json_get_str_udf(),
json_contains::json_contains_udf(),
json_length::json_length_udf(),
json_object_keys::json_object_keys_udf(),Perhaps it would be better to implement them directly on |
westonpace
left a comment
There was a problem hiding this comment.
Very cool, I have some suggestions but no major concerns.
| .iter() | ||
| .map(|arg| match arg { | ||
| ColumnarValue::Array(arr) => arr.clone(), | ||
| ColumnarValue::Scalar(scalar) => scalar.to_array().unwrap(), |
There was a problem hiding this comment.
It would be great if we could avoid broadcasting the scalars here. Let's create a follow-up?
| match raw_jsonb.get_by_name(key, false) { | ||
| Ok(Some(value)) => return Ok(Some(value.as_raw().as_ref().to_vec())), | ||
| Ok(None) => {} | ||
| Err(e) => { | ||
| return Err(datafusion::error::DataFusionError::Execution(format!( | ||
| "Failed to get field: {}", | ||
| e | ||
| ))) | ||
| } | ||
| } | ||
|
|
||
| // Try as array index | ||
| if let Ok(index) = key.parse::<usize>() { |
There was a problem hiding this comment.
It might be more efficient to determine if the key is a field or an integer once instead of each value? Although I guess it could be an integer-named field and so vary from value to value 😛
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
|
Catching some bugs from python side, working on it. |
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
|
I really don't like how I'm handling JSON and JSONB conventions in the code. Maybe I over-optimized too early, adding unnecessary complexity just to allow we can perform UDFs on JSONB instead of the entire JSON. I'm starting to reconsider @westonpace's suggestion to keep JSONB as part of the file format instead of treating it as a logical type. While it's working for now, we might as well keep moving forward and see how it holds up in the long run. We can always make adjustments later. I'll merge this PR first and do anyother changes in follow-up PRs. This will also unlock us to implement index over JSON. |
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
|
The follow-up tracked at #4595 |
Part of lance-format#4516 This PR will add UDFs for JSON so users can read JSON like: ```python ds_path = Path(tmpdir) / "nested_json_test.lance" lance.write_dataset(table, ds_path, data_storage_version="2.2") dataset = lance.dataset(ds_path) # Access nested fields using json_get recursively # First get user, then profile, then name result = dataset.to_table( filter="json_get_string(json_get(json_get(data, 'user'), 'profile'), 'name') = 'Alice'" ) assert result.num_rows == 1 assert result["id"][0].as_py() == 1 # Or use JSONPath for deep access result = dataset.to_table( filter="json_extract(data, '$.user.profile.settings.theme') = '\"dark\"'" ) assert result.num_rows == 1 assert result["id"][0].as_py() == 1 ``` --- **This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.** --------- Signed-off-by: Xuanwo <github@xuanwo.io>
Part of #4516
This PR will add UDFs for JSON so users can read JSON like:
This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.