Skip to content

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 20, 2022

Which issue does this PR close?

Relates to #3373

Rationale for this change

JSON data must be UTF-8, we can therefore safely infer json fields as containing UTF8 data. This leads to a superior experience as BinaryArray cannot be output to non-binary formats such as CSV and JSON.

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Dec 20, 2022
@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 20, 2022
@tustvold tustvold changed the title Infer JSON as UTF-8 Infer Parquet JSON Logical and Converted Type as UTF-8 Dec 20, 2022
@tustvold
Copy link
Contributor Author

There is a draft PR in apache/arrow#13901 that adds a canonical arrow extension type for JSON data backed by UTF-8 arrays.

@tustvold tustvold requested a review from alamb December 20, 2022 15:16
@alamb
Copy link
Contributor

alamb commented Dec 20, 2022

Seems reasonable to me -- I think we ought to have a test for it, however, to avoid regressions in the future

@tustvold
Copy link
Contributor Author

There is a test?

OPTIONAL FLOAT float;
OPTIONAL BINARY string (UTF8);
OPTIONAL BINARY string_2 (STRING);
OPTIONAL BINARY json (JSON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 here is the test

@tustvold tustvold merged commit a8968cd into apache:master Dec 20, 2022
@ursabot
Copy link

ursabot commented Dec 20, 2022

Benchmark runs are scheduled for baseline = 9cdc1c1 and contender = a8968cd. a8968cd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants