-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: reading from partitioned json & arrow tables
#9431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
253492f to
a46e0b9
Compare
|
Thank you @korowa . Looks very nice 🙏 @devinjdangelo any chance you have some time to look at this one? |
a46e0b9 to
d12a355
Compare
devinjdangelo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this @korowa! The changes are clear, well tested, and fixed the issue with reading partitioned JSON files.
A quick test locally verifies this as well:
DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/'
(format json, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.030 seconds.
❯ CREATE EXTERNAL TABLE validate_partitioned STORED AS JSON
LOCATION 'test_files/scratch/copy/partitioned_table2/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.
❯ select * from validate_partitioned;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1 | a | x |
| 3 | c | z |
| 2 | b | y |
+---------+---------+---------+
3 rows in set. Query took 0.002 seconds.I noticed some inconsistencies between how schema projection information is passed into JsonOpener, CsvOpener, and ArrowOpener which I think would be good to clean up so we aren't duplicating logic unnecessarily. I also think we will be able to fix the errors when trying read partitioned Arrow IPC file tables very similarly to the fix for JSON in this PR. I can cut a follow on issue or two for this after we merge this PR.
Thanks again 🙏
| } | ||
|
|
||
| /// Projects only file schema, ignoring partition columns | ||
| pub(crate) fn projected_file_schema(&self) -> SchemaRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks good, but it would be nice if we could leverage file_column_projection_indices (which CsvOpener uses) so we aren't duplicating the logic to exclude the partition columns.
I think in general we could make Csv, Json, and Arrow file opening / configuring more consistent. We can cut follow on tickets for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed -- now this method returns schema with fields collected from iteration over file_column_projection_indices result
| name VARCHAR, | ||
| ts TIMESTAMP, | ||
| c_date DATE, | ||
| name VARCHAR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes to this test related / required? I think it would be better to leave this test as-is and add a new one if required so we can validate that we haven't inadvertently changed how CSV partitioned tables are read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just got a bit confusing result for JSON table and decided to also pin it in a test for CSV -- correct, it's unrelated, and I'll better check for / file an issue regarding it. For now I've removed modifications for this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.
| # Issue open for this error: https://github.com/apache/arrow-datafusion/issues/7816 | ||
| query error DataFusion error: Arrow error: Json error: Encountered unmasked nulls in non\-nullable StructArray child: Field \{ name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} | ||
| query TT | ||
| select * from partitioned_insert_test_json order by a,b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
4a13168 to
7be4bf8
Compare
json & arrow tables
@devinjdangelo , thank you for pointing it out -- it's the same issue (even .arrow tests in |
devinjdangelo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @korowa! Glad to see both JSON and arrow partitioned table reads working correctly now 🚀
DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table/'
(format arrow, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.022 seconds.
❯ CREATE EXTERNAL TABLE validate_partitioned_arrow STORED AS arrow
LOCATION 'test_files/scratch/copy/partitioned_table/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.
❯ select * from validate_partitioned_arrow;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1 | a | x |
| 2 | b | y |
| 3 | c | z |
+---------+---------+---------+
3 rows in set. Query took 0.001 seconds.| name VARCHAR, | ||
| ts TIMESTAMP, | ||
| c_date DATE, | ||
| name VARCHAR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.
| b | ||
|
|
||
| # https://github.com/apache/arrow-datafusion/issues/7816 | ||
| query error DataFusion error: Arrow error: Schema error: project index 1 out of bounds, max field 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Glad to see the fix was essentially the same for arrow files.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @korowa and @devinjdangelo for the review. A very nice improvement. 🏆
* fix partitioned table reading for json * wildcard projection test for csv partitioned table * review comments * fix partitioned arrow tables reading
Which issue does this PR close?
Closes #7686.
Closes #7816.
Rationale for this change
Fix failures on reading partition column from partitioned JSON table, caused by passing
FileScanConfig::project()result schema toJsonOpener-- this function returns union of file columns and partition columns (which are not present in JSON, so they can be read as NULLs)Similar for ARROW format -- partition columns passed to file reader, which attempts to read them.
What changes are included in this PR?
FileScanConfig::projected_file_schema-- returns schema containing intersection of fields fromfile_schemaandprojection, and acts likeprojectfunction with partition columns filteringNdJsonExecusesFileScanConfig::projected_file_schemaresult for creating file reader.ArrowExecnow usesfile_column_projection_indicesresult for building file reader (intersection of file columns and scan projection columns)Are these changes tested?
Unit tests for
FileScanConfig& sqllogictestsAre there any user-facing changes?
Should fix reading from partitioned JSON & ARROW tables.