fix: reading from partitioned `json` & `arrow` tables #9431

korowa · 2024-03-02T18:51:56Z

Which issue does this PR close?

Closes #7686.
Closes #7816.

Rationale for this change

Fix failures on reading partition column from partitioned JSON table, caused by passing FileScanConfig::project() result schema to JsonOpener -- this function returns union of file columns and partition columns (which are not present in JSON, so they can be read as NULLs)

Similar for ARROW format -- partition columns passed to file reader, which attempts to read them.

What changes are included in this PR?

FileScanConfig::projected_file_schema -- returns schema containing intersection of fields from file_schema and projection, and acts like project function with partition columns filtering
NdJsonExec uses FileScanConfig::projected_file_schema result for creating file reader.
ArrowExec now uses file_column_projection_indices result for building file reader (intersection of file columns and scan projection columns)

Are these changes tested?

Unit tests for FileScanConfig & sqllogictests

Are there any user-facing changes?

Should fix reading from partitioned JSON & ARROW tables.

alamb · 2024-03-02T20:47:16Z

Thank you @korowa . Looks very nice 🙏

@devinjdangelo any chance you have some time to look at this one?

devinjdangelo

Thank you for digging into this @korowa! The changes are clear, well tested, and fixed the issue with reading partitioned JSON files.

A quick test locally verifies this as well:

DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' 
(format json, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.030 seconds.

❯ CREATE EXTERNAL TABLE validate_partitioned STORED AS JSON 
LOCATION 'test_files/scratch/copy/partitioned_table2/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.

❯ select * from validate_partitioned;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | a       | x       |
| 3       | c       | z       |
| 2       | b       | y       |
+---------+---------+---------+
3 rows in set. Query took 0.002 seconds.

I noticed some inconsistencies between how schema projection information is passed into JsonOpener, CsvOpener, and ArrowOpener which I think would be good to clean up so we aren't duplicating logic unnecessarily. I also think we will be able to fix the errors when trying read partitioned Arrow IPC file tables very similarly to the fix for JSON in this PR. I can cut a follow on issue or two for this after we merge this PR.

Thanks again 🙏

devinjdangelo · 2024-03-02T22:31:36Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

    }

+    /// Projects only file schema, ignoring partition columns
+    pub(crate) fn projected_file_schema(&self) -> SchemaRef {


This method looks good, but it would be nice if we could leverage file_column_projection_indices (which CsvOpener uses) so we aren't duplicating the logic to exclude the partition columns.

I think in general we could make Csv, Json, and Arrow file opening / configuring more consistent. We can cut follow on tickets for this.

Fixed -- now this method returns schema with fields collected from iteration over file_column_projection_indices result

devinjdangelo · 2024-03-02T22:33:16Z

datafusion/sqllogictest/test_files/ddl.slt

-  name VARCHAR,
-  ts TIMESTAMP,
  c_date DATE,
+  name VARCHAR,


Are the changes to this test related / required? I think it would be better to leave this test as-is and add a new one if required so we can validate that we haven't inadvertently changed how CSV partitioned tables are read.

I've just got a bit confusing result for JSON table and decided to also pin it in a test for CSV -- correct, it's unrelated, and I'll better check for / file an issue regarding it. For now I've removed modifications for this file.

Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.

devinjdangelo · 2024-03-02T22:34:13Z

datafusion/sqllogictest/test_files/insert_to_external.slt

-# Issue open for this error: https://github.com/apache/arrow-datafusion/issues/7816
-query error DataFusion error: Arrow error: Json error: Encountered unmasked nulls in non\-nullable StructArray child: Field \{ name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}
+query TT
 select * from partitioned_insert_test_json order by a,b


korowa · 2024-03-03T16:40:26Z

I also think we will be able to fix the errors when trying read partitioned Arrow IPC file tables very similarly to the fix for JSON in this PR.

@devinjdangelo , thank you for pointing it out -- it's the same issue (even .arrow tests in insert_to_external.slt references same issue). I've added another commit with fix for .arrow format and sqllogictests coverage for it.

devinjdangelo

Thanks again @korowa! Glad to see both JSON and arrow partitioned table reads working correctly now 🚀

DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table/' 
(format arrow, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.022 seconds.
❯ CREATE EXTERNAL TABLE validate_partitioned_arrow STORED AS arrow 
LOCATION 'test_files/scratch/copy/partitioned_table/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.
❯ select * from validate_partitioned_arrow;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | a       | x       |
| 2       | b       | y       |
| 3       | c       | z       |
+---------+---------+---------+
3 rows in set. Query took 0.001 seconds.

devinjdangelo · 2024-03-03T21:19:14Z

datafusion/sqllogictest/test_files/ddl.slt

-  name VARCHAR,
-  ts TIMESTAMP,
  c_date DATE,
+  name VARCHAR,


Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.

devinjdangelo · 2024-03-03T21:21:27Z

datafusion/sqllogictest/test_files/insert_to_external.slt

 b

-# https://github.com/apache/arrow-datafusion/issues/7816
-query error DataFusion error: Arrow error: Schema error: project index 1 out of bounds, max field 1


Very nice! Glad to see the fix was essentially the same for arrow files.

alamb

Thank you @korowa and @devinjdangelo for the review. A very nice improvement. 🏆

* fix partitioned table reading for json * wildcard projection test for csv partitioned table * review comments * fix partitioned arrow tables reading

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 2, 2024

korowa force-pushed the fix_partitioned_json branch from 253492f to a46e0b9 Compare March 2, 2024 18:57

fix partitioned table reading for json

384b277

wildcard projection test for csv partitioned table

d12a355

korowa force-pushed the fix_partitioned_json branch from a46e0b9 to d12a355 Compare March 2, 2024 20:47

devinjdangelo approved these changes Mar 2, 2024

View reviewed changes

review comments

7be4bf8

korowa force-pushed the fix_partitioned_json branch from 4a13168 to 7be4bf8 Compare March 3, 2024 14:45

fix partitioned arrow tables reading

33814f9

korowa changed the title ~~fix: reading from partitioned json tables~~ fix: reading from partitioned json & arrow tables Mar 3, 2024

devinjdangelo reviewed Mar 3, 2024

View reviewed changes

alamb approved these changes Mar 4, 2024

View reviewed changes

alamb merged commit 608b615 into apache:main Mar 4, 2024

fix: reading from partitioned json & arrow tables #9431

fix: reading from partitioned json & arrow tables #9431

Uh oh!

Conversation

korowa commented Mar 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Mar 2, 2024

Uh oh!

devinjdangelo left a comment

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 2, 2024

Choose a reason for hiding this comment

Uh oh!

korowa Mar 3, 2024

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 2, 2024

Choose a reason for hiding this comment

Uh oh!

korowa Mar 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 3, 2024

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 2, 2024

Choose a reason for hiding this comment

Uh oh!

korowa commented Mar 3, 2024

Uh oh!

devinjdangelo left a comment

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 3, 2024

Choose a reason for hiding this comment

Uh oh!

devinjdangelo Mar 3, 2024

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: reading from partitioned `json` & `arrow` tables #9431

fix: reading from partitioned `json` & `arrow` tables #9431

korowa commented Mar 2, 2024 •

edited

Loading

korowa Mar 3, 2024 •

edited

Loading