Skip to content

Conversation

@korowa
Copy link
Contributor

@korowa korowa commented Mar 2, 2024

Which issue does this PR close?

Closes #7686.
Closes #7816.

Rationale for this change

Fix failures on reading partition column from partitioned JSON table, caused by passing FileScanConfig::project() result schema to JsonOpener -- this function returns union of file columns and partition columns (which are not present in JSON, so they can be read as NULLs)

Similar for ARROW format -- partition columns passed to file reader, which attempts to read them.

What changes are included in this PR?

  • FileScanConfig::projected_file_schema -- returns schema containing intersection of fields from file_schema and projection, and acts like project function with partition columns filtering
  • NdJsonExec uses FileScanConfig::projected_file_schema result for creating file reader.
  • ArrowExec now uses file_column_projection_indices result for building file reader (intersection of file columns and scan projection columns)

Are these changes tested?

Unit tests for FileScanConfig & sqllogictests

Are there any user-facing changes?

Should fix reading from partitioned JSON & ARROW tables.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 2, 2024
@korowa korowa force-pushed the fix_partitioned_json branch from 253492f to a46e0b9 Compare March 2, 2024 18:57
@alamb
Copy link
Contributor

alamb commented Mar 2, 2024

Thank you @korowa . Looks very nice 🙏

@devinjdangelo any chance you have some time to look at this one?

@korowa korowa force-pushed the fix_partitioned_json branch from a46e0b9 to d12a355 Compare March 2, 2024 20:47
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this @korowa! The changes are clear, well tested, and fixed the issue with reading partitioned JSON files.

A quick test locally verifies this as well:

DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' 
(format json, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.030 seconds.

❯ CREATE EXTERNAL TABLE validate_partitioned STORED AS JSON 
LOCATION 'test_files/scratch/copy/partitioned_table2/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.

❯ select * from validate_partitioned;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | a       | x       |
| 3       | c       | z       |
| 2       | b       | y       |
+---------+---------+---------+
3 rows in set. Query took 0.002 seconds.

I noticed some inconsistencies between how schema projection information is passed into JsonOpener, CsvOpener, and ArrowOpener which I think would be good to clean up so we aren't duplicating logic unnecessarily. I also think we will be able to fix the errors when trying read partitioned Arrow IPC file tables very similarly to the fix for JSON in this PR. I can cut a follow on issue or two for this after we merge this PR.

Thanks again 🙏

}

/// Projects only file schema, ignoring partition columns
pub(crate) fn projected_file_schema(&self) -> SchemaRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks good, but it would be nice if we could leverage file_column_projection_indices (which CsvOpener uses) so we aren't duplicating the logic to exclude the partition columns.

I think in general we could make Csv, Json, and Arrow file opening / configuring more consistent. We can cut follow on tickets for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed -- now this method returns schema with fields collected from iteration over file_column_projection_indices result

name VARCHAR,
ts TIMESTAMP,
c_date DATE,
name VARCHAR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes to this test related / required? I think it would be better to leave this test as-is and add a new one if required so we can validate that we haven't inadvertently changed how CSV partitioned tables are read.

Copy link
Contributor Author

@korowa korowa Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just got a bit confusing result for JSON table and decided to also pin it in a test for CSV -- correct, it's unrelated, and I'll better check for / file an issue regarding it. For now I've removed modifications for this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.

# Issue open for this error: https://github.com/apache/arrow-datafusion/issues/7816
query error DataFusion error: Arrow error: Json error: Encountered unmasked nulls in non\-nullable StructArray child: Field \{ name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}
query TT
select * from partitioned_insert_test_json order by a,b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@korowa korowa force-pushed the fix_partitioned_json branch from 4a13168 to 7be4bf8 Compare March 3, 2024 14:45
@korowa korowa changed the title fix: reading from partitioned json tables fix: reading from partitioned json & arrow tables Mar 3, 2024
@korowa
Copy link
Contributor Author

korowa commented Mar 3, 2024

I also think we will be able to fix the errors when trying read partitioned Arrow IPC file tables very similarly to the fix for JSON in this PR.

@devinjdangelo , thank you for pointing it out -- it's the same issue (even .arrow tests in insert_to_external.slt references same issue). I've added another commit with fix for .arrow format and sqllogictests coverage for it.

Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @korowa! Glad to see both JSON and arrow partitioned table reads working correctly now 🚀

DataFusion CLI v36.0.0
❯ COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table/' 
(format arrow, partition_by 'column2, column3');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.022 seconds.
❯ CREATE EXTERNAL TABLE validate_partitioned_arrow STORED AS arrow 
LOCATION 'test_files/scratch/copy/partitioned_table/' PARTITIONED BY (column2, column3);
0 rows in set. Query took 0.001 seconds.
❯ select * from validate_partitioned_arrow;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | a       | x       |
| 2       | b       | y       |
| 3       | c       | z       |
+---------+---------+---------+
3 rows in set. Query took 0.001 seconds.

name VARCHAR,
ts TIMESTAMP,
c_date DATE,
name VARCHAR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely feel free to file an issue if you came across an unrelated problem or something confusing where we could improve documentation.

b

# https://github.com/apache/arrow-datafusion/issues/7816
query error DataFusion error: Arrow error: Schema error: project index 1 out of bounds, max field 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Glad to see the fix was essentially the same for arrow files.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa and @devinjdangelo for the review. A very nice improvement. 🏆

@alamb alamb merged commit 608b615 into apache:main Mar 4, 2024
wiedld pushed a commit to wiedld/arrow-datafusion that referenced this pull request Mar 21, 2024
* fix partitioned table reading for json

* wildcard projection test for csv partitioned table

* review comments

* fix partitioned arrow tables reading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error When Querying Partitioned JSON Table NDJsonExec doesn't properly apply predicates on partitioned tables.

3 participants