ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader by andygrove · Pull Request #5641 · apache/arrow

andygrove · 2019-10-13T20:05:40Z

Replaces the DataFusion Parquet reader with the new Arrow reader in the parquet crate.

github-actions · 2019-10-13T20:16:10Z

https://issues.apache.org/jira/browse/ARROW-6700

liurenjie1024

Some part should be replaced by new ArrowReader.

liurenjie1024 · 2019-10-14T06:56:38Z

array_reader is not designed for public use. This PR #5523 contains public api and doc example. Essentially, you should use ArrowReader

liurenjie1024 · 2019-10-14T07:09:27Z

Please use new ArrowReader here.

liurenjie1024 · 2019-10-14T07:11:51Z

Schema should also consider projection.

andygrove · 2019-10-14T23:57:01Z

@liurenjie1024 I updated to use ArrowReader. When I run cargo test I actually get a SIGSEGV:

error: process didn't exit successfully: `/home/andy/git/andygrove/arrow/rust/target/debug/deps/datafusion-39547aa10aa86781` (signal: 11, SIGSEGV: invalid memory reference)

liurenjie1024 · 2019-10-15T07:19:28Z

@andygrove I pull your request and run the tests. The root cause is that currently arrow reader doesn't support some data types(e.g., UTF8) and it caused program to crash.

liurenjie1024 · 2019-10-15T07:20:55Z

+
+    fn next(&mut self) -> Result<Option<RecordBatch>> {
+        match self.request_tx.send(()) {
+            Ok(_) => match self.response_rx.recv() {


Why we need another thread here? This send request, wait response model is also blocked here waiting for IO.

We need the threading because the Parquet structs/traits do not implement Sync + Send and cannot be sent between threads.

I would much prefer it if we could make Parquet safe to use in multi threaded environments.

andygrove · 2019-10-15T13:33:35Z

@andygrove I pull your request and run the tests. The root cause is that currently arrow reader doesn't support some data types(e.g., UTF8) and it caused program to crash.

We should make the code fail gracefully by returning an Err for unsupported types.

liurenjie1024 · 2019-10-18T08:28:50Z

@andygrove After debugging, I found that this is not caused by unwrap call, but by error in supporting UTF8. I'll add support for utf8 and this will be fixed.

andygrove · 2019-11-24T15:53:21Z

Hi @liurenjie1024 is there any update on adding support for UTF8?

liurenjie1024 · 2019-11-25T02:31:55Z

@andygrove Sorry, almost forgot about this. I'll address it this week.

andygrove · 2019-12-09T20:02:55Z

Hi @liurenjie1024 the next issue is that I have a regression due to Reading Timestamp(Nanosecond) type from parquet is not supported yet. Do you think this is an easy one to add? It requires a new Int96Type and Int96Converter.

andygrove · 2019-12-09T20:22:10Z

@liurenjie1024 I attempted to add support to the array reader for TimestampNanoseconds. It runs but returns the wrong results currently. Could you take a look?

nevi-me · 2019-12-09T20:28:30Z

@andygrove isn't it the same issue that timestamps are stored in 96bit values? How are you converting from the 96bit values?

andygrove · 2019-12-09T20:41:01Z

@nevi-me Yes, I guess I need to write a custom converter here rather than try and use the CastConverter. I'll have a go at that.

liurenjie1024 · 2019-12-10T01:35:08Z

@andygrove I'll take a look this week.

… as binary

nevi-me · 2019-12-10T05:16:18Z

I've created a complex converter for int96, and I've fixed the binary reads. The problem with the binary reads was that when I introduced StringType, I missed a part in the parquet code where we were supposed to convert a binary physical type with no logical type, to BinaryType.

All tests are passing locally for me.

nevi-me · 2019-12-10T05:19:36Z

Oops, I forgot examples. @andygrove, the alltypes_plain.parquet file has saved the string columns as binary. Instead of a quick fix of converting the binary data to utf8, this might be an opportunity for us to create a cast kernel, or some string kernel that converts BinaryArray to StringArray.

Then in DataFusion we could do an implicit cast to string in the SQL code. What do you think about this?

andygrove · 2019-12-10T06:22:38Z

Wow @nevi-me that's awesome! If the columns are stored as binary then DataFusion should also treat them as binary unless the user adds an explicit CAST in the SQL. I pushed a quick change for that and all tests are passing for me now.

I know there are some things I need to clean up in this PR so I'll start working on those tomorrow.

nevi-me · 2019-12-10T07:47:21Z

+                    if array.is_null(i) {
+                        b.append(false)?;
+                    } else {
+                        b.append_value(str::from_utf8(from.value(i)).unwrap())?;


from_utf8 can panic if the data is not valid utf8 bytes. Perhaps it's better to convert failures to null values instead? This would behave similarly to other casts that return nulls on overflowing data.

Why we can't return error? I don't it's correct to return null rather than error.

@liurenjie1024 +1

We've had the broader discussion around casts, on whether to return an error on overflows/invalid data, or whether to expose an option to the user to define their desired behaviour. I haven't opened a JIRA for this, perhaps I should, so we can address the overall cast behaviour there?

For now, an expedient would actually not be to downcast to BinaryArray, but to create a StringArray from the binary data, but maybe that's a premature optimisation.

Another discussion about configuring behaviors of error handling would be fine. But I can't understand for now why we can't just return error? Inconsistency with existing behavior?

Yes, inconsistency. For example, if you cast a i64 to u8, negative values and values that don't fit into u8 will be cast to nulls. It doesn't return an error.

Here's the CPP equivalent: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38

For this PR, I agree that we can return null to match existing behavior. But I don't think it's good idea to make returning null as default behavior, but returning error should be. Please open an jira ticket to track this.

Done: https://issues.apache.org/jira/browse/ARROW-7364

nevi-me · 2019-12-10T08:04:24Z

        SQLType::Float(_) | SQLType::Real => Ok(DataType::Float64),
        SQLType::Double => Ok(DataType::Float64),
        SQLType::Char(_) | SQLType::Varchar(_) => Ok(DataType::Utf8),
+        SQLType::Custom(t) if t.to_lowercase() == "string" => Ok(DataType::Utf8),


We probably don't need this just yet, as cast(binary as varchar) works without any changes.

paddyhoran

I'm a little out of the loop (day job, etc.), but LGTM at a high level. Thanks @andygrove @nevi-me and @liurenjie1024, great progress.

nevi-me · 2019-12-12T19:04:20Z

+            );
+        }
+
+        //TODO assertions


Are you going to add assertions in this PR?

Yes. I'm away on a business trip today and tomorrow but intend on adding assertions this weekend.

andygrove added WIP PR is work in progress Component: Rust Component: Rust - DataFusion labels Oct 13, 2019

andygrove self-assigned this Oct 13, 2019

liurenjie1024 suggested changes Oct 14, 2019

View reviewed changes

andygrove force-pushed the ARROW-6700 branch from 9a264a3 to 01c7a1d Compare October 14, 2019 23:55

liurenjie1024 reviewed Oct 15, 2019

View reviewed changes

andygrove force-pushed the ARROW-6700 branch from 02d497b to f3d60c5 Compare October 22, 2019 13:43

andygrove changed the title ~~ARROW-6700: [Rust] [DataFusion] Use new Arrow reader [WIP]~~ ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader [WIP] Nov 18, 2019

andygrove force-pushed the ARROW-6700 branch from 0117eac to 099e26a Compare December 6, 2019 04:55

andygrove force-pushed the ARROW-6700 branch from 8a9814f to b2973b1 Compare December 9, 2019 20:25

manual rebase

5eb4e73

andygrove force-pushed the ARROW-6700 branch from b2973b1 to 5eb4e73 Compare December 10, 2019 01:00

andygrove added 2 commits December 9, 2019 18:10

Int96Converter

9a5a6b5

improve error handling

6a80644

andygrove and others added 2 commits December 9, 2019 18:41

temp fix for regressions reading binary columns

ee02fb0

create complex converter for int96, read binary with no logical type…

37f2b0e

… as binary

Add support to DataFusion for CAST(binary_value AS string)

a8bf971

andygrove added 2 commits December 9, 2019 23:24

handle nulls correctly when casting from binary to utf8

1a041e7

Remove debug println

2b5fbd1

andygrove changed the title ~~ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader [WIP]~~ ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader Dec 10, 2019

andygrove removed the WIP PR is work in progress label Dec 10, 2019

nevi-me reviewed Dec 10, 2019

View reviewed changes

cast binary column to string in sql example

13caadf

nevi-me reviewed Dec 10, 2019

View reviewed changes

Address PR feedback

819a11f

andygrove requested a review from paddyhoran December 10, 2019 18:56

paddyhoran approved these changes Dec 12, 2019

View reviewed changes

nevi-me approved these changes Dec 12, 2019

View reviewed changes

add assertions

1f0e84e

andygrove closed this in 885b007 Dec 14, 2019

This was referenced Dec 14, 2019

[Rust] [DataFusion] Use new parquet arrow reader #23045

Closed

[Rust] [Parquet] ArrowReader fails with seg fault #23218

Closed

Conversation

andygrove commented Oct 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Oct 13, 2019

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Oct 14, 2019

Uh oh!

liurenjie1024 commented Oct 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Oct 15, 2019

Uh oh!

liurenjie1024 commented Oct 18, 2019

Uh oh!

andygrove commented Nov 24, 2019

Uh oh!

liurenjie1024 commented Nov 25, 2019

Uh oh!

andygrove commented Dec 9, 2019

Uh oh!

andygrove commented Dec 9, 2019

Uh oh!

nevi-me commented Dec 9, 2019

Uh oh!

andygrove commented Dec 9, 2019

Uh oh!

liurenjie1024 commented Dec 10, 2019

Uh oh!

nevi-me commented Dec 10, 2019

Uh oh!

nevi-me commented Dec 10, 2019

Uh oh!

andygrove commented Dec 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nevi-me Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddyhoran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

andygrove commented Oct 13, 2019 •

edited

Loading

nevi-me Dec 10, 2019 •

edited

Loading