Skip to content

ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader#5641

Closed
andygrove wants to merge 11 commits intoapache:masterfrom
andygrove:ARROW-6700
Closed

ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader#5641
andygrove wants to merge 11 commits intoapache:masterfrom
andygrove:ARROW-6700

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Oct 13, 2019

Replaces the DataFusion Parquet reader with the new Arrow reader in the parquet crate.

@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part should be replaced by new ArrowReader.

Comment thread rust/parquet/src/arrow/mod.rs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_reader is not designed for public use. This PR #5523 contains public api and doc example. Essentially, you should use ArrowReader

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use new ArrowReader here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema should also consider projection.

@andygrove
Copy link
Copy Markdown
Member Author

@liurenjie1024 I updated to use ArrowReader. When I run cargo test I actually get a SIGSEGV:

error: process didn't exit successfully: `/home/andy/git/andygrove/arrow/rust/target/debug/deps/datafusion-39547aa10aa86781` (signal: 11, SIGSEGV: invalid memory reference)

@liurenjie1024
Copy link
Copy Markdown
Contributor

@andygrove I pull your request and run the tests. The root cause is that currently arrow reader doesn't support some data types(e.g., UTF8) and it caused program to crash.


fn next(&mut self) -> Result<Option<RecordBatch>> {
match self.request_tx.send(()) {
Ok(_) => match self.response_rx.recv() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need another thread here? This send request, wait response model is also blocked here waiting for IO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the threading because the Parquet structs/traits do not implement Sync + Send and cannot be sent between threads.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer it if we could make Parquet safe to use in multi threaded environments.

@andygrove
Copy link
Copy Markdown
Member Author

@andygrove I pull your request and run the tests. The root cause is that currently arrow reader doesn't support some data types(e.g., UTF8) and it caused program to crash.

We should make the code fail gracefully by returning an Err for unsupported types.

@liurenjie1024
Copy link
Copy Markdown
Contributor

@andygrove After debugging, I found that this is not caused by unwrap call, but by error in supporting UTF8. I'll add support for utf8 and this will be fixed.

@andygrove andygrove changed the title ARROW-6700: [Rust] [DataFusion] Use new Arrow reader [WIP] ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader [WIP] Nov 18, 2019
@andygrove
Copy link
Copy Markdown
Member Author

Hi @liurenjie1024 is there any update on adding support for UTF8?

@liurenjie1024
Copy link
Copy Markdown
Contributor

@andygrove Sorry, almost forgot about this. I'll address it this week.

@andygrove
Copy link
Copy Markdown
Member Author

Hi @liurenjie1024 the next issue is that I have a regression due to Reading Timestamp(Nanosecond) type from parquet is not supported yet. Do you think this is an easy one to add? It requires a new Int96Type and Int96Converter.

@andygrove
Copy link
Copy Markdown
Member Author

@liurenjie1024 I attempted to add support to the array reader for TimestampNanoseconds. It runs but returns the wrong results currently. Could you take a look?

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Dec 9, 2019

@andygrove isn't it the same issue that timestamps are stored in 96bit values? How are you converting from the 96bit values?

@andygrove
Copy link
Copy Markdown
Member Author

@nevi-me Yes, I guess I need to write a custom converter here rather than try and use the CastConverter. I'll have a go at that.

@liurenjie1024
Copy link
Copy Markdown
Contributor

@andygrove I'll take a look this week.

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Dec 10, 2019

I've created a complex converter for int96, and I've fixed the binary reads. The problem with the binary reads was that when I introduced StringType, I missed a part in the parquet code where we were supposed to convert a binary physical type with no logical type, to BinaryType.

All tests are passing locally for me.

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Dec 10, 2019

Oops, I forgot examples. @andygrove, the alltypes_plain.parquet file has saved the string columns as binary. Instead of a quick fix of converting the binary data to utf8, this might be an opportunity for us to create a cast kernel, or some string kernel that converts BinaryArray to StringArray.

Then in DataFusion we could do an implicit cast to string in the SQL code. What do you think about this?

@andygrove
Copy link
Copy Markdown
Member Author

Wow @nevi-me that's awesome! If the columns are stored as binary then DataFusion should also treat them as binary unless the user adds an explicit CAST in the SQL. I pushed a quick change for that and all tests are passing for me now.

I know there are some things I need to clean up in this PR so I'll start working on those tomorrow.

@andygrove andygrove changed the title ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader [WIP] ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader Dec 10, 2019
@andygrove andygrove removed the WIP PR is work in progress label Dec 10, 2019
Comment thread rust/arrow/src/compute/kernels/cast.rs Outdated
if array.is_null(i) {
b.append(false)?;
} else {
b.append_value(str::from_utf8(from.value(i)).unwrap())?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_utf8 can panic if the data is not valid utf8 bytes. Perhaps it's better to convert failures to null values instead? This would behave similarly to other casts that return nulls on overflowing data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we can't return error? I don't it's correct to return null rather than error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had the broader discussion around casts, on whether to return an error on overflows/invalid data, or whether to expose an option to the user to define their desired behaviour. I haven't opened a JIRA for this, perhaps I should, so we can address the overall cast behaviour there?

For now, an expedient would actually not be to downcast to BinaryArray, but to create a StringArray from the binary data, but maybe that's a premature optimisation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another discussion about configuring behaviors of error handling would be fine. But I can't understand for now why we can't just return error? Inconsistency with existing behavior?

Copy link
Copy Markdown
Contributor

@nevi-me nevi-me Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, inconsistency. For example, if you cast a i64 to u8, negative values and values that don't fit into u8 will be cast to nulls. It doesn't return an error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I agree that we can return null to match existing behavior. But I don't think it's good idea to make returning null as default behavior, but returning error should be. Please open an jira ticket to track this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread rust/datafusion/src/sql/planner.rs Outdated
SQLType::Float(_) | SQLType::Real => Ok(DataType::Float64),
SQLType::Double => Ok(DataType::Float64),
SQLType::Char(_) | SQLType::Varchar(_) => Ok(DataType::Utf8),
SQLType::Custom(t) if t.to_lowercase() == "string" => Ok(DataType::Utf8),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this just yet, as cast(binary as varchar) works without any changes.

@andygrove andygrove requested a review from paddyhoran December 10, 2019 18:56
Copy link
Copy Markdown
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little out of the loop (day job, etc.), but LGTM at a high level. Thanks @andygrove @nevi-me and @liurenjie1024, great progress.

);
}

//TODO assertions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to add assertions in this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm away on a business trip today and tomorrow but intend on adding assertions this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants