ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader#5641
ARROW-6700: [Rust] [DataFusion] Use new Arrow Parquet reader#5641andygrove wants to merge 11 commits intoapache:masterfrom
Conversation
liurenjie1024
left a comment
There was a problem hiding this comment.
Some part should be replaced by new ArrowReader.
There was a problem hiding this comment.
array_reader is not designed for public use. This PR #5523 contains public api and doc example. Essentially, you should use ArrowReader
There was a problem hiding this comment.
Please use new ArrowReader here.
There was a problem hiding this comment.
Schema should also consider projection.
|
@liurenjie1024 I updated to use ArrowReader. When I run |
|
@andygrove I pull your request and run the tests. The root cause is that currently arrow reader doesn't support some data types(e.g., UTF8) and it caused program to crash. |
|
|
||
| fn next(&mut self) -> Result<Option<RecordBatch>> { | ||
| match self.request_tx.send(()) { | ||
| Ok(_) => match self.response_rx.recv() { |
There was a problem hiding this comment.
Why we need another thread here? This send request, wait response model is also blocked here waiting for IO.
There was a problem hiding this comment.
We need the threading because the Parquet structs/traits do not implement Sync + Send and cannot be sent between threads.
There was a problem hiding this comment.
I would much prefer it if we could make Parquet safe to use in multi threaded environments.
We should make the code fail gracefully by returning an |
|
@andygrove After debugging, I found that this is not caused by unwrap call, but by error in supporting UTF8. I'll add support for utf8 and this will be fixed. |
|
Hi @liurenjie1024 is there any update on adding support for UTF8? |
|
@andygrove Sorry, almost forgot about this. I'll address it this week. |
|
Hi @liurenjie1024 the next issue is that I have a regression due to |
|
@liurenjie1024 I attempted to add support to the array reader for TimestampNanoseconds. It runs but returns the wrong results currently. Could you take a look? |
|
@andygrove isn't it the same issue that timestamps are stored in 96bit values? How are you converting from the 96bit values? |
|
@nevi-me Yes, I guess I need to write a custom converter here rather than try and use the |
b2973b1 to
5eb4e73
Compare
|
@andygrove I'll take a look this week. |
|
I've created a complex converter for int96, and I've fixed the binary reads. The problem with the binary reads was that when I introduced All tests are passing locally for me. |
|
Oops, I forgot examples. @andygrove, the Then in DataFusion we could do an implicit cast to string in the SQL code. What do you think about this? |
|
Wow @nevi-me that's awesome! If the columns are stored as binary then DataFusion should also treat them as binary unless the user adds an explicit I know there are some things I need to clean up in this PR so I'll start working on those tomorrow. |
| if array.is_null(i) { | ||
| b.append(false)?; | ||
| } else { | ||
| b.append_value(str::from_utf8(from.value(i)).unwrap())?; |
There was a problem hiding this comment.
from_utf8 can panic if the data is not valid utf8 bytes. Perhaps it's better to convert failures to null values instead? This would behave similarly to other casts that return nulls on overflowing data.
There was a problem hiding this comment.
Why we can't return error? I don't it's correct to return null rather than error.
There was a problem hiding this comment.
We've had the broader discussion around casts, on whether to return an error on overflows/invalid data, or whether to expose an option to the user to define their desired behaviour. I haven't opened a JIRA for this, perhaps I should, so we can address the overall cast behaviour there?
For now, an expedient would actually not be to downcast to BinaryArray, but to create a StringArray from the binary data, but maybe that's a premature optimisation.
There was a problem hiding this comment.
Another discussion about configuring behaviors of error handling would be fine. But I can't understand for now why we can't just return error? Inconsistency with existing behavior?
There was a problem hiding this comment.
Yes, inconsistency. For example, if you cast a i64 to u8, negative values and values that don't fit into u8 will be cast to nulls. It doesn't return an error.
There was a problem hiding this comment.
Here's the CPP equivalent: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38
There was a problem hiding this comment.
For this PR, I agree that we can return null to match existing behavior. But I don't think it's good idea to make returning null as default behavior, but returning error should be. Please open an jira ticket to track this.
There was a problem hiding this comment.
| SQLType::Float(_) | SQLType::Real => Ok(DataType::Float64), | ||
| SQLType::Double => Ok(DataType::Float64), | ||
| SQLType::Char(_) | SQLType::Varchar(_) => Ok(DataType::Utf8), | ||
| SQLType::Custom(t) if t.to_lowercase() == "string" => Ok(DataType::Utf8), |
There was a problem hiding this comment.
We probably don't need this just yet, as cast(binary as varchar) works without any changes.
paddyhoran
left a comment
There was a problem hiding this comment.
I'm a little out of the loop (day job, etc.), but LGTM at a high level. Thanks @andygrove @nevi-me and @liurenjie1024, great progress.
| ); | ||
| } | ||
|
|
||
| //TODO assertions |
There was a problem hiding this comment.
Are you going to add assertions in this PR?
There was a problem hiding this comment.
Yes. I'm away on a business trip today and tomorrow but intend on adding assertions this weekend.
Replaces the DataFusion Parquet reader with the new Arrow reader in the parquet crate.