Skip to content

ARROW-4219: [Rust] [Parquet] Initial support for arrow reader.#5523

Closed
liurenjie1024 wants to merge 3 commits intoapache:masterfrom
liurenjie1024:arrow-4219
Closed

ARROW-4219: [Rust] [Parquet] Initial support for arrow reader.#5523
liurenjie1024 wants to merge 3 commits intoapache:masterfrom
liurenjie1024:arrow-4219

Conversation

@liurenjie1024
Copy link
Copy Markdown
Contributor

Initial support of arrow reader, which reads parquet into arrow record batch.

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

@sunchao @andygrove @nevi-me @paddyhoran Please take a look when you are available.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far but I have a couple of questions in this review

Comment thread rust/arrow/src/record_batch.rs Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a very similar trait in DataFusion, only it inherits Sync + Send too so that it can be passed between threads. Maybe we could do that here too? See https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/physical_plan/mod.rs#L44-L50

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to make this trait as Send+Sync because the reader used some unsafe data(e.g. MutableBuffer).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

Comment thread rust/arrow/src/record_batch.rs Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is returned if there are no more batches? Shouldn't this return Result<Option<RecordBatch>> ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if we reached EOF, it returns zero length array. But I think it's better to change return type to Result<Option<RecordBatch>> . Will fix it.

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

@andygrove I need to upload some generated files to arrow-testing repo for this integration test. How am I supposed to do that? Should I submit another PR in arrow-testing project?

@github-actions
Copy link
Copy Markdown

@andygrove
Copy link
Copy Markdown
Member

@liurenjie1024 If you are adding Parquet files then you can PR against https://github.com/apache/parquet-testing but please note the comments in the last PR merged there which suggests we look into alternate approaches: apache/parquet-testing#9

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

@andygrove I need to upload not only parquet file, but also json files and proto files. Here is my approach to do integration test: I use protobuf to define schema, then generate both json and parquet file with that schema using java code. Then I use rust code to load parquet and json file to compare them. I guess maybe it's better to upload them into arrow-testing library?

@andygrove
Copy link
Copy Markdown
Member

@liurenjie1024 I think it would be good to start a discussion about this on the mailing list. I'm not sure what advice to give on this.

@alippai
Copy link
Copy Markdown
Contributor

alippai commented Oct 7, 2019

The new ParquetFileArrowReader is faster than the existing row based iteration, looks stable for my dataset (but, the Utf8 datatype is missing)

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

@andygrove Sorry for late reply. I've submitted a PR in arrow-testing repo: apache/arrow-testing#11
Let's discuss there.

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

@andygrove This is ready for review now.

Copy link
Copy Markdown
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Oct 14, 2019

The new ParquetFileArrowReader is faster than the existing row based iteration, looks stable for my dataset (but, the Utf8 datatype is missing)

@liurenjie1024 is this an intentional limitation for now? (

(arrow_type, _) => Err(general_err!(
"Reading {:?} type from parquet is not supported yet.",
arrow_type
)),
)

Also, my apologies for the late review.

@liurenjie1024
Copy link
Copy Markdown
Contributor Author

liurenjie1024 commented Oct 14, 2019

@nevi-me No, it just takes time to implement. Will implement in future.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @liurenjie1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants