Skip to content

ARROW-6687: Add .parquet file with single np.nan value#9

Merged
wesm merged 1 commit intoapache:masterfrom
alippai:patch-1
Sep 26, 2019
Merged

ARROW-6687: Add .parquet file with single np.nan value#9
wesm merged 1 commit intoapache:masterfrom
alippai:patch-1

Conversation

@alippai
Copy link
Copy Markdown
Contributor

@alippai alippai commented Sep 25, 2019

Testing the new DataFusion .parquet support discovered an error. Adding a simple regression test resource to test it

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just add the nan file in this PR so we can add that regression test?

For the partition issue, we can have unit tests create the partition directories and just copy the all_types parquet files

@alippai alippai changed the title ARROW-6687: Add simple partitioned arrow file based on alltypes_plain… ARROW-6687: Add .parquet file with single np.nan value Sep 26, 2019
@alippai
Copy link
Copy Markdown
Contributor Author

alippai commented Sep 26, 2019

@andygrove removed the partition directory, the commit adds the single NaN value only

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Copy Markdown
Member

@wesm could you merge this please

@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

I'm worried about going down this route as a testing approach. Do you plan to keep adding more files as you develop the Parquet Rust project?

@andygrove
Copy link
Copy Markdown
Member

Yes, we definitely need more parquet files to test against. The current testing is very limited.

Is you concern about checking in static files versus generating them using them scripts?

@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

Yes, I don't think that having a static corpus is a scalable testing strategy.

@andygrove
Copy link
Copy Markdown
Member

I hear you. On the other hand, if the Rust developers now have to have a C++ and/or Python env set up as well to be able to run tests, that's also not ideal either. I suppose this could be Dockerized though?

@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

The ideal scenario is to generate files endogenously using the Rust library and not to rely on a different project. That's what we do in C++ (and what the Java library does also). I think checking in "problem" files that exhibit issues that you cannot easily generate from a particular library is okay.

@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

Once Rust has a fully capable IPC implementation I'd be supportive of developing some Dockerized automated fuzz/integration testing between the C++/Python/R and Rust libraries. We can have the libraries cross-validate Parquet versus the Arrow protocol "point of truth"

@andygrove
Copy link
Copy Markdown
Member

The ideal scenario is to generate files endogenously using the Rust library

Unfortunately the Rust implementation doesn't yet have support for writing Parquet files.

@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

Okay. I think it's very important for the Rust developers to prioritize this otherwise it will be very difficult for the project to mature into something that people can depend on in production.

@wesm wesm merged commit 46c9e97 into apache:master Sep 26, 2019
@wesm
Copy link
Copy Markdown
Member

wesm commented Sep 26, 2019

Merging this.

Short of writing Parquet files from Rust if this becomes a pattern I would recommend writing a data generation script in Python and providing a Dockerfile to run it as part of the testing process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants