ARROW-6687: Add .parquet file with single np.nan value#9
Conversation
andygrove
left a comment
There was a problem hiding this comment.
Could you just add the nan file in this PR so we can add that regression test?
For the partition issue, we can have unit tests create the partition directories and just copy the all_types parquet files
|
@andygrove removed the partition directory, the commit adds the single NaN value only |
|
@wesm could you merge this please |
|
I'm worried about going down this route as a testing approach. Do you plan to keep adding more files as you develop the Parquet Rust project? |
|
Yes, we definitely need more parquet files to test against. The current testing is very limited. Is you concern about checking in static files versus generating them using them scripts? |
|
Yes, I don't think that having a static corpus is a scalable testing strategy. |
|
I hear you. On the other hand, if the Rust developers now have to have a C++ and/or Python env set up as well to be able to run tests, that's also not ideal either. I suppose this could be Dockerized though? |
|
The ideal scenario is to generate files endogenously using the Rust library and not to rely on a different project. That's what we do in C++ (and what the Java library does also). I think checking in "problem" files that exhibit issues that you cannot easily generate from a particular library is okay. |
|
Once Rust has a fully capable IPC implementation I'd be supportive of developing some Dockerized automated fuzz/integration testing between the C++/Python/R and Rust libraries. We can have the libraries cross-validate Parquet versus the Arrow protocol "point of truth" |
Unfortunately the Rust implementation doesn't yet have support for writing Parquet files. |
|
Okay. I think it's very important for the Rust developers to prioritize this otherwise it will be very difficult for the project to mature into something that people can depend on in production. |
|
Merging this. Short of writing Parquet files from Rust if this becomes a pattern I would recommend writing a data generation script in Python and providing a Dockerfile to run it as part of the testing process |
Testing the new DataFusion .parquet support discovered an error. Adding a simple regression test resource to test it