-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ah I see that you're adding Python changes. I fixed the lint problems here so be sure to rebase your changes |
|
@patrickpai do you anticipate to complete this today? We are hoping to cut a release candidate tomorrow during the workday Central Europe Time so I can help finish this if needed |
|
@wesm I think it's best if I get help on this. I'm totally new to the python codebase and wasn't expecting to finish today. I can let you take over. |
|
No problem, I can take it from here. |
|
Wasn't it deliberate? IIRC we didn't want to break compatibility with existing files. |
|
In any case, the format used by Hadoop is neither of both, it's LZ4_RAW with a custom header... |
|
I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it. Note that we can provide backward compatibility if needed for existing LZ4-compressed files by looking at the version number in the file footer |
|
OK, writing is disabled but old files can still be read |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will merge on green
Due to ongoing LZ4 problems with Parquet files, this patch disables writing files with LZ4 codec by throwing a
ParquetException.In progress: adding exceptions for pyarrow when using LZ4 to write files and updating relevant pytests
Mailing list discussion: https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E
Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424