Skip to content

Conversation

@patrickpai
Copy link
Contributor

@patrickpai patrickpai commented Jul 14, 2020

Due to ongoing LZ4 problems with Parquet files, this patch disables writing files with LZ4 codec by throwing a ParquetException.

In progress: adding exceptions for pyarrow when using LZ4 to write files and updating relevant pytests

Mailing list discussion: https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E

Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424

@github-actions
Copy link

@wesm wesm marked this pull request as ready for review July 14, 2020 20:22
@wesm
Copy link
Member

wesm commented Jul 14, 2020

Ah I see that you're adding Python changes. I fixed the lint problems here so be sure to rebase your changes

@wesm
Copy link
Member

wesm commented Jul 14, 2020

@patrickpai do you anticipate to complete this today? We are hoping to cut a release candidate tomorrow during the workday Central Europe Time so I can help finish this if needed

@patrickpai
Copy link
Contributor Author

@wesm I think it's best if I get help on this. I'm totally new to the python codebase and wasn't expecting to finish today. I can let you take over.

@wesm
Copy link
Member

wesm commented Jul 14, 2020

No problem, I can take it from here.

@wesm
Copy link
Member

wesm commented Jul 14, 2020

@pitrou @xhochy It seems that despite adding the LZ4_FRAME format we've been continuing to use LZ4_RAW for Parquet files. Unfortunate that this hasn't seen more compatibility testing.

@pitrou
Copy link
Member

pitrou commented Jul 14, 2020

Wasn't it deliberate? IIRC we didn't want to break compatibility with existing files.

@pitrou
Copy link
Member

pitrou commented Jul 14, 2020

In any case, the format used by Hadoop is neither of both, it's LZ4_RAW with a custom header...

@wesm
Copy link
Member

wesm commented Jul 14, 2020

I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.

Note that we can provide backward compatibility if needed for existing LZ4-compressed files by looking at the version number in the file footer

@wesm
Copy link
Member

wesm commented Jul 14, 2020

OK, writing is disabled but old files can still be read

n [2]: pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')                                                                                                                    
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-597ef4749b0a> in <module>
----> 1 pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')

~/code/arrow/python/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, data_page_version, **kwargs)
   1632                 data_page_version=data_page_version,
   1633                 **kwargs) as writer:
-> 1634             writer.write_table(table, row_group_size=row_group_size)
   1635     except Exception:
   1636         if _is_path_like(where):

~/code/arrow/python/pyarrow/parquet.py in write_table(self, table, row_group_size)
    586             raise ValueError(msg)
    587 
--> 588         self.writer.write_table(table, row_group_size=row_group_size)
    589 
    590     def close(self):

~/code/arrow/python/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetWriter.write_table()
   1406 
   1407         with nogil:
-> 1408             check_status(self.writer.get()
   1409                          .WriteTable(deref(ctable), c_row_group_size))
   1410 

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     97                 raise IOError(errno, message)
     98             else:
---> 99                 raise IOError(message)
    100         elif status.IsOutOfMemory():
    101             raise ArrowMemoryError(message)

OSError: Per ARROW-9424, writing files with LZ4 compression has been disabled until implementation issues have been resolved. It is recommended to read any existing files and rewrite them using a different compression.
In ../src/parquet/arrow/writer.cc, line 684, code: WriteColumnChunk(table.column(i), offset, size)

In [3]: pq.read_table('example.parquet.lz4').to_pandas()                                                                                                                                       
Out[3]: 
   f0
0   1
1   2
2   3
3   4
4   5

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will merge on green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants