ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757

patrickpai · 2020-07-14T16:04:17Z

Due to ongoing LZ4 problems with Parquet files, this patch disables writing files with LZ4 codec by throwing a ParquetException.

In progress: adding exceptions for pyarrow when using LZ4 to write files and updating relevant pytests

Mailing list discussion: https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E

Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424

github-actions · 2020-07-14T16:04:24Z

https://issues.apache.org/jira/browse/ARROW-9424

wesm · 2020-07-14T20:22:35Z

Ah I see that you're adding Python changes. I fixed the lint problems here so be sure to rebase your changes

wesm · 2020-07-14T21:45:57Z

@patrickpai do you anticipate to complete this today? We are hoping to cut a release candidate tomorrow during the workday Central Europe Time so I can help finish this if needed

patrickpai · 2020-07-14T21:48:51Z

@wesm I think it's best if I get help on this. I'm totally new to the python codebase and wasn't expecting to finish today. I can let you take over.

wesm · 2020-07-14T21:50:37Z

No problem, I can take it from here.

wesm · 2020-07-14T22:07:10Z

@pitrou @xhochy It seems that despite adding the LZ4_FRAME format we've been continuing to use LZ4_RAW for Parquet files. Unfortunate that this hasn't seen more compatibility testing.

pitrou · 2020-07-14T22:08:04Z

Wasn't it deliberate? IIRC we didn't want to break compatibility with existing files.

pitrou · 2020-07-14T22:09:03Z

In any case, the format used by Hadoop is neither of both, it's LZ4_RAW with a custom header...

wesm · 2020-07-14T22:09:37Z

I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.

Note that we can provide backward compatibility if needed for existing LZ4-compressed files by looking at the version number in the file footer

wesm · 2020-07-14T23:53:57Z

OK, writing is disabled but old files can still be read

n [2]: pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')                                                                                                                    
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-597ef4749b0a> in <module>
----> 1 pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')

~/code/arrow/python/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, data_page_version, **kwargs)
   1632                 data_page_version=data_page_version,
   1633                 **kwargs) as writer:
-> 1634             writer.write_table(table, row_group_size=row_group_size)
   1635     except Exception:
   1636         if _is_path_like(where):

~/code/arrow/python/pyarrow/parquet.py in write_table(self, table, row_group_size)
    586             raise ValueError(msg)
    587 
--> 588         self.writer.write_table(table, row_group_size=row_group_size)
    589 
    590     def close(self):

~/code/arrow/python/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetWriter.write_table()
   1406 
   1407         with nogil:
-> 1408             check_status(self.writer.get()
   1409                          .WriteTable(deref(ctable), c_row_group_size))
   1410 

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     97                 raise IOError(errno, message)
     98             else:
---> 99                 raise IOError(message)
    100         elif status.IsOutOfMemory():
    101             raise ArrowMemoryError(message)

OSError: Per ARROW-9424, writing files with LZ4 compression has been disabled until implementation issues have been resolved. It is recommended to read any existing files and rewrite them using a different compression.
In ../src/parquet/arrow/writer.cc, line 684, code: WriteColumnChunk(table.column(i), offset, size)

In [3]: pq.read_table('example.parquet.lz4').to_pandas()                                                                                                                                       
Out[3]: 
   f0
0   1
1   2
2   3
3   4
4   5

wesm

+1, will merge on green

wesm force-pushed the ARROW-9424 branch from c752cc0 to 1e02675 Compare July 14, 2020 20:21

wesm marked this pull request as ready for review July 14, 2020 20:22

patrickpai and others added 6 commits July 14, 2020 18:48

disable writing files with LZ4 codec in c++ library

dce2d1e

fix cpp lint

48c42cb

clang-format

c5dfdd2

Fix up Python unit tests, throw more helpful error message

357b866

Only disallow LZ4 for writing files, not reading

7e51a60

Ensure that existing lz4-compressed files can still be read

a8be0bc

wesm force-pushed the ARROW-9424 branch from 1e02675 to a8be0bc Compare July 14, 2020 23:52

wesm approved these changes Jul 14, 2020

View reviewed changes

wesm closed this in 3586292 Jul 15, 2020

asfimport mentioned this pull request Jul 15, 2020

[C++][Parquet] Disable writing files with LZ4 codec #25500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757

ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757

Uh oh!

patrickpai commented Jul 14, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

patrickpai commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

pitrou commented Jul 14, 2020

Uh oh!

pitrou commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020 •

edited

Loading

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757

ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec #7757

Uh oh!

Conversation

patrickpai commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

patrickpai commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

pitrou commented Jul 14, 2020

Uh oh!

pitrou commented Jul 14, 2020

Uh oh!

wesm commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jul 14, 2020

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

patrickpai commented Jul 14, 2020 •

edited

Loading

wesm commented Jul 14, 2020 •

edited

Loading