Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 18, 2021

It seems than when the decompressed size exceeds 128 kiB, Hadoop compresses the data in several concatenated "frames".

Data in this file:

Version: 1.0
Created By: parquet-mr version 1.11.1 (build 765bd5cd7fdef2af1cecd0755000694b992bfadd)
Total rows: 10000
Number of RowGroups: 1
Number of Real Columns: 1
Number of Columns: 1
Number of Selected Columns: 1
Column 0: a (BYTE_ARRAY/UTF8)
--- Row Group: 0 ---
--- Total Bytes: 400029 ---
--- Rows: 10000 ---
Column 0
  Values: 10000, Null Values: 0, Distinct Values: 0
  Max: ffffe6a0-e0c0-4e65-a9d4-f7f4c176aea2, Min: 00087de7-10df-4979-94cf-79279f9745ce
  Compression: LZ4_HADOOP, Encodings: BIT_PACKED PLAIN
  Uncompressed Size: 400029, Compressed Size: 358351
--- Values ---
a                             |
[ ... ]

It seems than when the decompressed size exceeds 128 kiB, Hadoop compresses the data in several concatenated "frames".

Data in this file:
```
Version: 1.0
Created By: parquet-mr version 1.11.1 (build 765bd5cd7fdef2af1cecd0755000694b992bfadd)
Total rows: 10000
Number of RowGroups: 1
Number of Real Columns: 1
Number of Columns: 1
Number of Selected Columns: 1
Column 0: a (BYTE_ARRAY/UTF8)
--- Row Group: 0 ---
--- Total Bytes: 400029 ---
--- Rows: 10000 ---
Column 0
  Values: 10000, Null Values: 0, Distinct Values: 0
  Max: ffffe6a0-e0c0-4e65-a9d4-f7f4c176aea2, Min: 00087de7-10df-4979-94cf-79279f9745ce
  Compression: LZ4_HADOOP, Encodings: BIT_PACKED PLAIN
  Uncompressed Size: 400029, Compressed Size: 358351
--- Values ---
a                             |
[ ... ]
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant