PARQUET-2027: Fix calculating directory offset for merge#896
PARQUET-2027: Fix calculating directory offset for merge#896shangxinli merged 1 commit intoapache:masterfrom
Conversation
| long origPos = -1; | ||
| try { | ||
| origPos = in.getPos(); | ||
| in.seek(chunk.getStartingPos()); |
There was a problem hiding this comment.
Do we assume the dictionary page is always the chunk starting address?
There was a problem hiding this comment.
It is not obvious that one have to search this statements in the Encoding docs but it is there:
The dictionary page is written first, before the data pages of the column chunk.
There was a problem hiding this comment.
I know it is true today, but what if that assumption is broken when more and more page types are added. Can we add something in Encoding docs to not let people change that assumption?
There was a problem hiding this comment.
I agree it should be specified more clearly and maybe not only in the Encoding doc but somewhere in the "main" page but I feel it a separate topic.
There was a problem hiding this comment.
Can you create a Jira for it @gszadovszky so that we don't lose tracking of it?
Other than that, LGTM!
There was a problem hiding this comment.
Sure, @shangxinli. Check out PARQUET-2034 for details.
(cherry picked from commit 2ce35c7)
* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation