Set max message size on parquet v3 reader #1198

arthurpassos · 2025-12-08T19:30:48Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Set max message size on parquet v3 reader to avoid getting DB::Exception: apache::thrift::transport::TTransportException: MaxMessageSize reached

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

github-actions · 2025-12-08T19:31:42Z

Workflow [PR], commit [8b96e89]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/Processors/Formats/Impl/Parquet/ThriftUtil.cpp

ianton-ru · 2025-12-09T17:10:55Z

As I understand, PR increased message size limit from 100 megabytes to 2 gigabytes.
But what is in message?
As I see in thrift library, some message size just compared with limit. Both are signed 32-bit integers. Are you sure that with new limit we can't catch integer overflow somewhere in thrift code?

src/Processors/Formats/Impl/Parquet/ThriftUtil.cpp

arthurpassos · 2025-12-09T17:51:32Z

As I understand, PR increased message size limit from 100 megabytes to 2 gigabytes. But what is in message? As I see in thrift library, some message size just compared with limit. Both are signed 32-bit integers. Are you sure that with new limit we can't catch integer overflow somewhere in thrift code?

I have briefly checked the arrow/thrift and did not find problematic occurrences that would lead to integer overflow. I see it is used to set TTransport::remainingMessageSize_ and TTransport::knownMessageSize_. None seem to perform additions that would cause an overflow.

ilejn · 2025-12-09T19:24:05Z

Was this change tested?

arthurpassos · 2025-12-09T19:35:00Z

Was this change tested?

Not with real data. @dima-altinity was going to test it with real customer data, I have not heard from him yet.

On my side, I have tried creating a parquet file with a big metadata footer with the following ChatGPT provided script:

import os, struct
import pyarrow as pa
import pyarrow.parquet as pq

TARGET = 120 * 1024 * 1024  # 120 MiB

table = pa.table({"x": pa.array([1, 2, 3], type=pa.int32())})

payload = b"A" * TARGET
meta = dict(table.schema.metadata or {})
meta[b"giant_meta"] = payload
table2 = table.replace_schema_metadata(meta)

out = "too_big_footer_kv.parquet"
pq.write_table(table2, out, compression="snappy")

# Read parquet footer length (stored in last 8 bytes: <footer_len><PAR1>)
with open(out, "rb") as f:
    f.seek(-8, os.SEEK_END)
    footer_len = struct.unpack("<I", f.read(4))[0]
    magic = f.read(4)

print("file =", out)
print("footer_len =", footer_len, "bytes")
print("magic =", magic)
print("file_size =", os.path.getsize(out), "bytes")

Before these changes, I would get:

arthur :) select * from file('too_big_footer_kv.parquet') SETTINGS input_format_parquet_use_native_reader_v3=1;

SELECT *
FROM file('too_big_footer_kv.parquet')
SETTINGS input_format_parquet_use_native_reader_v3 = 1

Query id: 4f316fce-9807-4702-a50b-fc7ef930034b


Elapsed: 0.683 sec. 

Received exception from server (version 25.8.9):
Code: 636. DB::Exception: Received from localhost:9000. DB::Exception: The table structure cannot be extracted from a Parquet format file. Error:
Code: 1001. DB::Exception: apache::thrift::transport::TTransportException: MaxMessageSize reached. (STD_EXCEPTION) (version 25.8.9.20000.altinityantalya).
You can specify the structure manually: (in file/uri /home/laptop/work/altinity/export_replicated_mt_partition/programs/server/user_files/too_big_footer_kv.parquet). (CANNOT_EXTRACT_TABLE_STRUCTURE)

After these changes, I get:

arthur :) select * from file('too_big_footer_kv.parquet') SETTINGS input_format_parquet_use_native_reader_v3=1;

SELECT *
FROM file('too_big_footer_kv.parquet')
SETTINGS input_format_parquet_use_native_reader_v3 = 1

Query id: 386acd35-c2c0-4419-996c-a35a969f4468

   ┌─x─┐
1. │ 1 │
2. │ 2 │
3. │ 3 │
   └───┘

3 rows in set. Elapsed: 4.553 sec.

I opted for not introducing a test because the .parquet file necessary would be over 100 MB

ilejn · 2025-12-09T19:39:17Z

It may be interesting to know if this memory is under control of CH memory quotes (some libraries have issues with this, some others don't), but probably this is out of the scope of this PR.

ilejn

LGTM.

Selfeer · 2025-12-16T13:37:31Z

Verification test: https://github.com/Altinity/clickhouse-regression/blob/52cf60980b118efa4fe7dc0d790b1d7de2c757a0/parquet/tests/native_reader.py#L237

The test dynamically generates a parquet file, puts it to the user_files in clickhouse and does a cleanup after execution in order to not keep 300MB size parquet file.

Currently reading this file is only possible with input_format_parquet_use_native_reader_v3=1.

Just input_format_parquet_use_native_reader and no native reader are not able to read from this file.

max msg size parquet reader v3

1a5d1d1

chatgpt-codex-connector bot reviewed Dec 8, 2025

View reviewed changes

src/Processors/Formats/Impl/Parquet/ThriftUtil.cpp Show resolved Hide resolved

svb-alt added antalya antalya-25.8 labels Dec 8, 2025

arthurpassos added 2 commits December 8, 2025 18:08

empty to retrigger ci

ae038bb

Merge branch 'antalya-25.8' into max_message_side_reader_v3

8b96e89

ilejn reviewed Dec 9, 2025

View reviewed changes

src/Processors/Formats/Impl/Parquet/ThriftUtil.cpp Show resolved Hide resolved

ilejn reviewed Dec 9, 2025

View reviewed changes

ilejn approved these changes Dec 10, 2025

View reviewed changes

zvonand merged commit b69dde6 into antalya-25.8 Dec 10, 2025
126 of 132 checks passed

Selfeer self-requested a review December 12, 2025 11:26

Selfeer added the verified Verified by QA label Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set max message size on parquet v3 reader #1198

Set max message size on parquet v3 reader #1198

Uh oh!

arthurpassos commented Dec 8, 2025

Uh oh!

github-actions bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

ianton-ru commented Dec 9, 2025

Uh oh!

Uh oh!

arthurpassos commented Dec 9, 2025

Uh oh!

ilejn commented Dec 9, 2025

Uh oh!

arthurpassos commented Dec 9, 2025

Uh oh!

ilejn commented Dec 9, 2025

Uh oh!

ilejn left a comment

Uh oh!

Uh oh!

Selfeer commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Set max message size on parquet v3 reader #1198

Set max message size on parquet v3 reader #1198

Uh oh!

Conversation

arthurpassos commented Dec 8, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ianton-ru commented Dec 9, 2025

Uh oh!

Uh oh!

arthurpassos commented Dec 9, 2025

Uh oh!

ilejn commented Dec 9, 2025

Uh oh!

arthurpassos commented Dec 9, 2025

Uh oh!

ilejn commented Dec 9, 2025

Uh oh!

ilejn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Selfeer commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions bot commented Dec 8, 2025 •

edited

Loading