Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Set max message size on parquet v3 reader to avoid getting DB::Exception: apache::thrift::transport::TTransportException: MaxMessageSize reached

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Workflow [PR], commit [8b96e89]

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ianton-ru
Copy link

As I understand, PR increased message size limit from 100 megabytes to 2 gigabytes.
But what is in message?
As I see in thrift library, some message size just compared with limit. Both are signed 32-bit integers. Are you sure that with new limit we can't catch integer overflow somewhere in thrift code?

@arthurpassos
Copy link
Collaborator Author

As I understand, PR increased message size limit from 100 megabytes to 2 gigabytes. But what is in message? As I see in thrift library, some message size just compared with limit. Both are signed 32-bit integers. Are you sure that with new limit we can't catch integer overflow somewhere in thrift code?

I have briefly checked the arrow/thrift and did not find problematic occurrences that would lead to integer overflow. I see it is used to set TTransport::remainingMessageSize_ and TTransport::knownMessageSize_. None seem to perform additions that would cause an overflow.

@ilejn
Copy link
Collaborator

ilejn commented Dec 9, 2025

Was this change tested?

@arthurpassos
Copy link
Collaborator Author

Was this change tested?

Not with real data. @dima-altinity was going to test it with real customer data, I have not heard from him yet.

On my side, I have tried creating a parquet file with a big metadata footer with the following ChatGPT provided script:

import os, struct
import pyarrow as pa
import pyarrow.parquet as pq

TARGET = 120 * 1024 * 1024  # 120 MiB

table = pa.table({"x": pa.array([1, 2, 3], type=pa.int32())})

payload = b"A" * TARGET
meta = dict(table.schema.metadata or {})
meta[b"giant_meta"] = payload
table2 = table.replace_schema_metadata(meta)

out = "too_big_footer_kv.parquet"
pq.write_table(table2, out, compression="snappy")

# Read parquet footer length (stored in last 8 bytes: <footer_len><PAR1>)
with open(out, "rb") as f:
    f.seek(-8, os.SEEK_END)
    footer_len = struct.unpack("<I", f.read(4))[0]
    magic = f.read(4)

print("file =", out)
print("footer_len =", footer_len, "bytes")
print("magic =", magic)
print("file_size =", os.path.getsize(out), "bytes")

Before these changes, I would get:

arthur :) select * from file('too_big_footer_kv.parquet') SETTINGS input_format_parquet_use_native_reader_v3=1;

SELECT *
FROM file('too_big_footer_kv.parquet')
SETTINGS input_format_parquet_use_native_reader_v3 = 1

Query id: 4f316fce-9807-4702-a50b-fc7ef930034b


Elapsed: 0.683 sec. 

Received exception from server (version 25.8.9):
Code: 636. DB::Exception: Received from localhost:9000. DB::Exception: The table structure cannot be extracted from a Parquet format file. Error:
Code: 1001. DB::Exception: apache::thrift::transport::TTransportException: MaxMessageSize reached. (STD_EXCEPTION) (version 25.8.9.20000.altinityantalya).
You can specify the structure manually: (in file/uri /home/laptop/work/altinity/export_replicated_mt_partition/programs/server/user_files/too_big_footer_kv.parquet). (CANNOT_EXTRACT_TABLE_STRUCTURE)

After these changes, I get:

arthur :) select * from file('too_big_footer_kv.parquet') SETTINGS input_format_parquet_use_native_reader_v3=1;

SELECT *
FROM file('too_big_footer_kv.parquet')
SETTINGS input_format_parquet_use_native_reader_v3 = 1

Query id: 386acd35-c2c0-4419-996c-a35a969f4468

   ┌─x─┐
1. │ 1 │
2. │ 2 │
3. │ 3 │
   └───┘

3 rows in set. Elapsed: 4.553 sec. 

I opted for not introducing a test because the .parquet file necessary would be over 100 MB

@ilejn
Copy link
Collaborator

ilejn commented Dec 9, 2025

It may be interesting to know if this memory is under control of CH memory quotes (some libraries have issues with this, some others don't), but probably this is out of the scope of this PR.

Copy link
Collaborator

@ilejn ilejn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zvonand zvonand merged commit b69dde6 into antalya-25.8 Dec 10, 2025
126 of 132 checks passed
@Selfeer Selfeer self-requested a review December 12, 2025 11:26
@Selfeer Selfeer added the verified Verified by QA label Dec 16, 2025
@Selfeer
Copy link
Collaborator

Selfeer commented Dec 16, 2025

Verification test: https://github.com/Altinity/clickhouse-regression/blob/52cf60980b118efa4fe7dc0d790b1d7de2c757a0/parquet/tests/native_reader.py#L237

The test dynamically generates a parquet file, puts it to the user_files in clickhouse and does a cleanup after execution in order to not keep 300MB size parquet file.

Currently reading this file is only possible with input_format_parquet_use_native_reader_v3=1.

Just input_format_parquet_use_native_reader and no native reader are not able to read from this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants