Skip to content

fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader#2138

Merged
mbutrovich merged 4 commits intoapache:mainfrom
Kontinuation:fix-s3-url-decode
Aug 18, 2025
Merged

fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader#2138
mbutrovich merged 4 commits intoapache:mainfrom
Kontinuation:fix-s3-url-decode

Conversation

@Kontinuation
Copy link
Copy Markdown
Member

@Kontinuation Kontinuation commented Aug 13, 2025

Which issue does this PR close?

Closes #2139.

Rationale for this change

The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this:

s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet

Note that Brand%2321 is part of the original S3 path, not the url-encoded path. The partition key is Brand#21, the directory names of partitioned tables are url-encoded by design to support any character sequences.

If we url-decode this path twice, the resulting path will be s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is different from the original path.

What changes are included in this PR?

This PR fixes the repeated S3 path url-decoding. Now native parquet reader could correctly handle S3 paths containing escape sequences.

How are these changes tested?

Add a new Scala test which writes a partitioned table with partition key containing '#' character and read it back using Comet.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.48%. Comparing base (f09f8af) to head (630ed52).
⚠️ Report is 1167 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2138      +/-   ##
============================================
+ Coverage     56.12%   58.48%   +2.36%     
- Complexity      976     1257     +281     
============================================
  Files           119      143      +24     
  Lines         11743    13204    +1461     
  Branches       2251     2372     +121     
============================================
+ Hits           6591     7723    +1132     
- Misses         4012     4254     +242     
- Partials       1140     1227      +87     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Kontinuation Kontinuation marked this pull request as ready for review August 13, 2025 11:43
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Kontinuation

CometConf.SCAN_NATIVE_ICEBERG_COMPAT).foreach(scanMode => {
withSQLConf(CometConf.COMET_NATIVE_SCAN_IMPL.key -> scanMode) {
val df =
spark.read.format("parquet").load(testFilePath).agg(sum(col("id")), max(col("val")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a test with encoded url?
E.g. Brand%2321
This is based on your description

 if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added another test for testing encoded url explicitly.

Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending with CI

@mbutrovich mbutrovich merged commit f31cf78 into apache:main Aug 18, 2025
97 of 98 checks passed
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
…g native parquet reader (apache#2138)

* Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader

* Make ParquetReadFromS3Suite runs using all scan impls

* test: support URL escape sequences in write path

* Improve S3 parquet read tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error reading parquet files on S3 using native_iceberg_compat reader when the path contains URL-encode escape sequences

4 participants