fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader by Kontinuation · Pull Request #2138 · apache/datafusion-comet

Kontinuation · 2025-08-13T09:27:51Z

Which issue does this PR close?

Closes #2139.

Rationale for this change

The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this:

s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet

Note that Brand%2321 is part of the original S3 path, not the url-encoded path. The partition key is Brand#21, the directory names of partitioned tables are url-encoded by design to support any character sequences.

If we url-decode this path twice, the resulting path will be s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is different from the original path.

What changes are included in this PR?

This PR fixes the repeated S3 path url-decoding. Now native parquet reader could correctly handle S3 paths containing escape sequences.

How are these changes tested?

Add a new Scala test which writes a partitioned table with partition key containing '#' character and read it back using Comet.

…ive parquet reader

codecov-commenter · 2025-08-13T09:54:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.48%. Comparing base (f09f8af) to head (630ed52).
⚠️ Report is 1167 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2138      +/-   ##
============================================
+ Coverage     56.12%   58.48%   +2.36%     
- Complexity      976     1257     +281     
============================================
  Files           119      143      +24     
  Lines         11743    13204    +1461     
  Branches       2251     2372     +121     
============================================
+ Hits           6591     7723    +1132     
- Misses         4012     4254     +242     
- Partials       1140     1227      +87

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kazuyukitanimura

Thank you @Kontinuation

kazuyukitanimura · 2025-08-13T23:42:46Z

+      CometConf.SCAN_NATIVE_ICEBERG_COMPAT).foreach(scanMode => {
+      withSQLConf(CometConf.COMET_NATIVE_SCAN_IMPL.key -> scanMode) {
+        val df =
+          spark.read.format("parquet").load(testFilePath).agg(sum(col("id")), max(col("val")))


Should we include a test with encoded url?
E.g. Brand%2321
This is based on your description

if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

I have added another test for testing encoded url explicitly.

kazuyukitanimura

pending with CI

…g native parquet reader (apache#2138) * Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader * Make ParquetReadFromS3Suite runs using all scan impls * test: support URL escape sequences in write path * Improve S3 parquet read tests

Fix repeatedly url-decode path when reading parquet from s3 using nat…

20dc9ac

…ive parquet reader

Make ParquetReadFromS3Suite runs using all scan impls

6ab434a

Kontinuation marked this pull request as ready for review August 13, 2025 11:43

kazuyukitanimura reviewed Aug 13, 2025

View reviewed changes

Kontinuation added 2 commits August 14, 2025 15:28

test: support URL escape sequences in write path

0643e41

Improve S3 parquet read tests

630ed52

Kontinuation force-pushed the fix-s3-url-decode branch from 9405aee to 630ed52 Compare August 14, 2025 07:29

kazuyukitanimura approved these changes Aug 14, 2025

View reviewed changes

mbutrovich merged commit f31cf78 into apache:main Aug 18, 2025
97 of 98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader#2138

fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader#2138
mbutrovich merged 4 commits intoapache:mainfrom
Kontinuation:fix-s3-url-decode

Kontinuation commented Aug 13, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 13, 2025 •

edited

Loading

Uh oh!

kazuyukitanimura left a comment

Uh oh!

kazuyukitanimura Aug 13, 2025

Uh oh!

Kontinuation Aug 14, 2025

Uh oh!

kazuyukitanimura left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Kontinuation commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Kontinuation Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kontinuation commented Aug 13, 2025 •

edited

Loading

codecov-commenter commented Aug 13, 2025 •

edited

Loading