Error reading parquet files on S3 using native_iceberg_compat reader when the path contains URL-encode escape sequences

### Describe the bug

This is a bug of a feature introduced by https://github.com/apache/datafusion-comet/pull/1817.

The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this:

s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet
Note that Brand%2321 is part of the original S3 path, not the url-encoded path. The partition key is Brand#21, the directory names of partitioned tables are url-encoded by design to support any character sequences.

If we url-decode this path twice, the resulting path will be s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is different from the original path.

### Steps to reproduce

Simply counting the number of rows in a parquet file with Comet enabled. The S3 path should contain escape sequence:

```python
spark.read.parquet("s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet").count()
```

This produces an error:

```
Caused by: org.apache.comet.CometNativeException: External: Object at location path/to/data/p_brand=Brand#21/part-xxxx.parquet not found: Error performing GET https://s3.us-west-2.amazonaws.com/.../p_brand%3DBrand%2352/part-xxxx.snappy.parquet in 53.743599ms - Server returned non-2xx status code: 404 Not Found: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>path/to/data/p_brand=Brand#21/part-xxxx.parquet</Key><RequestId>R05Q6ASV5FECFQGW</RequestId><HostId>sFNJHdsH0it3d0WbQTczSO5wku4zVzEKXgp0d/K4z1Onj/Sy+m18q54xvYzeu2eRhJ8qz+dIBBE=</HostId></Error>
	at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
	at org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812)
	at org.apache.comet.parquet.NativeBatchReader.nextBatch(NativeBatchReader.java:749)
	at org.apache.comet.parquet.NativeBatchReader.nextKeyValue(NativeBatchReader.java:707)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:286)
	... 36 more
```

### Expected behavior

The parquet file should be loaded correctly.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading parquet files on S3 using native_iceberg_compat reader when the path contains URL-encode escape sequences #2139

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Error reading parquet files on S3 using native_iceberg_compat reader when the path contains URL-encode escape sequences #2139

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions