Skip to content

Error reading parquet files on S3 using native_iceberg_compat reader when the path contains URL-encode escape sequences #2139

@Kontinuation

Description

@Kontinuation

Describe the bug

This is a bug of a feature introduced by #1817.

The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data.

I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this:

s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet
Note that Brand%2321 is part of the original S3 path, not the url-encoded path. The partition key is Brand#21, the directory names of partitioned tables are url-encoded by design to support any character sequences.

If we url-decode this path twice, the resulting path will be s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is different from the original path.

Steps to reproduce

Simply counting the number of rows in a parquet file with Comet enabled. The S3 path should contain escape sequence:

spark.read.parquet("s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet").count()

This produces an error:

Caused by: org.apache.comet.CometNativeException: External: Object at location path/to/data/p_brand=Brand#21/part-xxxx.parquet not found: Error performing GET https://s3.us-west-2.amazonaws.com/.../p_brand%3DBrand%2352/part-xxxx.snappy.parquet in 53.743599ms - Server returned non-2xx status code: 404 Not Found: <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>path/to/data/p_brand=Brand#21/part-xxxx.parquet</Key><RequestId>R05Q6ASV5FECFQGW</RequestId><HostId>sFNJHdsH0it3d0WbQTczSO5wku4zVzEKXgp0d/K4z1Onj/Sy+m18q54xvYzeu2eRhJ8qz+dIBBE=</HostId></Error>
	at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
	at org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812)
	at org.apache.comet.parquet.NativeBatchReader.nextBatch(NativeBatchReader.java:749)
	at org.apache.comet.parquet.NativeBatchReader.nextKeyValue(NativeBatchReader.java:707)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:286)
	... 36 more

Expected behavior

The parquet file should be loaded correctly.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions