Skip to content

Conversation

@slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Aug 21, 2025

Summary

When no warehouse location is set, the code previously used getAbsolutePath() to configure the default Iceberg warehouse directory. This change switches to using toURI().toString(), ensuring the warehouse
location is represented as a standardized URI (e.g., file:/.../iceberg_data/).

@slfan1989
Copy link
Contributor Author

@nastra I discovered this issue while running the #13837 unit test: the RestCatalog used inconsistent warehouse path formats, which caused the path prefix validation to fail. The root cause was that the original implementation used getAbsolutePath(), while our unit tests consistently used toURI().toString(). This inconsistency not only led to test failures but also affected cross-platform compatibility.

This change unifies the implementation to use toURI().toString(), ensuring the warehouse path is always represented as a standardized URI. This resolves the unit test validation issue and provides consistent behavior across different operating systems.

@nastra
Copy link
Contributor

nastra commented Aug 21, 2025

which caused the path prefix validation to fail

can you elaborate which path prefix validation you mean here? I think it would be good to have a test here as well to show what was causing the issue that required this fix

while our unit tests consistently used toURI().toString()

I do actually see quite a lot of other places in tests that use getAbsolutePath(), such as https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java#L124.

I'm not saying that this is a good/bad change, I'm just trying to understand why this is all of a sudden needed now

@slfan1989
Copy link
Contributor Author

which caused the path prefix validation to fail

can you elaborate which path prefix validation you mean here? I think it would be good to have a test here as well to show what was causing the issue that required this fix

while our unit tests consistently used toURI().toString()

I do actually see quite a lot of other places in tests that use getAbsolutePath(), such as https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java#L124.

I'm not saying that this is a good/bad change, I'm just trying to understand why this is all of a sudden needed now

@nastra Many thanks for your comment!

In #13837, I added a new unit test to verify that the two newly introduced metrics, rewrittenDeleteFilesCount and rewrittenManifestFilesCount, are greater than 0.

This unit test was added to TestRewriteTablePathProcedure and is executed across multiple catalogs: HIVE, HADOOP, SPARK_SESSION, and REST.

The test passes normally under HIVE, HADOOP, and SPARK_SESSION, but it fails under the REST catalog with the following exception:

Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
java.lang.IllegalArgumentException: Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
	at org.apache.iceberg.RewriteTablePathUtil.relativize(RewriteTablePathUtil.java:686)
	at org.apache.iceberg.RewriteTablePathUtil.newPath(RewriteTablePathUtil.java:663)
	at org.apache.iceberg.RewriteTablePathUtil.writeDeleteFileEntry(RewriteTablePathUtil.java:492)
	at org.apache.iceberg.RewriteTablePathUtil.lambda$rewriteDeleteManifest$5(RewriteTablePathUtil.java:438)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)

The root cause of this issue lies in the different ways the base Hive warehouse path is generated in the test environment:

warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath();

This approach produces an absolute path without the file: prefix, for example:

/var/folders/.../iceberg_data/default/table/

In contrast, most other unit tests use the following:

String location = targetTableDir.toFile().toURI().toString();

This method generates a path with the file: prefix, for example:

file:/var/folders/.../iceberg_data/default/table/
image

From my perspective, I also believe that using .toURI().toString() provides better compatibility.

@slfan1989
Copy link
Contributor Author

@nastra Would it make sense to modify this PR in this way? Or should we standardize the unit tests to use toURI().toString() ? What do you think?

@nastra
Copy link
Contributor

nastra commented Aug 22, 2025

@nastra Would it make sense to modify this PR in this way? Or should we standardize the unit tests to use toURI().toString() ? What do you think?

I don't have an answer yet as I need to do some research across the codebase whether the proposed change here is safe in all cases and I'll be on vacation starting next week). Maybe other reviewers get a chance to take a look at this before me

@slfan1989
Copy link
Contributor Author

I don't have an answer yet as I need to do some research across the codebase whether the proposed change here is safe in all cases and I'll be on vacation starting next week). Maybe other reviewers get a chance to take a look at this before me

Thank you for your reply, and enjoy your holiday! I will conduct tests locally and sync the progress to this PR.

@slfan1989
Copy link
Contributor Author

After reviewing the latest survey results, I recommend using .toURI().toString() for the replacement here.

Upon analyzing all implementations of CatalogTestBase, I found that it currently supports four Catalog types: HIVE, HADOOP, SPARK_SESSION, and REST. Among them, the default paths for Hive tables in HIVE, HADOOP, and SPARK_SESSION Catalogs are all in URI format, as detailed below:

file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/hive12918907709677177324/table

When creating a table, the BaseMetastoreCatalog#create method is invoked, but different Catalog types use their own implementations.

public Table create() {
TableOperations ops = newTableOps(identifier);
if (ops.current() != null) {
throw new AlreadyExistsException("Table already exists: %s", identifier);
}
String baseLocation = location != null ? location : defaultWarehouseLocation(identifier);
tableProperties.putAll(tableOverrideProperties());
TableMetadata metadata =
TableMetadata.newTableMetadata(schema, spec, sortOrder, baseLocation, tableProperties);
try {

The key code segment is:

String baseLocation = location != null ? location : defaultWarehouseLocation(identifier);

HIVE Catalog

For the HIVE Catalog, the HiveCatalog is invoked during table creation. HiveCatalog retrieves the table's location via the HMS interface, which returns a path in URI format.

protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) {
// This is a little edgy since we basically duplicate the HMS location generation logic.
// Sadly I do not see a good way around this if we want to keep the order of events, like:
// - Create meta files
// - Create the metadata in HMS, and this way committing the changes
// Create a new location based on the namespace / database if it is set on database level
try {
Database databaseData =
clients.run(client -> client.getDatabase(tableIdentifier.namespace().levels()[0]));
if (databaseData.getLocationUri() != null) {
// If the database location is set use it as a base.
return String.format("%s/%s", databaseData.getLocationUri(), tableIdentifier.name());
}
} catch (NoSuchObjectException e) {
throw new NoSuchNamespaceException(
e, "Namespace does not exist: %s", tableIdentifier.namespace().levels()[0]);
} catch (TException e) {
throw new RuntimeException(
String.format("Metastore operation failed for %s", tableIdentifier), e);

The results of the local variables are as follows:

image

The path is displayed in the form of a URI.

HADOOP Catalog

When creating a table, HADOOP Catalog uses HadoopCatalog and appends a 'file:', which is essentially in the form of a URI.

TestBaseWithCatalog#configureValidationCatalog

case ICEBERG_CATALOG_TYPE_HADOOP:
this.validationCatalog =
new HadoopCatalog(spark.sessionState().newHadoopConf(), "file:" + warehouse);

image

The path is displayed in the form of a URI.

SPARK SESSION Catalog

When creating a table, SPARK_SESSION also uses HiveCatalog, so I won't repeat the related information.

The path is displayed in the form of a URI.

REST CataLog

REST Catalog uses JDBC Catalog, and the relevant code initializes the base path in RESTCatalogServer.

if (warehouseLocation == null) {
File tmp = java.nio.file.Files.createTempDirectory("iceberg_warehouse").toFile();
tmp.deleteOnExit();
warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath();
catalogProperties.put(CatalogProperties.WAREHOUSE_LOCATION, warehouseLocation);

image

This is displayed in the form of an absolute path, not a URI.

So I still believe that the REST Catalog should maintain consistency with HiveCatalog, HadoopCatalog, and SparkSessionCatalog by using the URI format to represent the location of the database.

@slfan1989
Copy link
Contributor Author

slfan1989 commented Aug 28, 2025

@RussellSpitzer @pvary @huaxingao Could you please take a look at this PR? Thank you very much! I've already outlined the detailed analysis process.

cc: @nastra

@RussellSpitzer
Copy link
Member

@slfan1989 I'm not sure I understand the issue. This PR isn't fixing any tests correct? In the #13837 this return is being changed so that the tests which use URI.toString() don't break? Couldn't we just change those tests? I'm not really opposed to changing this to standardize but it feels like the tests shouldn't be relying on URI output?

I'm more of a +0 here. If there was an obvious test this was fixing I'd be +1 but it doesn't seem like it has been a problem before?

@slfan1989
Copy link
Contributor Author

@slfan1989 I'm not sure I understand the issue. This PR isn't fixing any tests correct? In the #13837 this return is being changed so that the tests which use URI.toString() don't break? Couldn't we just change those tests? I'm not really opposed to changing this to standardize but it feels like the tests shouldn't be relying on URI output?

I'm more of a +0 here. If there was an obvious test this was fixing I'd be +1 but it doesn't seem like it has been a problem before?

@RussellSpitzer Thank you very much for your reply. I am indeed encountering an issue with a unit test, which appears in the newly submitted PR #13837, and this PR has not yet been merged. From my perspective, if we do not align the behavior of the REST Catalog with that of the other catalogs, we may not be able to resolve the issue effectively.

The problem arises in the TestRewriteTablePathProcedure#testRewriteTablePathWithoutFileList unit test. In the REST Catalog, due to the prefix check in RewriteTablePathUtil, the test fails with the following error message:

Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
java.lang.IllegalArgumentException: Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
	at org.apache.iceberg.RewriteTablePathUtil.relativize(RewriteTablePathUtil.java:686)
	at org.apache.iceberg.RewriteTablePathUtil.newPath(RewriteTablePathUtil.java:663)
	at org.apache.iceberg.RewriteTablePathUtil.writeDeleteFileEntry(RewriteTablePathUtil.java:492)
	at org.apache.iceberg.RewriteTablePathUtil.lambda$rewriteDeleteManifest$5(RewriteTablePathUtil.java:438)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)

The purpose of RewriteTablePath is to move or copy an Iceberg table to a new location, updating the paths in the metadata files so the table can be fully or incrementally copied to another storage location. To enhance the functionality of RewriteTablePath, the original author added a check that requires the original table's location and the path in the metadata files to be consistent. The problem we're encountering is that the REST Catalog's library path is not in URI format, triggering this check. In the unit test, we removed the file: prefix from the other variable paths to make REST Catalog pass the check, but this results in the new paths being inconsistent with the paths of other catalogs, causing the check to fail again.

The issue is that this is a Junit5 parameterized test, and it runs four times. If we modify the path to make REST Catalog pass, it will work for REST Catalog, but other catalogs (such as Hive Catalog, Hadoop Catalog, and Spark Session Catalog) will encounter similar errors. Therefore, to effectively resolve this issue, we need to align the behavior of different catalogs.

@slfan1989
Copy link
Contributor Author

In #13837, the logic of the unit tests has been adjusted and is now passing. As for the issue described in #13882, it doesn't really qualify as an actual problem from certain perspectives. However, to maintain consistency, it is recommended that all Catalogs follow a consistent approach when initializing the Hive database path, such as using URI format for the paths. I will close this PR for now to avoid consuming more resources. Thank you all for your attention!

@slfan1989 slfan1989 closed this Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants