Use URI string for default Iceberg warehouse location #13882

slfan1989 · 2025-08-21T07:13:30Z

Summary

When no warehouse location is set, the code previously used getAbsolutePath() to configure the default Iceberg warehouse directory. This change switches to using toURI().toString(), ensuring the warehouse
location is represented as a standardized URI (e.g., file:/.../iceberg_data/).

slfan1989 · 2025-08-21T07:18:42Z

@nastra I discovered this issue while running the #13837 unit test: the RestCatalog used inconsistent warehouse path formats, which caused the path prefix validation to fail. The root cause was that the original implementation used getAbsolutePath(), while our unit tests consistently used toURI().toString(). This inconsistency not only led to test failures but also affected cross-platform compatibility.

This change unifies the implementation to use toURI().toString(), ensuring the warehouse path is always represented as a standardized URI. This resolves the unit test validation issue and provides consistent behavior across different operating systems.

nastra · 2025-08-21T08:21:45Z

which caused the path prefix validation to fail

can you elaborate which path prefix validation you mean here? I think it would be good to have a test here as well to show what was causing the issue that required this fix

while our unit tests consistently used toURI().toString()

I do actually see quite a lot of other places in tests that use getAbsolutePath(), such as https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java#L124.

I'm not saying that this is a good/bad change, I'm just trying to understand why this is all of a sudden needed now

slfan1989 · 2025-08-21T09:03:05Z

which caused the path prefix validation to fail

can you elaborate which path prefix validation you mean here? I think it would be good to have a test here as well to show what was causing the issue that required this fix

while our unit tests consistently used toURI().toString()

I do actually see quite a lot of other places in tests that use getAbsolutePath(), such as https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java#L124.

I'm not saying that this is a good/bad change, I'm just trying to understand why this is all of a sudden needed now

@nastra Many thanks for your comment!

In #13837, I added a new unit test to verify that the two newly introduced metrics, rewrittenDeleteFilesCount and rewrittenManifestFilesCount, are greater than 0.

This unit test was added to TestRewriteTablePathProcedure and is executed across multiple catalogs: HIVE, HADOOP, SPARK_SESSION, and REST.

The test passes normally under HIVE, HADOOP, and SPARK_SESSION, but it fails under the REST catalog with the following exception:

Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
java.lang.IllegalArgumentException: Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
	at org.apache.iceberg.RewriteTablePathUtil.relativize(RewriteTablePathUtil.java:686)
	at org.apache.iceberg.RewriteTablePathUtil.newPath(RewriteTablePathUtil.java:663)
	at org.apache.iceberg.RewriteTablePathUtil.writeDeleteFileEntry(RewriteTablePathUtil.java:492)
	at org.apache.iceberg.RewriteTablePathUtil.lambda$rewriteDeleteManifest$5(RewriteTablePathUtil.java:438)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)

The root cause of this issue lies in the different ways the base Hive warehouse path is generated in the test environment:

warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath();

This approach produces an absolute path without the file: prefix, for example:

/var/folders/.../iceberg_data/default/table/

In contrast, most other unit tests use the following:

String location = targetTableDir.toFile().toURI().toString();

This method generates a path with the file: prefix, for example:

file:/var/folders/.../iceberg_data/default/table/

From my perspective, I also believe that using .toURI().toString() provides better compatibility.

slfan1989 · 2025-08-22T05:39:05Z

@nastra Would it make sense to modify this PR in this way? Or should we standardize the unit tests to use toURI().toString() ? What do you think?

nastra · 2025-08-22T12:55:24Z

@nastra Would it make sense to modify this PR in this way? Or should we standardize the unit tests to use toURI().toString() ? What do you think?

I don't have an answer yet as I need to do some research across the codebase whether the proposed change here is safe in all cases and I'll be on vacation starting next week). Maybe other reviewers get a chance to take a look at this before me

slfan1989 · 2025-08-23T00:02:20Z

I don't have an answer yet as I need to do some research across the codebase whether the proposed change here is safe in all cases and I'll be on vacation starting next week). Maybe other reviewers get a chance to take a look at this before me

Thank you for your reply, and enjoy your holiday! I will conduct tests locally and sync the progress to this PR.

slfan1989 · 2025-08-28T04:56:56Z

After reviewing the latest survey results, I recommend using .toURI().toString() for the replacement here.

Upon analyzing all implementations of CatalogTestBase, I found that it currently supports four Catalog types: HIVE, HADOOP, SPARK_SESSION, and REST. Among them, the default paths for Hive tables in HIVE, HADOOP, and SPARK_SESSION Catalogs are all in URI format, as detailed below:

file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/hive12918907709677177324/table

When creating a table, the BaseMetastoreCatalog#create method is invoked, but different Catalog types use their own implementations.

iceberg/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java

Lines 189 to 200 in 28555ad

    
           public Table create() { 
        
             TableOperations ops = newTableOps(identifier); 
        
             if (ops.current() != null) { 
        
               throw new AlreadyExistsException("Table already exists: %s", identifier); 
        
             } 
        
             String baseLocation = location != null ? location : defaultWarehouseLocation(identifier); 
        
             tableProperties.putAll(tableOverrideProperties()); 
        
             TableMetadata metadata = 
        
                 TableMetadata.newTableMetadata(schema, spec, sortOrder, baseLocation, tableProperties); 
        
             try {

The key code segment is:

String baseLocation = location != null ? location : defaultWarehouseLocation(identifier);

HIVE Catalog

For the HIVE Catalog, the HiveCatalog is invoked during table creation. HiveCatalog retrieves the table's location via the HMS interface, which returns a path in URI format.

iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java

Lines 698 to 718 in 28555ad

    
           protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { 
        
             // This is a little edgy since we basically duplicate the HMS location generation logic. 
        
             // Sadly I do not see a good way around this if we want to keep the order of events, like: 
        
             // - Create meta files 
        
             // - Create the metadata in HMS, and this way committing the changes 
        
             // Create a new location based on the namespace / database if it is set on database level 
        
             try { 
        
               Database databaseData = 
        
                   clients.run(client -> client.getDatabase(tableIdentifier.namespace().levels()[0])); 
        
               if (databaseData.getLocationUri() != null) { 
        
                 // If the database location is set use it as a base. 
        
                 return String.format("%s/%s", databaseData.getLocationUri(), tableIdentifier.name()); 
        
               } 
        
             } catch (NoSuchObjectException e) { 
        
               throw new NoSuchNamespaceException( 
        
                   e, "Namespace does not exist: %s", tableIdentifier.namespace().levels()[0]); 
        
             } catch (TException e) { 
        
               throw new RuntimeException( 
        
                   String.format("Metastore operation failed for %s", tableIdentifier), e);

The results of the local variables are as follows:

The path is displayed in the form of a URI.

HADOOP Catalog

When creating a table, HADOOP Catalog uses HadoopCatalog and appends a 'file:', which is essentially in the form of a URI.

TestBaseWithCatalog#configureValidationCatalog

iceberg/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestBaseWithCatalog.java

Lines 123 to 125 in 28555ad

    
           case ICEBERG_CATALOG_TYPE_HADOOP: 
        
             this.validationCatalog = 
        
                 new HadoopCatalog(spark.sessionState().newHadoopConf(), "file:" + warehouse);

The path is displayed in the form of a URI.

SPARK SESSION Catalog

When creating a table, SPARK_SESSION also uses HiveCatalog, so I won't repeat the related information.

The path is displayed in the form of a URI.

REST CataLog

REST Catalog uses JDBC Catalog, and the relevant code initializes the base path in RESTCatalogServer.

iceberg/open-api/src/testFixtures/java/org/apache/iceberg/rest/RESTCatalogServer.java

Lines 89 to 93 in 28555ad

    
           if (warehouseLocation == null) { 
        
             File tmp = java.nio.file.Files.createTempDirectory("iceberg_warehouse").toFile(); 
        
             tmp.deleteOnExit(); 
        
             warehouseLocation = new File(tmp, "iceberg_data").getAbsolutePath(); 
        
             catalogProperties.put(CatalogProperties.WAREHOUSE_LOCATION, warehouseLocation);

This is displayed in the form of an absolute path, not a URI.

So I still believe that the REST Catalog should maintain consistency with HiveCatalog, HadoopCatalog, and SparkSessionCatalog by using the URI format to represent the location of the database.

slfan1989 · 2025-08-28T05:00:37Z

@RussellSpitzer @pvary @huaxingao Could you please take a look at this PR? Thank you very much! I've already outlined the detailed analysis process.

cc: @nastra

RussellSpitzer · 2025-08-29T19:20:33Z

@slfan1989 I'm not sure I understand the issue. This PR isn't fixing any tests correct? In the #13837 this return is being changed so that the tests which use URI.toString() don't break? Couldn't we just change those tests? I'm not really opposed to changing this to standardize but it feels like the tests shouldn't be relying on URI output?

I'm more of a +0 here. If there was an obvious test this was fixing I'd be +1 but it doesn't seem like it has been a problem before?

slfan1989 · 2025-08-29T22:35:26Z

@slfan1989 I'm not sure I understand the issue. This PR isn't fixing any tests correct? In the #13837 this return is being changed so that the tests which use URI.toString() don't break? Couldn't we just change those tests? I'm not really opposed to changing this to standardize but it feels like the tests shouldn't be relying on URI output?

I'm more of a +0 here. If there was an obvious test this was fixing I'd be +1 but it doesn't seem like it has been a problem before?

@RussellSpitzer Thank you very much for your reply. I am indeed encountering an issue with a unit test, which appears in the newly submitted PR #13837, and this PR has not yet been merged. From my perspective, if we do not align the behavior of the REST Catalog with that of the other catalogs, we may not be able to resolve the issue effectively.

The problem arises in the TestRewriteTablePathProcedure#testRewriteTablePathWithoutFileList unit test. In the REST Catalog, due to the prefix check in RewriteTablePathUtil, the test fails with the following error message:

Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
java.lang.IllegalArgumentException: Path file:/var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/data/deletes.parquet does not start with /var/folders/2k/21gv5vmx6z7dlr1g0_jtm8r80000ks/T/iceberg_warehouse10689209309257803697/iceberg_data/default/table/
	at org.apache.iceberg.RewriteTablePathUtil.relativize(RewriteTablePathUtil.java:686)
	at org.apache.iceberg.RewriteTablePathUtil.newPath(RewriteTablePathUtil.java:663)
	at org.apache.iceberg.RewriteTablePathUtil.writeDeleteFileEntry(RewriteTablePathUtil.java:492)
	at org.apache.iceberg.RewriteTablePathUtil.lambda$rewriteDeleteManifest$5(RewriteTablePathUtil.java:438)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)

The purpose of RewriteTablePath is to move or copy an Iceberg table to a new location, updating the paths in the metadata files so the table can be fully or incrementally copied to another storage location. To enhance the functionality of RewriteTablePath, the original author added a check that requires the original table's location and the path in the metadata files to be consistent. The problem we're encountering is that the REST Catalog's library path is not in URI format, triggering this check. In the unit test, we removed the file: prefix from the other variable paths to make REST Catalog pass the check, but this results in the new paths being inconsistent with the paths of other catalogs, causing the check to fail again.

The issue is that this is a Junit5 parameterized test, and it runs four times. If we modify the path to make REST Catalog pass, it will work for REST Catalog, but other catalogs (such as Hive Catalog, Hadoop Catalog, and Spark Session Catalog) will encounter similar errors. Therefore, to effectively resolve this issue, we need to align the behavior of different catalogs.

slfan1989 · 2025-09-04T03:06:59Z

In #13837, the logic of the unit tests has been adjusted and is now passing. As for the issue described in #13882, it doesn't really qualify as an actual problem from certain perspectives. However, to maintain consistency, it is recommended that all Catalogs follow a consistent approach when initializing the Hive database path, such as using URI format for the paths. I will close this PR for now to avoid consuming more resources. Thank you all for your attention!

use URI string for default Iceberg warehouse location

ebd3401

github-actions bot added the OPENAPI label Aug 21, 2025

slfan1989 mentioned this pull request Aug 21, 2025

API, Spark 4.0: Add create_file_list option to RewriteTablePathProcedure. #13837

Merged

slfan1989 closed this Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use URI string for default Iceberg warehouse location #13882

Use URI string for default Iceberg warehouse location #13882

Uh oh!

slfan1989 commented Aug 21, 2025 •

edited

Loading

Uh oh!

slfan1989 commented Aug 21, 2025

Uh oh!

nastra commented Aug 21, 2025 •

edited

Loading

Uh oh!

slfan1989 commented Aug 21, 2025

Uh oh!

slfan1989 commented Aug 22, 2025

Uh oh!

nastra commented Aug 22, 2025

Uh oh!

slfan1989 commented Aug 23, 2025

Uh oh!

slfan1989 commented Aug 28, 2025

Uh oh!

slfan1989 commented Aug 28, 2025 •

edited

Loading

Uh oh!

RussellSpitzer commented Aug 29, 2025

Uh oh!

slfan1989 commented Aug 29, 2025

Uh oh!

slfan1989 commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use URI string for default Iceberg warehouse location #13882

Use URI string for default Iceberg warehouse location #13882

Uh oh!

Conversation

slfan1989 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

slfan1989 commented Aug 21, 2025

Uh oh!

nastra commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slfan1989 commented Aug 21, 2025

Uh oh!

slfan1989 commented Aug 22, 2025

Uh oh!

nastra commented Aug 22, 2025

Uh oh!

slfan1989 commented Aug 23, 2025

Uh oh!

slfan1989 commented Aug 28, 2025

Uh oh!

slfan1989 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Aug 29, 2025

Uh oh!

slfan1989 commented Aug 29, 2025

Uh oh!

slfan1989 commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slfan1989 commented Aug 21, 2025 •

edited

Loading

nastra commented Aug 21, 2025 •

edited

Loading

slfan1989 commented Aug 28, 2025 •

edited

Loading