-
Notifications
You must be signed in to change notification settings - Fork 3k
Use URI string for default Iceberg warehouse location #13882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nastra I discovered this issue while running the #13837 unit test: the RestCatalog used inconsistent warehouse path formats, which caused the path prefix validation to fail. The root cause was that the original implementation used getAbsolutePath(), while our unit tests consistently used toURI().toString(). This inconsistency not only led to test failures but also affected cross-platform compatibility. This change unifies the implementation to use toURI().toString(), ensuring the warehouse path is always represented as a standardized URI. This resolves the unit test validation issue and provides consistent behavior across different operating systems. |
can you elaborate which path prefix validation you mean here? I think it would be good to have a test here as well to show what was causing the issue that required this fix
I do actually see quite a lot of other places in tests that use I'm not saying that this is a good/bad change, I'm just trying to understand why this is all of a sudden needed now |
@nastra Many thanks for your comment! In #13837, I added a new unit test to verify that the two newly introduced metrics, This unit test was added to The test passes normally under The root cause of this issue lies in the different ways the base Hive warehouse path is generated in the test environment: This approach produces an absolute path without the file: prefix, for example: In contrast, most other unit tests use the following: This method generates a path with the file: prefix, for example:
From my perspective, I also believe that using .toURI().toString() provides better compatibility. |
|
@nastra Would it make sense to modify this PR in this way? Or should we standardize the unit tests to use |
I don't have an answer yet as I need to do some research across the codebase whether the proposed change here is safe in all cases and I'll be on vacation starting next week). Maybe other reviewers get a chance to take a look at this before me |
Thank you for your reply, and enjoy your holiday! I will conduct tests locally and sync the progress to this PR. |
|
After reviewing the latest survey results, I recommend using Upon analyzing all implementations of When creating a table, the BaseMetastoreCatalog#create method is invoked, but different Catalog types use their own implementations. iceberg/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java Lines 189 to 200 in 28555ad
The key code segment is:
For the HIVE Catalog, the iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Lines 698 to 718 in 28555ad
The results of the local variables are as follows:
The path is displayed in the form of a URI.
When creating a table, HADOOP Catalog uses HadoopCatalog and appends a 'file:', which is essentially in the form of a URI. TestBaseWithCatalog#configureValidationCatalog iceberg/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestBaseWithCatalog.java Lines 123 to 125 in 28555ad
The path is displayed in the form of a URI.
When creating a table, SPARK_SESSION also uses HiveCatalog, so I won't repeat the related information. The path is displayed in the form of a URI.
REST Catalog uses JDBC Catalog, and the relevant code initializes the base path in RESTCatalogServer. iceberg/open-api/src/testFixtures/java/org/apache/iceberg/rest/RESTCatalogServer.java Lines 89 to 93 in 28555ad
This is displayed in the form of an absolute path, not a URI. So I still believe that the |
|
@RussellSpitzer @pvary @huaxingao Could you please take a look at this PR? Thank you very much! I've already outlined the detailed analysis process. cc: @nastra |
|
@slfan1989 I'm not sure I understand the issue. This PR isn't fixing any tests correct? In the #13837 this return is being changed so that the tests which use URI.toString() don't break? Couldn't we just change those tests? I'm not really opposed to changing this to standardize but it feels like the tests shouldn't be relying on URI output? I'm more of a +0 here. If there was an obvious test this was fixing I'd be +1 but it doesn't seem like it has been a problem before? |
@RussellSpitzer Thank you very much for your reply. I am indeed encountering an issue with a unit test, which appears in the newly submitted PR #13837, and this PR has not yet been merged. From my perspective, if we do not align the behavior of the REST Catalog with that of the other catalogs, we may not be able to resolve the issue effectively. The problem arises in the TestRewriteTablePathProcedure#testRewriteTablePathWithoutFileList unit test. In the REST Catalog, due to the prefix check in RewriteTablePathUtil, the test fails with the following error message: The purpose of The issue is that this is a Junit5 parameterized test, and it runs four times. If we modify the path to make |
|
In #13837, the logic of the unit tests has been adjusted and is now passing. As for the issue described in #13882, it doesn't really qualify as an actual problem from certain perspectives. However, to maintain consistency, it is recommended that all Catalogs follow a consistent approach when initializing the Hive database path, such as using URI format for the paths. I will close this PR for now to avoid consuming more resources. Thank you all for your attention! |




Summary
When no warehouse location is set, the code previously used
getAbsolutePath()to configure the default Iceberg warehouse directory. This change switches to usingtoURI().toString(), ensuring the warehouselocation is represented as a standardized URI (e.g.,
file:/.../iceberg_data/).