-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29035: Fixing cache handling for REST catalog #6022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- add an event listener to invalidate cached tables impervious to source of change (direct HMS or REST); - added configuration option for event class handler; - lengthened default cache TTL;
…ion is latest (Hive DB, get location) on loadTable() ensuring no-stale table is returned;
- Reduced redundant calls (avoid super call); - Call invalidateTable for thorough eviction;
…apache/iceberg/rest/HMSCatalogFactory.java Co-authored-by: Shohei Okumiya <okumin@apache.org>
- made factory & cache extensible; - added specific test to verify cache behavior;
|
Continued from reverted #5882 . @deniskuzZ @okumin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henrib
I am experimentally adding caching Table Metadata on apache/iceberg so that the community can reuse the implementation. Do you think if it's reasonable and if Hive REST can cache them through HiveCatalog?
okumin/iceberg@8628795
It is inspired by apache/iceberg#4518
I expect we might potentially get the following feedback.
- Iceberg community rejects it because Table Metadata is not cacheable for our unknown reasons -> In this case, we will likely drop HIVE-29035
- Iceberg community does not want to merge it because RoI is not good -> In this case, we will implement it on our side
- Iceberg community accepts the patch -> Can we implement HIVE-29035 on top of the patch?
| } | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache in HIVE-29035 is limited to serving loadTable() for REST and resides server-side; the Table objects it serves are marshaled by to a client so there is no 'external' instance sharing. It is dependent upon HMS being the actual catalog implementation to acquire the latest known metadata location for a given table. This makes this PR pretty much tied to Hive; there is no need to involve Iceberg.
If/when the Iceberg community accepts the TableMetadata caching you propose, we can then assess what it means for this cache implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without using a REST catalog, a client retrieves table metadata through XYZCatalog -> TableMetadataParser -> S3/HDFS/etc. With a REST catalog, a client does it through RESTCatalog, where REST API(in our case, the servlet) serves metadata via XYZCatalog(in our case, HiveCatalog or HMSCachingCatalog) -> TableMetadataParser -> S3/HDFS/etc. So, TableMetadataParser might be a better place to maintain. It can support our use case, and we can remove HMSCachingCatalog, which uses CachingCatalog introduced for client-side caching and utilized only in SparkCatalog and FlinkCatalog.
If my idea has some defects, I probably won't send the patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without using a REST catalog, a client retrieves table metadata through XYZCatalog -> TableMetadataParser -> S3/HDFS/etc. With a REST catalog, a client does it through RESTCatalog, where REST API(in our case, the servlet) serves metadata via XYZCatalog(in our case, HiveCatalog or HMSCachingCatalog) -> TableMetadataParser -> S3/HDFS/etc. So, TableMetadataParser might be a better place to maintain. It can support our use case, and we can remove HMSCachingCatalog, which uses CachingCatalog introduced for client-side caching and utilized only in SparkCatalog and FlinkCatalog. If my idea has some defects, I probably won't send the patch
@okumin are you saying TableMetadataParser is the common place for both direct and REST catalog invocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some notes from iceberg dev lists:
Caching Metadata on the Client Side: Reloading table metadata for a
particular snapshot could leverage the ETag mechanism to reduce the amount
of network traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked the Iceberg community with the sample PR.
apache/iceberg#14137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okumin, do you know if apache/iceberg@72d5fd6 solves the same problem?
| * @param identifier a table identifier | ||
| * @return the location of the table if it exists, null otherwise | ||
| */ | ||
| public String getTableMetadataLocation(TableIdentifier identifier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not change anything in iceberg-catalog or submit an iceberg PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg PR: apache/iceberg#13800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
- add an event listener to invalidate cached tables impervious to source of change (direct HMS or REST); - added configuration option for event class handler; - lengthened default cache TTL;
…ion is latest (Hive DB, get location) on loadTable() ensuring no-stale table is returned;
- Reduced redundant calls (avoid super call); - Call invalidateTable for thorough eviction;
…apache/iceberg/rest/HMSCatalogFactory.java Co-authored-by: Shohei Okumiya <okumin@apache.org>
- made factory & cache extensible; - added specific test to verify cache behavior;
@check-spelling-bot Report🔴 Please reviewSee the files view or the action log for details. Unrecognized words (3)bucketedtables Previously acknowledged words that are now absentaarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyyTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the git@github.com:henrib/hive.git repository If the flagged items do not appear to be textIf items relate to a ...
|
@check-spelling-bot Report🔴 Please reviewSee the files view or the action log for details. Unrecognized words (3)bucketedtables Previously acknowledged words that are now absentaarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyyTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the git@github.com:henrib/hive.git repository If the flagged items do not appear to be textIf items relate to a ...
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |


What changes were proposed in this pull request?
Fixing the catalog cache behavior (take 2) by checking that a cached table metadata location is the same as the latest known metadata location for that table in HMS database to avoid returning stale entries.
Why are the changes needed?
To ensure the cache does not return stale entries using HA or when direct HMS calls are concurrent with REST Catalog usage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test