Skip to content

Conversation

@henrib
Copy link
Contributor

@henrib henrib commented Aug 12, 2025

What changes were proposed in this pull request?

Fixing the catalog cache behavior (take 2) by checking that a cached table metadata location is the same as the latest known metadata location for that table in HMS database to avoid returning stale entries.

Why are the changes needed?

To ensure the cache does not return stale entries using HA or when direct HMS calls are concurrent with REST Catalog usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

henrib and others added 18 commits June 30, 2025 17:36
- add an event listener to invalidate cached tables impervious to source of change (direct HMS or REST);
- added configuration option for event class handler;
- lengthened default cache TTL;
…ion is latest (Hive DB, get location) on loadTable() ensuring no-stale table is returned;
- Reduced redundant calls (avoid super call);
- Call invalidateTable for thorough eviction;
…apache/iceberg/rest/HMSCatalogFactory.java

Co-authored-by: Shohei Okumiya <okumin@apache.org>
- made factory & cache extensible;
- added specific test to verify cache behavior;
@henrib
Copy link
Contributor Author

henrib commented Aug 12, 2025

Continued from reverted #5882 . @deniskuzZ @okumin

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henrib
I am experimentally adding caching Table Metadata on apache/iceberg so that the community can reuse the implementation. Do you think if it's reasonable and if Hive REST can cache them through HiveCatalog?
okumin/iceberg@8628795

It is inspired by apache/iceberg#4518

I expect we might potentially get the following feedback.

  • Iceberg community rejects it because Table Metadata is not cacheable for our unknown reasons -> In this case, we will likely drop HIVE-29035
  • Iceberg community does not want to merge it because RoI is not good -> In this case, we will implement it on our side
  • Iceberg community accepts the patch -> Can we implement HIVE-29035 on top of the patch?

}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache in HIVE-29035 is limited to serving loadTable() for REST and resides server-side; the Table objects it serves are marshaled by to a client so there is no 'external' instance sharing. It is dependent upon HMS being the actual catalog implementation to acquire the latest known metadata location for a given table. This makes this PR pretty much tied to Hive; there is no need to involve Iceberg.
If/when the Iceberg community accepts the TableMetadata caching you propose, we can then assess what it means for this cache implementation.

Copy link
Contributor

@okumin okumin Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without using a REST catalog, a client retrieves table metadata through XYZCatalog -> TableMetadataParser -> S3/HDFS/etc. With a REST catalog, a client does it through RESTCatalog, where REST API(in our case, the servlet) serves metadata via XYZCatalog(in our case, HiveCatalog or HMSCachingCatalog) -> TableMetadataParser -> S3/HDFS/etc. So, TableMetadataParser might be a better place to maintain. It can support our use case, and we can remove HMSCachingCatalog, which uses CachingCatalog introduced for client-side caching and utilized only in SparkCatalog and FlinkCatalog.
If my idea has some defects, I probably won't send the patch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without using a REST catalog, a client retrieves table metadata through XYZCatalog -> TableMetadataParser -> S3/HDFS/etc. With a REST catalog, a client does it through RESTCatalog, where REST API(in our case, the servlet) serves metadata via XYZCatalog(in our case, HiveCatalog or HMSCachingCatalog) -> TableMetadataParser -> S3/HDFS/etc. So, TableMetadataParser might be a better place to maintain. It can support our use case, and we can remove HMSCachingCatalog, which uses CachingCatalog introduced for client-side caching and utilized only in SparkCatalog and FlinkCatalog. If my idea has some defects, I probably won't send the patch

@okumin are you saying TableMetadataParser is the common place for both direct and REST catalog invocations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am.

Using HiveCatalog directly, everything works on the client side as follows.
image

Using REST, some steps will be delegated to the server-side. In either case, TableMetadataParser would interact with an object storage.
image

Copy link
Member

@deniskuzZ deniskuzZ Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some notes from iceberg dev lists:

Caching Metadata on the Client Side: Reloading table metadata for a
particular snapshot could leverage the ETag mechanism to reduce the amount
of network traffic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked the Iceberg community with the sample PR.
apache/iceberg#14137

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okumin, do you know if apache/iceberg@72d5fd6 solves the same problem?

* @param identifier a table identifier
* @return the location of the table if it exists, null otherwise
*/
public String getTableMetadataLocation(TableIdentifier identifier) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not change anything in iceberg-catalog or submit an iceberg PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg PR: apache/iceberg#13800

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

henrib and others added 15 commits September 15, 2025 19:30
- add an event listener to invalidate cached tables impervious to source of change (direct HMS or REST);
- added configuration option for event class handler;
- lengthened default cache TTL;
…ion is latest (Hive DB, get location) on loadTable() ensuring no-stale table is returned;
- Reduced redundant calls (avoid super call);
- Call invalidateTable for thorough eviction;
…apache/iceberg/rest/HMSCatalogFactory.java

Co-authored-by: Shohei Okumiya <okumin@apache.org>
- made factory & cache extensible;
- added specific test to verify cache behavior;
@github-actions
Copy link

github-actions bot commented Sep 15, 2025

@check-spelling-bot Report

🔴 Please review

See the files view or the action log for details.

Unrecognized words (3)

bucketedtables
languagemanual
teradatabinaryserde

Previously acknowledged words that are now absent aarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyy
To accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the git@github.com:henrib/hive.git repository
on the HIVE-29035 branch:

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/apache/hive/issues/comments/3293292263" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u
If the flagged items do not appear to be text

If items relate to a ...

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

  • binary file.

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

@github-actions
Copy link

github-actions bot commented Sep 16, 2025

@check-spelling-bot Report

🔴 Please review

See the files view or the action log for details.

Unrecognized words (3)

bucketedtables
languagemanual
teradatabinaryserde

Previously acknowledged words that are now absent aarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyy
To accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the git@github.com:henrib/hive.git repository
on the HIVE-29035 branch:

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/apache/hive/issues/comments/3296993869" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u
If the flagged items do not appear to be text

If items relate to a ...

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

  • binary file.

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Nov 23, 2025
@github-actions github-actions bot closed this Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants