Skip to content

Conversation

@ngsg
Copy link
Contributor

@ngsg ngsg commented Jul 11, 2025

What changes were proposed in this pull request?

This patch aims to support configurable IMetaStoreClient.

This patch has been submitted three times previously. Austin originally proposed HIVE-12679 and submitted the initial patch in 2016. @moomindani later resubmitted it in #1402, and then @okumin took it over in #4444.

Why are the changes needed?

In some scenarios, users may want to run Hive as a general-purpose execution engine on top of a third-party data catalog. For instance, AWS Glue Data Catalog is an example of such usage, and there is also an open ticket to implement an Iceberg REST catalog (HIVE-28658). As noted in the previous PR (#4444), more than 40 people are watching HIVE-12679, which I believe reflects strong demand for using Hive in this more flexible architecture. Therefore, I believe it is worth completing this patch to support external data catalogs in Hive.

Does this PR introduce any user-facing change?

No, users can use the existing HMS client without any changes in their configuration files.
For the documentation fix, I will update hive-site once this patch is accepted and merged.

How was this patch tested?

This patch includes two simple tests loading MetaStoreClient via HiveMetaStoreClientFactory.


Note for those familiar with the earlier patch:

  • Regarding the design decision about MetaStoreClientFactory:
    This patch does not support a customizable MSC factory as the original patch did; instead, it supports a customizable MSC. In other words, this patch introduces metastore.client.class, not metastore.client.factory.class.
    I made this change not because I preferred this approach, but because I followed the majority opinion in the discussion here. Personally, I think both approaches are valid and have their merits, and I do not have a strong opinion among them. I would appreciate it if you could share your thoughts on this decision.

  • Regarding RetryingMetaStoreClient and other classes that instantiate MetaStoreClient (link):
    After resolving HIVE-27473, Hive now creates ThriftHiveMetaStoreClient only in Hive.java and HiveMetaStoreClient. Therefore, I believe it is sufficient to replace these two call sites with the new HiveMetaStoreClientFactory.

@deniskuzZ
Copy link
Member

Hi @ngsg, let's get #5924 in first. Please review it when time allows.
Thank you!

private static final Logger LOG = LoggerFactory.getLogger(HiveMetaStoreClientFactory.class);

public static IMetaStoreClient newClient(Configuration conf, boolean allowEmbedded) throws MetaException {
String mscClassName = MetastoreConf.getVar(conf, MetastoreConf.ConfVars.METASTORE_CLIENT_CLASS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MetastoreConf.getClass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll modify it to use getClass once HIVE-20189 is merged.

METASTORE_CLIENT_CLASS("metastore.client.class",
"hive.metastore.client.class",
"org.apache.hadoop.hive.metastore.client.ThriftHiveMetaStoreClient",
"The name of MetaStoreClient class that implements the IMetaStoreClient interface."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moomindani Please feel free to give us a suggestion if you have any thoughts.

Regarding the design decision about MetaStoreClientFactory:
This patch does not support a customizable MSC factory as the original patch did; instead, it supports a customizable MSC. In other words, this patch introduces metastore.client.class, not metastore.client.factory.class.
I made this change not because I preferred this approach, but because I followed the majority opinion in the discussion #4444 (review). Personally, I think both approaches are valid and have their merits, and I do not have a strong opinion among them. I would appreciate it if you could share your thoughts on this decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, I have no concerns around this.

Copy link
Member

@dengzhhu653 dengzhhu653 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here: if we specify the client via the configuration, it means each time we are allowed to query one type of metadata only. For example, we set the client to Glue for the tables stored in Glue, if we need the Hive tables later on in the same session, we should reset the client to Hive, and just think loud, if the SQL contains the Hive and Glue table?

From my point of view, I would suggest the pattern(the database or table) or the catalog routed way to choose which the client will use to obtain the metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, I would suggest the pattern(the database or table) or the catalog routed way to choose which the client will use to obtain the metadata.

I believe this is somewhat similar to the design proposed in HIVE-28879 (Federated Catalog).
From my perspective, I see HIVE-12679 as a milestone for both HIVE-28658 (Iceberg REST Catalog, #5628) and HIVE-28879. So, I'm fine with moving directly to HIVE-28879 or pursuing any other approach, as long as we aim to support third-party data catalogs. However, since this ticket predates my contributions and has many watchers, I want to ask for and respect others' opinions as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngsg maybe METASTORE_CLIENT_IMPL, similar to RAW_STORE_IMPL?

Copy link
Member

@dengzhhu653 dengzhhu653 Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a new property to configure those mappings? e.g, metastore.client.meta.class.mappings=catalog:client_classname,catalog.db(or db_pattern):client_classname,catalog.db.table:client_classname

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need multiple client impl for the same catalog (aka Snowflake, Glue, HMS, Rest, etc)
Instead, I’d suggest using a simple comma-separated list. Later, that would be persisted to the backend database during the Catalog registration (HIVE-26227)

METASTORE_CLIENT_IMPL("metastore.client.impl", catalog:client_classname)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should address this in HIVE-28879, since we also need to provide connection details per catalog

Copy link
Contributor Author

@ngsg ngsg Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngsg maybe METASTORE_CLIENT_IMPL, similar to RAW_STORE_IMPL?

I changed METASTORE_CLINT_CLASS to METASTORE_CLIENT_IMPL and metastore.client.class to metastore.client.impl.


how about a new property to configure those mappings? e.g, metastore.client.meta.class.mappings=catalog:client_classname,catalog.db(or db_pattern):client_classname,catalog.db.table:client_classname

Regarding the mapping, if there were no ThriftHiveMetaStoreClient-related logic in HiveMetaStoreClient, I think I could implement a mapping from a catalog to the actual client. However, due to that logic, I don't currently have a clear design to support it. Let me think more about a better design and share my thoughts once I’ve summarized them.


Another thought I had is that if we eventually implement HIVE-28879, we may not need this configuration key to specify either a single class or a mapping, since the relevant information would be stored in the MetaStore backend database. Given that we likely have enough time before the next release, I'm unsure about the feasibility of introducing a temporary configuration key that won't appear in any release. What do you think about keeping this patch as a fallback option in case HIVE-28879 isn't ready in time for the next release, and proceeding directly with HIVE-28879 for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had is that if we eventually implement HIVE-28879, we may not need this configuration key to specify either a single class or a mapping, since the relevant information would be stored in the MetaStore backend database.

Totally agree + the proposed mapping might require connection details as well. I think we should go ahead and merge this PR, and start thinking on a more generic solution like HIVE-28879

@deniskuzZ
Copy link
Member

HIVE-20189 is merged, @ngsg could you please rebase

@sonarqubecloud
Copy link

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ayushtkn
Copy link
Member

This seems good to go? if yes, @ngsg mind hitting the merge button?

@ngsg
Copy link
Contributor Author

ngsg commented Jul 31, 2025

@ayushtkn, thanks for the reminder. I'll merge it shortly.

@ngsg ngsg merged commit bcfa755 into apache:master Jul 31, 2025
4 checks passed
@ngsg
Copy link
Contributor Author

ngsg commented Jul 31, 2025

Thanks all for the review!


I have read the guide and tried to follow the documented process for merging. Since this is my first time merging, I would appreciate it if someone could check whether I've made any mistakes I might have overlooked.

@ayushtkn
Copy link
Member

ayushtkn commented Jul 31, 2025

I have read the guide and tried to follow the documented process for merging. Since this is my first time merging, I would appreciate it if someone could check whether I've made any mistakes I might have overlooked.

@ngsg It is all good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants