Skip to content

Conversation

@ahmedabu98
Copy link
Contributor

Addresses #35637

Refactors the Calcite Schema hierarchy in Beam SQL, introducing two new types: CatalogManagerSchema and CatalogSchema. This change addresses the limitation of the previous flat schema and enables true cross-database and cross-catalog operations.

This builds upon the existing BeamCalciteSchema and improves user experience by allowing for more complex and flexible SQL queries. This is especially useful for users working with external metastores and tables that need to be referenced across different catalogs or databases.

Enhanced Schema Hierarchy

The following hierarchy is implemented (taken from https://s.apache.org/beam-catalogs)

image

Previously, all schemas were represented by the same BeamCalciteSchema, which made for a flat schema structure. This PR introduces a new hierarchy that more accurately reflects standard SQL organization:

  • CatalogManagerSchema: the root of the hierarchy
  • CatalogSchema: child nodes representing Catalogs
  • BeamCalciteSchema (existing): child nodes of CatalogSchema that represent Databases

This new structure unlocks the ability to use SQL commands like USE CATALOG, USE DATABASE, and fully qualified table names, for example: catalog.database.table.

Support for cross-catalog and cross-database queries

This is the core benefit of this PR. Users can now perform operations that span multiple catalogs and databases, such as:

INSERT INTO catalog_1.database_1.table_1 SELECT * FROM catalog_2.database_2.table_2;

-- or
USE CATALOG catalog_1;
INSERT INTO database_1.table_1 SELECT * FROM catalog_2.database_2.table_2;

-- or
USE CATALOG catalog_1;
USE DATABASE database_1;
INSERT INTO table_1 SELECT * FROM catalog_2.database_2.table_2;

-- or
USE DATABASE catalog_1.database_1;
INSERT INTO table_1 SELECT * FROM catalog_2.database_2.table_2;

Improved usability for external tables

Ease of use is significantly improved for external tables (like Iceberg). Previously, users had to manually register existing tables and databases with CREATE EXTERNAL TABLE or CREATE DATABASE. Now that we have abstractions for external metastore entities, the following commands are possible:

  • INSERT INTO <table> ... on an existing Table without prior registeration
  • SELECT * FROM <table> ... on an existing Table without prior registeration
  • DROP TABLE <table> ... on an existing Table without prior registeration
  • USE DATABASE <database> on an existing Database without prior registeration
  • DROP DATABASE <database> on an existing Database without prior registeration

This is significant because it eliminates boilerplate code for users who manage external tables/databases using Beam SQL.

Backwards compatibility

To ensure a smooth transition, a "default" catalog with a "default" database is provided. For users who do not require cross-catalog or cross-database features, existing SQL commands will continue to function as before. For example, the command CREATE EXTERNAL TABLE my_table ... will create a table with path default.default.my_table.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ahmedabu98, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the Beam SQL Calcite Schema hierarchy to enable true cross-database and cross-catalog operations. It introduces new schema types (CatalogManagerSchema and CatalogSchema) that build upon the existing BeamCalciteSchema, allowing users to perform complex SQL queries with fully qualified names and interact more seamlessly with external metastores. This change enhances usability by reducing boilerplate for managing external tables and databases.

Highlights

  • Enhanced Schema Hierarchy: Introduced CatalogManagerSchema (root) and CatalogSchema (catalogs) on top of BeamCalciteSchema (databases) to support a multi-level SQL namespace, enabling USE CATALOG, USE DATABASE, and fully qualified table names (catalog.database.table).
  • Cross-Catalog/Database Operations: Enabled performing SQL queries and DDL operations (e.g., INSERT INTO, SELECT FROM, DROP TABLE/DATABASE) across different catalogs and databases, significantly improving flexibility for users.
  • Improved External Table Usability: Simplified interaction with external tables (e.g., Iceberg) by allowing direct operations without prior manual registration, reducing boilerplate code for managing existing tables and databases.
  • Calcite Parser and Internal Refactoring: Updated the Calcite parser to support compound identifiers for DDL statements and refactored internal schema implementations (BeamCalciteSchema, Catalog, CatalogManager, MetaStore) to align with the new hierarchical structure.
  • Iceberg Integration Updates: Replaced IcebergTableProvider with a new IcebergMetastore to provide more granular, database-aware interaction with Iceberg catalogs, including support for loading existing tables.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 5, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98
Copy link
Contributor Author

cc @kennknowles, @talatuyarer

…les; fix setOption gap; maybe register table providers from top-level CatalogManager cache
@liferoad liferoad requested a review from kennknowles August 13, 2025 20:08
@github-actions
Copy link
Contributor

Assigning reviewers:

R: @kennknowles for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@ahmedabu98
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a hierarchical schema structure for Beam SQL, enabling cross-catalog and cross-database queries. It refactors the Calcite Schema hierarchy by introducing CatalogManagerSchema and CatalogSchema, which is a significant improvement over the previous flat structure. The changes are extensive, touching many parts of the SQL extension, but they are well-organized and consistent. The introduction of IcebergMetastore and the removal of IcebergTableProvider simplify the Iceberg integration. The addition of new integration tests for cross-catalog operations is also a great addition. My feedback includes one minor suggestion to remove a debug statement.

catalogManager.registerTableProvider(testTableProvider);
cli.execute(
"CREATE EXTERNAL TABLE catalog_1.db_1.person(id int, name varchar, age int) TYPE 'test'");
System.out.println("xxx metastoreDb1 tables: " + metastoreDb1.getTables());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This System.out.println seems to be a leftover from debugging. It should be removed before merging.

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @kennknowles

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2025

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @chamikaramj for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

@talatuyarer talatuyarer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ahmedabu98 This is awesome change. I dropped few comments. Tomorrow I will also go thru second time from PR.

I used this code with Iceberg Rest Catalog on Beam SQl. It is very promising.

@talatuyarer
Copy link
Contributor

talatuyarer commented Sep 3, 2025

Can you also create another PR for user facing Catalog documentation like External Table on beam SQL ?

https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @chamikaramj

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @robertwb

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@codecov
Copy link

codecov bot commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 69.07216% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.91%. Comparing base (5565f38) to head (2f9d75d).
⚠️ Report is 141 commits behind head on master.

Files with missing lines Patch % Lines
.../sdk/extensions/sql/impl/CatalogManagerSchema.java 71.02% 26 Missing and 5 partials ⚠️
...he/beam/sdk/extensions/sql/impl/CatalogSchema.java 67.46% 18 Missing and 9 partials ⚠️
...m/sdk/extensions/sql/impl/parser/SqlDropTable.java 36.84% 10 Missing and 2 partials ⚠️
...k/extensions/sql/impl/parser/SqlSetOptionBeam.java 35.29% 8 Missing and 3 partials ⚠️
...k/extensions/sql/meta/store/InMemoryMetaStore.java 65.00% 5 Missing and 2 partials ⚠️
...k/extensions/sql/meta/catalog/InMemoryCatalog.java 72.72% 6 Missing ⚠️
...k/extensions/sql/impl/parser/SqlCreateCatalog.java 40.00% 2 Missing and 1 partial ⚠️
...nsions/sql/impl/parser/SqlCreateExternalTable.java 80.00% 2 Missing and 1 partial ⚠️
...sdk/extensions/sql/impl/parser/SqlDropCatalog.java 25.00% 2 Missing and 1 partial ⚠️
...sdk/extensions/sql/impl/parser/SqlUseDatabase.java 80.00% 1 Missing and 2 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #35787    +/-   ##
==========================================
  Coverage     54.90%   54.91%            
- Complexity     1618     1666    +48     
==========================================
  Files          1057     1058     +1     
  Lines        164311   164483   +172     
  Branches       1165     1190    +25     
==========================================
+ Hits          90219    90325   +106     
- Misses        71939    72003    +64     
- Partials       2153     2155     +2     
Flag Coverage Δ
java 68.23% <69.07%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ahmedabu98
Copy link
Contributor Author

Failures are unrelated. Merging

@ahmedabu98 ahmedabu98 merged commit 2f9a910 into apache:master Oct 1, 2025
28 of 34 checks passed
@kennknowles
Copy link
Member

Please do get tests green, and disable/address flakes. It isn't enough to analyze failures to believe they are unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants