-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Implement a hierarchy for Beam SQL Calcite Schemas #35787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ahmedabu98, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the Beam SQL Calcite Schema hierarchy to enable true cross-database and cross-catalog operations. It introduces new schema types (CatalogManagerSchema and CatalogSchema) that build upon the existing BeamCalciteSchema, allowing users to perform complex SQL queries with fully qualified names and interact more seamlessly with external metastores. This change enhances usability by reducing boilerplate for managing external tables and databases.
Highlights
- Enhanced Schema Hierarchy: Introduced
CatalogManagerSchema(root) andCatalogSchema(catalogs) on top ofBeamCalciteSchema(databases) to support a multi-level SQL namespace, enablingUSE CATALOG,USE DATABASE, and fully qualified table names (catalog.database.table). - Cross-Catalog/Database Operations: Enabled performing SQL queries and DDL operations (e.g.,
INSERT INTO,SELECT FROM,DROP TABLE/DATABASE) across different catalogs and databases, significantly improving flexibility for users. - Improved External Table Usability: Simplified interaction with external tables (e.g., Iceberg) by allowing direct operations without prior manual registration, reducing boilerplate code for managing existing tables and databases.
- Calcite Parser and Internal Refactoring: Updated the Calcite parser to support compound identifiers for DDL statements and refactored internal schema implementations (
BeamCalciteSchema,Catalog,CatalogManager,MetaStore) to align with the new hierarchical structure. - Iceberg Integration Updates: Replaced
IcebergTableProviderwith a newIcebergMetastoreto provide more granular, database-aware interaction with Iceberg catalogs, including support for loading existing tables.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
…les; fix setOption gap; maybe register table providers from top-level CatalogManager cache
|
Assigning reviewers: R: @kennknowles for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a hierarchical schema structure for Beam SQL, enabling cross-catalog and cross-database queries. It refactors the Calcite Schema hierarchy by introducing CatalogManagerSchema and CatalogSchema, which is a significant improvement over the previous flat structure. The changes are extensive, touching many parts of the SQL extension, but they are well-organized and consistent. The introduction of IcebergMetastore and the removal of IcebergTableProvider simplify the Iceberg integration. The addition of new integration tests for cross-catalog operations is also a great addition. My feedback includes one minor suggestion to remove a debug statement.
| catalogManager.registerTableProvider(testTableProvider); | ||
| cli.execute( | ||
| "CREATE EXTERNAL TABLE catalog_1.db_1.person(id int, name varchar, age int) TYPE 'test'"); | ||
| System.out.println("xxx metastoreDb1 tables: " + metastoreDb1.getTables()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Reminder, please take a look at this pr: @kennknowles |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @chamikaramj for label java. Available commands:
|
talatuyarer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ahmedabu98 This is awesome change. I dropped few comments. Tomorrow I will also go thru second time from PR.
I used this code with Iceberg Rest Catalog on Beam SQl. It is very promising.
...tensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/CatalogManagerSchema.java
Outdated
Show resolved
Hide resolved
...ensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/parser/SqlUseDatabase.java
Show resolved
Hide resolved
...tensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/CatalogManagerSchema.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/iceberg/IcebergMetastore.java
Show resolved
Hide resolved
...sions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/store/InMemoryMetaStore.java
Outdated
Show resolved
Hide resolved
|
Can you also create another PR for user facing Catalog documentation like External Table on beam SQL ? https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/ |
|
Reminder, please take a look at this pr: @chamikaramj |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
|
Reminder, please take a look at this pr: @robertwb |
18f4cf2 to
40ee916
Compare
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #35787 +/- ##
==========================================
Coverage 54.90% 54.91%
- Complexity 1618 1666 +48
==========================================
Files 1057 1058 +1
Lines 164311 164483 +172
Branches 1165 1190 +25
==========================================
+ Hits 90219 90325 +106
- Misses 71939 72003 +64
- Partials 2153 2155 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Failures are unrelated. Merging |
|
Please do get tests green, and disable/address flakes. It isn't enough to analyze failures to believe they are unrelated. |
Addresses #35637
Refactors the Calcite Schema hierarchy in Beam SQL, introducing two new types:
CatalogManagerSchemaandCatalogSchema. This change addresses the limitation of the previous flat schema and enables true cross-database and cross-catalog operations.This builds upon the existing
BeamCalciteSchemaand improves user experience by allowing for more complex and flexible SQL queries. This is especially useful for users working with external metastores and tables that need to be referenced across different catalogs or databases.Enhanced Schema Hierarchy
The following hierarchy is implemented (taken from https://s.apache.org/beam-catalogs)
Previously, all schemas were represented by the same
BeamCalciteSchema, which made for a flat schema structure. This PR introduces a new hierarchy that more accurately reflects standard SQL organization:CatalogManagerSchema: the root of the hierarchyCatalogSchema: child nodes representing CatalogsBeamCalciteSchema(existing): child nodes ofCatalogSchemathat represent DatabasesThis new structure unlocks the ability to use SQL commands like
USE CATALOG,USE DATABASE, and fully qualified table names, for example:catalog.database.table.Support for cross-catalog and cross-database queries
This is the core benefit of this PR. Users can now perform operations that span multiple catalogs and databases, such as:
Improved usability for external tables
Ease of use is significantly improved for external tables (like Iceberg). Previously, users had to manually register existing tables and databases with
CREATE EXTERNAL TABLEorCREATE DATABASE. Now that we have abstractions for external metastore entities, the following commands are possible:INSERT INTO <table> ...on an existing Table without prior registerationSELECT * FROM <table> ...on an existing Table without prior registerationDROP TABLE <table> ...on an existing Table without prior registerationUSE DATABASE <database>on an existing Database without prior registerationDROP DATABASE <database>on an existing Database without prior registerationThis is significant because it eliminates boilerplate code for users who manage external tables/databases using Beam SQL.
Backwards compatibility
To ensure a smooth transition, a "default" catalog with a "default" database is provided. For users who do not require cross-catalog or cross-database features, existing SQL commands will continue to function as before. For example, the command
CREATE EXTERNAL TABLE my_table ...will create a table with pathdefault.default.my_table.