Skip to content

INSERT/REPLACE can omit clustering when catalog has default#16260

Merged
zachjsh merged 26 commits intoapache:masterfrom
zachjsh:use-catalog-clustering-columns
Apr 26, 2024
Merged

INSERT/REPLACE can omit clustering when catalog has default#16260
zachjsh merged 26 commits intoapache:masterfrom
zachjsh:use-catalog-clustering-columns

Conversation

@zachjsh
Copy link
Copy Markdown
Contributor

@zachjsh zachjsh commented Apr 10, 2024

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, allowing for tables that are defined in the catalog to have any defined clustering columns used in DML INSERT/REPLACE operations without needing to be specified at query time. If the user specified a clustering columns at query time, these columns are preferred to the catalog defined clustering columns.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java Outdated
@zachjsh zachjsh requested a review from kgyrtkirk April 17, 2024 19:10
Copy link
Copy Markdown
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 ; just left some questions :)

Comment on lines +474 to +478
final SqlIdentifier colIdent = new SqlIdentifier(
Collections.singletonList(keyCol.expr()),
null, SqlParserPos.ZERO,
Collections.singletonList(SqlParserPos.ZERO)
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what will happen in the following case:

  • say colunmn c is a clusterKey
  • we are selecting from a join which has column c on both sides

but it seems like the column in the select list will take precedence.

one more thing I was wondering about: do we have a check that all keyCols are present in the selected column list?

Copy link
Copy Markdown
Contributor Author

@zachjsh zachjsh Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About whether there is a check that al keyCols are present in the selected column list, see the following tests:

testInsertTableWithClusteringWithClusteringOnNewColumnFromQuery
testInsertTableWithClusteringWithClusteringOnBadColumn

Do these cover the cases you are talking about?

About the join issue, do you have a concrete query in example, just to clarify?

final IdentifierNamespace insertNs = (IdentifierNamespace) targetNamespace;
SqlIdentifier identifier = insertNs.getId();
SqlValidatorTable catalogTable = getCatalogReader().getTable(identifier.names);
if (catalogTable != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the fall-thru from this condtional will cause that the CLUSTER BY on the ingestNode will not be applied (line399 right now); even if its there - is that okay?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the ingestNode already has the clustering columns, they will be used. There are existing tests which test that the clustering columns are used in the plan returned from dml query, when clustering is defined at query time, and the table is / it not in catalog. Let me know if this covers the issue that think could occur.

@zachjsh zachjsh merged commit 365cd7e into apache:master Apr 26, 2024
@zachjsh zachjsh deleted the use-catalog-clustering-columns branch April 26, 2024 14:19
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants