Move INSERT & REPLACE validation to the Calcite validator by zachjsh · Pull Request #15908 · apache/druid

zachjsh · 2024-02-14T19:47:31Z

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.

This PR has:

…ct source node. Implemented validateInsert method in DruidSqlValidtor to validate the respective node and moved a lot of the validation being done previously in the ingestHandlers to this overriden method. In the next commit will try to pull back out some of this validation to make this code change smaller and easier to review.

…sewhere

kgyrtkirk

left some comments; I'm not sure if I understand every bits of it...

are these codes covered by tests?
there are quite a few new incorrect state checks; it would be nice to cover at least some of them with directed tests

kgyrtkirk · 2024-02-16T06:52:25Z

+          (SqlNodeList) operands[5],
+          // Must match DruidSqlReplace.getOperandList()
+          operands[6],
+          null // fix this


what needs to be fixed? I would recommend to either use FIXME and address it before merging the patch; or remove this comment and make sure that bad doesn't happen by providing some reasonable exception

I think that in order for this to work properly, the exportFileFormat needs to be changed into an SqlNode, and added as an operand. Without this, I dont think that parameterized queries using export capabilites will work. cc @adarshsanjeev

kgyrtkirk · 2024-02-16T06:55:18Z

+    {
+      return new DruidSqlReplace(
+          pos,
+          // Must match SqlInsert.getOperandList()


I don't really understand these comments - it doesn't help me understand...do we need them?
you could create a bunch of local variables with the casted types and name them accrodingly - that might help or even provide a place to add comments...

removed the comments, let me know if ok now.

kgyrtkirk · 2024-02-16T06:57:42Z

+      if (!query.isA(SqlKind.QUERY)) {
+        throw InvalidSqlInput.exception("Unexpected SQL statement type [%s], expected it to be a QUERY", query.getKind());
+      }
+      return DruidSqlInsert.create(new SqlInsert(


I think contents of this method could be just pushed into DruidSqlInsert ; it seems like all this supposed to belong to there...

...or there is something I've missed...

See comment here: #15908 (comment)

kgyrtkirk · 2024-02-16T07:15:09Z

+    }
+  }
+
+  private void validateSegmentGranularity(


seeing a method which is void and named as validateSegmentGranularity I have a feeling that the method have lost its context;

meaning: ideally this should be named something like getEffectiveGranularity and return with Granularity - but in the process of getting the actual granularity it must also identify invalid cases...

do you see any possible better place for it?

Good point! Fixed. let me know if better now.

kgyrtkirk · 2024-02-16T07:26:28Z

+    final List<RelDataTypeField> sourceFields = sourceType.getFieldList();
+    for (final RelDataTypeField sourceField : sourceFields) {
+      // Check that there are no unnamed columns in the insert.
+      if (UNNAMED_COLUMN_PATTERN.matcher(sourceField.getName()).matches()) {


are there any test for this exception? what if the pattern doesn't match anymore?

Yes! there are tests in CalciteInsertDmlTest, for example, org.apache.druid.sql.calcite.CalciteInsertDmlTest#testInsertWithUnnamedColumnInSelectStatement

kgyrtkirk · 2024-02-16T08:26:57Z

+      if (query instanceof SqlOrderBy) {
+        SqlOrderBy sqlOrderBy = (SqlOrderBy) query;
+        SqlNodeList orderByList = sqlOrderBy.orderList;
+        if (!(orderByList == null || orderByList.equals(SqlNodeList.EMPTY))) {


why have all this here? isn't DruidSqlIngest subclassed to have DruidSqlInsert?
shouldn't this happen in DruidSqlInsert when it gets created?

doing it in DruidSqlInsert construction changes the output of unparse. Not sure if this is ok. There are a few tests that are testing the expected output of unparse in DruidSqlUnparseTest.java. I can update these tests, but just not sure if doing this has other ramifications?

* allow granularity mismatch * remove duplicate validation around unnamed columns

…estHandlers

…node

kgyrtkirk · 2024-02-22T16:10:42Z

+  // Copied here from MSQE since that extension is not visible here.
+  public static final String CTX_ROWS_PER_SEGMENT = "msqRowsPerSegment";
+
+  public interface ValidatorContext


note: unused interface

kgyrtkirk · 2024-02-22T16:18:25Z

+    }
+
+    // The target namespace is both the target table ID and the row type for that table.
+    final SqlValidatorNamespace targetNamespace = getNamespace(insert);


could we have a copy getNamespaceOrThrow over from SqlValidatorImpl or use requireNonNull (just to avoid possible issue if it ends up being null)

kgyrtkirk · 2024-02-22T17:27:43Z

+    // know names and we match by name.) Thus, we'd have to validate (to know names and types)
+    // to get the target types, but we need the target types to validate. Catch-22. So, we punt.
+    final SqlValidatorScope scope;
+    if (source instanceof SqlSelect) {


note: it seems like most of this is copied over from SqlValidatorImpl#validateInsert ; extended / refactored / methods were extracted / commented on...
I think it might be challenging to maintain this in the long run - however validateInsert doesn't seem to be changing very often
I think it might be usefull to leave some comments about the origins of this method as an apidoc of validateInsert method

kgyrtkirk · 2024-02-22T17:43:26Z

+    for (final RelDataTypeField sourceField : sourceFields) {
+      // Check that there are no unnamed columns in the insert.
+      if (UNNAMED_COLUMN_PATTERN.matcher(sourceField.getName()).matches()) {
+        throw InvalidSqlInput.exception(


could you replace these exceptions with ones communicating the SqlNode if its interesting/valuable....

if its a general error that doesn't matter...but this error is specific to a selected column:

throw buildCalciteContextException( "Insertion requires columns to be named....", getSqlNodeFor(insert, sourceFields.indexOf(sourceField))

rough sqlNodeFor method:

SqlNode getSqlNodeFor(SqlInsert insert, int idx) { SqlNode src = insert.getSource(); if(src instanceof SqlSelect) { SqlSelect sqlSelect = (SqlSelect) src; SqlNodeList selectList = sqlSelect.getSelectList(); if(idx < selectList.size()) { return selectList.get(idx); } } return src; }

kgyrtkirk · 2024-02-22T17:47:55Z

@@ -1765,7 +1762,6 @@ public void testErrorWhenInputSourceInvalid()
                             + "partitioned by DAY\n"
                             + "clustered by channel";
    HashMap<String, Object> context = new HashMap<>(DEFAULT_CONTEXT);


I can't comment on the testInsertWithInvalidColumnNameInIngest testcase; but the intention of the check is to ensure that something like:

INSERT INTO t SELECT __time, 1+1 FROM foo PARTITIONED BY ALL

is catched ; could you change or add something like this as a testcase?

abhishekagarwal87 · 2024-02-24T07:33:36Z

@zachjsh - This was merged without a committer approval. Can you revert it and open another PR so that you can get a committer approval?

abhishekrb19

Retroactive approval - LGTM overall. Left a few minor comments which can be addressed in a follow-up.

abhishekrb19 · 2024-02-26T18:18:23Z

+  public void testInsertHourGrain()
+  {
+    testIngestionQuery()
+        .sql("INSERT INTO hourDs\n" +


In a follow up, could you also please add a test for a REPLACE query where PARTITIONED BY clause in the query is omitted?

abhishekrb19 · 2024-02-26T18:20:17Z

+  }
+
+  /**
+   * If the segment grain is given in the catalog then use this value is used.


Suggested change

* If the segment grain is given in the catalog then use this value is used.

If the segment grain is given in the catalog and absent in the PARTITIONED BY clause in the query, then use the value from the catalog.

abhishekrb19 · 2024-02-26T18:21:18Z

+      @Nullable SqlIdentifier exportFileFormat,
+      @Nullable SqlNode replaceTimeQuery


curious, is the order of args swapped for a reason?

abhishekrb19 · 2024-02-26T18:31:51Z

+  // Copied here from MSQE since that extension is not visible here.
+  public static final String CTX_ROWS_PER_SEGMENT = "msqRowsPerSegment";
+
+  public interface ValidatorContext


abhishekrb19 · 2024-02-26T18:38:41Z

                             + "partitioned by DAY\n"
                             + "clustered by channel";
    HashMap<String, Object> context = new HashMap<>(DEFAULT_CONTEXT);
-    context.put(PlannerContext.CTX_SQL_OUTER_LIMIT, 100);


Curious, are these context parameters not required by these tests anymore?

…urce input expressions (#15962) * * address remaining comments from #15836 * * address remaining comments from #15908 * * add test that exposes relational algebra issue * * simplify test exposing issue * * fix * * add tests for sealed / non-sealed * * update test descriptions * * fix test failure when -Ddruid.generic.useDefaultValueForNull=true * * check type assignment based on natice Druid types * * add tests that cover missing jacoco coverage * * add replace tests * * add more tests and comments about column ordering * * simplify tests * * review comments * * remove commented line * * STRING family types should be validated as non-null

zachjsh added 2 commits February 14, 2024 13:43

* remove dead code

6579bd8

github-actions Bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 14, 2024

zachjsh changed the title ~~ingestHanlders Validate sql insert node (WIP)~~ ingestHandlers Validate sql insert node (WIP) Feb 14, 2024

zachjsh commented Feb 14, 2024

View reviewed changes

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlToRelConverter.java

zachjsh commented Feb 14, 2024

View reviewed changes

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/IngestHandler.java Outdated

zachjsh commented Feb 14, 2024

View reviewed changes

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/IngestHandler.java

zachjsh added 2 commits February 14, 2024 16:38

* fix failing tests, remove some uneeded validations that are done el…

e019b70

…sewhere

* fix static check

2565173

kgyrtkirk reviewed Feb 16, 2024

View reviewed changes

zachjsh added 8 commits February 16, 2024 13:22

* remove uneeded ValidatorContext

d2f0e47

* resolve getEffectiveGranularity comment

b475ecc

* allow granularity mismatch * remove duplicate validation around unnamed columns

* remove duplicate validation being done in DruidSqlValidator and Ing…

cad940d

…estHandlers

* simplify SqlInsert source query conversion

50061d6

* remove redundant comments

12c3d60

* add tests for catalog provided segment granularity

3836330

Merge remote-tracking branch 'apache/master' into validate-sqlInsert-…

eca4be2

…node

* remove uneeded comments

914553f

zachjsh marked this pull request as ready for review February 21, 2024 08:58

zachjsh requested a review from kgyrtkirk February 21, 2024 08:59

zachjsh added 2 commits February 21, 2024 11:19

Merge remote-tracking branch 'apache/master' into validate-sqlInsert-…

9328165

…node

* fix exportFileFormat issue

0608676

zachjsh changed the title ~~ingestHandlers Validate sql insert node (WIP)~~ Move INSERT & REPLACE validation to the Calcite validator Feb 21, 2024

kgyrtkirk approved these changes Feb 22, 2024

View reviewed changes

zachjsh merged commit 8ebf237 into apache:master Feb 22, 2024

zachjsh deleted the validate-sqlInsert-node branch February 22, 2024 19:02

zachjsh mentioned this pull request Feb 24, 2024

INSERT/REPLACE dimension target column types are validated against source input expressions #15962

Merged

10 tasks

abhishekrb19 reviewed Feb 26, 2024

View reviewed changes

zachjsh added a commit to zachjsh/druid that referenced this pull request Feb 27, 2024

* address remaining comments from apache#15908

d30a729

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

	* If the segment grain is given in the catalog then use this value is used.
	If the segment grain is given in the catalog and absent in the PARTITIONED BY clause in the query, then use the value from the catalog.

		@Nullable SqlIdentifier exportFileFormat,
		@Nullable SqlNode replaceTimeQuery

Conversation

zachjsh commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kgyrtkirk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 commented Feb 24, 2024

Uh oh!

abhishekrb19 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zachjsh commented Feb 14, 2024 •

edited

Loading

zachjsh Feb 16, 2024 •

edited

Loading

abhishekrb19 left a comment •

edited

Loading