DruidInputSource: Fix issues in column projection, timestamp handling. by gianm · Pull Request #10267 · apache/druid

gianm · 2020-08-12T02:47:54Z

DruidInputSource, DruidSegmentReader changes:

Remove "dimensions" and "metrics". They are not necessary, because we
can compute which columns we need to read based on what is going to
be used by the timestamp, transform, dimensions, and metrics.
Start using ColumnsFilter (see below) to decide which columns we need
to read.
Actually respect the "timestampSpec". Previously, it was ignored, and
the timestamp of the returned InputRows was set to the __time column
of the input datasource.

(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.

(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than __time.

(1) and (3) are breaking changes.

Web console changes:

Remove "Dimensions" and "Metrics" from the Druid input source.
Set timestampSpec to {"column": "__time", "format": "millis"} for
compatibility with the new behavior.

Other changes:

Add ColumnsFilter, a new class that allows input readers to determine
which columns they need to read. Currently, it's only used by the
DruidInputSource, but it could be used by other columnar input sources
in the future.
Add a ColumnsFilter to InputRowSchema.
Remove the metric names from InputRowSchema (they were unused).
Add InputRowSchemas.fromDataSchema method that computes the proper
ColumnsFilter for given timestamp, dimensions, transform, and metrics.
Add "getRequiredColumns" method to TransformSpec to support the above.

DruidInputSource, DruidSegmentReader changes: 1) Remove "dimensions" and "metrics". They are not necessary, because we can compute which columns we need to read based on what is going to be used by the timestamp, transform, dimensions, and metrics. 2) Start using ColumnsFilter (see below) to decide which columns we need to read. 3) Actually respect the "timestampSpec". Previously, it was ignored, and the timestamp of the returned InputRows was set to the `__time` column of the input datasource. (1) and (2) together fix a bug in which the DruidInputSource would not properly read columns that are used as inputs to a transformSpec. (3) fixes a bug where the timestampSpec would be ignored if you attempted to set the column to something other than `__time`. (1) and (3) are breaking changes. Web console changes: 1) Remove "Dimensions" and "Metrics" from the Druid input source. 2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for compatibility with the new behavior. Other changes: 1) Add ColumnsFilter, a new class that allows input readers to determine which columns they need to read. Currently, it's only used by the DruidInputSource, but it could be used by other columnar input sources in the future. 2) Add a ColumnsFilter to InputRowSchema. 3) Remove the metric names from InputRowSchema (they were unused). 4) Add InputRowSchemas.fromDataSchema method that computes the proper ColumnsFilter for given timestamp, dimensions, transform, and metrics. 5) Add "getRequiredColumns" method to TransformSpec to support the above.

gianm · 2020-08-12T02:54:37Z

(1) and (3) are breaking changes.

Fwiw, on this one, I think the likelihood of (1) causing problems is low. It's a breaking change because if you were previously specifying a column as an input to one of your dimensionsSpec or aggregators, but then explicitly not including in the input source's "dimensions" or "metrics" list, it'll now actually get read. Previously it'd be treated as null. The new behavior is better & less brittle but is different.

(3) is still fairly low, but somewhat more likely. It's possible that someone had their timestampSpec set to something like {"column": "__time", "format": "iso"}, which is "wrong" but would have worked previously. Now it won't work: the timestamp will fail to parse (as it should, because it's not really in iso format).

This patch as written just lets these things break, but we could cushion the fall, potentially:

For (1) we could throw an error in situations where people explicitly specify dimensions and metrics that don't match the computed inputs, informing them that these parameters no longer have an effect and asking them to remove them.
For (3) we could introduce a flag that causes an error to be thrown when you set the timestamp spec to anything other than {"column": "__time", "format": "auto"} or {"column": "__time", "format": "millis"}. Druid would invite you to set the flag, which would tell it "yes I really want the new, more correct behavior". That feature flag could be on or off by default.

What do people think?

gianm · 2020-08-12T02:58:02Z

Btw, this fixes #10266 too.

abhishekagarwal87 · 2020-08-12T11:29:29Z

@@ -87,13 +91,21 @@ public class DruidInputSource extends AbstractInputSource implements SplittableI
  @Nullable
  private final List<WindowedSegmentId> segmentIds;
  private final DimFilter dimFilter;


will it make sense to move DimFilter outside the InputSource in the task json? It seems more natural to me to put the filters alongside transforms, dimensions, and metrics and leave only the data source properties inside the InputSource section. On the flip side, it could make the compatibility situation more complicated than it is.

It's possible to specify a filter alongside transforms today! You can do it in two places:

In the transformSpec (this works with any input source / format, see https://druid.apache.org/docs/latest/ingestion/index.html#filter)

In the druid inputSource itself (of course, only works with this input source)

It's a little silly to have both, perhaps, but there's a practical reason: specifying a filter in the druid inputSource is faster, because it is applied while creating the cursor that reads the data, and therefore it can use indexes, etc. The filter in the transformSpec is applied after the cursor generates rows.

But I think in the future, it'd be better to support pushing down the transformSpec filter into the cursor, and then we could deprecate the filter parameter in the inputSource, because it wouldn't be useful anymore.

For now, I suggest we leave it as-is.

jon-wei · 2020-08-13T22:06:58Z

For (1) we could throw an error in situations where people explicitly specify dimensions and metrics that don't match the computed inputs, informing them that these parameters no longer have an effect and asking them to remove them.

(1) sounds good to me.

For (3) we could introduce a flag that causes an error to be thrown when you set the timestamp spec to anything other than {"column": "__time", "format": "auto"} or {"column": "__time", "format": "millis"}. Druid would invite you to set the flag, which would tell it "yes I really want the new, more correct behavior". That feature flag could be on or off by default.

I think I'm a little unclear on (3), is the feature flag controlling whether to use the old/new behavior or is it just for whether the auto/millis check is executed?

jon-wei

LGTM, for the (3) backwards compatibility item, what makes sense to me if we decide to feature flag this (I'm not sure that's worth doing, maybe enough to just show a clear error message and call this out in the release notes), would be to have the flag control whether we ignore (old mode) or respect the timestampSpec, and in the new mode we could have that __time with auto/millis format check.

jon-wei · 2020-08-14T02:12:54Z

+   * @see InputRowSchema#getColumnsFilter()
+   */
+  @VisibleForTesting
+  static ColumnsFilter createColumnsFilter(


(What's the thumbs up for?)

thought it was a good method 👍

gianm · 2020-08-14T14:20:51Z

(1) sounds good to me.

Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.

LGTM, for the (3) backwards compatibility item, what makes sense to me if we decide to feature flag this (I'm not sure that's worth doing, maybe enough to just show a clear error message and call this out in the release notes), would be to have the flag control whether we ignore (old mode) or respect the timestampSpec, and in the new mode we could have that __time with auto/millis format check.

I think it makes sense to add a flag, because of the potential for confusion around why something suddenly broke in a subtle way after an upgrade. Usually (hopefully) cluster operators read the release notes, but not all users will read them, and the people submitting indexing tasks might not be the same people that operate the cluster.

In new mode, I suggest we skip the check, because that will enable the full power of timestampSpec to be used (you could use it to switch a secondary timestamp to primary, for example).

IMO the default should be new mode, but we should put something in the release notes that says if you have a bunch of users that might be relying on the old behavior, you can set this flag and get the old behavior back.

I'd also consider adding a logged warning if the timestampSpec is anything other than __time + auto / millis, just to let people know that if they didn't intend to be in a special case, they should change their timestampSpec.

What do you think?

Separately — as a reviewer — would you prefer these changes to be made in this patch, or in a follow-up? I'm OK either way…

jon-wei · 2020-08-14T22:01:04Z

Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.

That sounds fine to me too, I don't really have a strong preference on that.

In new mode, I suggest we skip the check, because that will enable the full power of timestampSpec to be used (you could use it to switch a secondary timestamp to primary, for example).

I'd also consider adding a logged warning if the timestampSpec is anything other than __time + auto / millis, just to let people know that if they didn't intend to be in a special case, they should change their timestampSpec.

Ah, I was thinking that the auto/millis check in the new mode would only apply if the column being referenced was __time, so that users could still use alternate timestamp columns.

The warning log sounds fine to me.

Separately — as a reviewer — would you prefer these changes to be made in this patch, or in a follow-up? I'm OK either way…

Let's do it in this patch, the additions I'm guessing won't be too large and we can have everything related in one place.

gianm · 2020-08-25T07:39:53Z

@jon-wei I've pushed a new commit.

Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.

That sounds fine to me too, I don't really have a strong preference on that.

For this one, I left it as not-an-error.

Ah, I was thinking that the auto/millis check in the new mode would only apply if the column being referenced was __time, so that users could still use alternate timestamp columns.

The warning log sounds fine to me.

For this one, I added a config druid.indexer.task.ignoreTimestampSpecForDruidInputSource that defaults off (false). If enabled, timestampSpecs are ignored with the 'druid' input source, and a warning is logged each time a reader is created.

Either way, a warning is logged if you read from the __time column with a format other than millis or auto.

jon-wei · 2020-08-26T01:08:37Z

The new config LGTM, there are some CI failures

gianm · 2020-08-27T15:46:28Z

The new config LGTM, there are some CI failures

@jon-wei thanks; I can't see the CI failures right now due to merge conflicts, so I just fixed them and pushed them up. I'll take another look once CI runs again.

gianm · 2020-09-23T16:53:24Z

I'm having a tough time figuring out why the integration tests aren't passing. I suppose it's related to the fact that I added testing there for the new druid.indexer.task.ignoreTimestampSpecForDruidInputSource property, but, I built a copy locally and tested that property and it works OK. So there might be something else going on.

jon-wei · 2020-09-23T20:59:35Z

@gianm I saw some messages like this in the perfect rollup test: https://travis-ci.org/github/apache/druid/jobs/729522884

{"host":"172.172.172.7","port":8102,"tlsPort":8103},"dataSource":"wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?","errorMsg":"java.util.concurrent.ExecutionException: java.lang.ClassCastException: org.apache.druid.segment.incr..."},{"id":"partial_index_generate_wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?_bajijpif_2020-09-23T07:54:13.384Z","groupId":"index_parallel_wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?_pgofimno_2020-09-23T07:52:11.488Z","type":"partial_index_generate","createdTime":"2020-09-23T07:54:13.453Z","queueInsertionTime":"1970-01-01T00:00:00.000Z","statusCode":"FAILED","status":"FAILED","runnerStatusCode":"NONE","duration":55228,"location"

I didn't see the ClassCastException in the log for the failed batch test though.

gianm · 2020-12-07T08:01:56Z

OK, got it all sorted out. The tests are passing now.

vogievetsky

Minor nit

vogievetsky · 2020-12-08T01:52:48Z

 |`druid.indexer.task.gracefulShutdownTimeout`|Wait this long on middleManager restart for restorable tasks to gracefully exit.|PT5M|
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
+|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch.md#druid-input-source) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.20.0.|false|


should be 0.20.1 at this point

I'll change it to 0.21.0, in the guess that this will be the next release.

vogievetsky · 2020-12-08T01:52:57Z

 |`druid.indexer.task.gracefulShutdownTimeout`|Wait this long on Indexer restart for restorable tasks to gracefully exit.|PT5M|
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
+|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch.md#druid-input-source) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.20.0.|false|


vogievetsky · 2020-12-08T01:55:04Z

The web console changes (*.tsx) look good to me

vogievetsky · 2020-12-08T19:57:02Z

thank you for updating the docs per my ask

…ianm/druid into druid-input-source-projections

gianm · 2021-01-08T08:54:18Z

Fixed up some merge conflicts.

gianm · 2021-01-29T18:30:57Z

Fixed a conflict and updated the docs to say "before Druid 0.22.0" instead of "before Druid 0.21.0".

jihoonson

@gianm sorry for the delayed review. LGTM overall, but I left some minor comments. Please address them. Also the build is failing because of the signature change of the constructor of InputRowSchema().

[ERROR] /home/jihoonson/Projects/druid/core/src/test/java/org/apache/druid/data/input/impl/JsonReaderTest.java:[382,13] cannot find symbol
[ERROR]   symbol:   variable Collections
[ERROR]   location: class org.apache.druid.data.input.impl.JsonReaderTest

[ERROR] /home/jihoonson/Projects/druid/indexing-service/src/test/java/org/apache/druid/indexing/seekablestream/StreamChunkParserTest.java:[183,87] incompatible types: no instance(s) of type variable(s) T exist so that java.util.List<T> conforms to org.apache.druid.data.input.ColumnsFilter
[ERROR] /home/jihoonson/Projects/druid/indexing-service/src/test/java/org/apache/druid/indexing/seekablestream/StreamChunkParserTest.java:[206,87] incompatible types: no instance(s) of type variable(s) T exist so that java.util.List<T> conforms to org.apache.druid.data.input.ColumnsFilter

jihoonson · 2021-03-24T17:52:18Z

-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.


Maybe good to suggest using auto compaction here instead of writing an ingestion spec.

Good idea. I added this.

Alternatively, if your goals can be satisfied by [compaction](compaction.md), consider that instead as a simpler approach.

jihoonson · 2021-03-24T17:54:30Z

+      "partitionsSpec": {
+        "type": "hashed",
+        "numShards": 1
+      },
+      "forceGuaranteedRollup": true,
+      "maxNumConcurrentSubTasks": 1


This part was not in the previous example. Was it intentional to use hashed partitionsSpec here? Seems unnecessary to me.

What I was thinking was:

I want to include a full ingest spec, not just the inputSource part, so people have a full example.

This spec uses rollup, so for a reindexing spec, it'd be good to use a partitionsSpec that guarantees rollup too.

Do you have a better suggestion for what to put in the example?

I see. It makes sense. If that's the case, I would suggest simply removing numShards from the spec. The parallel task will find the numShards automatically based on targetRowsPerSegment which is 5 million by default.

OK, I'll do that.

jihoonson · 2021-03-24T18:02:57Z


  @Nullable
  @JsonProperty
+  @JsonInclude(Include.NON_NULL)


Is it better to add this annotation at the class-level? Seems reasonable to not include any fields in JSON if they are null.

Interesting question. To answer it, I had to add some tests to make sure it worked properly. The answer is yes, it does work. I'll make the change and keep the new tests (look for them in DruidInputSourceTest).

jihoonson · 2021-03-24T19:38:15Z

-    return Collections.singletonList(new MapBasedInputRow(timestamp.getMillis(), dimensions, intermediateRow));
+    return Collections.singletonList(
+        MapInputRowParser.parse(
+            new InputRowSchema(


Would be better to cache inputRowSchema since this function is called per row.

jihoonson · 2021-03-24T23:39:22Z

          AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING,
          0,
-          22482,
+          22481,


Do you know why this changed?

I guessed it was because the metadata changed. By that, I mean the org.apache.druid.segment.Metadata object stored in the segment, which contains the TimestampSpec.

It adds up, I think, since -1 character is the difference between the old default timestamp + auto (13 chars) and the new default __time + millis (12 chars).

Ah, that seems likely the reason. Thanks 👍

gianm · 2021-03-25T01:30:50Z

Thanks for the review @jihoonson. I've pushed up the changes.

jihoonson

@gianm thanks for the quick fix. The latest change LGTM. +1 after CI.

gianm · 2021-03-25T17:32:18Z

thanks!

gianm added Bug Release Notes labels Aug 12, 2020

gianm added 3 commits August 12, 2020 00:30

Various fixups.

be23e98

Uncomment incorrectly commented lines.

9404661

Move TransformSpecTest to the proper module.

6a4a97e

abhishekagarwal87 reviewed Aug 12, 2020

View reviewed changes

jon-wei reviewed Aug 14, 2020

View reviewed changes

gianm added 3 commits August 24, 2020 23:34

Merge branch 'master' into druid-input-source-projections

c9b088a

Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.

df73427

Fix.

7bd0f90

gianm added 3 commits August 25, 2020 01:24

Fix build.

cf68ace

Checkstyle.

6369cc0

Misc fixes.

14efe00

Merge branch 'master' into druid-input-source-projections

1a961cd

gianm added 5 commits August 27, 2020 15:32

Fix test.

7c6cf83

Move config.

530eb32

Merge branch 'master' into druid-input-source-projections

15b07b0

Fix imports.

94f2930

Fixup.

34d4792

vogievetsky requested changes Dec 8, 2020

View reviewed changes

gianm added 2 commits December 7, 2020 18:28

Update docs.

02dfb64

Merge branch 'master' into druid-input-source-projections

96aec64

vogievetsky approved these changes Dec 8, 2020

View reviewed changes

gianm and others added 3 commits December 21, 2020 12:19

Merge branch 'master' into druid-input-source-projections

79402d0

Merge branch 'master' into druid-input-source-projections

e22296a

Merge branch 'druid-input-source-projections' of https://github.com/g…

f513127

…ianm/druid into druid-input-source-projections

jihoonson added the Incompatible label Jan 9, 2021

gianm added 2 commits January 29, 2021 10:29

Merge branch 'master' into druid-input-source-projections

5d68ea8

Update docs to say Druid 0.22.0 instead of 0.21.0.

8ce44bd

gianm added 3 commits February 25, 2021 08:06

Merge branch 'master' into druid-input-source-projections

4b37cf2

Fix test.

ade207e

Fix ITAutoCompactionTest.

8fb44d7

jihoonson reviewed Mar 25, 2021

View reviewed changes

gianm added 2 commits March 24, 2021 17:31

Merge branch 'master' into ingest-druid-input-source-projections

68c8cea

Changes from review & from merging.

9bc0481

jihoonson approved these changes Mar 25, 2021

View reviewed changes

gianm merged commit bf20f9e into apache:master Mar 25, 2021

gianm deleted the druid-input-source-projections branch March 25, 2021 17:32

vogievetsky mentioned this pull request Apr 9, 2021

Web console: Do not put __time in the dimensions list #11085

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

jon-wei mentioned this pull request Jan 20, 2022

DruidInputSource ignores dimensionsSpec when dimensions is set #10266

Closed

Conversation

gianm commented Aug 12, 2020

Uh oh!

gianm commented Aug 12, 2020

Uh oh!

gianm commented Aug 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Aug 13, 2020

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Aug 14, 2020

Uh oh!

jon-wei commented Aug 14, 2020

Uh oh!

gianm commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-wei commented Aug 26, 2020

Uh oh!

gianm commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Sep 23, 2020

Uh oh!

jon-wei commented Sep 23, 2020

Uh oh!

gianm commented Dec 7, 2020

Uh oh!

vogievetsky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vogievetsky commented Dec 8, 2020

Uh oh!

vogievetsky commented Dec 8, 2020

Uh oh!

gianm commented Jan 8, 2021

Uh oh!

gianm commented Jan 29, 2021

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Aug 25, 2020 •

edited

Loading

gianm commented Aug 27, 2020 •

edited

Loading

gianm Mar 25, 2021 •

edited

Loading

gianm Mar 25, 2021 •

edited

Loading