Introduce "transformSpec" at ingest-time. by gianm · Pull Request #4890 · apache/druid

gianm · 2017-10-02T18:36:05Z

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.

Not yet documented for two reasons:

Expression-based features are generally not yet documented (although I think the time is soon arriving where we should document them)
There is not really a clear place to put the docs, since dataSchema docs are strewn about the documentation landscape. This is probably a sign that the docs need some refactoring.

I'm hoping to address both of these in a later patch, but for now, the docs are unchanged.

It accepts a "filter" (standard query filter object) and "transforms" (a list of objects with "name" and "expression"). These can be used to do filtering and single-row transforms without need for a separate data processing job. The "expression" fields use the same expression language as other expression-based feature.

drcrallen · 2017-10-02T22:53:46Z

@gianm can you go in a little deeper on the motivation here?

Previously the druid stuff was solely dedicated to indexing. Is the purpose here to add in more advanced ETL because people don't want to run more than one type of "worker"? or is it because there's common very primitive ETLs that will help 85% of use cases you see?

As an alternative, what if we had more "indexing" extensions that plugged into different data writers or data syncs of different systems. Like a spark writer / hadoop output format. Or stuff specialized for certain other data processing frameworks?

gianm · 2017-10-02T23:39:17Z

@drcrallen The motivation is to add some very basic stateless transforms to the Druid indexer to save people from the cost and hassle of running extra jobs when they want to do something basic. It's the same idea as the flattenSpec that exists for JSON and Avro.

Some examples of problems that this solves:

I have a field called "url" but I only want to index the domain -> use a regexp_extract expression
I have two fields, "firstName" and "lastName" but I want to index them as a pair -> use a concat expression
I have two time columns (perhaps start time and settlement time for a transaction), I want one to be the primary timestamp but I want to store the other as an int64 dimension -> use a timestamp_parse expression
I want to normalize some field to lowercase -> use a lower expression
I have a field called "eventType" and I only want to index type "beepBoop" -> use a selector filter

For people that are just trying to plug Kafka streams into Druid and don't have stream processing infrastructure set up, this capability is huge. (A lot of people have pretty bare bones setups)

Even for people that do have a stream processor set up, this approach is cheaper, since it doesn't require setting aside capacity in Kafka for processing and retention of the transformed topic.

fjy · 2017-10-03T15:16:59Z

👍

gianm · 2017-10-04T16:27:19Z

@drcrallen Does that rationale seem reasonable to you?

himanshug · 2017-10-04T16:51:44Z

+    if (valueMatcher != null) {
+      rowSupplierForValueMatcher.set(transformedRow);
+      if (!valueMatcher.matches()) {
+        return null;


most indexers don't have the ability to filter null rows but rather fail with NPE , does this filtering work with either hadoop indexing job or kafka indexing service ?

@himanshug that's a good point, I tested it with IndexTask but not every form. IndexTask works since it has this code:

// The null inputRow means the caller must skip this row. if (inputRow == null) { continue; }

I'll add some tests for kafka, hadoop, and realtime tasks as well.

himanshug · 2017-10-04T17:00:23Z

+    if (transforms.isEmpty()) {
+      transformedRow = row;
+    } else {
+      transformedRow = new TransformedInputRow(row, transforms);


also, can you adjust https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/hadoop/DatasourceRecordReader.java#L145 for hadoop re-indexing where rows don't go through the configured parser.

Hmm, I think IngestSegmentFirehoseFactory doesn't work right either, it also bypasses the parser. Will try to do something to make this work for both of them and add some tests.

yeah its even better to handle this in IngestSegmentFirehoseFactory then DatasourceRecordReader will get it automatically.

himanshug · 2017-10-04T17:04:04Z

@drcrallen @gianm sounds good for a feature, i have also seen the needs for simple transformations at ingestion time.
does undocumented in this case mean that you don't want users to use this feature just yet because things may change very rapidly or that you want to do more testing and ensure it works ?

gianm · 2017-10-04T17:32:28Z

@himanshug just undocumented because nothing expression-related is documented yet. I'm intending to do another patch before too long that will add documentation for all of them, linking through to the existing math-expr.md.

Also for the reason in the original comment -- there is not really a clear place to put docs for stuff in dataSchema, since the docs probably need a refactor.

gianm · 2017-10-13T18:17:49Z

Marking WIP until I can add more tests.

- Add nullable annotation to Firehose.nextRow. - Add tests for index task, realtime task, kafka task, hadoop mapper, and ingestSegment firehose.

gianm · 2017-10-20T21:16:28Z

@himanshug I would say a transform can be used to add columns (as long as those columns are functions of existing ones…). And I guess you could use it to sort of "remove" a column in a sense, by overwriting it with one that is just all nulls? Anyway, I just pushed a new javadoc that is more clear.

I also took your suggestion in #4890 (comment) and added the sanity check.

himanshug · 2017-10-20T21:19:45Z

@gianm dint reallize all the possibilties, so that doc helps. thanks.

- Uses the technique from apache#4883 on DimFilterHavingSpec too. - Also uses Transformers from apache#4890, necessitating a move of that and other related classes from druid-server to druid-processing. They probably make more sense there anyway. - Adds a SQL query test. Fixes apache#4957.

* Fix havingSpec on complex aggregators. - Uses the technique from #4883 on DimFilterHavingSpec too. - Also uses Transformers from #4890, necessitating a move of that and other related classes from druid-server to druid-processing. They probably make more sense there anyway. - Adds a SQL query test. Fixes #4957. * Remove unused import.

* Introduce "transformSpec" at ingest-time. It accepts a "filter" (standard query filter object) and "transforms" (a list of objects with "name" and "expression"). These can be used to do filtering and single-row transforms without need for a separate data processing job. The "expression" fields use the same expression language as other expression-based feature.

* Fix havingSpec on complex aggregators. - Uses the technique from apache#4883 on DimFilterHavingSpec too. - Also uses Transformers from apache#4890, necessitating a move of that and other related classes from druid-server to druid-processing. They probably make more sense there anyway. - Adds a SQL query test. Fixes apache#4957. * Remove unused import.

* Introduce "transformSpec" at ingest-time. It accepts a "filter" (standard query filter object) and "transforms" (a list of objects with "name" and "expression"). These can be used to do filtering and single-row transforms without need for a separate data processing job. The "expression" fields use the same expression language as other expression-based feature.

* Fix havingSpec on complex aggregators. - Uses the technique from apache#4883 on DimFilterHavingSpec too. - Also uses Transformers from apache#4890, necessitating a move of that and other related classes from druid-server to druid-processing. They probably make more sense there anyway. - Adds a SQL query test. Fixes apache#4957. * Remove unused import.

gianm added the Feature label Oct 2, 2017

gianm added 3 commits October 2, 2017 12:23

Remove forbidden api.

089a52f

Fix compile error.

5b6b1d5

Fix tests.

f2ec86e

himanshug reviewed Oct 4, 2017

View reviewed changes

gianm added the WIP label Oct 13, 2017

gianm added 2 commits October 16, 2017 11:33

Some more changes.

8bffb71

- Add nullable annotation to Firehose.nextRow. - Add tests for index task, realtime task, kafka task, hadoop mapper, and ingestSegment firehose.

Merge branch 'master' into ingest-expressions

7da33a1

gianm force-pushed the ingest-expressions branch from b44e6df to 7da33a1 Compare October 16, 2017 18:36

gianm added 4 commits October 16, 2017 11:51

Fix bad merge.

dcd01f1

Adjust imports.

577326f

Adjust whitespace.

1511c6d

Make Transform into an interface.

7c06024

gianm added the Design Review label Oct 16, 2017

Add missing annotation.

8a61b6f

gianm closed this Oct 16, 2017

gianm reopened this Oct 16, 2017

gianm added 2 commits October 16, 2017 15:23

Switch logger.

4f724cc

Switch logger.

53e0efa

gianm mentioned this pull request Oct 16, 2017

druid sql bug of having #4957

Closed

Adjust test.

707b49e

More javadocs, and always decorate.

c4f8e37

himanshug approved these changes Oct 20, 2017

View reviewed changes

gianm added 6 commits October 25, 2017 13:10

Merge branch 'master' into ingest-expressions

c4300e3

Fix bug in TransformingStringInputRowParser.

bdd7ff8

Merge branch 'master' into ingest-expressions

236ddcb

Fix bad merge.

f90482d

Fix ISFF tests.

8ab2dc7

Fix DORC test.

aa4c930

gianm merged commit 0ce406b into apache:master Oct 31, 2017

gianm deleted the ingest-expressions branch October 31, 2017 00:38

gianm mentioned this pull request Oct 31, 2017

Fix havingSpec on complex aggregators. #5024

Merged

jon-wei added this to the 0.12.0 milestone Jan 5, 2018

mitchlloyd mentioned this pull request Feb 4, 2020

Mark Tranform interface with @extensionPoint #9311

Closed

Conversation

gianm commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drcrallen commented Oct 2, 2017

Uh oh!

gianm commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjy commented Oct 3, 2017

Uh oh!

gianm commented Oct 4, 2017

Uh oh!

himanshug Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

gianm Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

himanshug Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

gianm Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

himanshug Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

himanshug commented Oct 4, 2017

Uh oh!

gianm commented Oct 4, 2017

Uh oh!

gianm commented Oct 13, 2017

Uh oh!

gianm commented Oct 20, 2017

Uh oh!

himanshug commented Oct 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gianm commented Oct 2, 2017 •

edited

Loading

gianm commented Oct 2, 2017 •

edited

Loading