Skip to content

Introduce "transformSpec" at ingest-time.#4890

Merged
gianm merged 27 commits intoapache:masterfrom
gianm:ingest-expressions
Oct 31, 2017
Merged

Introduce "transformSpec" at ingest-time.#4890
gianm merged 27 commits intoapache:masterfrom
gianm:ingest-expressions

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Oct 2, 2017

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.

Not yet documented for two reasons:

  1. Expression-based features are generally not yet documented (although I think the time is soon arriving where we should document them)
  2. There is not really a clear place to put the docs, since dataSchema docs are strewn about the documentation landscape. This is probably a sign that the docs need some refactoring.

I'm hoping to address both of these in a later patch, but for now, the docs are unchanged.

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.
@gianm gianm added the Feature label Oct 2, 2017
@drcrallen
Copy link
Copy Markdown
Contributor

@gianm can you go in a little deeper on the motivation here?

Previously the druid stuff was solely dedicated to indexing. Is the purpose here to add in more advanced ETL because people don't want to run more than one type of "worker"? or is it because there's common very primitive ETLs that will help 85% of use cases you see?

As an alternative, what if we had more "indexing" extensions that plugged into different data writers or data syncs of different systems. Like a spark writer / hadoop output format. Or stuff specialized for certain other data processing frameworks?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 2, 2017

@drcrallen The motivation is to add some very basic stateless transforms to the Druid indexer to save people from the cost and hassle of running extra jobs when they want to do something basic. It's the same idea as the flattenSpec that exists for JSON and Avro.

Some examples of problems that this solves:

  • I have a field called "url" but I only want to index the domain -> use a regexp_extract expression
  • I have two fields, "firstName" and "lastName" but I want to index them as a pair -> use a concat expression
  • I have two time columns (perhaps start time and settlement time for a transaction), I want one to be the primary timestamp but I want to store the other as an int64 dimension -> use a timestamp_parse expression
  • I want to normalize some field to lowercase -> use a lower expression
  • I have a field called "eventType" and I only want to index type "beepBoop" -> use a selector filter

For people that are just trying to plug Kafka streams into Druid and don't have stream processing infrastructure set up, this capability is huge. (A lot of people have pretty bare bones setups)

Even for people that do have a stream processor set up, this approach is cheaper, since it doesn't require setting aside capacity in Kafka for processing and retention of the transformed topic.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Oct 3, 2017

👍

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 4, 2017

@drcrallen Does that rationale seem reasonable to you?

if (valueMatcher != null) {
rowSupplierForValueMatcher.set(transformedRow);
if (!valueMatcher.matches()) {
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most indexers don't have the ability to filter null rows but rather fail with NPE , does this filtering work with either hadoop indexing job or kafka indexing service ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug that's a good point, I tested it with IndexTask but not every form. IndexTask works since it has this code:

          // The null inputRow means the caller must skip this row.
          if (inputRow == null) {
            continue;
          }

I'll add some tests for kafka, hadoop, and realtime tasks as well.

if (transforms.isEmpty()) {
transformedRow = row;
} else {
transformedRow = new TransformedInputRow(row, transforms);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, can you adjust https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/hadoop/DatasourceRecordReader.java#L145 for hadoop re-indexing where rows don't go through the configured parser.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think IngestSegmentFirehoseFactory doesn't work right either, it also bypasses the parser. Will try to do something to make this work for both of them and add some tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its even better to handle this in IngestSegmentFirehoseFactory then DatasourceRecordReader will get it automatically.

@himanshug
Copy link
Copy Markdown
Contributor

@drcrallen @gianm sounds good for a feature, i have also seen the needs for simple transformations at ingestion time.
does undocumented in this case mean that you don't want users to use this feature just yet because things may change very rapidly or that you want to do more testing and ensure it works ?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 4, 2017

@himanshug just undocumented because nothing expression-related is documented yet. I'm intending to do another patch before too long that will add documentation for all of them, linking through to the existing math-expr.md.

Also for the reason in the original comment -- there is not really a clear place to put docs for stuff in dataSchema, since the docs probably need a refactor.

@gianm gianm added the WIP label Oct 13, 2017
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 13, 2017

Marking WIP until I can add more tests.

gianm added 2 commits October 16, 2017 11:33
- Add nullable annotation to Firehose.nextRow.
- Add tests for index task, realtime task, kafka task, hadoop mapper,
  and ingestSegment firehose.
@gianm gianm force-pushed the ingest-expressions branch from b44e6df to 7da33a1 Compare October 16, 2017 18:36
@gianm gianm closed this Oct 16, 2017
@gianm gianm reopened this Oct 16, 2017
@gianm gianm mentioned this pull request Oct 16, 2017
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 20, 2017

@himanshug I would say a transform can be used to add columns (as long as those columns are functions of existing ones…). And I guess you could use it to sort of "remove" a column in a sense, by overwriting it with one that is just all nulls? Anyway, I just pushed a new javadoc that is more clear.

I also took your suggestion in #4890 (comment) and added the sanity check.

@himanshug
Copy link
Copy Markdown
Contributor

@gianm dint reallize all the possibilties, so that doc helps. thanks.

@gianm gianm merged commit 0ce406b into apache:master Oct 31, 2017
@gianm gianm deleted the ingest-expressions branch October 31, 2017 00:38
gianm added a commit to gianm/druid that referenced this pull request Oct 31, 2017
- Uses the technique from apache#4883 on DimFilterHavingSpec too.
- Also uses Transformers from apache#4890, necessitating a move of that and other
  related classes from druid-server to druid-processing. They probably make
  more sense there anyway.
- Adds a SQL query test.

Fixes apache#4957.
fjy pushed a commit that referenced this pull request Nov 1, 2017
* Fix havingSpec on complex aggregators.

- Uses the technique from #4883 on DimFilterHavingSpec too.
- Also uses Transformers from #4890, necessitating a move of that and other
  related classes from druid-server to druid-processing. They probably make
  more sense there anyway.
- Adds a SQL query test.

Fixes #4957.

* Remove unused import.
gianm added a commit to implydata/druid-public that referenced this pull request Nov 3, 2017
* Introduce "transformSpec" at ingest-time.

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.
gianm added a commit to implydata/druid-public that referenced this pull request Nov 14, 2017
* Introduce "transformSpec" at ingest-time.

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.
gianm added a commit to implydata/druid-public that referenced this pull request Nov 15, 2017
* Fix havingSpec on complex aggregators.

- Uses the technique from apache#4883 on DimFilterHavingSpec too.
- Also uses Transformers from apache#4890, necessitating a move of that and other
  related classes from druid-server to druid-processing. They probably make
  more sense there anyway.
- Adds a SQL query test.

Fixes apache#4957.

* Remove unused import.
gianm added a commit to implydata/druid-public that referenced this pull request Dec 5, 2017
* Introduce "transformSpec" at ingest-time.

It accepts a "filter" (standard query filter object) and "transforms" (a
list of objects with "name" and "expression"). These can be used to do
filtering and single-row transforms without need for a separate data
processing job.

The "expression" fields use the same expression language as other
expression-based feature.
gianm added a commit to implydata/druid-public that referenced this pull request Dec 6, 2017
* Fix havingSpec on complex aggregators.

- Uses the technique from apache#4883 on DimFilterHavingSpec too.
- Also uses Transformers from apache#4890, necessitating a move of that and other
  related classes from druid-server to druid-processing. They probably make
  more sense there anyway.
- Adds a SQL query test.

Fixes apache#4957.

* Remove unused import.
@jon-wei jon-wei added this to the 0.12.0 milestone Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants