Query time lookup (In Code Review) by drcrallen · Pull Request #1093 · apache/druid

drcrallen · 2015-02-05T19:48:56Z

This PR is at the code review stage. Comments on either the high-level overview or the low-level implementations are welcome. This master comment will be updated with pertinent discussion points if they become major topics in the track below.

Add query time lookups for renames via query "namespace" (may end up renaming this to something with a more natural description)
* Add a new factory type io.druid.query.extraction.namespace.ExtractionNamespaceFunctionFactory which is used for runtime wiring of io.druid.query.extraction.namespace.ExtractionNamespace to their caching and functionality. The wiring of a namespace to its factory is accomplished at guice binding via @Named implementations of the ExtractionNamespace's canonical class name.
* Add ability to explicitly rename with a map passed at query time
* Add ability to rename using a key/value lookup in a database
* Add extension which caches renames from a kafka stream
* Added README.md for the dim-rename extension
* Changed DimExtractionFn to be more of a factory rather than an implementation
* Updates are pushed via a zookeeper path (defaults to ${base}/namespaces) and take the form of io.druid.query.extraction.namespace.ExtractionNamespaceUpdate. They can be one-off (updateMs to 0 or null) or regularly scheduled. Regularly scheduled makes an attempt to only update if the source has been modified since the cache refresh.
* Added coordinator endpiont /druid/coordinator/v1/namespaces for items that need "load everything everywhere" kind of logic.

I'll do a rebase/squash before real merge

What currently works:

Lookup up stuff at query time (renames and rebucket)
Choosing optimization path
Kafka powered cluster-wide instant renames
Pulling configurations from zookeeper
Pulling from a DB table via JDBC
Pulling rename information from a file location on: S3, HDFS, Cassandra, or locally
Refreshing rename information

TODO:

~~Better cluster-level data distribution capacity (ex: piggy-backing omniLoader)~~ Done
~~Have Rename Query Results and a Rebucket Query Results be two separate use cases.~~ This one largely affects topN, and is not unique to this PR. See TopN Aliasing #1134 and Add more TopN documentation regarding Aliasing #1135
Add Coordinator configuration web page. (API get - add - delete available, need help from @tarekrached for console)
~~Better polymorphism on loadSpec (see Overhaul of SegmentPullers to add consistency and retries #1132 waiting on code review)~~ Merged

Current performance comparison for TopN:

xvrl · 2015-02-05T20:24:17Z

we probably don't need this anymore, do we?

Technically no because the ObjectMapper::copy() was deleted. IMHO it makes the later part of that file cleaner at the expense of having a very confusing method here. But please limit that discussion to #1092

Ah ok, well if we are making this change as part of #1092 already, then let's remove those changes completely and let #1092 deal with it.

I will before merge, but this PR won't work at all without #1092.

drcrallen · 2015-02-05T20:27:09Z

TravisCI failed on unrelated code.

xvrl · 2015-02-05T20:33:20Z

I would prefer if we did not add @JacksonInject annotations to objects that do the query serialization and de-serialization. This prevents someone from using the DimExtractionFn classes directly to construct Druid queries programmatically. Instead I think we should use an approach similar to what is done with DimFilters, where the DimFilter classes are just simple POJOs for serde purposes, and the actual implementation is implemented as a Filter classes.

Another option is to have two constructors. One that has the @JsonCreator and a @JacksonInject, the other doesn't and just delegates, passing null for the @JacksonInject parameters?

Either way is fine, but I would like us to move in a direction where query level objects could be moved into druid-api for serde purposes and not depend on other druid modules.

My other concern would be that by leaving @JacksonInject parameters, it may require unnecessary binding to dummy RenameManager instances for node types that don't need to understand what those objects are, but do need to be able to deserialize / serialize or re-write part of queries. One example are the router nodes, which don't require that knowledge.

That's a good point as well... Fwiw, I think that these pains are just pointing out that the initial "do the entire query API through JSON/Jackson" idea might not be right anymore.

Perhaps we should think about actually separating the query layer from the implementation layer a bit more? We could explore more expressive or common languages (SQL, or whatever others are doing). Then we could make a distinction between the things that just pass queries around from the things that actually process them? (That's much bigger than this PR, but perhaps we should start thinking about it?)

That was my thought when referring to the DimFilter / Filter separation. Agree we should think about having different query APIs as well as higher level query languages. I think JSON / Jackson could be one of those, being our low-level API, with the implementation separate.

xvrl · 2015-02-05T21:57:00Z

I like the ability to have various underlying implementations, but in terms of usability I don't think the user necessarily needs to know or have to decide what the underlying implementation should be. That also makes it much easier to configure, so configuration parameters don't have to leak into the query.

Instead, it would be nice if we just had a concept of "lookup datasources" the user can decide to use. However, how those lookup datasources are implemented should be configurable at deployment time.

If we need to support multiple implementations of lookups simultaneously, it might be useful to namespace those lookup datasources in a way that a user can choose which type to use.

For example, at configuration time we may define several namespaces:

namespace userinfo maps to a sql databse
namespace userlog maps to a kafka cluster

Assuming the database contains two tables: login and geo, each with an id column and a few other associated lookup columns, then a user can:

use the lookup datasource userinfo:login.email to map user ids to their email
and geo:countries.english to map ISO country codes to their actual english names.

Using Kafka, the datasource name could correspond to topic names. Imagine there is a topic called userstatus with key being user id and value being the user's status, then we can map the user id to status by using the lookupdatasource userlog:userstatus.

cheddar · 2015-02-05T22:04:36Z

Get the Properties injected, not System.getProperty(). The bootstrap doesn't guarantee that the property will be set as a System property, but it does guarantee that wherever it got the property from, it will be in the Properties object.

Very good point, fixing.

cheddar · 2015-02-05T22:18:19Z

I'm in agreement with @xvrl. I think this is an awesome first implementation, but would like to see it a little more decoupled.

I think we can achieve that by making the "RenameManager" a bit less tied to Kafka topics and specific implementations. That is, what if RenameManager was

public class RenameManager {
  public RenameManager(Map<String, DimRenameFn>) {...}

  DimRenameFn find(String name)
}

Then at bootstrap time, we could have something that actually loads up dim extractors into a map and passes that in via DI. The dim extraction function could refer to the "table" by name and just do the join through that.

Each implementation of the extraction functions could have its own logic and just be registered with the RenameManager. So, JDBC and Kafka would actually be behind the same interface and the question of what topic, etc. would be something pre-determined.

I can imagine bootstrapping this with a big json object like

{
   "publisher": { "type": "kafka", "topic": "xyz" ... },
   "advertiser": { "type": "jdbc", "url": "jdbc://blah/blah" ... }
}

Does that seem reasonable/make sense/all that good stuff?

fjy · 2015-02-05T22:23:14Z

can we call this something more specific to Kafka? it seems like that's what it is doing

xvrl · 2015-02-05T22:33:20Z

@cheddar makes sense, although I would prefer to register namespaces as opposed to table names, in order to allow joining on new Kafka topics and tables names without having to restart servers.
Alternatively, we could make the configuration dynamic.

drcrallen · 2015-02-05T22:33:33Z

@cheddar : Yes that's pretty much what I was thinking after talking with @xvrl as well.

drcrallen · 2015-02-05T22:56:01Z

I'm actually finding a bunch of tests which violate the specified interface on DimExtractionFn. Specifically that results are to never be returned as empty strings, but should instead be null.

cheddar · 2015-02-06T01:15:35Z

@xvrl

I agree on needing a more dynamic method of specifying the tables. I kinda think we should move towards being able to configure them on the coordinator console in a dynamic, administrative fashion instead of making it purely query-driven. This is to help protect the cluster in environments where there are significantly more people with the ability to query than the ability/know-how to administer.

I'm willing to be convinced otherwise, but this generally seems like a place where latency concerns would mean that bootstrapping this stuff would be better done proactively (enabled by central configuration) rather than reactively to a query.

drcrallen · 2015-02-06T01:45:18Z

@cheddar / @xvrl : added a new type "namespace" which has the namespace mapping injected, but it is a concurrent map rather than some complex object and can take a null value just fine.

drcrallen · 2015-02-09T18:33:51Z

I'm closing this down temporarily while I address some comments.

drcrallen · 2015-02-10T02:13:01Z

Opening so I can push updates, please ignore for now

drcrallen · 2015-02-10T02:24:11Z

Closing for internal banging around before going open for review.

drcrallen · 2015-02-10T21:13:58Z

Encountered an error with topNs:

Here is a topN query with a threshold of 1 and a dimension extraction set as per "dimension":{"type" : "extraction","dimension" : "l_orderkey","outputName" : "l_orderkey","dimExtractionFn" : {"type":"namespace","namespace":"testTopic0"}}

[ {
  "timestamp" : "1992-01-02T00:00:00.000Z",
  "result" : [ {
    "l_orderkey" : "King",
    "L_DISCOUNT_" : 0.5200000051409006,
    "L_QUANTITY_" : 417,
    "L_TAX_" : 0.4299999922513962,
    "count" : 10,
    "L_EXTENDEDPRICE_" : 676338.4765625
  } ]
} ]

This is the EXACT SAME query but also with an extraction filter

[ {
  "timestamp" : "1992-01-02T00:00:00.000Z",
  "result" : [ {
    "l_orderkey" : "King",
    "L_DISCOUNT_" : 0.6500000022351742,
    "L_QUANTITY_" : 530,
    "L_TAX_" : 0.5999999921768904,
    "count" : 14,
    "L_EXTENDEDPRICE_" : 846855.53125
  } ]
} ]

This is the result of a groupBy with the same dimension extraction filter and dimension extraction definition:

[ {
  "version" : "v1",
  "timestamp" : "1900-01-09T00:00:00.000Z",
  "event" : {
    "L_DISCOUNT_" : 0.6500000357627869,
    "l_orderkey" : "King",
    "count" : 14,
    "L_EXTENDEDPRICE_" : 846855.5,
    "L_QUANTITY_" : 530,
    "L_TAX_" : 0.5999999642372131
  }
} ]

So it seems there may be some nastyness in DimExtractionTopNAlgorithm that will need to be resolved before this PR can be ready.

EDIT: determined to be an existing and known issue with topN

drcrallen · 2015-02-10T21:31:27Z

opening to merge some changes

drcrallen · 2015-02-10T21:31:48Z

Closing again until its ready. Have some more tests to run.

milimetric · 2015-03-26T18:35:51Z

This discussion is useful for context here: https://groups.google.com/forum/#!searchin/druid-development/query$20time$20lookups/druid-development/udiivvgoMO8/prEEamAxd3sJ

drcrallen · 2015-03-31T20:35:12Z

Had a request to move this to the main docs instead of a readme here

* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime and historic nodes * Add namespace caching and populating (can be on heap or off heap) * Add NamespaceExtractionCacheManager for handling caches * Added ExtractionNamespace for handling metadata on the extraction namespaces * Added ExtractionNamespaceUpdate for handling metadata related to updates * Updates can be one-off or regularly scheduled. There is a Zookeeper path to handle the namespace update requests (defaults to `namespaces`) * Add ability to explicitly rename with a map passed at query time * Add ability to rename using a key/value lookup in a database * Add extension which caches renames from a kafka stream (requires kafka8) * Added README.md for the dim-rename kafka extension * Added docs to Historical-Config.md and DimensionSpecs.md * Added context flag `topNFastRename` for if the user knows the data does not need rebucketing and wants to run the more optimized pool strategy. Otherwise an explicit dim extraction strategy is used which handles rebucketing of data. * Added tests to show why PooledTopnAlgorithm cannot be used for dim extraction fn topNs * Added endpoint for namespaces for coordinator at `/druid/coordinator/v1/namespaces`

cheddar · 2015-04-01T17:23:39Z

There's a bunch of extra gunk on this PR because it was used for discussion early on. Can you close and re-open?

drcrallen · 2015-04-01T17:24:04Z

@cheddar : sure

drcrallen · 2015-04-01T17:24:54Z

Making fresh PR for code review.

drcrallen · 2015-04-01T17:27:24Z

@cheddar : #1259

drcrallen force-pushed the queryTimeLookup branch from 50c79dd to 88b890f Compare February 5, 2015 20:13

xvrl reviewed Feb 5, 2015
View reviewed changes

cheddar reviewed Feb 5, 2015
View reviewed changes

fjy reviewed Feb 5, 2015
View reviewed changes

drcrallen closed this Feb 9, 2015

drcrallen reopened this Feb 10, 2015

drcrallen force-pushed the queryTimeLookup branch from 1049093 to 3689575 Compare February 10, 2015 02:13

drcrallen closed this Feb 10, 2015

drcrallen reopened this Feb 10, 2015

drcrallen force-pushed the queryTimeLookup branch from 3689575 to 96cfb1d Compare February 10, 2015 21:31

drcrallen closed this Feb 10, 2015

drcrallen force-pushed the queryTimeLookup branch 6 times, most recently from a29f814 to ca53f26 Compare February 20, 2015 22:48

drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from a6df072 to 4fbc099 Compare March 2, 2015 18:29

xvrl added the Discuss label Mar 3, 2015

drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from 2757d8b to 86926a7 Compare March 20, 2015 00:33

drcrallen force-pushed the queryTimeLookup branch from 86926a7 to a16b765 Compare March 24, 2015 19:24

drcrallen force-pushed the queryTimeLookup branch 5 times, most recently from 5596757 to 955a550 Compare March 31, 2015 19:35

drcrallen changed the title ~~Query time lookup~~ Query time lookup (In Code Review) Mar 31, 2015

drcrallen reviewed Mar 31, 2015
View reviewed changes

drcrallen force-pushed the queryTimeLookup branch from 955a550 to 31769bc Compare March 31, 2015 21:00

xvrl removed the Discuss label Apr 1, 2015

drcrallen closed this Apr 1, 2015

drcrallen mentioned this pull request Apr 1, 2015

Query Time Lookup #1259

Merged

Conversation

drcrallen commented Feb 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Feb 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xvrl commented Feb 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheddar commented Feb 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xvrl commented Feb 5, 2015

Uh oh!

drcrallen commented Feb 5, 2015

Uh oh!

drcrallen commented Feb 5, 2015

Uh oh!

cheddar commented Feb 6, 2015

Uh oh!

drcrallen commented Feb 6, 2015

Uh oh!

drcrallen commented Feb 9, 2015

Uh oh!

drcrallen commented Feb 10, 2015

Uh oh!

drcrallen commented Feb 10, 2015

Uh oh!

drcrallen commented Feb 10, 2015

Uh oh!

drcrallen commented Feb 10, 2015

Uh oh!

drcrallen commented Feb 10, 2015

Uh oh!

milimetric commented Mar 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheddar commented Apr 1, 2015

Uh oh!

drcrallen commented Apr 1, 2015

Uh oh!

drcrallen commented Apr 1, 2015

Uh oh!

drcrallen commented Apr 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants