Query time lookup (In Code Review)#1093
Conversation
50c79dd to
88b890f
Compare
There was a problem hiding this comment.
we probably don't need this anymore, do we?
There was a problem hiding this comment.
Technically no because the ObjectMapper::copy() was deleted. IMHO it makes the later part of that file cleaner at the expense of having a very confusing method here. But please limit that discussion to #1092
There was a problem hiding this comment.
I will before merge, but this PR won't work at all without #1092.
|
TravisCI failed on unrelated code. |
There was a problem hiding this comment.
I would prefer if we did not add @JacksonInject annotations to objects that do the query serialization and de-serialization. This prevents someone from using the DimExtractionFn classes directly to construct Druid queries programmatically. Instead I think we should use an approach similar to what is done with DimFilters, where the DimFilter classes are just simple POJOs for serde purposes, and the actual implementation is implemented as a Filter classes.
There was a problem hiding this comment.
Another option is to have two constructors. One that has the @JsonCreator and a @JacksonInject, the other doesn't and just delegates, passing null for the @JacksonInject parameters?
There was a problem hiding this comment.
Either way is fine, but I would like us to move in a direction where query level objects could be moved into druid-api for serde purposes and not depend on other druid modules.
My other concern would be that by leaving @JacksonInject parameters, it may require unnecessary binding to dummy RenameManager instances for node types that don't need to understand what those objects are, but do need to be able to deserialize / serialize or re-write part of queries. One example are the router nodes, which don't require that knowledge.
There was a problem hiding this comment.
That's a good point as well... Fwiw, I think that these pains are just pointing out that the initial "do the entire query API through JSON/Jackson" idea might not be right anymore.
Perhaps we should think about actually separating the query layer from the implementation layer a bit more? We could explore more expressive or common languages (SQL, or whatever others are doing). Then we could make a distinction between the things that just pass queries around from the things that actually process them? (That's much bigger than this PR, but perhaps we should start thinking about it?)
There was a problem hiding this comment.
That was my thought when referring to the DimFilter / Filter separation. Agree we should think about having different query APIs as well as higher level query languages. I think JSON / Jackson could be one of those, being our low-level API, with the implementation separate.
|
I like the ability to have various underlying implementations, but in terms of usability I don't think the user necessarily needs to know or have to decide what the underlying implementation should be. That also makes it much easier to configure, so configuration parameters don't have to leak into the query. Instead, it would be nice if we just had a concept of "lookup datasources" the user can decide to use. However, how those lookup datasources are implemented should be configurable at deployment time. If we need to support multiple implementations of lookups simultaneously, it might be useful to namespace those lookup datasources in a way that a user can choose which type to use. For example, at configuration time we may define several namespaces:
Assuming the database contains two tables:
Using Kafka, the datasource name could correspond to topic names. Imagine there is a topic called |
There was a problem hiding this comment.
Get the Properties injected, not System.getProperty(). The bootstrap doesn't guarantee that the property will be set as a System property, but it does guarantee that wherever it got the property from, it will be in the Properties object.
There was a problem hiding this comment.
Very good point, fixing.
|
I'm in agreement with @xvrl. I think this is an awesome first implementation, but would like to see it a little more decoupled. I think we can achieve that by making the "RenameManager" a bit less tied to Kafka topics and specific implementations. That is, what if RenameManager was Then at bootstrap time, we could have something that actually loads up dim extractors into a map and passes that in via DI. The dim extraction function could refer to the "table" by name and just do the join through that. Each implementation of the extraction functions could have its own logic and just be registered with the RenameManager. So, JDBC and Kafka would actually be behind the same interface and the question of what topic, etc. would be something pre-determined. I can imagine bootstrapping this with a big json object like Does that seem reasonable/make sense/all that good stuff? |
There was a problem hiding this comment.
can we call this something more specific to Kafka? it seems like that's what it is doing
|
@cheddar makes sense, although I would prefer to register namespaces as opposed to table names, in order to allow joining on new Kafka topics and tables names without having to restart servers. |
|
I'm actually finding a bunch of tests which violate the specified interface on DimExtractionFn. Specifically that results are to never be returned as empty strings, but should instead be |
|
I agree on needing a more dynamic method of specifying the tables. I kinda think we should move towards being able to configure them on the coordinator console in a dynamic, administrative fashion instead of making it purely query-driven. This is to help protect the cluster in environments where there are significantly more people with the ability to query than the ability/know-how to administer. I'm willing to be convinced otherwise, but this generally seems like a place where latency concerns would mean that bootstrapping this stuff would be better done proactively (enabled by central configuration) rather than reactively to a query. |
|
@cheddar / @xvrl : added a new type "namespace" which has the namespace mapping injected, but it is a concurrent map rather than some complex object and can take a null value just fine. |
|
I'm closing this down temporarily while I address some comments. |
|
Opening so I can push updates, please ignore for now |
1049093 to
3689575
Compare
|
Closing for internal banging around before going open for review. |
|
Encountered an error with topNs: Here is a topN query with a threshold of 1 and a dimension extraction set as per This is the EXACT SAME query but also with an extraction filter This is the result of a groupBy with the same dimension extraction filter and dimension extraction definition: So it seems there may be some nastyness in EDIT: determined to be an existing and known issue with topN |
|
opening to merge some changes |
3689575 to
96cfb1d
Compare
|
Closing again until its ready. Have some more tests to run. |
a29f814 to
ca53f26
Compare
a6df072 to
4fbc099
Compare
2757d8b to
86926a7
Compare
86926a7 to
a16b765
Compare
|
This discussion is useful for context here: https://groups.google.com/forum/#!searchin/druid-development/query$20time$20lookups/druid-development/udiivvgoMO8/prEEamAxd3sJ |
5596757 to
955a550
Compare
There was a problem hiding this comment.
Had a request to move this to the main docs instead of a readme here
* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime and historic nodes * Add namespace caching and populating (can be on heap or off heap) * Add NamespaceExtractionCacheManager for handling caches * Added ExtractionNamespace for handling metadata on the extraction namespaces * Added ExtractionNamespaceUpdate for handling metadata related to updates * Updates can be one-off or regularly scheduled. There is a Zookeeper path to handle the namespace update requests (defaults to `namespaces`) * Add ability to explicitly rename with a map passed at query time * Add ability to rename using a key/value lookup in a database * Add extension which caches renames from a kafka stream (requires kafka8) * Added README.md for the dim-rename kafka extension * Added docs to Historical-Config.md and DimensionSpecs.md * Added context flag `topNFastRename` for if the user knows the data does not need rebucketing and wants to run the more optimized pool strategy. Otherwise an explicit dim extraction strategy is used which handles rebucketing of data. * Added tests to show why PooledTopnAlgorithm cannot be used for dim extraction fn topNs * Added endpoint for namespaces for coordinator at `/druid/coordinator/v1/namespaces`
955a550 to
31769bc
Compare
|
There's a bunch of extra gunk on this PR because it was used for discussion early on. Can you close and re-open? |
|
@cheddar : sure |
|
Making fresh PR for code review. |
This PR is at the code review stage. Comments on either the high-level overview or the low-level implementations are welcome. This master comment will be updated with pertinent discussion points if they become major topics in the track below.
Add query time lookups for renames via query "namespace" (may end up renaming this to something with a more natural description)
* Add a new factory type
io.druid.query.extraction.namespace.ExtractionNamespaceFunctionFactorywhich is used for runtime wiring ofio.druid.query.extraction.namespace.ExtractionNamespaceto their caching and functionality. The wiring of a namespace to its factory is accomplished at guice binding via@Namedimplementations of theExtractionNamespace's canonical class name.* Add ability to explicitly rename with a map passed at query time
* Add ability to rename using a key/value lookup in a database
* Add extension which caches renames from a kafka stream
* Added README.md for the dim-rename extension
* Changed DimExtractionFn to be more of a factory rather than an implementation
* Updates are pushed via a zookeeper path (defaults to
${base}/namespaces) and take the form ofio.druid.query.extraction.namespace.ExtractionNamespaceUpdate. They can be one-off (updateMsto 0 or null) or regularly scheduled. Regularly scheduled makes an attempt to only update if the source has been modified since the cache refresh.* Added coordinator endpiont
/druid/coordinator/v1/namespacesfor items that need "load everything everywhere" kind of logic.I'll do a rebase/squash before real merge
What currently works:
TODO:
Better cluster-level data distribution capacity (ex: piggy-backing omniLoader)DoneHave Rename Query Results and a Rebucket Query Results be two separate use cases.This one largely affects topN, and is not unique to this PR. See TopN Aliasing #1134 and Add more TopN documentation regarding Aliasing #1135Better polymorphism on loadSpec (see Overhaul of SegmentPullers to add consistency and retries #1132 waiting on code review)MergedCurrent performance comparison for TopN:
