Skip to content

Query time lookup (In Code Review)#1093

Closed
drcrallen wants to merge 1 commit intoapache:masterfrom
metamx:queryTimeLookup
Closed

Query time lookup (In Code Review)#1093
drcrallen wants to merge 1 commit intoapache:masterfrom
metamx:queryTimeLookup

Conversation

@drcrallen
Copy link
Copy Markdown
Contributor

This PR is at the code review stage. Comments on either the high-level overview or the low-level implementations are welcome. This master comment will be updated with pertinent discussion points if they become major topics in the track below.

Add query time lookups for renames via query "namespace" (may end up renaming this to something with a more natural description)
* Add a new factory type io.druid.query.extraction.namespace.ExtractionNamespaceFunctionFactory which is used for runtime wiring of io.druid.query.extraction.namespace.ExtractionNamespace to their caching and functionality. The wiring of a namespace to its factory is accomplished at guice binding via @Named implementations of the ExtractionNamespace's canonical class name.
* Add ability to explicitly rename with a map passed at query time
* Add ability to rename using a key/value lookup in a database
* Add extension which caches renames from a kafka stream
* Added README.md for the dim-rename extension
* Changed DimExtractionFn to be more of a factory rather than an implementation
* Updates are pushed via a zookeeper path (defaults to ${base}/namespaces) and take the form of io.druid.query.extraction.namespace.ExtractionNamespaceUpdate. They can be one-off (updateMs to 0 or null) or regularly scheduled. Regularly scheduled makes an attempt to only update if the source has been modified since the cache refresh.
* Added coordinator endpiont /druid/coordinator/v1/namespaces for items that need "load everything everywhere" kind of logic.

I'll do a rebase/squash before real merge

What currently works:

  • Lookup up stuff at query time (renames and rebucket)
  • Choosing optimization path
  • Kafka powered cluster-wide instant renames
  • Pulling configurations from zookeeper
  • Pulling from a DB table via JDBC
  • Pulling rename information from a file location on: S3, HDFS, Cassandra, or locally
  • Refreshing rename information

TODO:

Current performance comparison for TopN:
namespaces

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need this anymore, do we?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no because the ObjectMapper::copy() was deleted. IMHO it makes the later part of that file cleaner at the expense of having a very confusing method here. But please limit that discussion to #1092

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, well if we are making this change as part of #1092 already, then let's remove those changes completely and let #1092 deal with it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will before merge, but this PR won't work at all without #1092.

@drcrallen
Copy link
Copy Markdown
Contributor Author

TravisCI failed on unrelated code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we did not add @JacksonInject annotations to objects that do the query serialization and de-serialization. This prevents someone from using the DimExtractionFn classes directly to construct Druid queries programmatically. Instead I think we should use an approach similar to what is done with DimFilters, where the DimFilter classes are just simple POJOs for serde purposes, and the actual implementation is implemented as a Filter classes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to have two constructors. One that has the @JsonCreator and a @JacksonInject, the other doesn't and just delegates, passing null for the @JacksonInject parameters?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine, but I would like us to move in a direction where query level objects could be moved into druid-api for serde purposes and not depend on other druid modules.

My other concern would be that by leaving @JacksonInject parameters, it may require unnecessary binding to dummy RenameManager instances for node types that don't need to understand what those objects are, but do need to be able to deserialize / serialize or re-write part of queries. One example are the router nodes, which don't require that knowledge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point as well... Fwiw, I think that these pains are just pointing out that the initial "do the entire query API through JSON/Jackson" idea might not be right anymore.

Perhaps we should think about actually separating the query layer from the implementation layer a bit more? We could explore more expressive or common languages (SQL, or whatever others are doing). Then we could make a distinction between the things that just pass queries around from the things that actually process them? (That's much bigger than this PR, but perhaps we should start thinking about it?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thought when referring to the DimFilter / Filter separation. Agree we should think about having different query APIs as well as higher level query languages. I think JSON / Jackson could be one of those, being our low-level API, with the implementation separate.

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Feb 5, 2015

I like the ability to have various underlying implementations, but in terms of usability I don't think the user necessarily needs to know or have to decide what the underlying implementation should be. That also makes it much easier to configure, so configuration parameters don't have to leak into the query.

Instead, it would be nice if we just had a concept of "lookup datasources" the user can decide to use. However, how those lookup datasources are implemented should be configurable at deployment time.

If we need to support multiple implementations of lookups simultaneously, it might be useful to namespace those lookup datasources in a way that a user can choose which type to use.

For example, at configuration time we may define several namespaces:

  • namespace userinfo maps to a sql databse
  • namespace userlog maps to a kafka cluster

Assuming the database contains two tables: login and geo, each with an id column and a few other associated lookup columns, then a user can:

  • use the lookup datasource userinfo:login.email to map user ids to their email
  • and geo:countries.english to map ISO country codes to their actual english names.

Using Kafka, the datasource name could correspond to topic names. Imagine there is a topic called userstatus with key being user id and value being the user's status, then we can map the user id to status by using the lookupdatasource userlog:userstatus.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the Properties injected, not System.getProperty(). The bootstrap doesn't guarantee that the property will be set as a System property, but it does guarantee that wherever it got the property from, it will be in the Properties object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, fixing.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Feb 5, 2015

I'm in agreement with @xvrl. I think this is an awesome first implementation, but would like to see it a little more decoupled.

I think we can achieve that by making the "RenameManager" a bit less tied to Kafka topics and specific implementations. That is, what if RenameManager was

public class RenameManager {
  public RenameManager(Map<String, DimRenameFn>) {...}

  DimRenameFn find(String name)
}

Then at bootstrap time, we could have something that actually loads up dim extractors into a map and passes that in via DI. The dim extraction function could refer to the "table" by name and just do the join through that.

Each implementation of the extraction functions could have its own logic and just be registered with the RenameManager. So, JDBC and Kafka would actually be behind the same interface and the question of what topic, etc. would be something pre-determined.

I can imagine bootstrapping this with a big json object like

{
   "publisher": { "type": "kafka", "topic": "xyz" ... },
   "advertiser": { "type": "jdbc", "url": "jdbc://blah/blah" ... }
}

Does that seem reasonable/make sense/all that good stuff?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this something more specific to Kafka? it seems like that's what it is doing

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Feb 5, 2015

@cheddar makes sense, although I would prefer to register namespaces as opposed to table names, in order to allow joining on new Kafka topics and tables names without having to restart servers.
Alternatively, we could make the configuration dynamic.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : Yes that's pretty much what I was thinking after talking with @xvrl as well.

@drcrallen
Copy link
Copy Markdown
Contributor Author

I'm actually finding a bunch of tests which violate the specified interface on DimExtractionFn. Specifically that results are to never be returned as empty strings, but should instead be null.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Feb 6, 2015

@xvrl

I agree on needing a more dynamic method of specifying the tables. I kinda think we should move towards being able to configure them on the coordinator console in a dynamic, administrative fashion instead of making it purely query-driven. This is to help protect the cluster in environments where there are significantly more people with the ability to query than the ability/know-how to administer.

I'm willing to be convinced otherwise, but this generally seems like a place where latency concerns would mean that bootstrapping this stuff would be better done proactively (enabled by central configuration) rather than reactively to a query.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar / @xvrl : added a new type "namespace" which has the namespace mapping injected, but it is a concurrent map rather than some complex object and can take a null value just fine.

@drcrallen
Copy link
Copy Markdown
Contributor Author

I'm closing this down temporarily while I address some comments.

@drcrallen drcrallen closed this Feb 9, 2015
@drcrallen
Copy link
Copy Markdown
Contributor Author

Opening so I can push updates, please ignore for now

@drcrallen
Copy link
Copy Markdown
Contributor Author

Closing for internal banging around before going open for review.

@drcrallen drcrallen closed this Feb 10, 2015
@drcrallen
Copy link
Copy Markdown
Contributor Author

Encountered an error with topNs:

Here is a topN query with a threshold of 1 and a dimension extraction set as per "dimension":{"type" : "extraction","dimension" : "l_orderkey","outputName" : "l_orderkey","dimExtractionFn" : {"type":"namespace","namespace":"testTopic0"}}

[ {
  "timestamp" : "1992-01-02T00:00:00.000Z",
  "result" : [ {
    "l_orderkey" : "King",
    "L_DISCOUNT_" : 0.5200000051409006,
    "L_QUANTITY_" : 417,
    "L_TAX_" : 0.4299999922513962,
    "count" : 10,
    "L_EXTENDEDPRICE_" : 676338.4765625
  } ]
} ]

This is the EXACT SAME query but also with an extraction filter

[ {
  "timestamp" : "1992-01-02T00:00:00.000Z",
  "result" : [ {
    "l_orderkey" : "King",
    "L_DISCOUNT_" : 0.6500000022351742,
    "L_QUANTITY_" : 530,
    "L_TAX_" : 0.5999999921768904,
    "count" : 14,
    "L_EXTENDEDPRICE_" : 846855.53125
  } ]
} ]

This is the result of a groupBy with the same dimension extraction filter and dimension extraction definition:

[ {
  "version" : "v1",
  "timestamp" : "1900-01-09T00:00:00.000Z",
  "event" : {
    "L_DISCOUNT_" : 0.6500000357627869,
    "l_orderkey" : "King",
    "count" : 14,
    "L_EXTENDEDPRICE_" : 846855.5,
    "L_QUANTITY_" : 530,
    "L_TAX_" : 0.5999999642372131
  }
} ]

So it seems there may be some nastyness in DimExtractionTopNAlgorithm that will need to be resolved before this PR can be ready.

EDIT: determined to be an existing and known issue with topN

@drcrallen
Copy link
Copy Markdown
Contributor Author

opening to merge some changes

@drcrallen
Copy link
Copy Markdown
Contributor Author

Closing again until its ready. Have some more tests to run.

@drcrallen drcrallen closed this Feb 10, 2015
@drcrallen drcrallen force-pushed the queryTimeLookup branch 6 times, most recently from a29f814 to ca53f26 Compare February 20, 2015 22:48
@drcrallen drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from a6df072 to 4fbc099 Compare March 2, 2015 18:29
@xvrl xvrl added the Discuss label Mar 3, 2015
@drcrallen drcrallen force-pushed the queryTimeLookup branch 2 times, most recently from 2757d8b to 86926a7 Compare March 20, 2015 00:33
@milimetric
Copy link
Copy Markdown

@drcrallen drcrallen force-pushed the queryTimeLookup branch 5 times, most recently from 5596757 to 955a550 Compare March 31, 2015 19:35
@drcrallen drcrallen changed the title Query time lookup Query time lookup (In Code Review) Mar 31, 2015
Comment thread extensions/dim-rename-kafka8/README.md Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a request to move this to the main docs instead of a readme here

* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime and historic nodes
* Add namespace caching and populating (can be on heap or off heap)
* Add NamespaceExtractionCacheManager for handling caches
* Added ExtractionNamespace for handling metadata on the extraction namespaces
* Added ExtractionNamespaceUpdate for handling metadata related to updates
* Updates can be one-off or regularly scheduled. There is a Zookeeper path to handle the namespace update requests (defaults to `namespaces`)
* Add ability to explicitly rename with a map passed at query time
* Add ability to rename using a key/value lookup in a database
* Add extension which caches renames from a kafka stream (requires kafka8)
* Added README.md for the dim-rename kafka extension
* Added docs to Historical-Config.md and DimensionSpecs.md
* Added context flag `topNFastRename` for if the user knows the data does not need rebucketing and wants to run the more optimized pool strategy. Otherwise an explicit dim extraction strategy is used which handles rebucketing of data.
* Added tests to show why PooledTopnAlgorithm cannot be used  for dim extraction fn topNs
* Added endpoint for namespaces for coordinator at `/druid/coordinator/v1/namespaces`
@xvrl xvrl removed the Discuss label Apr 1, 2015
@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Apr 1, 2015

There's a bunch of extra gunk on this PR because it was used for discussion early on. Can you close and re-open?

@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : sure

@drcrallen
Copy link
Copy Markdown
Contributor Author

Making fresh PR for code review.

@drcrallen drcrallen closed this Apr 1, 2015
@drcrallen drcrallen mentioned this pull request Apr 1, 2015
@drcrallen
Copy link
Copy Markdown
Contributor Author

@cheddar : #1259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants