Skip to content

[QTL] Support multiple lookup maps within one namespace#2524

Closed
sirpkt wants to merge 3 commits intoapache:masterfrom
sirpkt:multi-column-lookup
Closed

[QTL] Support multiple lookup maps within one namespace#2524
sirpkt wants to merge 3 commits intoapache:masterfrom
sirpkt:multi-column-lookup

Conversation

@sirpkt
Copy link
Copy Markdown
Contributor

@sirpkt sirpkt commented Feb 23, 2016

This PR is related with #2523

  • For URI or JDBC source with multiple columns, user can define all the needed (key column, value column) mappings within one namespace configuration.
  • Impact on existing namespace implementations(like druid-Kafka-extraction-namespace) is minimized. Just change implements ExtractionNamespaceFunctionFactory to extends ExtractionNamespaceFunctionFactory.
  • Now, NamespaceExtractor has one more parameter mapName, which indicates lookup map name in the given namespace, however, it works as before without that parameter for backward-compatibility.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 23, 2016

@sirpkt awesome!

@fjy fjy added this to the 0.9.1 milestone Feb 23, 2016
@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Feb 23, 2016

@sirpkt i am not sure how this can work with the actual LookupDimensionSpec i am wondering how you will call it a query time.

@sirpkt
Copy link
Copy Markdown
Contributor Author

sirpkt commented Feb 24, 2016

As I replied in the issue page,
user can specify lookup map name within the namespace in lookup like

{
  "type":"namespace",
  "namepace":"DB1",
  "mapName":"CtoD"
}

I added unit test code for explicit Json usage of NamepacedExtraction.

@fjy fjy changed the title Support multiple lookup maps within one namespace [QTL] Support multiple lookup maps within one namespace Feb 24, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 24, 2016

@drcrallen @b-slim @cheddar can u guys review this and coordinate over development?

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Mar 28, 2016

@drcrallen @b-slim

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates offheap caching

@drcrallen
Copy link
Copy Markdown
Contributor

@sirpkt I think this one needs more discussion among the community to make sure it fits overall expectations. As such I'm proposing punting it out of 0.9.1.

0.9.1 is slated for a major overhaul of Lookups to essentially be the first (hopefully) production-ready version for lookups.

This is a (important) feature add for lookups, but is outside the scope of "required for MVP"

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jun 15, 2016

@drcrallen @b-slim what's going on with this PR?

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Jun 16, 2016

Same opinion as @drcrallen the feature needs more discussion, plus major changes to be compatible with new lookups impls. In addition we have a pretty busy roadmap and i guess this feature is not a top priority IMHO. The author can always start working on make it working with new lookup module.

@sirpkt
Copy link
Copy Markdown
Contributor Author

sirpkt commented Jun 16, 2016

I'll try to make this working with new lookup module.
I think it will take some time because lookup is changed considerably.

@fjy fjy modified the milestones: 0.9.3, 0.9.2 Jun 16, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Aug 26, 2016

@sirpkt @b-slim @drcrallen can we submit an issue or a proposal for the list of changes described in this PR and discuss changes there?

@sirpkt sirpkt force-pushed the multi-column-lookup branch 2 times, most recently from 7258abb to 254fe3d Compare September 1, 2016 04:29
@sirpkt
Copy link
Copy Markdown
Contributor Author

sirpkt commented Sep 1, 2016

I added mapName field in LookupDimensionSpec and RegisteredLookupExtractionFn
and updated docs to reflect the change of Globally Cached Lookups.
I also modified the description about Lookup Extraction Function
because LookupExtractor no longer supports "namespace" type.

@sirpkt sirpkt force-pushed the multi-column-lookup branch from 254fe3d to 79ef586 Compare September 21, 2016 05:27
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments, but I'm generally on board with the goal and approach taken by this PR.

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this to something like getCacheInnerMap() to differentiate the two functions, and note in the javadocs that this retrieves an inner map

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For thought, would it be easier/better to use a MultiKey for the composite namespace:ID key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced by MultiKey

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a bit more documentation detail along the lines of:

  • Key/Value column refer to columns within the lookup source; "columns" field refers to Druid columns whose values will be used as filtering criteria for retrieving the mapping row from the lookup source

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "kayValueMaps" -> "keyValueMaps"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed as "maps"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "kayValueMaps"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed as maps

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a note in the docs about how KafkaLookupExtractor only uses the default mapname

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some javadocs explaining how this function differs from getMapCachePopulator()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: "swap" -> "swaps", "leave" -> "leaves"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add javadocs for these two methods

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a note on why the delete can be a no-op here (GC?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a useful feature to reduce memory use and simplify management of lookups - although I have some concerns about the API. Specifically, we need to try to retain backwards compatibility.

Copy link
Copy Markdown
Contributor

@gianm gianm Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree just maps is clearer.

Copy link
Copy Markdown
Contributor

@gianm gianm Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

  • keyName -> keyColumn
  • valueName -> valueColumn

(like the old configs)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@gianm gianm Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to retain backwards compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should have a "default map" name that the keyColumn/valueColumn go into if you don't specify a maps list. And then that one also gets used at query time if you don't specify a mapName.

Copy link
Copy Markdown
Contributor

@gianm gianm Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see we already have this in DEFAULT_MAPNAME. Let's use that for this purpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer List<KeyValueMap> here, it's generally easier to work with.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use the auto generated IntelliJ style?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use auto generated one

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be escaped; field names can have funny characters in them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__default is more consistent with defaults in other Druid areas.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sirpkt sirpkt force-pushed the multi-column-lookup branch from 79ef586 to 739290a Compare October 10, 2016 01:54
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't totally review the cache manager code or jdbc namespace yet. But I looked at the main apis, http stuff, parser stuff, and uri namespace code so far. Will look at the rest soon but I just wanted to get at least this part of the review out.

The biggest question for me at this point is, does it make sense to move "maps" into the parse spec? I think it does but would appreciate a second opinion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyValueMap is maps now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using __default in these examples as it is the actual default map name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document this as the default map name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I wonder about backwards compatibility here. Will take a closer look at the actual http code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapName (spelling)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I understand having a default mapName, but it seems strange to have a default key/value name (especially undocumented).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing String to Object means this is no longer a "flat" data parser! Maybe that's okay, but if it is okay, the name should definitely change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading through the URIExtractionNamespace changes I now wonder if having the "maps" with their keyColumn / valueColumn out here is causing the dizziness and weirdness with simpleJson… because it has no keyColumn!

I wonder if it makes more sense to move "maps" into the namespaceParseSpec. That way the parser is in charge of what the map names and k/vs it returns are, and that should remove some of the weirdness in URIExtractionNamespace.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like a reasonable change, it would express more directly that simpleJson parser doesn't use the "maps" field unlike the other parser types

I suppose the logic for map building from a set of KeyValueMaps in the URIExtractionNamespace's delegate parser could be moved to something shared by the CSV/TSV/customJson "FlatDataParsers" in URIExtractionNamespace

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above… I wonder if it'd be less dizzy to move "maps" into the namespaceParseSpec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is enough escaping. There could be backslashes and quotes and stuff in the field names. Maybe. Does JDBC/JDBI have a utility function to help with escaping?

EIther that, or let's check the requiredFields against a whitelist of characters.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Oct 19, 2016

@drcrallen @b-slim any thoughts on the general idea & API here?

Broadly: looks like the changes are all centered around having more than one lookup map per thing-we-load. So we may load a single json file that has many logical lookups in it. IMO the nice thing about doing it this way is we only have to poll and parse the file one time. It's also easier to configure loading multiple lookups from one file. I'm on board with the general idea and attempting to work out whether the API needs adjustments or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing the API ? not all the lookups will have a map name ?

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Oct 19, 2016

I like the idea on minimizing the amount of fetch that a lookup had to make but the current API change make it backward incompatible plus it is unclear what mapName really mean. I would highly recommend to work on that by for instance use a name spacing convention like name.subname where the name will be used to match on the registered lookup then subname as equivalent to mapName.

@gianm gianm assigned fjy and gianm Nov 22, 2016
@sirpkt sirpkt force-pushed the multi-column-lookup branch from 739290a to dceb5f8 Compare November 24, 2016 05:37
@sirpkt
Copy link
Copy Markdown
Contributor Author

sirpkt commented Nov 24, 2016

Sorry for late response.
I updated the code based on the review comments.

@b-slim I don't understand your point about backward compatibility because mapName is optional argument and users always make LookupDimensionSpec with json so that they just omit
mapName in their spec when their lookups do not have multiple maps.
And I still think having separate mapName is better than combining name and mapName because user may use combining delimiter (ex. .) in name or mapName.
However, it makes sense that mapName is unclear so I changed it to innerMapName. Welcome any suggestion.

@gianm For escaping column and table names at SQL query creation, I use escape and quote methods of SQLTemplate in Querydsl. As I'm not familiar with SQL querying in Java, I'm not sure that this make sense.

Other updates:

  • KeyValueMap is moved to namespaceParseSpec from URIExtractionNamespace.
  • Dependency on commons-collections is removed by using Pair
  • FlatDataParser is refactored
  • NamespaceLookupIntrospectHandler is modified as suggested by @gianm

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 9, 2016

@b-slim @gianm can we finish this up?

@fjy fjy assigned b-slim and unassigned fjy Dec 19, 2016
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 28, 2017

Moving to 0.10.1 as review is not complete. @sirpkt please let us know if you're still interested and we will endeavor to take another look.

@gianm gianm modified the milestones: 0.10.1, 0.10.0 Feb 28, 2017
@gianm gianm removed this from the 0.10.1 milestone May 16, 2017
@stale
Copy link
Copy Markdown

stale Bot commented Feb 28, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Feb 28, 2019
@stale
Copy link
Copy Markdown

stale Bot commented Mar 7, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale Bot closed this Mar 7, 2019
seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
* Refactoring Appendertor Driver (apache#4292)

* Rename FiniteAppenderatorDriver to AppenderatorDriver (apache#4356)

* Add totalRowCount to appenderator

* add localhost as advertised hostname (apache#4689)

* kafkaIndexTask unannounce service in final block (apache#4736)

* warn if topic not found (apache#4834)

* Kafka: Fixes needlessly low interpretation of maxRowsInMemory. (apache#5034)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants