Unsigned integer druid complex column#13370
Unsigned integer druid complex column#13370churromorales wants to merge 10 commits intoapache:masterfrom
Conversation
|
@churromorales , could you please share some more details on the kind of savings you see with using this column? |
|
How does this compare with the 'auto' |
|
@clintropolis i did test it out without any encoding and it does save space. This extension can very well be used with the long encoding feature, some small changes to the long encoder because it relies on having a block of a certain size and then figures out how many long values it can stuff in there. I could add another encoder, or better yet modify the existing one (since it works) and have it work for both. But that would require a core change, but it is very possible. I think this + encoding could add much more value, but for now I don't think I have justification to change the encoder in druid-core until I show a valid reason. Let me know what you think...or have any other thoughts. |
22a7fd0 to
67755ba
Compare
…t the branch so had to rebase and force push, lets see if this works
67755ba to
fe74cc9
Compare
|
@clintropolis i looked at the long encoding work some more and think it handles everything for us. If we take a look at |
|
oops, sorry for the delay
yes, it does bit-packing so should effectively achieve the same thing
I think it would be reasonable to turn on by default. I used to have some worries about the performance since the abstraction seems to cause some overhead, especially in the non-vectorized engine, at least the last time I measured this https://user-images.githubusercontent.com/1577461/42849379-d1483132-89d7-11e8-8cdd-2382690d70b6.gif as seen by this chart where the 'auto' encoded data grew at a faster rate than the 'longs' encoding (top chart is basically segment scan time from 0 to 100% selection) that i collected as part of ancient #6016 (which maybe someday I will get back to...). But, the difference wasn't huge, and I think the vectorization improvements done as part of #11004 probably make up for this, so Im ok with switching the default. |
|
I am also supportive of doing |
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
This pull request/issue has been closed due to lack of activity. If you think that |
This adds the ability to store a unsigned integer as a complex column type. Note only on disk is the value stored as an unsigned int, when it is deserialized it is a
Longthus aggregators will not overflow. This should help save a little space for those that rely on metric columns as counts.Docs show how to use it, basically add the extension, then your spec would have something like this: