Unsigned integer druid complex column by churromorales · Pull Request #13370 · apache/druid

churromorales · 2022-11-15T19:53:12Z

This adds the ability to store a unsigned integer as a complex column type. Note only on disk is the value stored as an unsigned int, when it is deserialized it is a Long thus aggregators will not overflow. This should help save a little space for those that rely on metric columns as counts.

Docs show how to use it, basically add the extension, then your spec would have something like this:

"metricsSpec": [
        {
          "type": "unsigned_int",
          "name": "value",
          "fieldName": "value"
        },

kfaraz · 2022-11-17T04:05:13Z

@churromorales , could you please share some more details on the kind of savings you see with using this column?

clintropolis · 2022-11-17T04:11:21Z

How does this compare with the 'auto' longEncoding you can specify on IndexSpec? https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#indexspec. I'll admit this isn't terribly well documented, but added in #3148, basically it provides table and delta encoding with bitpacking for long typed columns which can save a fair bit of size. #11004 might also have some useful stuff to reference if you're looking to do any benchmarking.

churromorales · 2022-11-28T23:46:48Z

@clintropolis i did test it out without any encoding and it does save space. This extension can very well be used with the long encoding feature, some small changes to the long encoder because it relies on having a block of a certain size and then figures out how many long values it can stuff in there. I could add another encoder, or better yet modify the existing one (since it works) and have it work for both. But that would require a core change, but it is very possible. I think this + encoding could add much more value, but for now I don't think I have justification to change the encoder in druid-core until I show a valid reason. Let me know what you think...or have any other thoughts.

…t the branch so had to rebase and force push, lets see if this works

churromorales · 2022-11-30T20:37:51Z

@clintropolis i looked at the long encoding work some more and think it handles everything for us. If we take a look at VSizeLongSerde.getSerializer() it looks like it only stores the bits it needs. I think this PR is not necessary with this feature, but I do have a suggestion. I believe this patch has been in for a while and we should make long encoding on by default. When I tested the encoding vs unsigned int, it encoded better because it rarely stored leading 0's. What do you think about me closing this PR and having another one to have long encoding on by default in druid?

clintropolis · 2022-12-08T22:52:20Z

oops, sorry for the delay

@clintropolis i looked at the long encoding work some more and think it handles everything for us. If we take a look at VSizeLongSerde.getSerializer() it looks like it only stores the bits it needs.

yes, it does bit-packing so should effectively achieve the same thing

I think this PR is not necessary with this feature, but I do have a suggestion. I believe this patch has been in for a while and we should make long encoding on by default. When I tested the encoding vs unsigned int, it encoded better because it rarely stored leading 0's. What do you think about me closing this PR and having another one to have long encoding on by default in druid

I think it would be reasonable to turn on by default. I used to have some worries about the performance since the abstraction seems to cause some overhead, especially in the non-vectorized engine, at least the last time I measured this https://user-images.githubusercontent.com/1577461/42849379-d1483132-89d7-11e8-8cdd-2382690d70b6.gif as seen by this chart where the 'auto' encoded data grew at a faster rate than the 'longs' encoding (top chart is basically segment scan time from 0 to 100% selection) that i collected as part of ancient #6016 (which maybe someday I will get back to...).

But, the difference wasn't huge, and I think the vectorization improvements done as part of #11004 probably make up for this, so Im ok with switching the default.

gianm · 2022-12-09T20:32:08Z

I am also supportive of doing auto long encoding by default.

github-actions · 2024-01-12T00:17:04Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-02-10T00:15:24Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

kfaraz added Area - Segment Format and Ser/De Area - Extension labels Nov 17, 2022

churromorales force-pushed the unsigned_integer branch from 22a7fd0 to 67755ba Compare November 29, 2022 17:56

churromorales added 10 commits November 29, 2022 16:29

Unsigned integer druid complex column

6664c3d

Travis CI Fixes

5e99bfc

Needed header for initialization file

e8ad174

Fixing the spelling for the docs

5d9d6e7

Rat requires license headers for test data, trying to work around this

ee5c4e5

Things finally work

4fd4d63

Fixing up tests and making things work

5170935

Checkstyle issues

22661f4

Using the updated APIs and added better tests

d382f38

Github actions fails to build, looks like its pulling from master, no…

fe74cc9

…t the branch so had to rebase and force push, lets see if this works

churromorales force-pushed the unsigned_integer branch from 67755ba to fe74cc9 Compare November 30, 2022 00:29

github-actions Bot added the stale label Jan 12, 2024

github-actions Bot closed this Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsigned integer druid complex column#13370

Unsigned integer druid complex column#13370
churromorales wants to merge 10 commits intoapache:masterfrom
churromorales:unsigned_integer

churromorales commented Nov 15, 2022

Uh oh!

kfaraz commented Nov 17, 2022

Uh oh!

clintropolis commented Nov 17, 2022 •

edited

Loading

Uh oh!

churromorales commented Nov 28, 2022

Uh oh!

churromorales commented Nov 30, 2022

Uh oh!

clintropolis commented Dec 8, 2022

Uh oh!

gianm commented Dec 9, 2022

Uh oh!

github-actions Bot commented Jan 12, 2024

Uh oh!

github-actions Bot commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

churromorales commented Nov 15, 2022

Uh oh!

kfaraz commented Nov 17, 2022

Uh oh!

clintropolis commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

churromorales commented Nov 28, 2022

Uh oh!

churromorales commented Nov 30, 2022

Uh oh!

clintropolis commented Dec 8, 2022

Uh oh!

gianm commented Dec 9, 2022

Uh oh!

github-actions Bot commented Jan 12, 2024

Uh oh!

github-actions Bot commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clintropolis commented Nov 17, 2022 •

edited

Loading