[Feature] Add Topn udaf #4803

Youngwb · 2020-10-27T07:29:40Z

Proposed changes

For #4674
This is a udaf for approximate topn using Space-Saving algorithm. At present, we can only calculate the frequent items and their frequencies in a certain column, based on which we can implement similar topN functions supported by Kylin in the future.

I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result. The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is used to set the counter number in the space-saving algorithm

zf exponent = 0.5
Element cardinality	        20X        50X          100X
               1000		100%	   100%         100%
               10000		100%	   100%		100%
	       100000		100%	   100%		100%
	       500000		 94%	    98%		 99%

zf exponent = 0.6，1
Element cardinality	        20X        50X          100X
		1000		100%	   100%         100%
		10000		100%	   100%		100%
		100000		100%	   100%		100%
		500000		100%	   100%		100%

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

[] Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
[] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[] Documentation Update (if none of the other choices apply)
[] Code refactor (Modify the code structure, format the code, etc...)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have create an issue on (Fix #ISSUE), and have described the bug/feature there in detail
Compiling and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
If this change need a document change, I have updated the document
[] Any dependent changes have been merged

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

gensrc/proto/olap_common.proto

morningman · 2020-11-21T14:47:31Z

be/src/util/topn_counter.cpp

+    sort_retain(_capacity);
+}
+
+void TopNCounter::finalize(std::string& finalize_str) {


How about using JSON format output?

Co-authored-by: Mingyu Chen <morningman.cmy@gmail.com>

morningman

LGTM

For apache#4674 This is a udaf for approximate topn using Space-Saving algorithm. At present, we can only calculate the frequent items and their frequencies in a certain column, based on which we can implement similar topN functions supported by Kylin in the future. I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result. The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is used to set the counter number in the space-saving algorithm ``` zf exponent = 0.5 Element cardinality 20X 50X 100X 1000 100% 100% 100% 10000 100% 100% 100% 100000 100% 100% 100% 500000 94% 98% 99% zf exponent = 0.6，1 Element cardinality 20X 50X 100X 1000 100% 100% 100% 10000 100% 100% 100% 100000 100% 100% 100% 500000 100% 100% 100% ```

morningman added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 3, 2020

morningman reviewed Nov 21, 2020

View reviewed changes

yangwenbo6 and others added 3 commits December 15, 2020 11:04

topn

0c55b7c

add doc

55c79cc

Update gensrc/proto/olap_common.proto

1a5ec06

Co-authored-by: Mingyu Chen <morningman.cmy@gmail.com>

Youngwb force-pushed the topn branch from 65154a3 to 1a5ec06 Compare December 15, 2020 03:05

yangwenbo6 and others added 2 commits December 15, 2020 14:21

json result

1f187d5

fix test

0ab589d

morningman approved these changes Dec 15, 2020

View reviewed changes

morningman merged commit 650536d into apache:master Dec 16, 2020

yangzhg mentioned this pull request Feb 9, 2021

Release Notes 0.14.0 #5374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Topn udaf #4803

[Feature] Add Topn udaf #4803

Uh oh!

Youngwb commented Oct 27, 2020

Uh oh!

Uh oh!

morningman Nov 21, 2020

Uh oh!

Youngwb Dec 15, 2020

Uh oh!

morningman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Feature] Add Topn udaf #4803

[Feature] Add Topn udaf #4803

Uh oh!

Conversation

Youngwb commented Oct 27, 2020

Proposed changes

Types of changes

Checklist

Further comments

Uh oh!

Uh oh!

morningman Nov 21, 2020

Choose a reason for hiding this comment

Uh oh!

Youngwb Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants