-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](function) add approx_top_k aggregation function #40813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
run buildall |
15aef5f to
fe04eee
Compare
morrySnow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test case please
| agg(VarianceSamp.class, "var_samp", "variance_samp"), | ||
| agg(WindowFunnel.class, "window_funnel") | ||
| agg(WindowFunnel.class, "window_funnel"), | ||
| agg(MultiTopN.class, "multi_topn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort by lexicographical order
|
add desc, and add doc to https://github.com/apache/doris-website |
fe04eee to
5a515dc
Compare
|
run buildall |
|
TeamCity be ut coverage result: |
75d9fad to
3381976
Compare
3381976 to
641e14b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
|
||
| #pragma once | ||
|
|
||
| #include <rapidjson/encodings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]
#include <rapidjson/encodings.h>
^| namespace doris::vectorized { | ||
|
|
||
| template <typename T> | ||
| inline uint32_t get_leading_zero_bits_unsafe(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'uint32_t' [clang-diagnostic-error]
inline uint32_t get_leading_zero_bits_unsafe(T x) {
^| } | ||
|
|
||
| template <typename T> | ||
| inline uint32_t bit_scan_reverse(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'uint32_t' [clang-diagnostic-error]
inline uint32_t bit_scan_reverse(T x) {
^|
|
||
| template <typename T> | ||
| inline uint32_t bit_scan_reverse(T x) { | ||
| return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'std' [clang-diagnostic-error]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^|
|
||
| template <typename T> | ||
| inline uint32_t bit_scan_reverse(T x) { | ||
| return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'size_t'; did you mean 'sizeof'? [clang-diagnostic-error]
| return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - | |
| return (std::max<sizeof>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
|
|
||
| template <typename T> | ||
| inline uint32_t bit_scan_reverse(T x) { | ||
| return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: expected expression [clang-diagnostic-error]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^|
|
||
| template <typename T> | ||
| inline uint32_t bit_scan_reverse(T x) { | ||
| return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: left operand of comma operator has no effect [clang-diagnostic-unused-value]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^|
|
||
| #pragma once | ||
|
|
||
| #include <boost/range/adaptor/reversed.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'boost/range/adaptor/reversed.hpp' file not found [clang-diagnostic-error]
#include <boost/range/adaptor/reversed.hpp>
^| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| #include "vec/common/space_saving.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'vec/common/space_saving.h' file not found [clang-diagnostic-error]
#include "vec/common/space_saving.h"
^641e14b to
4dd3060
Compare
|
run buildall |
1 similar comment
|
run buildall |
|
need detail description for this PR |
|
|
||
| try { | ||
| create_table(indexTbNameV1, 'V1') | ||
| create_table(indexTbNameV2, 'V2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the two tables and why you need two tables?
| for (size_t i = 0; i < batch_size; ++i) { | ||
| derived->add(place, columns, i, arena); | ||
|
|
||
| if constexpr (is_aggregate_function_multi_top<Derived>::value || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to the implementation in add_match
| template <typename T> \ | ||
| struct is_aggregate_function_##Name : std::is_base_of<AggregateFunctionMultiTop, T> {} | ||
|
|
||
| REGISTER_AGGREGATE_FUNCTION(MultiTopN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use consistent lower case name style
| AggregateFunctionMultiTopN>(argument_types_), | ||
| column_size(argument_types_.size() - 2) {} | ||
|
|
||
| String get_name() const override { return "multi_topn"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approx_topn is better since the key semantics is approximate and multi filed is not necessary.
| register_aggregate_function_linear_histogram(instance); | ||
| register_aggregate_function_map_agg(instance); | ||
| register_aggregate_function_bitmap_agg(instance); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
|
|
||
| if constexpr (is_aggregate_function_multi_top<Derived>::value || | ||
| is_aggregate_function_multi_top_with_null_variadic_inline<Derived>::value) { | ||
| derived->add_range(place, columns, 0, batch_size, arena); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_range() calls add() in a loop just like the else branch. So is it necessary to add a branch for is_aggregate_function_multi_top?
void add_range(AggregateDataPtr __restrict place, const IColumn** columns, ssize_t min,
ssize_t max, Arena* arena) const {
for (ssize_t row_num = min; row_num < max; ++row_num) {
add(place, columns, row_num, arena);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling add directly on nullable types has poor performance. Using add_range uniformly can prevent multiple calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible could override add_batch_single_place function directly? Maybe code more clearly
| read_var_uint(reserved, buf); | ||
| } | ||
|
|
||
| void add(AggregateDataPtr __restrict place, const IColumn** columns, ssize_t row_num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment
| write_var_uint(reserved, buf); | ||
| } | ||
|
|
||
| void deserialize(AggregateDataPtr __restrict place, BufferReadable& buf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment
|
|
||
| @Override | ||
| public void checkLegalityBeforeTypeCoercion() { | ||
| if (arity() < 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a function without argument reserved and use a default value to be more easy to use for general users.
4dd3060 to
e61f898
Compare
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
|
||
| #pragma once | ||
|
|
||
| #include <rapidjson/encodings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]
#include <rapidjson/encodings.h>
^e61f898 to
1d82a27
Compare
1bf0f6c to
fda6f21
Compare
|
run buildall |
fda6f21 to
941ff08
Compare
|
run buildall |
941ff08 to
937b20e
Compare
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
b2695da to
4c844f9
Compare
4c844f9 to
26e7b8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
|
||
| #pragma once | ||
|
|
||
| #include <rapidjson/encodings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]
#include <rapidjson/encodings.h>
^|
run buildall |
26e7b8a to
4436e85
Compare
|
run buildall |
|
TeamCity be ut coverage result: |
xiaokang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
qidaye
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
## Proposed changes
1. select approx_top_k(clientip, status, size, 10, 300) from tbl;
This code implements an approximate Top-N query function based on the
SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm
commonly used to handle frequent element query problems in large
datasets. Below is a description of the main functionalities:
(1) Data Structures and Memory Management:
The SpaceSavingArena class provides a memory pool to manage memory
allocation and deallocation. For keys of type StringRef, it handles the
memory by copying the string into the memory pool.
The Counter struct stores the key, count, and error for each element,
and provides serialization and deserialization functions.
(2) Insertion and Updates:
The insert method is used to insert new elements or update the count of
existing elements. If the current capacity is not full, it inserts the
new element; if it is full, it replaces the element with the smallest
count based on the element's count and error.
(3) Merge Operation:
The merge method allows merging two SpaceSaving objects. During the
merge, it adjusts the counts and errors of the existing elements,
ensuring that the result maintains the correct order.
(4) Top-K Query:
The top_k method returns the current Top-K most frequent elements,
sorted by their count and error.
(5) Capacity Expansion and Shrinking:
The resize method allows adjusting the storage capacity, and it
recalculates the size of the alpha_map accordingly.
(6) Serialization and Deserialization:
The write and read methods are provided for serializing the SpaceSaving
structure to disk or reading data from disk.
(7) Optimization and Performance:
The code uses a hash table-based approach for lookup and storage, and
dynamically adjusts the alpha_map size to optimize performance and
reduce memory waste.
In summary, the SpaceSaving class efficiently implements Top-N queries
for large data streams within limited memory, with efficient insertion,
updating, and merging mechanisms.
Co-authored-by: zzzxl1993 <yangsiyu@selectdb.com>
fix macos compile failed, introduced by apache#40813, apache#42930, apache#43218, apache#43289 (cherry picked from commit ded2190)
Proposed changes
select approx_top_k(clientip, status, size, 10, 300) from tbl;
This code implements an approximate Top-N query function based on the SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm commonly used to handle frequent element query problems in large datasets. Below is a description of the main functionalities:
(1) Data Structures and Memory Management:
The SpaceSavingArena class provides a memory pool to manage memory allocation and deallocation. For keys of type StringRef, it handles the memory by copying the string into the memory pool.
The Counter struct stores the key, count, and error for each element, and provides serialization and deserialization functions.
(2) Insertion and Updates:
The insert method is used to insert new elements or update the count of existing elements. If the current capacity is not full, it inserts the new element; if it is full, it replaces the element with the smallest count based on the element's count and error.
(3) Merge Operation:
The merge method allows merging two SpaceSaving objects. During the merge, it adjusts the counts and errors of the existing elements, ensuring that the result maintains the correct order.
(4) Top-K Query:
The top_k method returns the current Top-K most frequent elements, sorted by their count and error.
(5) Capacity Expansion and Shrinking:
The resize method allows adjusting the storage capacity, and it recalculates the size of the alpha_map accordingly.
(6) Serialization and Deserialization:
The write and read methods are provided for serializing the SpaceSaving structure to disk or reading data from disk.
(7) Optimization and Performance:
The code uses a hash table-based approach for lookup and storage, and dynamically adjusts the alpha_map size to optimize performance and reduce memory waste.
In summary, the SpaceSaving class efficiently implements Top-N queries for large data streams within limited memory, with efficient insertion, updating, and merging mechanisms.