[feature](function) add approx_top_k aggregation function #40813

zzzxl1993 · 2024-09-13T10:34:38Z

Proposed changes

select approx_top_k(clientip, status, size, 10, 300) from tbl;

This code implements an approximate Top-N query function based on the SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm commonly used to handle frequent element query problems in large datasets. Below is a description of the main functionalities:

(1) Data Structures and Memory Management:
The SpaceSavingArena class provides a memory pool to manage memory allocation and deallocation. For keys of type StringRef, it handles the memory by copying the string into the memory pool.
The Counter struct stores the key, count, and error for each element, and provides serialization and deserialization functions.

(2) Insertion and Updates:
The insert method is used to insert new elements or update the count of existing elements. If the current capacity is not full, it inserts the new element; if it is full, it replaces the element with the smallest count based on the element's count and error.

(3) Merge Operation:
The merge method allows merging two SpaceSaving objects. During the merge, it adjusts the counts and errors of the existing elements, ensuring that the result maintains the correct order.

(4) Top-K Query:
The top_k method returns the current Top-K most frequent elements, sorted by their count and error.

(5) Capacity Expansion and Shrinking:
The resize method allows adjusting the storage capacity, and it recalculates the size of the alpha_map accordingly.

(6) Serialization and Deserialization:
The write and read methods are provided for serializing the SpaceSaving structure to disk or reading data from disk.

(7) Optimization and Performance:
The code uses a hash table-based approach for lookup and storage, and dynamically adjusts the alpha_map size to optimize performance and reduce memory waste.

In summary, the SpaceSaving class efficiently implements Top-N queries for large data streams within limited memory, with efficient insertion, updating, and merging mechanisms.

doris-robot · 2024-09-13T10:34:42Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

zzzxl1993 · 2024-09-13T10:35:08Z

run buildall

morrySnow

add test case please

morrySnow · 2024-09-18T04:02:26Z

fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java

            agg(VarianceSamp.class, "var_samp", "variance_samp"),
-            agg(WindowFunnel.class, "window_funnel")
+            agg(WindowFunnel.class, "window_funnel"),
+            agg(MultiTopN.class, "multi_topn")


sort by lexicographical order

morrySnow · 2024-09-18T04:03:46Z

add desc, and add doc to https://github.com/apache/doris-website

zzzxl1993 · 2024-09-18T10:44:59Z

run buildall

doris-robot · 2024-09-18T12:27:33Z

TeamCity be ut coverage result:
Function Coverage: 37.36% (9610/25724)
Line Coverage: 28.74% (79409/276344)
Region Coverage: 28.21% (41106/145717)
Branch Coverage: 24.82% (20943/84368)
Coverage Report: http://coverage.selectdb-in.cc/coverage/5a515dcbdd11f618088f01edaa8105ef798a79f8_5a515dcbdd11f618088f01edaa8105ef798a79f8/report/index.html

github-actions

clang-tidy made some suggestions

github-actions · 2024-10-22T07:40:16Z

be/src/vec/aggregate_functions/aggregate_function_multi_topn.h

+
+#pragma once
+
+#include <rapidjson/encodings.h>


warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h> ^

github-actions · 2024-10-22T07:40:16Z