Skip to content

Conversation

@zzzxl1993
Copy link
Contributor

@zzzxl1993 zzzxl1993 commented Sep 13, 2024

Proposed changes

  1. select approx_top_k(clientip, status, size, 10, 300) from tbl;

    This code implements an approximate Top-N query function based on the SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm commonly used to handle frequent element query problems in large datasets. Below is a description of the main functionalities:

    (1) Data Structures and Memory Management:
    The SpaceSavingArena class provides a memory pool to manage memory allocation and deallocation. For keys of type StringRef, it handles the memory by copying the string into the memory pool.
    The Counter struct stores the key, count, and error for each element, and provides serialization and deserialization functions.

    (2) Insertion and Updates:
    The insert method is used to insert new elements or update the count of existing elements. If the current capacity is not full, it inserts the new element; if it is full, it replaces the element with the smallest count based on the element's count and error.

    (3) Merge Operation:
    The merge method allows merging two SpaceSaving objects. During the merge, it adjusts the counts and errors of the existing elements, ensuring that the result maintains the correct order.

    (4) Top-K Query:
    The top_k method returns the current Top-K most frequent elements, sorted by their count and error.

    (5) Capacity Expansion and Shrinking:
    The resize method allows adjusting the storage capacity, and it recalculates the size of the alpha_map accordingly.

    (6) Serialization and Deserialization:
    The write and read methods are provided for serializing the SpaceSaving structure to disk or reading data from disk.

    (7) Optimization and Performance:
    The code uses a hash table-based approach for lookup and storage, and dynamically adjusts the alpha_map size to optimize performance and reduce memory waste.

    In summary, the SpaceSaving class efficiently implements Top-N queries for large data streams within limited memory, with efficient insertion, updating, and merging mechanisms.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993 zzzxl1993 force-pushed the 202409131832 branch 2 times, most recently from 15aef5f to fe04eee Compare September 13, 2024 13:20
Copy link
Contributor

@morrySnow morrySnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test case please

agg(VarianceSamp.class, "var_samp", "variance_samp"),
agg(WindowFunnel.class, "window_funnel")
agg(WindowFunnel.class, "window_funnel"),
agg(MultiTopN.class, "multi_topn")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort by lexicographical order

@morrySnow
Copy link
Contributor

add desc, and add doc to https://github.com/apache/doris-website

@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.36% (9610/25724)
Line Coverage: 28.74% (79409/276344)
Region Coverage: 28.21% (41106/145717)
Branch Coverage: 24.82% (20943/84368)
Coverage Report: http://coverage.selectdb-in.cc/coverage/5a515dcbdd11f618088f01edaa8105ef798a79f8_5a515dcbdd11f618088f01edaa8105ef798a79f8/report/index.html

@zzzxl1993 zzzxl1993 force-pushed the 202409131832 branch 2 times, most recently from 75d9fad to 3381976 Compare September 19, 2024 05:06
@xiaokang xiaokang changed the title [feature](inverted index) multi_topn function add [feature](function) add multi_topn aggregation function Sep 22, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


#pragma once

#include <rapidjson/encodings.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h>
         ^

namespace doris::vectorized {

template <typename T>
inline uint32_t get_leading_zero_bits_unsafe(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unknown type name 'uint32_t' [clang-diagnostic-error]

inline uint32_t get_leading_zero_bits_unsafe(T x) {
       ^

}

template <typename T>
inline uint32_t bit_scan_reverse(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unknown type name 'uint32_t' [clang-diagnostic-error]

inline uint32_t bit_scan_reverse(T x) {
       ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'std' [clang-diagnostic-error]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
            ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'size_t'; did you mean 'sizeof'? [clang-diagnostic-error]

Suggested change
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
return (std::max<sizeof>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: expected expression [clang-diagnostic-error]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
                           ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: left operand of comma operator has no effect [clang-diagnostic-unused-value]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
                             ^


#pragma once

#include <boost/range/adaptor/reversed.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'boost/range/adaptor/reversed.hpp' file not found [clang-diagnostic-error]

#include <boost/range/adaptor/reversed.hpp>
         ^

// specific language governing permissions and limitations
// under the License.

#include "vec/common/space_saving.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'vec/common/space_saving.h' file not found [clang-diagnostic-error]

#include "vec/common/space_saving.h"
         ^

@zzzxl1993
Copy link
Contributor Author

run buildall

1 similar comment
@zzzxl1993
Copy link
Contributor Author

run buildall

@airborne12
Copy link
Member

need detail description for this PR


try {
create_table(indexTbNameV1, 'V1')
create_table(indexTbNameV2, 'V2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the two tables and why you need two tables?

for (size_t i = 0; i < batch_size; ++i) {
derived->add(place, columns, i, arena);

if constexpr (is_aggregate_function_multi_top<Derived>::value ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the implementation in add_match

template <typename T> \
struct is_aggregate_function_##Name : std::is_base_of<AggregateFunctionMultiTop, T> {}

REGISTER_AGGREGATE_FUNCTION(MultiTopN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use consistent lower case name style

AggregateFunctionMultiTopN>(argument_types_),
column_size(argument_types_.size() - 2) {}

String get_name() const override { return "multi_topn"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approx_topn is better since the key semantics is approximate and multi filed is not necessary.

register_aggregate_function_linear_histogram(instance);
register_aggregate_function_map_agg(instance);
register_aggregate_function_bitmap_agg(instance);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change


if constexpr (is_aggregate_function_multi_top<Derived>::value ||
is_aggregate_function_multi_top_with_null_variadic_inline<Derived>::value) {
derived->add_range(place, columns, 0, batch_size, arena);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_range() calls add() in a loop just like the else branch. So is it necessary to add a branch for is_aggregate_function_multi_top?

    void add_range(AggregateDataPtr __restrict place, const IColumn** columns, ssize_t min,
                   ssize_t max, Arena* arena) const {
        for (ssize_t row_num = min; row_num < max; ++row_num) {
            add(place, columns, row_num, arena);
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling add directly on nullable types has poor performance. Using add_range uniformly can prevent multiple calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible could override add_batch_single_place function directly? Maybe code more clearly

read_var_uint(reserved, buf);
}

void add(AggregateDataPtr __restrict place, const IColumn** columns, ssize_t row_num,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment

write_var_uint(reserved, buf);
}

void deserialize(AggregateDataPtr __restrict place, BufferReadable& buf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment


@Override
public void checkLegalityBeforeTypeCoercion() {
if (arity() < 3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a function without argument reserved and use a default value to be more easy to use for general users.

@zzzxl1993
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


#pragma once

#include <rapidjson/encodings.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h>
         ^

@zzzxl1993 zzzxl1993 changed the title [feature](function) add multi_topn aggregation function [feature](function) add approx_topn aggregation function Nov 4, 2024
@zzzxl1993 zzzxl1993 force-pushed the 202409131832 branch 2 times, most recently from 1bf0f6c to fda6f21 Compare November 5, 2024 06:28
@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 7, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2024

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2024

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 8, 2024
@zzzxl1993 zzzxl1993 changed the title [feature](function) add approx_topn aggregation function [feature](function) add approx_top_n aggregation function Nov 8, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


#pragma once

#include <rapidjson/encodings.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h>
         ^

@zzzxl1993 zzzxl1993 changed the title [feature](function) add approx_top_n aggregation function [feature](function) add approx_top_k aggregation function Nov 8, 2024
@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.95% (9896/26078)
Line Coverage: 29.11% (82457/283294)
Region Coverage: 28.27% (42451/150153)
Branch Coverage: 24.81% (21494/86624)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4436e850dfd85ca06e67352ffa455d71a0c8ca64_4436e850dfd85ca06e67352ffa455d71a0c8ca64/report/index.html

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qidaye qidaye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qidaye qidaye merged commit cd514b6 into apache:master Nov 11, 2024
zzzxl1993 added a commit to zzzxl1993/doris that referenced this pull request Nov 17, 2024
## Proposed changes

1. select approx_top_k(clientip, status, size, 10, 300) from tbl;

This code implements an approximate Top-N query function based on the
SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm
commonly used to handle frequent element query problems in large
datasets. Below is a description of the main functionalities:
      
      (1) Data Structures and Memory Management:
The SpaceSavingArena class provides a memory pool to manage memory
allocation and deallocation. For keys of type StringRef, it handles the
memory by copying the string into the memory pool.
The Counter struct stores the key, count, and error for each element,
and provides serialization and deserialization functions.

      (2) Insertion and Updates:
The insert method is used to insert new elements or update the count of
existing elements. If the current capacity is not full, it inserts the
new element; if it is full, it replaces the element with the smallest
count based on the element's count and error.

      (3) Merge Operation:
The merge method allows merging two SpaceSaving objects. During the
merge, it adjusts the counts and errors of the existing elements,
ensuring that the result maintains the correct order.

      (4) Top-K Query:
The top_k method returns the current Top-K most frequent elements,
sorted by their count and error.

      (5) Capacity Expansion and Shrinking:
The resize method allows adjusting the storage capacity, and it
recalculates the size of the alpha_map accordingly.

      (6) Serialization and Deserialization:
The write and read methods are provided for serializing the SpaceSaving
structure to disk or reading data from disk.

      (7) Optimization and Performance:
The code uses a hash table-based approach for lookup and storage, and
dynamically adjusts the alpha_map size to optimize performance and
reduce memory waste.

In summary, the SpaceSaving class efficiently implements Top-N queries
for large data streams within limited memory, with efficient insertion,
updating, and merging mechanisms.

Co-authored-by: zzzxl1993 <yangsiyu@selectdb.com>
924060929 added a commit that referenced this pull request Nov 18, 2024
924060929 added a commit to 924060929/incubator-doris that referenced this pull request Nov 18, 2024
fix macos compile failed, introduced by apache#40813, apache#42930, apache#43218, apache#43289

(cherry picked from commit ded2190)
924060929 added a commit that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants