Skip to content

Conversation

@Jibing-Li
Copy link
Contributor

backport: #46534

…ey column. (apache#46534)

When doing sample analyze for partition column and key column, BE may
encounter OOM problem. The reason is, partition column need to choose at
least one tablet in each partition to calculate the NDV and couldn't use
limit in the SQL, so when the table has large number of partitions and
each tablet in each partition is quite large, the sample SQL may try to
read too many data which will cause BE OOM.
Similarly, key column couldn't use limit as well, so when one tablet is
very large, it also could cause OOM.

This pr is try to solve this problem.
For partition columns, when the selected tablets contain more than
1000000000 (one billion) rows, we use ndv() function to read up to 5
partitions to get the NDV value of this 5 partitions, say the ndv is n.
Suppose the row count in the 5 partitions is r, and the row count of tje
table is R, the table NDV would be n * R / r.
ndv() function use hll, so it only use a small amount of memory.

For key columns, when the selected tablets contain more than 1000000000
rows, we use limit 1000000000 to control the rows to read.

Reading 1000000000 rows would use at most 8GB memory in BE, which is
acceptable.

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

None
@Jibing-Li
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Jibing-Li Jibing-Li marked this pull request as ready for review March 12, 2025 01:56
@Jibing-Li Jibing-Li requested a review from yiguolei as a code owner March 12, 2025 01:56
@yiguolei yiguolei merged commit 5988495 into apache:branch-2.1 Mar 14, 2025
21 checks passed
@Jibing-Li Jibing-Li deleted the oom2.1 branch March 14, 2025 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants