-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Is your feature request related to a problem? Please describe.
Doris currently supports approximate percentage calculations, but there are some business scenarios that require accurate percentage calculation. Hive, Spark and Alicloud MaxCompute all support exact percentile aggregate.
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
https://help.aliyun.com/document_detail/48975.html#title-x4d-jao-van
Describe the solution you'd like
refer to: https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/UDAFPercentile.java
- calculate the cumulative number of occurrences of each value.
<Value, count>
19,2,1,1,7,5,7,9,9,1 => <1,3> <2,1> <5,1> <7,2> <9,2> <19,1>
- sort by value and calculate cumulative rank
<1,3> <2,4> <5,5> <7,7> <9,9> <19,10>
- Linear exploration to calculate the exact percentile (linear interpolation calculation if necessary)
percentile(value, 0.25) = (3-2.25)*1 + (2.25 - 2)*2 = 1.25
import numpy as np
a = np.array([1,1,1,2,5,7,7,9,9,19])
print(np.percentile(a, 25))
1.25Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.