We used the DataSketches to compute quantiles and got very weird query results.
Affected Version
0.12.2
Description
Metrics spec at ingestion time:
"metricsSpec": [
{
"type": "count",
"name": "count"
},
{
"type": "doubleSum",
"name": "cm_value",
"fieldName": "cm_value",
"expression": null
},
{
"type": "quantilesDoublesSketch",
"name": "cm_value_sketch",
"fieldName": "cm_value",
"k": 128
}
]
My query:
"aggregations": [
{
"type": "quantilesDoublesSketch",
"name": "custom_value_sketch",
"fieldName": "cm_value"
},
{
"type": "doubleSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "cm_value_sum",
"fieldName": "cm_value"
}
],
"postAggregations": [
{
"type": "quantilesDoublesSketchToQuantiles",
"name": "quantiles",
"fractions": [
0.1,
0.2,
0.3,
0.4,
0.5,
0.6,
0.7,
0.8,
0.9,
1
],
"field": {
"type": "fieldAccess",
"fieldName": "custom_value_sketch"
}
}
]
The query result:
"result" : {
"count" : 4223.0,
"cm_value_sum" : 667109.0,
"quantiles" : [ 52.0, 179.0, 515.0, 929.0, 1185.0, 1426.0, 1680.0, 2047.0, 2601.0, 6000.0 ],
"custom_value_sketch" : 529
}
As we can see, the value of 0.5-quantile is 1185.0, so there must be nearly half of the cm_value greater than or equal to 1185.0. However, if we multiply 1185 and 2111 (half of the count) , we found the result is 2501535 which is much greater than the sum of cm_value 667109. Impossible! this should not be happen. We have loaded the same data into hive, and queried hive we got the result:
"result" : {
"count" : 4223.0,
"cm_value_sum" : 667109.0,
"quantiles" : [ 70.0, 82.0, 96.0, 112.0, 136.0, 160.0, 189.0, 229.0, 274.8000000000002, 3368.0 ]
}
@AlexanderSaydakov is there any bug of DataSketches Quantiles Sketch or I used it in a wrong way?
We used the DataSketches to compute quantiles and got very weird query results.
Affected Version
0.12.2
Description
Metrics spec at ingestion time:
My query:
The query result:
As we can see, the value of 0.5-quantile is 1185.0, so there must be nearly half of the
cm_valuegreater than or equal to 1185.0. However, if we multiply 1185 and 2111 (half of the count) , we found the result is 2501535 which is much greater than the sum of cm_value 667109. Impossible! this should not be happen. We have loaded the same data into hive, and queried hive we got the result:@AlexanderSaydakov is there any bug of DataSketches Quantiles Sketch or I used it in a wrong way?