Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions distribution/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@
<argument>-c</argument>
<argument>io.druid.extensions:druid-s3-extensions</argument>
<argument>-c</argument>
<argument>io.druid.extensions:druid-stats</argument>
<argument>-c</argument>
<argument>io.druid.extensions:mysql-metadata-storage</argument>
<argument>-c</argument>
<argument>io.druid.extensions:postgresql-metadata-storage</argument>
Expand Down
152 changes: 152 additions & 0 deletions docs/content/development/extensions-core/stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
layout: doc_page
---

# Stats aggregator

Includes stat-related aggregators, including variance and standard deviations, etc. Make sure to [include](../../operations/including-extensions.html) `druid-stats` as an extension.

## Variance aggregator

Algorithm of the aggregator is the same with that of apache hive. This is the description in GenericUDAFVariance in hive.

Evaluate the variance using the algorithm described by Chan, Golub, and LeVeque in
"Algorithms for computing the sample variance: analysis and recommendations"
The American Statistician, 37 (1983) pp. 242--247.

variance = variance1 + variance2 + n/(m*(m+n)) * pow(((m/n)*t1 - t2),2)

where: - variance is sum[x-avg^2] (this is actually n times the variance)
and is updated at every step. - n is the count of elements in chunk1 - m is
the count of elements in chunk2 - t1 = sum of elements in chunk1, t2 =
sum of elements in chunk2.

This algorithm was proven to be numerically stable by J.L. Barlow in
"Error analysis of a pairwise summation algorithm to compute sample variance"
Numer. Math, 58 (1991) pp. 583--590

### Pre-aggregating variance at ingestion time

To use this feature, an "variance" aggregator must be included at indexing time.
The ingestion aggregator can only apply to numeric values. If you use "variance"
then any input rows missing the value will be considered to have a value of 0.

User can specify expected input type as one of "float", "long", "variance" for ingestion, which is by default "float".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, github seems to have reverted to comic-sans for me, that looks like a capital L but is indeed a l


```json
{
"type" : "variance",
"name" : <output_name>,
"fieldName" : <metric_name>,
"inputType" : <input_type>,
"estimator" : <string>
}
```

To query for results, "variance" aggregator with "variance" input type or simply a "varianceFold" aggregator must be included in the query.

```json
{
"type" : "varianceFold",
"name" : <output_name>,
"fieldName" : <metric_name>,
"estimator" : <string>
}
```

|Property |Description |Default |
|-------------------------|------------------------------|----------------------------------|
|`estimator`|Set "population" to get variance_pop rather than variance_sample, which is default.|null|


### Standard Deviation post-aggregator

To acquire standard deviation from variance, user can use "stddev" post aggregator.

```json
{
"type": "stddev",
"name": "<output_name>",
"fieldName": "<aggregator_name>",
"estimator": <string>
}
```

## Query Examples:

### Timeseries Query

```json
{
"queryType": "timeseries",
"dataSource": "testing",
"granularity": "day",
"aggregations": [
{
"type": "variance",
"name": "index_var",
"fieldName": "index_var"
}
],
"intervals": [
"2016-03-01T00:00:00.000/2013-03-20T00:00:00.000"
]
}
```

### TopN Query

```json
{
"queryType": "topN",
"dataSource": "testing",
"dimensions": ["alias"],
"threshold": 5,
"granularity": "all",
"aggregations": [
{
"type": "variance",
"name": "index_var",
"fieldName": "index"
}
],
"postAggregations": [
{
"type": "stddev",
"name": "index_stddev",
"fieldName": "index_var"
}
],
"intervals": [
"2016-03-06T00:00:00/2016-03-06T23:59:59"
]
}
```

### GroupBy Query

```json
{
"queryType": "groupBy",
"dataSource": "testing",
"dimensions": ["alias"],
"granularity": "all",
"aggregations": [
{
"type": "variance",
"name": "index_var",
"fieldName": "index"
}
],
"postAggregations": [
{
"type": "stddev",
"name": "index_stddev",
"fieldName": "index_var"
}
],
"intervals": [
"2016-03-06T00:00:00/2016-03-06T23:59:59"
]
}
```
1 change: 1 addition & 0 deletions docs/content/development/extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Core extensions are maintained by Druid committers.
|druid-kafka-extraction-namespace|Kafka-based namespaced lookup. Requires namespace lookup extension.|[link](../development/extensions-core/kafka-extraction-namespace.html)|
|druid-lookups-cached-global|A module for [lookups](../querying/lookups.html) providing a jvm-global eager caching for lookups. It provides JDBC and URI implementations for fetching lookup data.|[link](../development/extensions-core/lookups-cached-global.html)|
|druid-s3-extensions|Interfacing with data in AWS S3, and using S3 as deep storage.|[link](../development/extensions-core/s3.html)|
|druid-stats|Statistics related module including variance and standard deviation.|[link](../development/extensions-core/stats.html)|
|mysql-metadata-storage|MySQL metadata store.|[link](../development/extensions-core/mysql.html)|
|postgresql-metadata-storage|PostgreSQL metadata store.|[link](../development/extensions-core/postgresql.html)|

Expand Down
64 changes: 64 additions & 0 deletions extensions-core/stats/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to Metamarkets Group Inc. (Metamarkets) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. Metamarkets licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>io.druid.extensions</groupId>
<artifactId>druid-stats</artifactId>
<name>druid-stats</name>
<description>druid-stats</description>

<parent>
<groupId>io.druid</groupId>
<artifactId>druid</artifactId>
<version>0.9.2-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>

<dependencies>
<dependency>
<groupId>io.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
<scope>provided</scope>
</dependency>

<!-- Tests -->
<dependency>
<groupId>io.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
<scope>test</scope>
<type>test-jar</type>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.easymock</groupId>
<artifactId>easymock</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/*
* Licensed to Metamarkets Group Inc. (Metamarkets) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. Metamarkets licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package io.druid.query.aggregation.stats;

import com.fasterxml.jackson.databind.Module;
import com.fasterxml.jackson.databind.module.SimpleModule;
import com.google.common.collect.ImmutableList;
import com.google.inject.Binder;
import io.druid.initialization.DruidModule;
import io.druid.query.aggregation.variance.StandardDeviationPostAggregator;
import io.druid.query.aggregation.variance.VarianceAggregatorFactory;
import io.druid.query.aggregation.variance.VarianceFoldingAggregatorFactory;
import io.druid.query.aggregation.variance.VarianceSerde;
import io.druid.segment.serde.ComplexMetrics;

import java.util.List;

/**
*/
public class DruidStatsModule implements DruidModule
{
@Override
public List<? extends Module> getJacksonModules()
{
return ImmutableList.of(
new SimpleModule().registerSubtypes(
VarianceAggregatorFactory.class,
VarianceFoldingAggregatorFactory.class,
StandardDeviationPostAggregator.class
)
);
}

@Override
public void configure(Binder binder)
{
if (ComplexMetrics.getSerdeForType("variance") == null) {
ComplexMetrics.registerSerde("variance", new VarianceSerde());
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
/*
* Licensed to Metamarkets Group Inc. (Metamarkets) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. Metamarkets licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package io.druid.query.aggregation.variance;

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.annotation.JsonTypeName;
import com.google.common.base.Preconditions;
import com.google.common.collect.Sets;
import io.druid.query.aggregation.PostAggregator;
import io.druid.query.aggregation.post.ArithmeticPostAggregator;

import java.util.Comparator;
import java.util.Map;
import java.util.Set;

/**
*/
@JsonTypeName("stddev")
public class StandardDeviationPostAggregator implements PostAggregator
{
protected final String name;
protected final String fieldName;
protected final String estimator;

protected final boolean isVariancePop;

@JsonCreator
public StandardDeviationPostAggregator(
@JsonProperty("name") String name,
@JsonProperty("fieldName") String fieldName,
@JsonProperty("estimator") String estimator
)
{
this.fieldName = Preconditions.checkNotNull(fieldName, "fieldName is null");
this.name = Preconditions.checkNotNull(name, "name is null");
this.estimator = estimator;
this.isVariancePop = VarianceAggregatorCollector.isVariancePop(estimator);
}

@Override
public Set<String> getDependentFields()
{
return Sets.newHashSet(fieldName);
}

@Override
public Comparator<Double> getComparator()
{
return ArithmeticPostAggregator.DEFAULT_COMPARATOR;
}

@Override
public Object compute(Map<String, Object> combinedAggregators)
{
return Math.sqrt(((VarianceAggregatorCollector) combinedAggregators.get(fieldName)).getVariance(isVariancePop));
}

@Override
@JsonProperty("name")
public String getName()
{
return name;
}

@JsonProperty("fieldName")
public String getFieldName()
{
return fieldName;
}

@JsonProperty("estimator")
public String getEstimator()
{
return estimator;
}

@Override
public String toString()
{
return "StandardDeviationPostAggregator{" +
"name='" + name + '\'' +
", fieldName='" + fieldName + '\'' +
", isVariancePop='" + isVariancePop + '\'' +
'}';
}
}
Loading