[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274

rxin · 2016-12-14T01:37:14Z

What changes were proposed in this pull request?

This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

How was this patch tested?

This should be covered by existing tests.

…ating statistics

rxin · 2016-12-14T01:37:39Z

cc @mallman @davies @srinathshankar

davies · 2016-12-14T02:48:26Z

lgtm

SparkQA · 2016-12-14T03:36:30Z

Test build #70113 has finished for PR 16274 at commit 4d33dd8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-14T07:48:49Z

Test build #70123 has finished for PR 16274 at commit 21570a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-14T15:13:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala

-   * The default size of a value of the ArrayType is 100 * the default size of the element type.
-   * (We assume that there are 100 elements).
+   * The default size of a value of the ArrayType is 1 * the default size of the element type.
+   * (We assume that there are 1 elements).


Language? (We assume that there is 1 element)

hvanhovell · 2016-12-14T15:13:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala

+   * (We assume that there are 1 elements).
   */
-  override def defaultSize: Int = 100 * elementType.defaultSize
+  override def defaultSize: Int = 1 * elementType.defaultSize


Why multiply by 1?

hvanhovell · 2016-12-14T15:14:25Z

Two nits, otherwise LGTM.

mallman · 2016-12-14T17:46:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala

  /**
-   * The default size of a value of the ArrayType is 100 * the default size of the element type.
-   * (We assume that there are 100 elements).
+   * The default size of a value of the ArrayType is 1 * the default size of the element type.


Suggest:

The default size of a value of the ArrayType is the default size of the element type.

mallman · 2016-12-14T17:47:11Z

Outside of some comment grooming, LGTM.

gatorsmile · 2016-12-14T17:56:55Z

LGTM

SparkQA · 2016-12-14T20:18:16Z

Test build #70141 has finished for PR 16274 at commit 86a2b94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-14T20:21:44Z

Merging to master/2.1. Thanks!

…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d79947) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…ating statistics This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d79947) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> (cherry picked from commit e8866f9) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#16274 from rxin/SPARK-18853.

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estim…

4d33dd8

…ating statistics

Fix test

21570a7

hvanhovell reviewed Dec 14, 2016

View reviewed changes

mallman reviewed Dec 14, 2016

View reviewed changes

cr

86a2b94

asfgit closed this in 5d79947 Dec 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274

Uh oh!

rxin commented Dec 14, 2016

Uh oh!

rxin commented Dec 14, 2016

Uh oh!

davies commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

hvanhovell Dec 14, 2016

Uh oh!

hvanhovell Dec 14, 2016

Uh oh!

hvanhovell commented Dec 14, 2016

Uh oh!

mallman Dec 14, 2016

Uh oh!

mallman commented Dec 14, 2016

Uh oh!

gatorsmile commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

hvanhovell commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274

Uh oh!

Conversation

rxin commented Dec 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Dec 14, 2016

Uh oh!

davies commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

hvanhovell Dec 14, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell Dec 14, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Dec 14, 2016

Uh oh!

mallman Dec 14, 2016

Choose a reason for hiding this comment

Uh oh!

mallman commented Dec 14, 2016

Uh oh!

gatorsmile commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

hvanhovell commented Dec 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants