Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Dec 14, 2016

What changes were proposed in this pull request?

This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

How was this patch tested?

This should be covered by existing tests.

@rxin
Copy link
Contributor Author

rxin commented Dec 14, 2016

cc @mallman @davies @srinathshankar

@davies
Copy link
Contributor

davies commented Dec 14, 2016

lgtm

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70113 has finished for PR 16274 at commit 4d33dd8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70123 has finished for PR 16274 at commit 21570a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* The default size of a value of the ArrayType is 100 * the default size of the element type.
* (We assume that there are 100 elements).
* The default size of a value of the ArrayType is 1 * the default size of the element type.
* (We assume that there are 1 elements).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language? (We assume that there is 1 element)

* (We assume that there are 1 elements).
*/
override def defaultSize: Int = 100 * elementType.defaultSize
override def defaultSize: Int = 1 * elementType.defaultSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multiply by 1?

@hvanhovell
Copy link
Contributor

Two nits, otherwise LGTM.

/**
* The default size of a value of the ArrayType is 100 * the default size of the element type.
* (We assume that there are 100 elements).
* The default size of a value of the ArrayType is 1 * the default size of the element type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

The default size of a value of the ArrayType is the default size of the element type.

@mallman
Copy link
Contributor

mallman commented Dec 14, 2016

Outside of some comment grooming, LGTM.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70141 has finished for PR 16274 at commit 86a2b94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master/2.1. Thanks!

asfgit pushed a commit that referenced this pull request Dec 14, 2016
…ating statistics

## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #16274 from rxin/SPARK-18853.

(cherry picked from commit 5d79947)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 14, 2016
…ating statistics

This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #16274 from rxin/SPARK-18853.

(cherry picked from commit 5d79947)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(cherry picked from commit e8866f9)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@asfgit asfgit closed this in 5d79947 Dec 14, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…ating statistics

## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#16274 from rxin/SPARK-18853.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ating statistics

## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#16274 from rxin/SPARK-18853.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants