-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics #16274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
lgtm |
|
Test build #70113 has finished for PR 16274 at commit
|
|
Test build #70123 has finished for PR 16274 at commit
|
| * The default size of a value of the ArrayType is 100 * the default size of the element type. | ||
| * (We assume that there are 100 elements). | ||
| * The default size of a value of the ArrayType is 1 * the default size of the element type. | ||
| * (We assume that there are 1 elements). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language? (We assume that there is 1 element)
| * (We assume that there are 1 elements). | ||
| */ | ||
| override def defaultSize: Int = 100 * elementType.defaultSize | ||
| override def defaultSize: Int = 1 * elementType.defaultSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why multiply by 1?
|
Two nits, otherwise LGTM. |
| /** | ||
| * The default size of a value of the ArrayType is 100 * the default size of the element type. | ||
| * (We assume that there are 100 elements). | ||
| * The default size of a value of the ArrayType is 1 * the default size of the element type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest:
The default size of a value of the ArrayType is the default size of the element type.
|
Outside of some comment grooming, LGTM. |
|
LGTM |
|
Test build #70141 has finished for PR 16274 at commit
|
|
Merging to master/2.1. Thanks! |
…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d79947) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…ating statistics This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d79947) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> (cherry picked from commit e8866f9) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#16274 from rxin/SPARK-18853.
…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#16274 from rxin/SPARK-18853.
What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.
How was this patch tested?
This should be covered by existing tests.