-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count #15898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @piaozhexiu I think you changed this last? |
|
Can we add a unit test for this function? |
|
I was just watching the JIRA just for my curiosity. Actually, aren't we able to add test for test("Empty schema does not read data from ORC file") {
val data = (1 to 2).zip(1 to 2)
withOrcFile(data) { path =>
val requrestedSchema = StructType(Nil)
val conf = new Configuration()
val physicalSchema = OrcFileOperator.readSchema(Seq(path), Some(conf)).get
OrcRelation.setRequiredColumns(conf, physicalSchema, requrestedSchema)
val maybeOrcReader = OrcFileOperator.getFileReader(path, Some(conf))
assert(maybeOrcReader.isDefined)
val orcRecordReader = new SparkOrcNewRecordReader(
maybeOrcReader.get, conf, 0, maybeOrcReader.get.getContentLength)
val recordsIterator = new RecordReaderIterator[OrcStruct](orcRecordReader)
assert(recordsIterator.next().toString == "{null, null}")
recordsIterator.close()
}
} |
|
Test build #68693 has finished for PR 15898 at commit
|
|
The code that is being changed originated 2 years ago with the addition of Hive 0.13 support by @zhzhan, see I will add the test you are suggesting @HyukjinKwon |
| val recordsIterator = new RecordReaderIterator[OrcStruct](orcRecordReader) | ||
| assert(recordsIterator.next().toString == "{null, null}") | ||
|
|
||
| recordsIterator.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should put some sugar on this something like try {..} finally {recordsIterator.close()} if this test looks reasonable. I just wanted to show that it might be able to be tested :).
| test("Empty schema does not read data from ORC file") { | ||
| val data = (1 to 2).zip(1 to 2) | ||
| withOrcFile(data) { path => | ||
| val requrestedSchema = StructType(Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in variable name
| } | ||
|
|
||
| test("Empty schema does not read data from ORC file") { | ||
| val data = (1 to 2).zip(1 to 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels weird to read. Would rather prefer simple Seq((1,1), (2,2))
|
Feels like this will only help for Which version of ORC have you tested against ? In past I have seen PRs which do not work over older version of ORC and needed some fixes so it would be good to check. |
|
@tejasapatil yes that is the use case where this applies. It's only tested against whatever version is included in the hadoop2.7+hive build configuration listed above. Is there anything in particular you were concerned about compatibility wise? |
|
Test build #68698 has finished for PR 15898 at commit
|
|
Test build #68701 has finished for PR 15898 at commit
|
|
Test build #68702 has finished for PR 15898 at commit
|
|
@tejasapatil does this lgtu? |
| try { | ||
| assert(recordsIterator.next().toString == "{null, null}") | ||
| } catch { | ||
| case e: Exception => fail(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why bother catching? the test case will fail anyway wouldn't it?
|
Test build #68853 has finished for PR 15898 at commit
|
|
Thanks - merging in master/branch-2.1. |
… all columns when doing a simple count
## What changes were proposed in this pull request?
When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
## How was this patch tested?
Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
```
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
```
However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
I tested with the following setup using above build options
```
case class OrcData(intField: Long, stringField: String)
spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
sql(
s"""CREATE EXTERNAL TABLE orc_test(
| intField LONG,
| stringField STRING
|)
|STORED AS ORC
|LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
""".stripMargin)
```
## Results
query | Spark 2.0.2 | this PR
---|---|---
`sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
`sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
`sql("select * from orc_test").collect`|4.4 MB|4.4 MB
Author: Andrew Ray <ray.andrew@gmail.com>
Closes #15898 from aray/sql-orc-no-col.
(cherry picked from commit 795e9fc)
Signed-off-by: Reynold Xin <rxin@databricks.com>
… all columns when doing a simple count
## What changes were proposed in this pull request?
When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
## How was this patch tested?
Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
```
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
```
However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
I tested with the following setup using above build options
```
case class OrcData(intField: Long, stringField: String)
spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
sql(
s"""CREATE EXTERNAL TABLE orc_test(
| intField LONG,
| stringField STRING
|)
|STORED AS ORC
|LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
""".stripMargin)
```
## Results
query | Spark 2.0.2 | this PR
---|---|---
`sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
`sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
`sql("select * from orc_test").collect`|4.4 MB|4.4 MB
Author: Andrew Ray <ray.andrew@gmail.com>
Closes apache#15898 from aray/sql-orc-no-col.
What changes were proposed in this pull request?
When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
How was this patch tested?
Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
I tested with the following setup using above build options
Results
sql("select count(*) from orc_test").collectsql("select intField from orc_test").collectsql("select * from orc_test").collect