-
Notifications
You must be signed in to change notification settings - Fork 347
Use FileFormat-based data source instead of HadoopRDD for reads #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| sparkSession: SparkSession, | ||
| options: Map[String, String], | ||
| path: Path): Boolean = { | ||
| // Redshift unload files are not splittable because records containing newline characters may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion with @mengxr, it sounds like the existing InputFormat already does support splitting, so this comment is incorrect. I'll go ahead and update this.
Current coverage is 88.25% (diff: 89.70%)@@ master #289 diff @@
==========================================
Files 12 15 +3
Lines 702 732 +30
Methods 568 591 +23
Messages 0 0
Branches 134 141 +7
==========================================
+ Hits 622 646 +24
- Misses 80 86 +6
Partials 0 0
|
| * An adaptor from a Hadoop [[RecordReader]] to an [[Iterator]] over the values returned. | ||
| * | ||
| * Note that this returns [[Object]]s instead of [[InternalRow]] because we rely on erasure to pass | ||
| * column batches by pretending they are rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph comes from Spark source code and doesn't really apply here. It confused me for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is confusing to me as well, especially since the return type is generic. I'll just remove this comment.
|
LGTM! |
|
Great, merging to master! |
This patch refactors this library's read path to use a Spark 2.0's `FileFormat`-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existing `HadoopRDD`-based approach: - It will benefit from performance improvements in `FileScanRDD` and `HadoopFsRelation`, including automatic coalescing. - We don't have to create a separate RDD per partition and union them together, making the RDD DAG smaller. The bulk of the diff are helper classes copied from Spark and `spark-avro` and inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note that `spark-avro` itself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse). Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Closes #289 from JoshRosen/use-fileformat-for-reads.
This patch refactors this library's read path to use a Spark 2.0's
FileFormat-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existingHadoopRDD-based approach:FileScanRDDandHadoopFsRelation, including automatic coalescing.The bulk of the diff are helper classes copied from Spark and
spark-avroand inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note thatspark-avroitself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse).