Use FileFormat-based data source instead of HadoopRDD for reads #289

JoshRosen · 2016-10-25T00:12:58Z

This patch refactors this library's read path to use a Spark 2.0's FileFormat-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existing HadoopRDD-based approach:

It will benefit from performance improvements in FileScanRDD and HadoopFsRelation, including automatic coalescing.
We don't have to create a separate RDD per partition and union them together, making the RDD DAG smaller.

The bulk of the diff are helper classes copied from Spark and spark-avro and inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note that spark-avro itself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse).

JoshRosen · 2016-10-25T00:21:45Z

src/main/scala/com/databricks/spark/redshift/RedshiftFileFormat.scala

+      sparkSession: SparkSession,
+      options: Map[String, String],
+      path: Path): Boolean = {
+    // Redshift unload files are not splittable because records containing newline characters may


Per discussion with @mengxr, it sounds like the existing InputFormat already does support splitting, so this comment is incorrect. I'll go ahead and update this.

codecov-io · 2016-10-25T00:58:25Z

Current coverage is 88.25% (diff: 89.70%)

Merging #289 into master will decrease coverage by 0.35%

@@             master       #289   diff @@
==========================================
  Files            12         15     +3   
  Lines           702        732    +30   
  Methods         568        591    +23   
  Messages          0          0          
  Branches        134        141     +7   
==========================================
+ Hits            622        646    +24   
- Misses           80         86     +6   
  Partials          0          0

Powered by Codecov. Last update 6cc49da...4041989

liancheng · 2016-10-25T01:00:11Z

src/main/scala/com/databricks/spark/redshift/RecordReaderIterator.scala

+ * An adaptor from a Hadoop [[RecordReader]] to an [[Iterator]] over the values returned.
+ *
+ * Note that this returns [[Object]]s instead of [[InternalRow]] because we rely on erasure to pass
+ * column batches by pretending they are rows.


This paragraph comes from Spark source code and doesn't really apply here. It confused me for a while.

Yeah, this is confusing to me as well, especially since the return type is generic. I'll just remove this comment.

liancheng · 2016-10-25T05:20:54Z

LGTM!

JoshRosen · 2016-10-25T20:45:18Z

Great, merging to master!

This patch refactors this library's read path to use a Spark 2.0's `FileFormat`-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existing `HadoopRDD`-based approach: - It will benefit from performance improvements in `FileScanRDD` and `HadoopFsRelation`, including automatic coalescing. - We don't have to create a separate RDD per partition and union them together, making the RDD DAG smaller. The bulk of the diff are helper classes copied from Spark and `spark-avro` and inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note that `spark-avro` itself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse). Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Closes #289 from JoshRosen/use-fileformat-for-reads.

JoshRosen and others added 6 commits October 20, 2016 13:19

Use CSV data source to read unloaded data from Redshift

8d79b3b

Add SerializableConfiguration

7990ad6

WIP towards custom FileFormat read path.

f6e6f0d

Fix ConversionsSuite

968b06d

Fix tests.

46d2abd

Remove debugging aids.

8a4e94f

JoshRosen added the enhancement label Oct 25, 2016

JoshRosen added this to the 3.0.0-preview1 milestone Oct 25, 2016

JoshRosen assigned liancheng Oct 25, 2016

JoshRosen commented Oct 25, 2016

View reviewed changes

Update split record handling

113cca0

liancheng reviewed Oct 25, 2016

View reviewed changes

Remove confusing comment from RecordReaderIterator

4041989

JoshRosen closed this Oct 25, 2016

JoshRosen deleted the use-fileformat-for-reads branch October 25, 2016 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use FileFormat-based data source instead of HadoopRDD for reads #289

Use FileFormat-based data source instead of HadoopRDD for reads #289

Uh oh!

JoshRosen commented Oct 25, 2016

Uh oh!

JoshRosen Oct 25, 2016

Uh oh!

codecov-io commented Oct 25, 2016 •

edited

Loading

Uh oh!

liancheng Oct 25, 2016

Uh oh!

JoshRosen Oct 25, 2016

Uh oh!

liancheng commented Oct 25, 2016

Uh oh!

JoshRosen commented Oct 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use FileFormat-based data source instead of HadoopRDD for reads #289

Use FileFormat-based data source instead of HadoopRDD for reads #289

Uh oh!

Conversation

JoshRosen commented Oct 25, 2016

Uh oh!

JoshRosen Oct 25, 2016

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 88.25% (diff: 89.70%)

Uh oh!

liancheng Oct 25, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Oct 25, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 25, 2016

Uh oh!

JoshRosen commented Oct 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Oct 25, 2016 •

edited

Loading