Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch refactors this library's read path to use a Spark 2.0's FileFormat-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existing HadoopRDD-based approach:

  • It will benefit from performance improvements in FileScanRDD and HadoopFsRelation, including automatic coalescing.
  • We don't have to create a separate RDD per partition and union them together, making the RDD DAG smaller.

The bulk of the diff are helper classes copied from Spark and spark-avro and inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note that spark-avro itself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse).

sparkSession: SparkSession,
options: Map[String, String],
path: Path): Boolean = {
// Redshift unload files are not splittable because records containing newline characters may
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion with @mengxr, it sounds like the existing InputFormat already does support splitting, so this comment is incorrect. I'll go ahead and update this.

@codecov-io
Copy link

codecov-io commented Oct 25, 2016

Current coverage is 88.25% (diff: 89.70%)

Merging #289 into master will decrease coverage by 0.35%

@@             master       #289   diff @@
==========================================
  Files            12         15     +3   
  Lines           702        732    +30   
  Methods         568        591    +23   
  Messages          0          0          
  Branches        134        141     +7   
==========================================
+ Hits            622        646    +24   
- Misses           80         86     +6   
  Partials          0          0          

Powered by Codecov. Last update 6cc49da...4041989

* An adaptor from a Hadoop [[RecordReader]] to an [[Iterator]] over the values returned.
*
* Note that this returns [[Object]]s instead of [[InternalRow]] because we rely on erasure to pass
* column batches by pretending they are rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph comes from Spark source code and doesn't really apply here. It confused me for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is confusing to me as well, especially since the return type is generic. I'll just remove this comment.

@liancheng
Copy link
Contributor

LGTM!

@JoshRosen
Copy link
Contributor Author

Great, merging to master!

JoshRosen added a commit that referenced this pull request Oct 25, 2016
This patch refactors this library's read path to use a Spark 2.0's `FileFormat`-based data source to read unloaded Redshift output from S3. This approach has a few advantages over using our existing `HadoopRDD`-based approach:

- It will benefit from performance improvements in `FileScanRDD` and `HadoopFsRelation`, including automatic coalescing.
- We don't have to create a separate RDD per partition and union them together, making the RDD DAG smaller.

The bulk of the diff are helper classes copied from Spark and `spark-avro` and inlined here for API compatibility / stability purposes. Some of the new classes implemented here are likely to become incompatible with new releases of Spark, but note that `spark-avro` itself relies on similar unstable / experimental APIs and thus this library is already vulnerable to changes to those APIs (in other words, this change is not making our compatibility story significantly worse).

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>

Closes #289 from JoshRosen/use-fileformat-for-reads.
@JoshRosen JoshRosen closed this Oct 25, 2016
@JoshRosen JoshRosen deleted the use-fileformat-for-reads branch October 25, 2016 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants