Skip to content

Conversation

@JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Oct 20, 2016

This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by @emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.

Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.

As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.

Fixes #73.

@JoshRosen JoshRosen added this to the 2.1.0 milestone Oct 20, 2016
@JoshRosen JoshRosen changed the title Use csv for writes Add option to use CSV as an intermediate data format during writes Oct 20, 2016
@codecov-io
Copy link

codecov-io commented Oct 20, 2016

Current coverage is 88.60% (diff: 100%)

Merging #288 into master will increase coverage by 0.36%

@@             master       #288   diff @@
==========================================
  Files            12         12          
  Lines           680        702    +22   
  Methods         545        568    +23   
  Messages          0          0          
  Branches        135        134     -1   
==========================================
+ Hits            600        622    +22   
  Misses           80         80          
  Partials          0          0          

Powered by Codecov. Last update d508d3e...6bd795f

@yhuai
Copy link

yhuai commented Oct 24, 2016

Looking now.

// path uses SparkContext.saveAsHadoopFile(), which produces filenames of the form
// part-XXXXX.avro. In spark-avro 2.0.0+, the partition filenames are of the form
// part-r-XXXXX-UUID.avro.
// part-r-XXXXX-UUID.avro. In spark-csv, the partition filenames are of the form part-XXXXX.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark 2.0, the file name format should be the same as avro, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, actually: I think that @jaley wrote this comment back when this was using the third-party spark-csv library in Spark 1.x. I can go ahead and update this comment to remove the 1.x discussion.

@yhuai
Copy link

yhuai commented Oct 24, 2016

@yhuai
Copy link

yhuai commented Oct 24, 2016

LGTM

@JoshRosen
Copy link
Contributor Author

Great! Merging to master.

JoshRosen added a commit that referenced this pull request Oct 25, 2016
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.

Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.

As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.

Fixes #73.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com>
Author: Emlyn Corrin <emlyn@swiftkey.com>

Closes #288 from JoshRosen/use-csv-for-writes.
@JoshRosen JoshRosen closed this Oct 25, 2016
@JoshRosen JoshRosen deleted the use-csv-for-writes branch October 25, 2016 00:02
@emlyn
Copy link
Contributor

emlyn commented Oct 27, 2016

What's the ETA on this appearing in a release? It will be nice to move back to plain spark-redshift from our private fork.

@JoshRosen
Copy link
Contributor Author

@emlyn, I'm aiming to cut a 3.0.0-preview1 preview release today or tomorrow (this release will be published to Maven Central just like a non-preview release).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants