-
Notifications
You must be signed in to change notification settings - Fork 347
Add option to use CSV as an intermediate data format during writes #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 88.60% (diff: 100%)@@ master #288 diff @@
==========================================
Files 12 12
Lines 680 702 +22
Methods 545 568 +23
Messages 0 0
Branches 135 134 -1
==========================================
+ Hits 600 622 +22
Misses 80 80
Partials 0 0
|
|
Looking now. |
| // path uses SparkContext.saveAsHadoopFile(), which produces filenames of the form | ||
| // part-XXXXX.avro. In spark-avro 2.0.0+, the partition filenames are of the form | ||
| // part-r-XXXXX-UUID.avro. | ||
| // part-r-XXXXX-UUID.avro. In spark-csv, the partition filenames are of the form part-XXXXX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spark 2.0, the file name format should be the same as avro, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, actually: I think that @jaley wrote this comment back when this was using the third-party spark-csv library in Spark 1.x. I can go ahead and update this comment to remove the 1.x discussion.
|
looks good. Left a question at https://github.com/databricks/spark-redshift/pull/288/files#r84764318. |
|
LGTM |
|
Great! Merging to master. |
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests. Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch. As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more. Fixes #73. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com> Author: Emlyn Corrin <emlyn@swiftkey.com> Closes #288 from JoshRosen/use-csv-for-writes.
|
What's the ETA on this appearing in a release? It will be nice to move back to plain spark-redshift from our private fork. |
|
@emlyn, I'm aiming to cut a |
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by @emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.
Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.
As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.
Fixes #73.