Add option to use CSV as an intermediate data format during writes #288

JoshRosen · 2016-10-20T22:40:38Z

This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by @emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.

Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.

As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.

Fixes #73.

codecov-io · 2016-10-20T22:48:52Z

Current coverage is 88.60% (diff: 100%)

Merging #288 into master will increase coverage by 0.36%

@@             master       #288   diff @@
==========================================
  Files            12         12          
  Lines           680        702    +22   
  Methods         545        568    +23   
  Messages          0          0          
  Branches        135        134     -1   
==========================================
+ Hits            600        622    +22   
  Misses           80         80          
  Partials          0          0

Powered by Codecov. Last update d508d3e...6bd795f

… have special characters.

yhuai · 2016-10-24T18:48:37Z

Looking now.

yhuai · 2016-10-24T19:26:55Z

src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala

      // path uses SparkContext.saveAsHadoopFile(), which produces filenames of the form
      // part-XXXXX.avro. In spark-avro 2.0.0+, the partition filenames are of the form
-      // part-r-XXXXX-UUID.avro.
+      // part-r-XXXXX-UUID.avro. In spark-csv, the partition filenames are of the form part-XXXXX.


For spark 2.0, the file name format should be the same as avro, right?

Good point, actually: I think that @jaley wrote this comment back when this was using the third-party spark-csv library in Spark 1.x. I can go ahead and update this comment to remove the 1.x discussion.

yhuai · 2016-10-24T19:29:29Z

looks good. Left a question at https://github.com/databricks/spark-redshift/pull/288/files#r84764318.

yhuai · 2016-10-24T21:27:40Z

LGTM

JoshRosen · 2016-10-25T00:00:30Z

Great! Merging to master.

This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests. Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch. As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more. Fixes #73. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com> Author: Emlyn Corrin <emlyn@swiftkey.com> Closes #288 from JoshRosen/use-csv-for-writes.

emlyn · 2016-10-27T10:11:54Z

What's the ETA on this appearing in a release? It will be nice to move back to plain spark-redshift from our private fork.

JoshRosen · 2016-10-28T17:33:43Z

@emlyn, I'm aiming to cut a 3.0.0-preview1 preview release today or tomorrow (this release will be published to Maven Central just like a non-preview release).

emlyn and others added 7 commits October 20, 2016 14:33

Add CSV option

b9e09df

Escape quotes with quotes

5dcd663

Add CSV options to README.

ab305b3

Treat tempformat option as case-insensitive.

b37ba54

Fix style nit and avoid consecutive positional string arguments.

74fd1d4

Run write tests with Avro, CSV, and CSV GZIP

cfb842e

Use uppercase constants for consistency with Spark CSV's write modes.

99a4360

JoshRosen added this to the 2.1.0 milestone Oct 20, 2016

JoshRosen changed the title ~~Use csv for writes~~ Add option to use CSV as an intermediate data format during writes Oct 20, 2016

JoshRosen added the enhancement label Oct 20, 2016

JoshRosen added 2 commits October 20, 2016 16:18

Use write() in roundtripSaveAndLoad() so those tests run with CSV

6899e41

Add informative error messages and tests for saving when column names…

1bc6e7f

… have special characters.

JoshRosen mentioned this pull request Oct 21, 2016

Consider using CSV when writing data back into Redshift #73

Closed

yhuai reviewed Oct 24, 2016

View reviewed changes

Fix outdated comment and partitionIdRegex.

6bd795f

JoshRosen closed this Oct 25, 2016

JoshRosen deleted the use-csv-for-writes branch October 25, 2016 00:02

This was referenced Oct 25, 2016

Support for overriding file serialization format in RedshiftWriter #206

Open

Save to redshift completes, but the job never ends #290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to use CSV as an intermediate data format during writes #288

Add option to use CSV as an intermediate data format during writes #288

Uh oh!

JoshRosen commented Oct 20, 2016 •

edited

Loading

Uh oh!

codecov-io commented Oct 20, 2016 •

edited

Loading

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

yhuai Oct 24, 2016

Uh oh!

JoshRosen Oct 24, 2016

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

JoshRosen commented Oct 25, 2016

Uh oh!

emlyn commented Oct 27, 2016

Uh oh!

JoshRosen commented Oct 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add option to use CSV as an intermediate data format during writes #288

Add option to use CSV as an intermediate data format during writes #288

Uh oh!

Conversation

JoshRosen commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 88.60% (diff: 100%)

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

yhuai Oct 24, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Oct 24, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

yhuai commented Oct 24, 2016

Uh oh!

JoshRosen commented Oct 25, 2016

Uh oh!

emlyn commented Oct 27, 2016

Uh oh!

JoshRosen commented Oct 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JoshRosen commented Oct 20, 2016 •

edited

Loading

codecov-io commented Oct 20, 2016 •

edited

Loading