Use metadata to set column comments and encoding #178

emlyn · 2016-02-12T11:40:05Z

Fixes #164 and #172.

codecov-io · 2016-02-12T11:56:54Z

Current coverage is 75.93%

Merging #178 into master will decrease coverage by 14.26%

@@             master       #178   diff @@
==========================================
  Files            13         13          
  Lines           684        694    +10   
  Methods         589        609    +20   
  Messages          0          0          
  Branches         95         85    -10   
==========================================
- Hits            617        527    -90   
- Misses           67        167   +100   
  Partials          0          0

Powered by Codecov. Last updated by 9ac340e...88ba25a

emlyn · 2016-02-12T15:57:38Z

@JoshRosen
I've just run it and the compression encoding setting seems to work fine. The comments don't work at present due to the use of a staging table. I think because the comments are stored not on the table itself, but in a separate system table, they are not carried over when the table is renamed. I've tried merging in #157, and with that it works, so it's probably worth waiting for that before merging in this.

JoshRosen · 2016-02-15T02:22:18Z

Very cool; glad to see that this was pretty simple to implement.

I agree that we should get #157 in first, so I'll try to take a look at that patch and bring it up to date (sorry for the lag in addressing feedback on that).

Do we need to add any roundtrip integration tests for this? The unit tests to ensure that we issue the expected CREATE TABLE SQL are probably sufficient since a roundtrip test would need to check that the comments were actually stored properly and would have to run some sort of table metadata query to check the column compression options.

JoshRosen · 2016-02-15T02:23:21Z

README.md


+### Configuring column encoding
+
+When creating a table, `spark-redshift` can be configured to use a specific compression encoding on individual columns. You can use the `encoding` column metadata field to specify a compression encoding for each column (see [Amazon docs](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html) for available encodings).


This library is undergoing a slight naming / branding change in the README and documentation, so spark-redshift needs to be replaced with "this library" (see #179).

emlyn · 2016-02-15T10:16:53Z

It's probably a good idea to have a roundtrip test, just to be sure the syntax is correct and works with Redshift, so I've added a couple of integration tests. The query to get the column comments is a bit convoluted, but it was the best I could find.

JoshRosen · 2016-07-06T19:47:28Z

src/it/scala/com/databricks/spark/redshift/RedshiftIntegrationSuite.scala

+      val metadata = new MetadataBuilder().putString("encoding", "LZO").build()
+      val schema = StructType(
+        StructField("x", StringType, metadata = metadata) :: Nil)
+      sqlContext.createDataFrame(sc.parallelize(Seq(Row("a" * 512))), schema).write


This test seems to fail with a "java.sql.SQLException: Error (code 1204) while loading data into Redshift: "String length exceeds DDL length"" exception. My hunch is that compression doesn't affect data type constraints, such as maximum field length, and thus I think that you'll have to either reduce the string size or increase the column's string length.

JoshRosen · 2016-07-06T20:44:24Z

src/it/scala/com/databricks/spark/redshift/RedshiftIntegrationSuite.scala

+        .format("com.databricks.spark.redshift")
+        .option("url", jdbcUrl)
+        .option("dbtable",
+          s"""(SELECT "column", encoding FROM pg_table_def WHERE tablename='$tableName')""")


Interestingly, this query is now failing for me with the following error:

[info] Caused by: com.amazon.support.exceptions.ErrorException: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.; [info] ... 28 more (QueryTest.scala:60) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at com.databricks.spark.redshift.QueryTest$class.checkAnswer(QueryTest.scala:60) [info] at com.databricks.spark.redshift.RedshiftIntegrationSuite.checkAnswer(RedshiftIntegrationSuite.scala:27) [info] at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply$mcV$sp(RedshiftIntegrationSuite.scala:417) [info] at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply(RedshiftIntegrationSuite.scala:389) [info] at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply(RedshiftIntegrationSuite.scala:389) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at com.databricks.spark.redshift.RedshiftIntegrationSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RedshiftIntegrationSuite.scala:27) [...]

I received a bug report from a user who hit this problem but was never able to get to the bottom of it as far as I could tell (since I didn't have access to any logs or queries to even know which query was triggering that error). I'm going to see if I can add some additional logging to help figure out what's going on here now that I've stumbled across a reproduction.

Err, here's a better copy of the stacktrace:

[info] java.sql.SQLException: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.; [info] at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source) [info] at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source) [info] at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source) [info] at com.amazon.redshift.dataengine.PGExecutionResults.<init>(Unknown Source) [info] at com.amazon.redshift.dataengine.PGAbstractQueryExecutor.createExecutionResults(Unknown Source) [info] at com.amazon.redshift.dataengine.PGAbstractQueryExecutor.getResults(Unknown Source) [info] at com.amazon.jdbc.common.SPreparedStatement.executeWithParams(Unknown Source) [info] at com.amazon.jdbc.common.SPreparedStatement.execute(Unknown Source) [info] at com.databricks.spark.redshift.JDBCWrapper$$anonfun$executeInterruptibly$1.apply(RedshiftJDBCWrapper.scala:122) [info] at com.databricks.spark.redshift.JDBCWrapper$$anonfun$executeInterruptibly$1.apply(RedshiftJDBCWrapper.scala:122) [info] at com.databricks.spark.redshift.JDBCWrapper$$anonfun$2.apply(RedshiftJDBCWrapper.scala:140) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

This is a client-side error; I added ?loglevel=1 to the end of my JDBC URL in order to get the Redshift JDBC driver to perform more logging and used DriverManager.setLogWriter(new PrintWriter(System.out)) to direct those logs to stdout.

This produced the following output:

SQLWarning: reason(Function "format_type(oid,integer)" not supported.) SQLState(01000) vendor code(0) SQLWarning: reason(Function "pg_table_is_visible(oid)" not supported.) SQLState(01000) vendor code(0)

It looks like what's happening here is that we're trying to perform an UNLOAD on a leader-only table, which is unsupported: https://stackoverflow.com/questions/28719808/how-to-unload-pg-table-def-table-to-s3

I think what you'll have to do here is to use the regular JDBC data source to query these Redshift system tables rather than using spark-redshift.

JoshRosen · 2016-07-06T21:49:20Z

I think that I've managed to fix the integration tests with the changes in 119d950

emlyn · 2016-07-06T22:02:14Z

Thanks, I've merged in those changes.

JoshRosen · 2016-07-06T22:51:50Z

LGTM. Integration tests pass in my branch, so I'm going to merge this into master. I'm planning to merge #157 soon so these patches should end up being shipped together as part of the same release. Thanks and sorry for the review delay!

emlyn · 2016-07-06T23:34:16Z

Thanks for merging. No worries about the delay, we're actually running this from a fork at the moment so that we can stage the data in CSV format to speed up the Redshift import.

emlyn force-pushed the table-comments-from-metadata branch 2 times, most recently from 8be8ae5 to ddafe4c Compare February 12, 2016 11:48

emlyn added 4 commits February 12, 2016 15:35

Allow setting comments of table and columns from metadata

f10c0ff

Allow setting column encoding from metadata

9ea7896

Update README

05078f6

Tests

0066e7e

emlyn force-pushed the table-comments-from-metadata branch from ddafe4c to 0066e7e Compare February 12, 2016 15:35

Escape quotes in columns and comments

d1b3bae

JoshRosen reviewed Feb 15, 2016
View reviewed changes

emlyn added 2 commits February 15, 2016 10:08

Add integration tests

8b38090

Remove reference to spark-redshift in README

2e47c2c

JoshRosen reviewed Jul 6, 2016
View reviewed changes

Address review comments

5e42923

emlyn force-pushed the table-comments-from-metadata branch from 68a8ccf to 5e42923 Compare July 6, 2016 20:38

JoshRosen reviewed Jul 6, 2016
View reviewed changes

Fix integration tests.

88ba25a

JoshRosen closed this in 5bc5fab Jul 6, 2016

JoshRosen added this to the 0.6.1 milestone Jul 6, 2016


		### Configuring column encoding

		When creating a table, `spark-redshift` can be configured to use a specific compression encoding on individual columns. You can use the `encoding` column metadata field to specify a compression encoding for each column (see [Amazon docs](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html) for available encodings).

Use metadata to set column comments and encoding #178

Use metadata to set column comments and encoding #178

Uh oh!

Conversation

emlyn commented Feb 12, 2016 • edited by JoshRosen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 75.93%

Uh oh!

emlyn commented Feb 12, 2016

Uh oh!

JoshRosen commented Feb 15, 2016

Uh oh!

JoshRosen Feb 15, 2016

Choose a reason for hiding this comment

Uh oh!

emlyn commented Feb 15, 2016

Uh oh!

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jul 6, 2016

Uh oh!

emlyn commented Jul 6, 2016

Uh oh!

JoshRosen commented Jul 6, 2016

Uh oh!

emlyn commented Jul 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emlyn commented Feb 12, 2016 •

edited by JoshRosen

Loading

codecov-io commented Feb 12, 2016 •

edited

Loading