Skip to content

Conversation

@emlyn
Copy link
Contributor

@emlyn emlyn commented Feb 12, 2016

Fixes #164 and #172.

@emlyn emlyn force-pushed the table-comments-from-metadata branch 2 times, most recently from 8be8ae5 to ddafe4c Compare February 12, 2016 11:48
@codecov-io
Copy link

codecov-io commented Feb 12, 2016

Current coverage is 75.93%

Merging #178 into master will decrease coverage by 14.26%

@@             master       #178   diff @@
==========================================
  Files            13         13          
  Lines           684        694    +10   
  Methods         589        609    +20   
  Messages          0          0          
  Branches         95         85    -10   
==========================================
- Hits            617        527    -90   
- Misses           67        167   +100   
  Partials          0          0          

Powered by Codecov. Last updated by 9ac340e...88ba25a

@emlyn emlyn force-pushed the table-comments-from-metadata branch from ddafe4c to 0066e7e Compare February 12, 2016 15:35
@emlyn
Copy link
Contributor Author

emlyn commented Feb 12, 2016

@JoshRosen
I've just run it and the compression encoding setting seems to work fine. The comments don't work at present due to the use of a staging table. I think because the comments are stored not on the table itself, but in a separate system table, they are not carried over when the table is renamed. I've tried merging in #157, and with that it works, so it's probably worth waiting for that before merging in this.

@JoshRosen
Copy link
Contributor

Very cool; glad to see that this was pretty simple to implement.

I agree that we should get #157 in first, so I'll try to take a look at that patch and bring it up to date (sorry for the lag in addressing feedback on that).

Do we need to add any roundtrip integration tests for this? The unit tests to ensure that we issue the expected CREATE TABLE SQL are probably sufficient since a roundtrip test would need to check that the comments were actually stored properly and would have to run some sort of table metadata query to check the column compression options.

README.md Outdated

### Configuring column encoding

When creating a table, `spark-redshift` can be configured to use a specific compression encoding on individual columns. You can use the `encoding` column metadata field to specify a compression encoding for each column (see [Amazon docs](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html) for available encodings).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library is undergoing a slight naming / branding change in the README and documentation, so spark-redshift needs to be replaced with "this library" (see #179).

@emlyn
Copy link
Contributor Author

emlyn commented Feb 15, 2016

It's probably a good idea to have a roundtrip test, just to be sure the syntax is correct and works with Redshift, so I've added a couple of integration tests. The query to get the column comments is a bit convoluted, but it was the best I could find.

val metadata = new MetadataBuilder().putString("encoding", "LZO").build()
val schema = StructType(
StructField("x", StringType, metadata = metadata) :: Nil)
sqlContext.createDataFrame(sc.parallelize(Seq(Row("a" * 512))), schema).write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to fail with a "java.sql.SQLException: Error (code 1204) while loading data into Redshift: "String length exceeds DDL length"" exception. My hunch is that compression doesn't affect data type constraints, such as maximum field length, and thus I think that you'll have to either reduce the string size or increase the column's string length.

@emlyn emlyn force-pushed the table-comments-from-metadata branch from 68a8ccf to 5e42923 Compare July 6, 2016 20:38
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("dbtable",
s"""(SELECT "column", encoding FROM pg_table_def WHERE tablename='$tableName')""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this query is now failing for me with the following error:

[info]   Caused by: com.amazon.support.exceptions.ErrorException: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
[info]      ... 28 more (QueryTest.scala:60)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at com.databricks.spark.redshift.QueryTest$class.checkAnswer(QueryTest.scala:60)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite.checkAnswer(RedshiftIntegrationSuite.scala:27)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply$mcV$sp(RedshiftIntegrationSuite.scala:417)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply(RedshiftIntegrationSuite.scala:389)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$17.apply(RedshiftIntegrationSuite.scala:389)
[info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RedshiftIntegrationSuite.scala:27)
[...]

I received a bug report from a user who hit this problem but was never able to get to the bottom of it as far as I could tell (since I didn't have access to any logs or queries to even know which query was triggering that error). I'm going to see if I can add some additional logging to help figure out what's going on here now that I've stumbled across a reproduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, here's a better copy of the stacktrace:

[info]   java.sql.SQLException: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
[info]      at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
[info]      at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
[info]      at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.client.PGMessagingContext.getOperationMetadata(Unknown Source)
[info]      at com.amazon.redshift.dataengine.PGExecutionResults.<init>(Unknown Source)
[info]      at com.amazon.redshift.dataengine.PGAbstractQueryExecutor.createExecutionResults(Unknown Source)
[info]      at com.amazon.redshift.dataengine.PGAbstractQueryExecutor.getResults(Unknown Source)
[info]      at com.amazon.jdbc.common.SPreparedStatement.executeWithParams(Unknown Source)
[info]      at com.amazon.jdbc.common.SPreparedStatement.execute(Unknown Source)
[info]      at com.databricks.spark.redshift.JDBCWrapper$$anonfun$executeInterruptibly$1.apply(RedshiftJDBCWrapper.scala:122)
[info]      at com.databricks.spark.redshift.JDBCWrapper$$anonfun$executeInterruptibly$1.apply(RedshiftJDBCWrapper.scala:122)
[info]      at com.databricks.spark.redshift.JDBCWrapper$$anonfun$2.apply(RedshiftJDBCWrapper.scala:140)
[info]      at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info]      at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info]      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a client-side error; I added ?loglevel=1 to the end of my JDBC URL in order to get the Redshift JDBC driver to perform more logging and used DriverManager.setLogWriter(new PrintWriter(System.out)) to direct those logs to stdout.

This produced the following output:

SQLWarning: reason(Function "format_type(oid,integer)" not supported.) SQLState(01000) vendor code(0)
SQLWarning: reason(Function "pg_table_is_visible(oid)" not supported.) SQLState(01000) vendor code(0)

It looks like what's happening here is that we're trying to perform an UNLOAD on a leader-only table, which is unsupported: https://stackoverflow.com/questions/28719808/how-to-unload-pg-table-def-table-to-s3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you'll have to do here is to use the regular JDBC data source to query these Redshift system tables rather than using spark-redshift.

@JoshRosen
Copy link
Contributor

I think that I've managed to fix the integration tests with the changes in 119d950

@emlyn
Copy link
Contributor Author

emlyn commented Jul 6, 2016

Thanks, I've merged in those changes.

@JoshRosen
Copy link
Contributor

LGTM. Integration tests pass in my branch, so I'm going to merge this into master. I'm planning to merge #157 soon so these patches should end up being shipped together as part of the same release. Thanks and sorry for the review delay!

@JoshRosen JoshRosen closed this in 5bc5fab Jul 6, 2016
@emlyn
Copy link
Contributor Author

emlyn commented Jul 6, 2016

Thanks for merging. No worries about the delay, we're actually running this from a fork at the moment so that we can stage the data in CSV format to speed up the Redshift import.

@JoshRosen JoshRosen added this to the 0.6.1 milestone Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants