Skip to content

Conversation

@liancheng
Copy link
Contributor

@liancheng liancheng commented Jul 17, 2016

This PR branch is based on #221 and contains its changes. For easier review, please refer to liancheng#1.

This PR cleans up all the deprecated APIs in 1.x:

  • Parameters.overwrite

    Users should use SaveMode instead.

  • SchemaParser

    Removed since the only API method uses this class is removed.

  • RedshiftDataFrame.saveAsRedshiftTable(parameters: Map[String, String])

    Users should use DataFrameWriter methods instead.

  • RedshiftContext.redshiftTable(parameters: Map[String, String])

    Users should use DataFrameReader methods instead.

  • RedshiftContext.redshiftFile(path: String, schema: String)

    Removed since we'd like to stop using string schema parsing. The following API method is added to replace it:

    def redshiftFile(path: String, schema: StructType): DataFrame

Fixes #66.

@liancheng
Copy link
Contributor Author

@JoshRosen Since the following public method in RedshiftContext isn't marked as deprecated

def redshiftFile(path: String, columns: Seq[String]): DataFrame

I added a new version of redshiftFile()

def redshiftFile(path: String, schema: StructType): DataFrame

to replace the deprecated one:

def redshiftFile(path: String, schema: String): DataFrame

These methods are used to read data files unloaded from a Redshift table using RedshiftInputFormat, and can't be replaced by existing DataFrame reader/writer API. Another choice can be to make RedshiftRelation support reading unloaded files, so that we can have a more unified API. What do you think?

@codecov-io
Copy link

codecov-io commented Jul 17, 2016

Current coverage is 89.07%

Merging #239 into master will decrease coverage by <.01%

@@             master       #239   diff @@
==========================================
  Files            13         12     -1   
  Lines           669        641    -28   
  Methods         589        562    -27   
  Messages          0          0          
  Branches         80         79     -1   
==========================================
- Hits            596        571    -25   
+ Misses           73         70     -3   
  Partials          0          0          

Powered by Codecov. Last updated by c2dd7bb...77c2838

@JoshRosen JoshRosen added this to the 2.0.0 milestone Jul 17, 2016
@JoshRosen
Copy link
Contributor

@liancheng, in the long run I think it would make sense to open up more modularized APIs to the spark-redshift internals in order to address advanced use-cases (such as the one in #197), so at some point I think we'll want to have a separate component for reading unloaded files. I don't think that we necessarily want to expose that functionality publically through RedshiftRelation since it's not the easiest class to construct AFAIK.

Regarding the individual deprecations:

  • +1 on removing Parameters.overwrite; this is an obvious good chose.
  • +1 on removing saveAsRedshiftTable and redshiftTable, since these are subsumed by the data sources API.
  • +1 on replacing redshiftFile(path: String, schema: String) by redshiftFile(path: String, schema: StructType), but I think that this introduces a small gap in functionality w.r.t the old code since I don't think spark-redshift exposes a great API to retrieve the schema of a Redshift table independent of unloading / reading that table, so it might be a little tricky for the user to actually obtain the schema object. We could try to address this by opening up an API for calling the schema-retrieval code over JDBC, but I don't think that is quite a sufficient solution since the old schemaString approach could be used in cases where the Spark cluster could not connect to Redshift / didn't require a JDBC connection. For this reason it might be nice to expose the SchemaParser object so that a user can call that as a shortcut helper method for constructing the Spark Schema object. However, the current SchemaParser is slightly painful to use from Java. Therefore, I think it might make sense to expose a public SchemaParser class in Java with static method and keep the parser implementation in a private Scala helper class that it calls.

I don't want to overengineer things now, though, so I'd be fine with rebasing and merging these changes as-is and dealing with feature-gaps / requests in followups or in a followup to the 2.x release.

@JoshRosen
Copy link
Contributor

Yeah, so TL;DR: rebase this and I'll merge it.

@liancheng liancheng force-pushed the delete-deprecated-api branch from 54000ec to 77c2838 Compare July 18, 2016 04:18
@liancheng
Copy link
Contributor Author

@JoshRosen Thanks! Having a Java version of SchemaParser sounds good to me. Finished rebasing.

@JoshRosen
Copy link
Contributor

LGTM so I'm going to merge this to master. Thanks @liancheng!

@JoshRosen JoshRosen self-assigned this Jul 18, 2016
@JoshRosen JoshRosen closed this in d50ab75 Jul 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants