Use MANIFEST in unload to guard against eventually-consistent S3 bucket listing #151

JoshRosen · 2016-01-12T03:49:40Z

This patch modifies spark-redshift's read path in order to guard against a potential source of missing data / inconsistent reads.

According to the Amazon S3 Data Consistency Model documentation, S3 bucket listing is eventually-consistent. In spark-redshift 1.6.0 and earlier, the S3 read path performs bucket-listing and thus may be impacted by this eventual-consistency, meaning that in rare circumstances reads may see only a subset of the unloaded data.

This patch fixes this issue by using UNLOAD command's the MANIFEST parameter, which causes it to write a JSON manifest which lists the paths of all unload output files. I modified the UNLOAD query to produce this manifest and added some code to the reader to consume it. In order to simplify parsing of the JSON manifest, I used the minimal-json JSON library, chosen because it's small and has no external dependencies (and thus is unlikely to cause the sorts of dependency conflicts that something like Jackson would lead to).

codecov-io · 2016-01-12T03:49:41Z

Current coverage is `89.27%`

Merging #151 into master will increase coverage by +0.21% as of 2b64a94

@@            master    #151   diff @@
======================================
  Files           13      13       
  Stmts          649     662    +13
  Branches       144     146     +2
  Methods          0       0       
======================================
+ Hit            578     591    +13
  Partial          0       0       
  Missed          71      71

Review entire Coverage Diff as of 2b64a94

Powered by Codecov. Updated on successful CI builds.

JoshRosen · 2016-01-12T03:51:39Z

src/test/scala/com/databricks/spark/redshift/RedshiftSourceSuite.scala

This messy mocking is only to get the unit tests to pass. It's not the most high-fidelity mocking in the world, but that's okay because this code path is well-exercised by the integration tests.

JoshRosen · 2016-01-25T02:45:54Z

Ping @liancheng, could you help me to review this?

liancheng · 2016-01-28T07:32:10Z

LGTM

JoshRosen · 2016-01-28T22:37:59Z

Thanks for reviewing; I've updated the README to reflect the new semantics and have fixed the merge conflicts, so I'll merge this as soon as tests pass.

JoshRosen added 2 commits January 11, 2016 18:41

Use manifest to guard against eventually-consistent reads.

ed6d8e6

Add mocking to fix unit tests.

a2c1bb2

JoshRosen added the bug label Jan 12, 2016

JoshRosen assigned rxin Jan 12, 2016

JoshRosen added this to the 0.6.1 milestone Jan 12, 2016

JoshRosen mentioned this pull request Jan 12, 2016

Add documentation on transactional guarantees #150

Closed

JoshRosen reviewed Jan 12, 2016
View reviewed changes

JoshRosen unassigned rxin Jan 14, 2016

JoshRosen added 2 commits January 28, 2016 14:33

Merge remote-tracking branch 'origin/master' into unload-manifest

4a179ef

Update README.

2b64a94

JoshRosen closed this in d97ffb0 Jan 30, 2016

JoshRosen deleted the unload-manifest branch January 30, 2016 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use MANIFEST in unload to guard against eventually-consistent S3 bucket listing #151

Use MANIFEST in unload to guard against eventually-consistent S3 bucket listing #151

Uh oh!

JoshRosen commented Jan 12, 2016

Uh oh!

codecov-io commented Jan 12, 2016

Uh oh!

JoshRosen Jan 12, 2016

Uh oh!

JoshRosen commented Jan 25, 2016

Uh oh!

liancheng commented Jan 28, 2016

Uh oh!

JoshRosen commented Jan 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Use MANIFEST in unload to guard against eventually-consistent S3 bucket listing #151

Use MANIFEST in unload to guard against eventually-consistent S3 bucket listing #151

Uh oh!

Conversation

JoshRosen commented Jan 12, 2016

Uh oh!

codecov-io commented Jan 12, 2016

Current coverage is 89.27%

Uh oh!

JoshRosen Jan 12, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jan 25, 2016

Uh oh!

liancheng commented Jan 28, 2016

Uh oh!

JoshRosen commented Jan 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Current coverage is `89.27%`