Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch modifies spark-redshift's read path in order to guard against a potential source of missing data / inconsistent reads.

According to the Amazon S3 Data Consistency Model documentation, S3 bucket listing is eventually-consistent. In spark-redshift 1.6.0 and earlier, the S3 read path performs bucket-listing and thus may be impacted by this eventual-consistency, meaning that in rare circumstances reads may see only a subset of the unloaded data.

This patch fixes this issue by using UNLOAD command's the MANIFEST parameter, which causes it to write a JSON manifest which lists the paths of all unload output files. I modified the UNLOAD query to produce this manifest and added some code to the reader to consume it. In order to simplify parsing of the JSON manifest, I used the minimal-json JSON library, chosen because it's small and has no external dependencies (and thus is unlikely to cause the sorts of dependency conflicts that something like Jackson would lead to).

@JoshRosen JoshRosen added the bug label Jan 12, 2016
@JoshRosen JoshRosen added this to the 0.6.1 milestone Jan 12, 2016
@codecov-io
Copy link

Current coverage is 89.27%

Merging #151 into master will increase coverage by +0.21% as of 2b64a94

@@            master    #151   diff @@
======================================
  Files           13      13       
  Stmts          649     662    +13
  Branches       144     146     +2
  Methods          0       0       
======================================
+ Hit            578     591    +13
  Partial          0       0       
  Missed          71      71       

Review entire Coverage Diff as of 2b64a94

Powered by Codecov. Updated on successful CI builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This messy mocking is only to get the unit tests to pass. It's not the most high-fidelity mocking in the world, but that's okay because this code path is well-exercised by the integration tests.

@JoshRosen
Copy link
Contributor Author

Ping @liancheng, could you help me to review this?

@liancheng
Copy link
Contributor

LGTM

@JoshRosen
Copy link
Contributor Author

Thanks for reviewing; I've updated the README to reflect the new semantics and have fixed the merge conflicts, so I'll merge this as soon as tests pass.

@JoshRosen JoshRosen closed this in d97ffb0 Jan 30, 2016
@JoshRosen JoshRosen deleted the unload-manifest branch January 30, 2016 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants