-
Notifications
You must be signed in to change notification settings - Fork 4k
PARQUET-1300: [C++] Implement encrypted Parquet read and write support #4826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/parquet/file_reader.cc
Outdated
| std::shared_ptr<ArrowInputStream> stream = | ||
| properties_.GetStream(source_, col_start, col_length); | ||
|
|
||
| #ifdef PARQUET_ENCRYPTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these ifdef and use conditionals based on the presence of crypto variables? In this case we can check based off col->crypto_metadata()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks like col->crypto_metadata() check does the job here. We'll also handle all other similar ifdefs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed some but not completely. I'm not sure if it matches your expectation. Can you please take a look again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed this question, please see the comment below.
Codecov Report
@@ Coverage Diff @@
## master #4826 +/- ##
==========================================
+ Coverage 88.93% 88.98% +0.04%
==========================================
Files 988 1006 +18
Lines 134262 137127 +2865
Branches 1501 1501
==========================================
+ Hits 119412 122025 +2613
- Misses 14485 14737 +252
Partials 365 365
Continue to review full report at Codecov.
|
|
@wesm , @majetideepak fyi, all CI issues are now resolved. |
|
I won't be able to do a thorough review but would like to get this merged soon. Does the PR into parquet-testing need to be merged before this is? |
Not strictly necessary, but allows for better testing. We've expanded the decryption tests to read two sets of files - one produced in real time by the encryption tests; another stored in the parquet-testing repo. The former can miss problems caused by matching bugs in writers and readers. |
cpp/src/parquet/column_reader.cc
Outdated
| meta_decryptor_ = ctx->meta_decryptor; | ||
| data_decryptor_ = ctx->data_decryptor; | ||
| if (data_decryptor_ != NULLPTR || meta_decryptor_ != NULLPTR) { | ||
| #ifdef PARQUET_ENCRYPTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Tham, I believe we need to try to get rid of all #ifdef PARQUET_ENCRYPTION statements (if possible). Lets go over the code early next week (our weekend starts today..).
|
OK. I'll wait for you to address the ifdefs |
|
@wesm @majetideepak we've figured out a way to remove all #ifdef PARQUET_ENCRYPTION. To be able to compile without openssl, we create a encryption_internal-nossl.cc instead of encryption_internal.cc with openssl. I'll update the fix, after internal team review, I will push to this pull. |
|
Note that "internal review" is not really consistent with the Apache Way |
|
Ok @wesm. So I will push the change directly to this pull. |
|
@wesm @majetideepak We are done with all todo items on our current list. This PR should be able to pass the CI now (at least the encryption unitests), due to availability of the new parquet-testing files (thanks Wes for merging them). Is there a way to trigger the CI here directly? Or we should add a new commit / rebase the PR? |
|
In parallel, @thamht4190 and @revit13 , if you're online now, can you check if this PR can be rebased (a good chance the Arrow master has moved..) |
|
@majetideepak CI tests have passed except for R in AppVeyor that needs OpenSSL. It seems we should link openssl libs at line: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26093493/job/9yen567yr3pw6fh8#L3403. However, I'm not sure how to do it. Can you please take a look? |
|
We got this issue when testing our product on Windows with MacAfee (it may happen with MacOS as well): |
|
adding @wesm to the loop on the last comment. My personal take on that - technically, you can use files without .encrypted extension in your settings; neither encryption spec nor the code depend on file names (and there are no plans today to introduce such dependency), so parquet tools should be able to read these files in the future. |
ed180da to
85fe336
Compare
|
@wesm @majetideepak We are done, let us know what's next. All comments have been addressed; the CI tests pass (except for one, R/libarrow, that lacks OpenSSL - Deepak, your help with that will be appreciated). |
|
@ggershinsky I will make a final pass today and tomorrow. I will also look into the R issue. |
|
To fix the R failure, we have to link the OpenSSL library here https://github.com/apache/arrow/blob/48ee38f833a51bb46a404da10d89c4b687e952c9/r/src/Makevars.win |
|
I decided to leave the current behavior of building encryption support if the system has OpenSSL libraries for now. This was discussed in #4494 |
|
@thamht4190 can you rebase with master again. The failing rust tests have been fixed on master. |
ci/windows-pkg-arrow-for-r.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @nealrichardson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that seems ok (though you need to remove the extra : in the middle of the URLs).
Since you had to make this change, I suspect that this means we'll need to update the Homebrew formula too, but unfortunately we don't have that under CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the URLs. What are your thoughts on updating the Homebrew formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should put the C++ homebrew formula in our ci/ directory and have a Travis job, perhaps only run nightly/on-demand if we're concerned about build time, that builds from it and then runs some test suite that requires it (e.g. R, Ruby). https://github.com/Homebrew/homebrew-core/blob/master/CONTRIBUTING.md suggests we can brew install --build-from-source so IIUC we'd use something like that.
Related, I previously tried to add brew install --HEAD to CI but that didn't go anywhere: #4576
I don't think this needs to block this patch, but it would be great to do something with this sooner rather than later. Otherwise we'll keep making releases unnecessarily expensive and that will keep us from doing them as frequently as we should.
|
The CI is green now. Thanks @majetideepak for handling the OpenSSL. |
|
@wesm I am done with my refactoring for now. We can merge this PR if there are no other comments. |
|
All tests are passing. @thamht4190, @revit1976 @ggershinsky do you have any comments on this PR? Is this good to be merged? |
|
no additional comments, looks good to me. |
|
LGTM. Thanks Deepak for your change. |
|
LGTM also. Thanks. |
|
@wesm Do you plan to make a pass before we merge this? |
|
@majetideepak sorry for the delay. I am going to take a brief look through this to check for anything that seems obviously off |
|
I apologize, I've been out of pocket the last couple weeks. Taking a look through and if I don't any anything troubling I'll merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'm going to push some stylistic fixes. It's a bummer that this branch is so messed up. I'm going to create a squashed version to merge to master that doesn't have all the unrelated commits in it. I'll have to manually list out the co-authors...
cpp/src/parquet/column_reader.cc
Outdated
| total_num_rows_(total_num_rows) { | ||
| total_num_rows_(total_num_rows), | ||
| decryption_buffer_(AllocateBuffer(pool, 0)) { | ||
| if (crypto_ctx != NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: it's not necessary to use NULLPTR in .cc files, only in headers.
cpp/src/parquet/column_reader.cc
Outdated
|
|
||
| private: | ||
| void UpdateDecryption(const std::shared_ptr<Decryptor>& decryptor, int8_t module_type, | ||
| const std::string& pageAAD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: page_aad would probably be a better variable name
cpp/src/parquet/column_reader.cc
Outdated
|
|
||
| void SerializedPageReader::InitDecryption() { | ||
| // Prepare the AAD for quick update later. | ||
| if (crypto_ctx_.data_decryptor != NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullptr
cpp/src/parquet/column_reader.cc
Outdated
| data_pageAAD_ = | ||
| encryption::CreateModuleAad(crypto_ctx_.data_decryptor->file_aad(), | ||
| encryption::kDataPage, crypto_ctx_.row_group_ordinal, | ||
| crypto_ctx_.column_ordinal, static_cast<int16_t>(-1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this static_cast needed?
cpp/src/parquet/column_reader.cc
Outdated
| void SerializedPageReader::UpdateDecryption(const std::shared_ptr<Decryptor>& decryptor, | ||
| int8_t module_type, | ||
| const std::string& pageAAD) { | ||
| DCHECK(decryptor != NULLPTR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullptr
cpp/src/parquet/column_writer.cc
Outdated
| int16_t column_chunk_ordinal, | ||
| MemoryPool* pool = arrow::default_memory_pool(), | ||
| std::shared_ptr<Encryptor> meta_encryptor = NULLPTR, | ||
| std::shared_ptr<Encryptor> data_encryptor = NULLPTR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLPTR -> nullptr
cpp/src/parquet/column_writer.cc
Outdated
| const uint8_t* output_data_buffer = compressed_data->data(); | ||
| int32_t output_data_len = static_cast<int32_t>(compressed_data->size()); | ||
|
|
||
| std::shared_ptr<Buffer> encrypted_data_buffer = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting = nullptr is redundant
| } | ||
|
|
||
| INSTANTIATE_TEST_CASE_P( | ||
| decryptionTests, TestDecryptionConfiguration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple non-conforming names in this file
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that some code in these methods could be simplified or reused
| -DARROW_PYTHON=ON | ||
| -DARROW_GANDIVA=ON | ||
| -DARROW_PARQUET=ON | ||
| -DPARQUET_REQUIRE_ENCRYPTION=ON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may make it harder for people to verify releases. We can see for the next major release
|
Here's the merge-ready commit, I'm going to wait a bit for CI to make progress Do NOT merge this PR as is, please |
|
Merging this. Thanks all |
|
@majetideepak @wesm While Parquet encryption feature is a part of the Arrow 0.16.0 release, it doesn't appear in https://github.com/apache/arrow/blob/master/CHANGELOG.md. Also, the https://issues.apache.org/jira/browse/PARQUET-1300 is still open. Should it be closed, with setting the "Affects Version/s:" field? Can the 0.16.0 changelog be updated? |
|
@ggershinsky I think if needed we could do a second minor release ARROW 0.16.1 if there are no objections. Do you want to e-mail the Arrow/Parquet mailing lists to discuss this? |
|
@majetideepak We've submitted a paper where we've referenced Arrow 0.16.0 for parquet encryption. So having it in 0.16.1 doesn't solve our immediate (though very minor) problem. I wouldn't suggest running a new release just for adding this to the log. I believe we can think of 0.16.0 as a "silent release" of this feature. I've created an Arrow jira (8018) for adding this feature explicitly in 1.0.0, and closed the PARQUET-1300 as fixed - but if I've done something wrong there, please change as needed. |
|
@wesm what should be done for packaging issues for encryption? I would handle this, but not 100% sure about the details. |
|
I just opened ARROW-8038 and ARROW-8040 to cover some of the basic packaging stuff. Let's move the discussion there (and to child issues) |
I open a fresh pull request from the pull #2555 since pull #2555 is too big now. Please let your comments here. Thanks!