Skip to content

Conversation

@thamht4190
Copy link
Contributor

I open a fresh pull request from the pull #2555 since pull #2555 is too big now. Please let your comments here. Thanks!

std::shared_ptr<ArrowInputStream> stream =
properties_.GetStream(source_, col_start, col_length);

#ifdef PARQUET_ENCRYPTION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these ifdef and use conditionals based on the presence of crypto variables? In this case we can check based off col->crypto_metadata()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks like col->crypto_metadata() check does the job here. We'll also handle all other similar ifdefs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed some but not completely. I'm not sure if it matches your expectation. Can you please take a look again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this question, please see the comment below.

@codecov-io
Copy link

codecov-io commented Jul 10, 2019

Codecov Report

Merging #4826 into master will increase coverage by 0.04%.
The diff coverage is 92.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4826      +/-   ##
==========================================
+ Coverage   88.93%   88.98%   +0.04%     
==========================================
  Files         988     1006      +18     
  Lines      134262   137127    +2865     
  Branches     1501     1501              
==========================================
+ Hits       119412   122025    +2613     
- Misses      14485    14737     +252     
  Partials      365      365
Impacted Files Coverage Δ
cpp/src/parquet/column_writer.h 95.65% <ø> (ø) ⬆️
cpp/src/parquet/metadata.h 100% <ø> (ø) ⬆️
cpp/src/parquet/file_writer.h 81.81% <ø> (ø) ⬆️
cpp/src/parquet/column_reader.cc 89.4% <100%> (+0.52%) ⬆️
cpp/src/parquet/internal_file_encryptor.h 100% <100%> (ø)
cpp/src/parquet/types.h 100% <100%> (ø) ⬆️
cpp/src/parquet/column_reader.h 98% <100%> (+0.22%) ⬆️
cpp/src/parquet/encryption_properties_test.cc 100% <100%> (ø)
cpp/src/parquet/properties.h 97.47% <100%> (+0.2%) ⬆️
...rc/parquet/encryption_write_configurations_test.cc 100% <100%> (ø)
... and 146 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e53c00...4e9b9af. Read the comment docs.

@revit1976
Copy link

@wesm , @majetideepak fyi, all CI issues are now resolved.

@wesm
Copy link
Member

wesm commented Jul 11, 2019

I won't be able to do a thorough review but would like to get this merged soon. Does the PR into parquet-testing need to be merged before this is?

@ggershinsky
Copy link
Contributor

Does the PR into parquet-testing need to be merged before this is?

Not strictly necessary, but allows for better testing. We've expanded the decryption tests to read two sets of files - one produced in real time by the encryption tests; another stored in the parquet-testing repo. The former can miss problems caused by matching bugs in writers and readers.

meta_decryptor_ = ctx->meta_decryptor;
data_decryptor_ = ctx->data_decryptor;
if (data_decryptor_ != NULLPTR || meta_decryptor_ != NULLPTR) {
#ifdef PARQUET_ENCRYPTION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tham, I believe we need to try to get rid of all #ifdef PARQUET_ENCRYPTION statements (if possible). Lets go over the code early next week (our weekend starts today..).

@wesm
Copy link
Member

wesm commented Jul 11, 2019

OK. I'll wait for you to address the ifdefs

@thamht4190
Copy link
Contributor Author

@wesm @majetideepak we've figured out a way to remove all #ifdef PARQUET_ENCRYPTION. To be able to compile without openssl, we create a encryption_internal-nossl.cc instead of encryption_internal.cc with openssl. I'll update the fix, after internal team review, I will push to this pull.

@wesm
Copy link
Member

wesm commented Jul 16, 2019

Note that "internal review" is not really consistent with the Apache Way

@thamht4190
Copy link
Contributor Author

Ok @wesm. So I will push the change directly to this pull.

@ggershinsky
Copy link
Contributor

@wesm @majetideepak We are done with all todo items on our current list. This PR should be able to pass the CI now (at least the encryption unitests), due to availability of the new parquet-testing files (thanks Wes for merging them). Is there a way to trigger the CI here directly? Or we should add a new commit / rebase the PR?

@ggershinsky
Copy link
Contributor

In parallel, @thamht4190 and @revit13 , if you're online now, can you check if this PR can be rebased (a good chance the Arrow master has moved..)

@thamht4190
Copy link
Contributor Author

@majetideepak CI tests have passed except for R in AppVeyor that needs OpenSSL. It seems we should link openssl libs at line: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26093493/job/9yen567yr3pw6fh8#L3403. However, I'm not sure how to do it. Can you please take a look?

@thamht4190
Copy link
Contributor Author

We got this issue when testing our product on Windows with MacAfee (it may happen with MacOS as well):
"Failed to open local file: <path_to_folder>/<file_name>.parquet.encrypted , error: Permission denied"
Then when checking log of MacAfee, I see:
NT AUTHORITY\SYSTEM ran CORTEXSERVICE.EXE, which tried to access <PATH_TO_FOLDER>\<FILE_NAME>.PARQUET.ENCRYPTED, violating the rule "*.encrypted Blocker", and was blocked. For information about how to respond to this event, see KB85494.
It seems .encrypted extension is not so friendly with antivirus softwares.
As I know, parquet tools should not base on file extension, then it's quite safe to use another file extension (without .encrypted) in my application when writing parquet file, however, it seems tricky. Otherwise, it's inconvenient for our customers if keeping .encrypted extension. We are discussing to trade off. Any thoughts? @majetideepak

@ggershinsky
Copy link
Contributor

adding @wesm to the loop on the last comment. My personal take on that - technically, you can use files without .encrypted extension in your settings; neither encryption spec nor the code depend on file names (and there are no plans today to introduce such dependency), so parquet tools should be able to read these files in the future.
However, I believe the right thing to do is to re-configure your anti-virus system (to enable say "*.parquet.encrypted" files). These files are encrypted after all.., so that should be handled at a policy level. But, eventually, this is up to the organizational security admin decision; naming these files without ".encrypted" extension can be also an option.

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@ggershinsky
Copy link
Contributor

@wesm @majetideepak We are done, let us know what's next. All comments have been addressed; the CI tests pass (except for one, R/libarrow, that lacks OpenSSL - Deepak, your help with that will be appreciated).

@majetideepak
Copy link
Contributor

@ggershinsky I will make a final pass today and tomorrow. I will also look into the R issue.

@majetideepak
Copy link
Contributor

To fix the R failure, we have to link the OpenSSL library here https://github.com/apache/arrow/blob/48ee38f833a51bb46a404da10d89c4b687e952c9/r/src/Makevars.win
or disable Parquet Encryption in this test.
An issue in the current form is that there is no way to disable Parquet Encryption if the system has OpenSSL. I think we have to switch the default PARQUET_REQUIRE_ENCRYPTION to ON and support OFF. I will open a separate JIRA and make that change.

@majetideepak
Copy link
Contributor

I decided to leave the current behavior of building encryption support if the system has OpenSSL libraries for now. This was discussed in #4494
We can always change this behavior in the future.

@majetideepak
Copy link
Contributor

majetideepak commented Jul 26, 2019

@thamht4190 can you rebase with master again. The failing rust tests have been fixed on master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that seems ok (though you need to remove the extra : in the middle of the URLs).

Since you had to make this change, I suspect that this means we'll need to update the Homebrew formula too, but unfortunately we don't have that under CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the URLs. What are your thoughts on updating the Homebrew formula?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should put the C++ homebrew formula in our ci/ directory and have a Travis job, perhaps only run nightly/on-demand if we're concerned about build time, that builds from it and then runs some test suite that requires it (e.g. R, Ruby). https://github.com/Homebrew/homebrew-core/blob/master/CONTRIBUTING.md suggests we can brew install --build-from-source so IIUC we'd use something like that.

Related, I previously tried to add brew install --HEAD to CI but that didn't go anywhere: #4576

I don't think this needs to block this patch, but it would be great to do something with this sooner rather than later. Otherwise we'll keep making releases unnecessarily expensive and that will keep us from doing them as frequently as we should.

@thamht4190
Copy link
Contributor Author

The CI is green now. Thanks @majetideepak for handling the OpenSSL.
@majetideepak @wesm The pull request is complete and ready for merge, can you please have a look?

@majetideepak
Copy link
Contributor

@wesm I am done with my refactoring for now. We can merge this PR if there are no other comments.

@majetideepak
Copy link
Contributor

All tests are passing. @thamht4190, @revit1976 @ggershinsky do you have any comments on this PR? Is this good to be merged?

@ggershinsky
Copy link
Contributor

no additional comments, looks good to me.

@thamht4190
Copy link
Contributor Author

LGTM. Thanks Deepak for your change.

@revit1976
Copy link

LGTM also. Thanks.

@majetideepak
Copy link
Contributor

@wesm Do you plan to make a pass before we merge this?

@wesm
Copy link
Member

wesm commented Oct 17, 2019

@majetideepak sorry for the delay. I am going to take a brief look through this to check for anything that seems obviously off

@wesm
Copy link
Member

wesm commented Oct 30, 2019

I apologize, I've been out of pocket the last couple weeks. Taking a look through and if I don't any anything troubling I'll merge

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm going to push some stylistic fixes. It's a bummer that this branch is so messed up. I'm going to create a squashed version to merge to master that doesn't have all the unrelated commits in it. I'll have to manually list out the co-authors...

total_num_rows_(total_num_rows) {
total_num_rows_(total_num_rows),
decryption_buffer_(AllocateBuffer(pool, 0)) {
if (crypto_ctx != NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it's not necessary to use NULLPTR in .cc files, only in headers.


private:
void UpdateDecryption(const std::shared_ptr<Decryptor>& decryptor, int8_t module_type,
const std::string& pageAAD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: page_aad would probably be a better variable name


void SerializedPageReader::InitDecryption() {
// Prepare the AAD for quick update later.
if (crypto_ctx_.data_decryptor != NULLPTR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullptr

data_pageAAD_ =
encryption::CreateModuleAad(crypto_ctx_.data_decryptor->file_aad(),
encryption::kDataPage, crypto_ctx_.row_group_ordinal,
crypto_ctx_.column_ordinal, static_cast<int16_t>(-1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this static_cast needed?

void SerializedPageReader::UpdateDecryption(const std::shared_ptr<Decryptor>& decryptor,
int8_t module_type,
const std::string& pageAAD) {
DCHECK(decryptor != NULLPTR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullptr

int16_t column_chunk_ordinal,
MemoryPool* pool = arrow::default_memory_pool(),
std::shared_ptr<Encryptor> meta_encryptor = NULLPTR,
std::shared_ptr<Encryptor> data_encryptor = NULLPTR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULLPTR -> nullptr

const uint8_t* output_data_buffer = compressed_data->data();
int32_t output_data_len = static_cast<int32_t>(compressed_data->size());

std::shared_ptr<Buffer> encrypted_data_buffer = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting = nullptr is redundant

}

INSTANTIATE_TEST_CASE_P(
decryptionTests, TestDecryptionConfiguration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple non-conforming names in this file

}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that some code in these methods could be simplified or reused

-DARROW_PYTHON=ON
-DARROW_GANDIVA=ON
-DARROW_PARQUET=ON
-DPARQUET_REQUIRE_ENCRYPTION=ON
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make it harder for people to verify releases. We can see for the next major release

@wesm
Copy link
Member

wesm commented Oct 30, 2019

Here's the merge-ready commit, I'm going to wait a bit for CI to make progress

wesm@69db453

Do NOT merge this PR as is, please

@wesm
Copy link
Member

wesm commented Oct 30, 2019

Merging this. Thanks all

@wesm wesm closed this in 41753ac Oct 30, 2019
@ggershinsky
Copy link
Contributor

@majetideepak @wesm While Parquet encryption feature is a part of the Arrow 0.16.0 release, it doesn't appear in https://github.com/apache/arrow/blob/master/CHANGELOG.md. Also, the https://issues.apache.org/jira/browse/PARQUET-1300 is still open. Should it be closed, with setting the "Affects Version/s:" field? Can the 0.16.0 changelog be updated?

@majetideepak
Copy link
Contributor

@ggershinsky I think if needed we could do a second minor release ARROW 0.16.1 if there are no objections. Do you want to e-mail the Arrow/Parquet mailing lists to discuss this?

@ggershinsky
Copy link
Contributor

@majetideepak We've submitted a paper where we've referenced Arrow 0.16.0 for parquet encryption. So having it in 0.16.1 doesn't solve our immediate (though very minor) problem. I wouldn't suggest running a new release just for adding this to the log. I believe we can think of 0.16.0 as a "silent release" of this feature. I've created an Arrow jira (8018) for adding this feature explicitly in 1.0.0, and closed the PARQUET-1300 as fixed - but if I've done something wrong there, please change as needed.

@ggershinsky
Copy link
Contributor

@wesm what should be done for packaging issues for encryption? I would handle this, but not 100% sure about the details.

@wesm
Copy link
Member

wesm commented Mar 9, 2020

I just opened ARROW-8038 and ARROW-8040 to cover some of the basic packaging stuff. Let's move the discussion there (and to child issues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.