-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
@pitrou pointed out that InternalFileDecryptor reusing the footer_data_decryptor_ could be problematic for multi-threaded Parquet reads: #43057 (comment)
I confirmed that this does lead to decryptor errors when scanning a Dataset with Parquet files that use uniform encryption by modifying the existing Parquet Dataset encryption tests:
diff --git a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
index 0287d593d1..6a13b1ee37 100644
--- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -90,7 +90,7 @@ class DatasetEncryptionTestBase : public ::testing::Test {
auto encryption_config =
std::make_shared<parquet::encryption::EncryptionConfiguration>(
std::string(kFooterKeyName));
- encryption_config->column_keys = kColumnKeyMapping;
+ encryption_config->uniform_encryption = true;
auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
// Directly assign shared_ptr objects to ParquetEncryptionConfig members
parquet_encryption_config->crypto_factory = crypto_factory_;
This causes DatasetEncryptionTest::WriteReadDatasetWithEncryption to fail with an error like:
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: AesDecryptor was wiped outDeserializing page header failed.
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109 LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263 ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95 func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:208: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
Actual: it does.
For LargeRowEncryptionTest::ReadEncryptLargeRows, I sometimes get the same AesDecryptor was wiped out error, but also see errors like:
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: Failed decryption finalization
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109 LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263 ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95 func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:265: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
Actual: it does.
I don't think it's possible to reproduce this from PyArrow only, as the uniform_encryption setting isn't exposed in PyArrow.
Component(s)
C++, Parquet