Skip to content

[C++][Parquet] Decryptor errors when scanning a dataset that uses uniform encryption #44852

@adamreeve

Description

@adamreeve

Describe the bug, including details regarding any error messages, version, and platform.

@pitrou pointed out that InternalFileDecryptor reusing the footer_data_decryptor_ could be problematic for multi-threaded Parquet reads: #43057 (comment)

I confirmed that this does lead to decryptor errors when scanning a Dataset with Parquet files that use uniform encryption by modifying the existing Parquet Dataset encryption tests:

diff --git a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
index 0287d593d1..6a13b1ee37 100644
--- a/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
+++ b/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
@@ -90,7 +90,7 @@ class DatasetEncryptionTestBase : public ::testing::Test {
     auto encryption_config =
         std::make_shared<parquet::encryption::EncryptionConfiguration>(
             std::string(kFooterKeyName));
-    encryption_config->column_keys = kColumnKeyMapping;
+    encryption_config->uniform_encryption = true;
     auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
     // Directly assign shared_ptr objects to ParquetEncryptionConfig members
     parquet_encryption_config->crypto_factory = crypto_factory_;

This causes DatasetEncryptionTest::WriteReadDatasetWithEncryption to fail with an error like:

/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: AesDecryptor was wiped outDeserializing page header failed.

/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109  LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95  func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:208: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
  Actual: it does.

For LargeRowEncryptionTest::ReadEncryptLargeRows, I sometimes get the same AesDecryptor was wiped out error, but also see errors like:

/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: Failed decryption finalization
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:109  LoadBatch(batch_size)
/home/adam/dev/arrow/cpp/src/parquet/arrow/reader.cc:1263  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
/home/adam/dev/arrow/cpp/src/arrow/util/parallel.h:95  func(i, inputs[i])
/home/adam/dev/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:265: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
  Actual: it does.

I don't think it's possible to reproduce this from PyArrow only, as the uniform_encryption setting isn't exposed in PyArrow.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions