[GLUTEN-8455][VL] Fallback Scan for Encrypted Parquet Files#8456
[GLUTEN-8455][VL] Fallback Scan for Encrypted Parquet Files#8456Yohahaha merged 7 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
dab4674 to
e375c06
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
Yohahaha
left a comment
There was a problem hiding this comment.
LGTM! just one comments
| .createWithDefault(false) | ||
|
|
||
| val ENCRYPTED_PARQUET_FALLBACK_ENABLED = | ||
| buildStaticConf("spark.gluten.sql.fallbackEncryptedParquet") |
There was a problem hiding this comment.
updated, this config changing at runtime is an unlikely event, kept removed static to allow any such usecases
| case _ => ValidationResult.succeeded | ||
| def validateEncryption(): Option[String] = { | ||
|
|
||
| val encryptionValidationEnabled = GlutenConfig.getConf.enableEncryptedParquetFallback |
There was a problem hiding this comment.
The config name is a little wird to me. maybe parquetEncryptionValidationEnabled?
| val fileStatus = filesIterator.next() | ||
| checkedFileCount += 1 | ||
| try { | ||
| ParquetFileReader.readFooter(conf, fileStatus.getPath).toString |
There was a problem hiding this comment.
Is it a better way to check encrypted metadata than use the Exception?
There was a problem hiding this comment.
This is done to keep it backward compatible, spark 33 uses parquet 1.12, support is not there yet
There was a problem hiding this comment.
Spark 3.5 uses parquet 1.13, and I found it defined in thrift. This could work for Spark 3.5?
|
Run Gluten Clickhouse CI on x86 |
|
Thanks for the review @Yohahaha addressed comments, could you please take a look |
|
CI fails. can you fix format? |
|
Run Gluten Clickhouse CI on x86 |
please rebase and fix conflict. |
|
Run Gluten Clickhouse CI on x86 |
|
@Yohahaha The way to check encrypted file haven't been confirmed, I'm afraid this is not work on Spark 3.5 |
Hi @jackylee-ch, spark 3.5 uses parquet 1.13. After parquet 1.13, there is a new field added to check for encryption, which can provide if the file is encrypted. However if we try to read the encrypted file footer, it throws ParquetCryptoRuntimeException. Could you please elaborate on why it may not work on 3.5? Thanks Alternatively, I was planning to shade parquet 1.14 and bring the shaded parquet as a packaged dependency in Gluten (to check the encryption), which will be a more elegant solution and work with multiple spark version. If there are no concerns with bringing in shaded parquet inside gluten, I'm happy to work on that implementation as well. I could not see such implementations inside Gluten so was a bit hesitant, althought parquet encryption seems to be a special case and could benefit from such a solution. Let me know what you think thanks! |
@jackylee-ch sorry, I missed your comments. the check is disabled by default, it's safe to merge to main branch. if we verified it's not work in Spark 3.5, we can fix with a followup PR. |
|
@ArnavBalyan thanks for the update! would you add UT in different spark shim module with a minimal encrypted parquet file? we do not need package parquet deps in Gluten I think, just do verification in shim module. |
|
@jackylee-ch I'm ok if you want to revert this PR with Spark 3.5 compatibility concern. |
@ArnavBalyan You mean that no matter which Spark version we use, we can get BTW what would happen if the footer is not encrypted but the column is encrypted? |
Current PR should work fine for Spark 3.2 and 3.3 for encrypted footer, not sure for other cases. If it doesn't support those cases, we need move Since it is merged and not work by default, a followed PR is good for me. |
Sure let me add a follow up UT for 3.5, this feature is behind feature flag and verified for parquet 1.13. @jackylee-ch, it would depend how you are doing encryption in your setup. Typically the footer metadata will indicate encryption for newer versions of parquet.
@Yohahaha, if we want to keep it backward compatible for parquet 1.12, and not use exception, then we will need a newer parquet inside Gluten regardless of the spark version, in that case, the check can hold true if there are future parquet upgrades inside spark. Using the shim layer we can probably do validation but still will use exception checking for 1.12, I was wondering how feasible it would be to bring in shaded parquet to do this? Thanks |
|
We can use different checker for different Spark version in shims, and for parquet 1.12, the exception is good to me if there is no other better idea. |
spark.gluten.sql.fallbackEncryptedParquettrue.Fixes: #8455