[GLUTEN-8455][VL] Port encrypted file checks to shim layer#8501
[GLUTEN-8455][VL] Port encrypted file checks to shim layer#8501Yohahaha merged 10 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
cc @Yohahaha, @jackylee-ch, can you please take a look. The exception checks also works for spark 3.5 but will use the footer metadata since it's more efficient |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
Yohahaha
left a comment
There was a problem hiding this comment.
thank you for the work! left some comments.
| */ | ||
| package org.apache.gluten.utils | ||
|
|
||
| object ExceptionUtils { |
There was a problem hiding this comment.
I think this util could be added in gluten-core
There was a problem hiding this comment.
yes makes sense, do you think we can do this with 35, I just wanted to be sure once the logic is set in thanks
There was a problem hiding this comment.
yes, let's move this util to common modules.
| override def isParquetFileEncrypted( | ||
| fileStatus: LocatedFileStatus, | ||
| conf: Configuration): Boolean = { | ||
| return false |
There was a problem hiding this comment.
nit: add TODO with current PR link.
| * - Ensures the file is still detected as encrypted despite the plaintext footer. | ||
| */ | ||
|
|
||
| class ParquetEncryptionDetectionSuite extends AnyFunSuite { |
There was a problem hiding this comment.
I suppose this suite can be moved to backends-velox test module after Spark35 shims check was done, right?
There was a problem hiding this comment.
This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.
And also, we can use existing parquet files instead of writing new ones each time.
There was a problem hiding this comment.
Yes was planning to move it, but testWithSpecifiedSparkVersion works as well, added support thanks for the tip
| plan.unsetTagValue(QueryPlan.OP_ID_TAG) | ||
| } | ||
|
|
||
| override def isParquetFileEncrypted( |
There was a problem hiding this comment.
Can this function be implemented in it's parent class: shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala
?
If it has a different implement, just to override it.
There was a problem hiding this comment.
The logic will be different for 3.5 shim, will check if consolidation can be done when that is added
There was a problem hiding this comment.
This is a common issue in our shim code. Perhaps we should develop a better way than the current one to manage these code duplications in shim layer in future.
| */ | ||
| package org.apache.gluten.utils | ||
|
|
||
| object ExceptionUtils { |
There was a problem hiding this comment.
Put it in a common class? As the code is much redundant in spark32/spark33/34/35
There was a problem hiding this comment.
Minor code which needs to change with 35, updated for now
| ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString | ||
| false | ||
| } catch { | ||
| case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) => |
There was a problem hiding this comment.
Why use ExceptionUtils.hasCause instead of case _: ParquetCryptoRuntimeException ?
There was a problem hiding this comment.
the exception may wrap ParquetCryptoRuntimeException, and may not directly expose it. This handles all cases thanks
| ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString | ||
| false | ||
| } catch { | ||
| case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) => |
| * - Ensures the file is still detected as encrypted despite the plaintext footer. | ||
| */ | ||
|
|
||
| class ParquetEncryptionDetectionSuite extends AnyFunSuite { |
There was a problem hiding this comment.
This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.
And also, we can use existing parquet files instead of writing new ones each time.
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| fs.listFiles(new Path(path), false).next() | ||
| } | ||
|
|
||
| // private def withTempDir(testCode: File => Any): Unit = { |
There was a problem hiding this comment.
remove this since we don't need it.
|
|
||
| testWithSpecifiedSparkVersion( | ||
| "Detect encrypted Parquet without encrypted footer (plaintext footer)", | ||
| Array("3.2", "3.3")) { |
There was a problem hiding this comment.
why this test skip 3.4?
| } | ||
| } | ||
|
|
||
| testWithSpecifiedSparkVersion("Detect plain (unencrypted) Parquet file", Array("3.3", "3.4")) { |
There was a problem hiding this comment.
it is also needed for 3.2?
|
Basically LGTM, left few comments |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
Uh oh!
There was an error while loading. Please reload this page.