Skip to content

[GLUTEN-8455][VL] Port encrypted file checks to shim layer#8501

Merged
Yohahaha merged 10 commits intoapache:mainfrom
ArnavBalyan:arnavb/parquet-112-fb
Jan 16, 2025
Merged

[GLUTEN-8455][VL] Port encrypted file checks to shim layer#8501
Yohahaha merged 10 commits intoapache:mainfrom
ArnavBalyan:arnavb/parquet-112-fb

Conversation

@ArnavBalyan
Copy link
Copy Markdown
Member

@ArnavBalyan ArnavBalyan commented Jan 10, 2025

  • Moved encrypted parquet file checks to shim, since parquet version changes across spark versions.
  • Added unit tests to cover the following cases:
    • No encrytpion.
    • Footer encrypted, and file encrypted
    • Footer plaintext, file encrypted
  • The behaviour continues to remain the same for any other case, and assume native scan is safe if we are unable to validate.
  • 3.5 can be checked in similar way, but provides better metadata for encryption checking, will add the support in next PR. Currently 3.5 retains the previous behaviour of always offloading.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Jan 10, 2025
@github-actions
Copy link
Copy Markdown

#8455

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @Yohahaha, @jackylee-ch, can you please take a look. The exception checks also works for spark 3.5 but will use the footer metadata since it's more efficient

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

update
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the work! left some comments.

*/
package org.apache.gluten.utils

object ExceptionUtils {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this util could be added in gluten-core

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense, do you think we can do this with 35, I just wanted to be sure once the logic is set in thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's move this util to common modules.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

override def isParquetFileEncrypted(
fileStatus: LocatedFileStatus,
conf: Configuration): Boolean = {
return false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add TODO with current PR link.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* - Ensures the file is still detected as encrypted despite the plaintext footer.
*/

class ParquetEncryptionDetectionSuite extends AnyFunSuite {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this suite can be moved to backends-velox test module after Spark35 shims check was done, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.

And also, we can use existing parquet files instead of writing new ones each time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes was planning to move it, but testWithSpecifiedSparkVersion works as well, added support thanks for the tip

plan.unsetTagValue(QueryPlan.OP_ID_TAG)
}

override def isParquetFileEncrypted(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be implemented in it's parent class: shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala
?
If it has a different implement, just to override it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic will be different for 3.5 shim, will check if consolidation can be done when that is added

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common issue in our shim code. Perhaps we should develop a better way than the current one to manage these code duplications in shim layer in future.

*/
package org.apache.gluten.utils

object ExceptionUtils {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it in a common class? As the code is much redundant in spark32/spark33/34/35

Copy link
Copy Markdown
Member Author

@ArnavBalyan ArnavBalyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code which needs to change with 35, updated for now

ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString
false
} catch {
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ExceptionUtils.hasCause instead of case _: ParquetCryptoRuntimeException ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exception may wrap ParquetCryptoRuntimeException, and may not directly expose it. This handles all cases thanks

ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString
false
} catch {
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed above

* - Ensures the file is still detected as encrypted despite the plaintext footer.
*/

class ParquetEncryptionDetectionSuite extends AnyFunSuite {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.

And also, we can use existing parquet files instead of writing new ones each time.

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

fs.listFiles(new Path(path), false).next()
}

// private def withTempDir(testCode: File => Any): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this since we don't need it.


testWithSpecifiedSparkVersion(
"Detect encrypted Parquet without encrypted footer (plaintext footer)",
Array("3.2", "3.3")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this test skip 3.4?

}
}

testWithSpecifiedSparkVersion("Detect plain (unencrypted) Parquet file", Array("3.3", "3.4")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also needed for 3.2?

@jackylee-ch
Copy link
Copy Markdown
Contributor

Basically LGTM, left few comments

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yohahaha Yohahaha merged commit 67ebbc8 into apache:main Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants