What is the problem the feature request solves?
Background
The main goal of Comet is to accelerate Spark jobs and produce the same results that Spark would have produced.
In some cases, there will always be known differences, and in those cases we should disable Comet by default but allow users to opt in and make sure that we document how Comet differs from Spark. One example is regular expressions. Rust and Java have different regular expression engines and there will be differences in behavior in some cases.
Limitations of Current Testing Approaches
Too much use of makeParquetFileAllPrimitiveTypes
Early versions of Comet only accelerated Parquet reads. CometTestBase has a method makeParquetFileAllPrimitiveTypes that is very specific to testing Parquet (using different combinations of logical and physical type, for example) but does not generate values that would be edge cases for testing expressions. For example, it does not generate negative floating point zero or min/max values for each numeric type. Over time, many tests for expressions continued to use makeParquetFileAllPrimitiveTypes, resulting in limited testing for edge cases.
Hand Written Tests
We write tests manually, with numerous testing styles. These tests often do not attempt to cover edge cases but just test for basic use cases.
Limited Testing for Exceptions
We generally evaluate expressions against multiple rows at a time. If one value is invalid then an exception will be thrown, but we are not testing that all invalid values would have been caught, or that the valid values would have returned the correct results. For expressions that are fallible, we should be testing one value at a time for correct handling of invalid values (in addition to testing multiple rows for valid cases).
No code coverage or test coverage reporting
How do we know if we have tests for all expressions? We would currently have to manually audit the code base.
Proposed Improvements
Now that we have mostly moved expression serde to the new framework, we have a list of supported expressions:
val exprSerdeMap: Map[Class[_ <: Expression], CometExpressionSerde[_]] =
mathExpressions ++ hashExpressions ++ stringExpressions ++
conditionalExpressions ++ mapExpressions ++ predicateExpressions ++
structExpressions ++ bitwiseExpressions ++ miscExpressions ++ arrayExpressions ++
temporalExpressions ++ conversionExpressions
I would like to propose that we take some of the ideas from CometFuzz and the current tests that extend CometFuzzTestBase and generate tests for each supported expression.
We may need to add new methods to the CometExpressionSerde trait to help with testing. For example, we may need to add information about expression signatures and the types of inputs they can accept so that we can generate tests for all cases.
edit: We can infer the signature from the Spark class in many cases. For example, expressions may extend UnaryExpression or BinaryExpression, and there is the inputTypes method.
There are multiple approaches to how we could implement this, and we will need to try some different ideas out, so I haven't added a specific proposal here. I have created a Google Doc where we can collaborate. We can also discuss in the comments in this issue.
https://docs.google.com/document/d/1hZ7osyfALVdjxL-3UzzhdTclM1j972bYSK7M6CuEFwg/edit?usp=sharing
Describe the potential solution
No response
Additional context
No response
What is the problem the feature request solves?
Background
The main goal of Comet is to accelerate Spark jobs and produce the same results that Spark would have produced.
In some cases, there will always be known differences, and in those cases we should disable Comet by default but allow users to opt in and make sure that we document how Comet differs from Spark. One example is regular expressions. Rust and Java have different regular expression engines and there will be differences in behavior in some cases.
Limitations of Current Testing Approaches
Too much use of makeParquetFileAllPrimitiveTypes
Early versions of Comet only accelerated Parquet reads.
CometTestBasehas a methodmakeParquetFileAllPrimitiveTypesthat is very specific to testing Parquet (using different combinations of logical and physical type, for example) but does not generate values that would be edge cases for testing expressions. For example, it does not generate negative floating point zero or min/max values for each numeric type. Over time, many tests for expressions continued to usemakeParquetFileAllPrimitiveTypes, resulting in limited testing for edge cases.Hand Written Tests
We write tests manually, with numerous testing styles. These tests often do not attempt to cover edge cases but just test for basic use cases.
Limited Testing for Exceptions
We generally evaluate expressions against multiple rows at a time. If one value is invalid then an exception will be thrown, but we are not testing that all invalid values would have been caught, or that the valid values would have returned the correct results. For expressions that are fallible, we should be testing one value at a time for correct handling of invalid values (in addition to testing multiple rows for valid cases).
No code coverage or test coverage reporting
How do we know if we have tests for all expressions? We would currently have to manually audit the code base.
Proposed Improvements
Now that we have mostly moved expression serde to the new framework, we have a list of supported expressions:
I would like to propose that we take some of the ideas from CometFuzz and the current tests that extend
CometFuzzTestBaseand generate tests for each supported expression.We may need to add new methods to the
CometExpressionSerdetrait to help with testing. For example, we may need to add information about expression signatures and the types of inputs they can accept so that we can generate tests for all cases.edit: We can infer the signature from the Spark class in many cases. For example, expressions may extend
UnaryExpressionorBinaryExpression, and there is theinputTypesmethod.There are multiple approaches to how we could implement this, and we will need to try some different ideas out, so I haven't added a specific proposal here. I have created a Google Doc where we can collaborate. We can also discuss in the comments in this issue.
https://docs.google.com/document/d/1hZ7osyfALVdjxL-3UzzhdTclM1j972bYSK7M6CuEFwg/edit?usp=sharing
Describe the potential solution
No response
Additional context
No response