test: interval analysis unit tests#14189
Conversation
datafusion/physical-expr/Cargo.toml
Outdated
| [dev-dependencies] | ||
| arrow = { workspace = true, features = ["test_utils"] } | ||
| criterion = "0.5" | ||
| datafusion = { workspace = true } |
There was a problem hiding this comment.
I suspect this is the line that caused the circular dependencies check to fail. I added it so I could get the SessionContext in my test for parsing the SQL expressions - perhaps there is a better way to do that...
There was a problem hiding this comment.
I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs
Then you run it like
cargo test --test core_integration
Basically SessionContext is in a different crate that depends on this crate (but not the other way around)
alamb
left a comment
There was a problem hiding this comment.
Thanks @hiltontj -- this looks quite cool
Maybe @berkaysynnada knows if there are existing tests and/or where the tests could go
datafusion/physical-expr/Cargo.toml
Outdated
| [dev-dependencies] | ||
| arrow = { workspace = true, features = ["test_utils"] } | ||
| criterion = "0.5" | ||
| datafusion = { workspace = true } |
There was a problem hiding this comment.
I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs
Then you run it like
cargo test --test core_integration
Basically SessionContext is in a different crate that depends on this crate (but not the other way around)
| let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int64)])); | ||
| type TestCase = (&'static str, Option<i64>, Option<i64>); | ||
| let test_cases: Vec<TestCase> = vec![ | ||
| ("a > 10", Some(11), None), |
There was a problem hiding this comment.
Another approach is to avoid parsing SQL and insteadl build these expressions programatically
Like
| ("a > 10", Some(11), None), | |
| (col(a).gt(lit(10), Some(11), None), |
There are some other examples here: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#column-references-and-literals
There was a problem hiding this comment.
I agree, I think it makes more sense to use the expression helpers for two reasons:
- It brings these tests closer to unit tests and further from integration tests, i.e., no need for
SessionContextand SQL parsing - It removes the need for the circular dependency
I don't mind changing them over to use the helpers.
656f4fd to
a43bf00
Compare
berkaysynnada
left a comment
There was a problem hiding this comment.
Thanks @hiltontj increasing the coverage.
There are many interval_arithmetic test in
but since these tests are from the scope of AnalysisContext, I think it's fine to keep them here. This PR also has shown something I don't like; infinite bounds and empty-set results are both result with None
| fn test_analyze_invalid_boundary_exprs() { | ||
| let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int32)])); | ||
| type TestCase = (Expr, &'static str); | ||
| let test_cases: Vec<TestCase> = vec![( |
There was a problem hiding this comment.
Do you plan to extend this vec?
There was a problem hiding this comment.
Originally I set it up this way because I thought the empty-set result would be an error, e.g., the test case in the other test in this PR for:
(a > 10 AND a < 20) AND (a > 20 AND a < 30)
Results in None/None instead of an error. (I believe that is what you are referring to by empty-set?)
There are other things I have noticed while using this analyze method in my work at Influx (see influxdata/influxdb#25866, and specifically here). The function produces an error for various expression types. For example IsNotNull.
So, I could extend this test to check a wider range of expression types if that would be helpful - either on this or a following PR. If not, I can trim this test down to that singular case.
I'm open to your recommendation.
There was a problem hiding this comment.
To avoid any misunderstanding, I'd like to elaborate what is in my mind: There are 3 types of analyze result:
- The analysis result will shrink the bounds (a successful evaluation)
- The analysis result will be successful again, but the bounds could not be shrunk ,e.g. (a<0 || a>=0). That will be represented as [None, None], and means range of [-∞, ∞].
- The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate
a<0 AND a>0in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set. - As you mention, there are unsupported types. We are (or should) giving internal errors for them.
As you're increasing the test coverage, and there is not a bug which should be resolved immediately, we can continue with the current state of this PR. However, if you are willing to solve that possible confusion, I can provide further recommendation in detail (related with the 2nd and 3rd items).
BTW, What I wanted to emphasize in this thread, you have defined a vector of single element
let test_cases: Vec<TestCase> = vec![(
col("a").lt(lit(10)).or(col("a").gt(lit(20))),
"Interval arithmetic does not support the operator OR",
)];
I just wanted to point out that.
There was a problem hiding this comment.
- The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.
Perhaps it would be best to log an issue for this. I don't know if this is something I can focus on at the moment, but perhaps another contributor will like to pick it up.
Otherwise, I can simplify that test case to remove the vector wrapping the single element as that is confusing.
There was a problem hiding this comment.
@berkaysynnada - I have made an attempt at summarizing your above comment into an issue here: #14226
Otherwise, I simplified that test case in the PR.
Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.
a43bf00 to
8e4f193
Compare
berkaysynnada
left a comment
There was a problem hiding this comment.
LGTM, thank you @hiltontj
Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.
Which issue does this PR close?
I did not have an issue for this, but I was experimenting with the
analyzemethod for convertingExprtrees into sets ofIntervals and pushed the resulting tests up, since I did not see any that directly tested theanalyzemethod.Rationale for this change
I did not see any unit tests for the
analyzemethod. In addition to this example, these provide a bit more of a sense of how the method works for different boundary expressions in queries.What changes are included in this PR?
Two new unit tests in the
datafusion/physical-expr/src/analysis.rsmodule.Are these changes tested?
Yes.
Are there any user-facing changes?
No.