test: interval analysis unit tests by hiltontj · Pull Request #14189 · apache/datafusion

hiltontj · 2025-01-18T20:41:32Z

Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.

Which issue does this PR close?

I did not have an issue for this, but I was experimenting with the analyze method for converting Expr trees into sets of Intervals and pushed the resulting tests up, since I did not see any that directly tested the analyze method.

Rationale for this change

I did not see any unit tests for the analyze method. In addition to this example, these provide a bit more of a sense of how the method works for different boundary expressions in queries.

What changes are included in this PR?

Two new unit tests in the datafusion/physical-expr/src/analysis.rs module.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

hiltontj · 2025-01-18T20:43:31Z

datafusion/physical-expr/Cargo.toml

 [dev-dependencies]
 arrow = { workspace = true, features = ["test_utils"] }
 criterion = "0.5"
+datafusion = { workspace = true }


I suspect this is the line that caused the circular dependencies check to fail. I added it so I could get the SessionContext in my test for parsing the SQL expressions - perhaps there is a better way to do that...

I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs

Then you run it like

cargo test --test core_integration

Basically SessionContext is in a different crate that depends on this crate (but not the other way around)

alamb

Thanks @hiltontj -- this looks quite cool

Maybe @berkaysynnada knows if there are existing tests and/or where the tests could go

alamb · 2025-01-18T23:14:29Z

datafusion/physical-expr/Cargo.toml

 [dev-dependencies]
 arrow = { workspace = true, features = ["test_utils"] }
 criterion = "0.5"
+datafusion = { workspace = true }


I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs

Then you run it like

cargo test --test core_integration

Basically SessionContext is in a different crate that depends on this crate (but not the other way around)

alamb · 2025-01-18T23:16:02Z

datafusion/physical-expr/src/analysis.rs

+        let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int64)]));
+        type TestCase = (&'static str, Option<i64>, Option<i64>);
+        let test_cases: Vec<TestCase> = vec![
+            ("a > 10", Some(11), None),


Another approach is to avoid parsing SQL and insteadl build these expressions programatically

Like

Suggested change

("a > 10", Some(11), None),

(col(a).gt(lit(10), Some(11), None),

There are some other examples here: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#column-references-and-literals

I agree, I think it makes more sense to use the expression helpers for two reasons:

It brings these tests closer to unit tests and further from integration tests, i.e., no need for SessionContext and SQL parsing

It removes the need for the circular dependency

I don't mind changing them over to use the helpers.

I did just that in a43bf00

berkaysynnada

Thanks @hiltontj increasing the coverage.

There are many interval_arithmetic test in

datafusion/datafusion/expr-common/src/interval_arithmetic.rs

Line 1834 in f667a01

mod tests {

but since these tests are from the scope of AnalysisContext, I think it's fine to keep them here. This PR also has shown something I don't like; infinite bounds and empty-set results are both result with None

berkaysynnada · 2025-01-20T07:43:00Z

datafusion/physical-expr/src/analysis.rs

+    fn test_analyze_invalid_boundary_exprs() {
+        let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int32)]));
+        type TestCase = (Expr, &'static str);
+        let test_cases: Vec<TestCase> = vec![(


Do you plan to extend this vec?

Originally I set it up this way because I thought the empty-set result would be an error, e.g., the test case in the other test in this PR for:

(a > 10 AND a < 20) AND (a > 20 AND a < 30)

Results in None/None instead of an error. (I believe that is what you are referring to by empty-set?)

There are other things I have noticed while using this analyze method in my work at Influx (see influxdata/influxdb#25866, and specifically here). The function produces an error for various expression types. For example IsNotNull.

So, I could extend this test to check a wider range of expression types if that would be helpful - either on this or a following PR. If not, I can trim this test down to that singular case.

I'm open to your recommendation.

To avoid any misunderstanding, I'd like to elaborate what is in my mind: There are 3 types of analyze result:

The analysis result will shrink the bounds (a successful evaluation)

The analysis result will be successful again, but the bounds could not be shrunk ,e.g. (a<0 || a>=0). That will be represented as [None, None], and means range of [-∞, ∞].

The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.

As you mention, there are unsupported types. We are (or should) giving internal errors for them.

As you're increasing the test coverage, and there is not a bug which should be resolved immediately, we can continue with the current state of this PR. However, if you are willing to solve that possible confusion, I can provide further recommendation in detail (related with the 2nd and 3rd items).

BTW, What I wanted to emphasize in this thread, you have defined a vector of single element

let test_cases: Vec<TestCase> = vec![( col("a").lt(lit(10)).or(col("a").gt(lit(20))), "Interval arithmetic does not support the operator OR", )];

I just wanted to point out that.

The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.

Perhaps it would be best to log an issue for this. I don't know if this is something I can focus on at the moment, but perhaps another contributor will like to pick it up.

Otherwise, I can simplify that test case to remove the vector wrapping the single element as that is confusing.

@berkaysynnada - I have made an attempt at summarizing your above comment into an issue here: #14226

Otherwise, I simplified that test case in the PR.

Thank you :)

Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.

berkaysynnada

LGTM, thank you @hiltontj

github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 18, 2025

hiltontj commented Jan 18, 2025

View reviewed changes

alamb reviewed Jan 18, 2025

View reviewed changes

hiltontj force-pushed the hiltontj/analyze-interval-tests branch from 656f4fd to a43bf00 Compare January 19, 2025 14:01

berkaysynnada reviewed Jan 20, 2025

View reviewed changes

test: interval analysis unit tests

8e4f193

Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.

hiltontj force-pushed the hiltontj/analyze-interval-tests branch from a43bf00 to 8e4f193 Compare January 22, 2025 02:11

hiltontj mentioned this pull request Jan 22, 2025

Variant on AnalysisContext to represent empty-set #14226

Closed

berkaysynnada approved these changes Jan 22, 2025

View reviewed changes

berkaysynnada merged commit 5edb276 into apache:main Jan 22, 2025
25 checks passed

hiltontj deleted the hiltontj/analyze-interval-tests branch January 22, 2025 10:53

davidhewitt mentioned this pull request Apr 15, 2025

support OR operator in binary evaluate_bounds #15716

Merged

	("a > 10", Some(11), None),
	(col(a).gt(lit(10), Some(11), None),

Conversation

hiltontj commented Jan 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants