-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to ensure that DataFusion gets the correct answers for SQL queries (especially in tricky corner cases like the one described in #4211)
From experience, both in DataFusion and in prior jobs, the effort required to maintain tests (both to add new tests as well as update existing tests) is substantial. Making it easier to add new tests and maintain existing ones will help us keep up velocity.
Right now, we have two sql integration style tests:
- the
integrationtest from @jimexist 🦾 https://github.com/apache/arrow-datafusion/tree/master/integration-tests: Runs a limited number of queries against data in both postgres and datafusion and compares the results
- The
sql_integrationtest https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests/sql: test setup, execution, and verification is written in rust.
The challenge with sql_integration test is that to add new tests or update existing ones, we need to change rust code and recompile, which takes a loong time
Likewise, the integration test requires that the results are exactly the same as postgres which is not possible in all cases (like when testing for unsigned types, which postgres doesn't support, or testing some DataFusion specific thing)
Describe the solution you'd like
I would like some sort of data driven test to replace sql_integration
You can see this style of test in duckdb: https://github.com/duckdb/duckdb/blob/master/test/sql/join/empty_joins.test
My ideal solution would be to implement a runner (ideally the same as SQLLogicTests from DuckDB)
2. Using the same data file format as duckdb (will mean we could reuse their tests without much modification)
3. Start porting as many of the tests in sql_integration over to this new format as possible)
I implemented a impler version of this approach in https://github.com/influxdata/influxdb_iox/blob/main/query_tests/README.md which runs sql queries from a file and compares the result to known output. I think the duckdb way is superior
Describe alternatives you've considered
Leave things the same
Additional context
Add any other context or screenshots about the feature request here.