Skip to content

Make a data driven SQL testing tool (so we can reuse duckdb test suite, example) #4248

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to ensure that DataFusion gets the correct answers for SQL queries (especially in tricky corner cases like the one described in #4211)

From experience, both in DataFusion and in prior jobs, the effort required to maintain tests (both to add new tests as well as update existing tests) is substantial. Making it easier to add new tests and maintain existing ones will help us keep up velocity.

Right now, we have two sql integration style tests:

  1. the integration test from @jimexist 🦾 https://github.com/apache/arrow-datafusion/tree/master/integration-tests: Runs a limited number of queries against data in both postgres and datafusion and compares the results

The challenge with sql_integration test is that to add new tests or update existing ones, we need to change rust code and recompile, which takes a loong time

Likewise, the integration test requires that the results are exactly the same as postgres which is not possible in all cases (like when testing for unsigned types, which postgres doesn't support, or testing some DataFusion specific thing)

Describe the solution you'd like
I would like some sort of data driven test to replace sql_integration

You can see this style of test in duckdb: https://github.com/duckdb/duckdb/blob/master/test/sql/join/empty_joins.test

My ideal solution would be to implement a runner (ideally the same as SQLLogicTests from DuckDB)
2. Using the same data file format as duckdb (will mean we could reuse their tests without much modification)
3. Start porting as many of the tests in sql_integration over to this new format as possible)

I implemented a impler version of this approach in https://github.com/influxdata/influxdb_iox/blob/main/query_tests/README.md which runs sql queries from a file and compares the result to known output. I think the duckdb way is superior

Describe alternatives you've considered
Leave things the same

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions