Skip to content

Conversation

@liurenjie1024
Copy link
Contributor

@liurenjie1024 liurenjie1024 commented May 24, 2023

Which issue does this PR close?

Closes #6405

Rationale for this change

Move tpch verification/plan test into sqllogictest.

What changes are included in this PR?

Are these changes tested?

Yes, tested with cargo test -p datafusion --test sqllogictests

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 24, 2023
jackwener
jackwener previously approved these changes May 25, 2023
Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 .
cc @alamb

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to have all that inserts
Imho its better to use existing approach with sqllogictest like

CREATE EXTERNAL TABLE aggregate_test_100_by_sql (
  c1  VARCHAR NOT NULL,
  c2  TINYINT NOT NULL,
  c3  SMALLINT NOT NULL,
  c4  SMALLINT,
  c5  INT,
  c6  BIGINT NOT NULL,
  c7  SMALLINT NOT NULL,
  c8  INT NOT NULL,
  c9  BIGINT UNSIGNED NOT NULL,
  c10 VARCHAR NOT NULL,
  c11 FLOAT NOT NULL,
  c12 DOUBLE NOT NULL,
  c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '../../testing/data/csv/aggregate_test_100.csv'

SELECT avg(c12) FROM aggregate_test_100

@alamb

@jackwener jackwener dismissed their stale review May 25, 2023 16:38

ddl by file

@jackwener
Copy link
Member

Not sure we need to have all that inserts Imho its better to use existing approach with sqllogictest like

agree with you, it's easier to maintain.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liurenjie1024 . This is going to be awesome.

I was looking at what gets run on master for the tpch queries.

https://github.com/apache/arrow-datafusion/blob/444f0c42c4f37103c51eb8e0e92ad140727ca16f/.github/workflows/rust.yml#L198-L213

My reading is that the CI actually runs the tpch data generator and generates the SF1 data (about 1GB in total size)

Thus in terms of this test I suggest:

  1. remove all the data from the files
  2. use the CREATE EXTERNAL TABLE command as suggested by @comphead and @jackwener to create table that point at the data generated by the tpch data generator
  3. Figure out some way to run the tpch tests conditionally in CI (maybe an environment variable or a flag 🤔 ) so they still pass even when the data generator hasn't been run.

What do you think?

@alamb
Copy link
Contributor

alamb commented May 25, 2023

In terms of conditionally running the tests, perhaps we can special case tests that start with tpch_ the way we do with pg_compat tests: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests#running-tests-postgres-compatibility

@liurenjie1024
Copy link
Contributor Author

liurenjie1024 commented May 26, 2023

Oh, I didn't notice that we already have tpch verification in ci, so another round of tpch verification would not be necessary. The main goal of this pr is to move tpch test in benchmark into sqllogictest to make it easier to maintain, so I will remove unnecessary changes to avoid.

@liurenjie1024
Copy link
Contributor Author

I think this is ready for review. cc @comphead @alamb @jackwener

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liurenjie1024 -- I think this looks great.

One improvement I would like to look into is pre-creating the tpch data (perhaps we can create it in a docker image and then just copy it down). This would increase the CI speed greatly I think

./dbgen -f -s 1
mv *.tbl ../benchmarks/data
mv ./answers/* ../benchmarks/data/answers/
./dbgen -f -s 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a little more -- and note we are running this CI check sf 0.1 (rather than SF1) It seems like a good idea and means it takes only 8 seconds to make the data 👍

https://github.com/apache/arrow-datafusion/actions/runs/5108217213/jobs/9181887855?pr=6435

@alamb
Copy link
Contributor

alamb commented May 30, 2023

I was able to run these tests locally following the directions 👍

@alamb alamb merged commit ff2fd62 into apache:main May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move tpch tests into sqllogictest.

4 participants