Skip to content

Fix run_tpcds data dir#19771

Merged
Dandandan merged 1 commit intoapache:mainfrom
gabotechs:fix-run-tpcds-benchmarks
Jan 13, 2026
Merged

Fix run_tpcds data dir#19771
Dandandan merged 1 commit intoapache:mainfrom
gabotechs:fix-run-tpcds-benchmarks

Conversation

@gabotechs
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

Running ./bench.sh run tpcds with a freshly created ./bench.sh data tpcds fails with the following error:

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

This PR fixes it

What changes are included in this PR?

Fixes the TPCDS_DIR variable in run_tpcds

Are these changes tested?

just benchmark scripts

Are there any user-facing changes?

no need

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gabotechs I think it shouldn't be there. by default the script checks for datafusion-benchmarks repo here https://github.com/apache/datafusion-benchmarks/tree/main/tpcds/data/sf1 and there is no tpcds-sf1.

you can specify your own DATA_DIR like

export DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/
and then run tpcds benchmarks

@gabotechs
Copy link
Copy Markdown
Contributor Author

gabotechs commented Jan 12, 2026

🤔 Are you sure? I get the impression that this is why the benchmark run commands are failing

#19761 (comment)

Also, note how the data_tpcds() function counterpart actually has this same line:

https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L633

# Downloads TPC-DS data
data_tpcds() {
    TPCDS_DIR="${DATA_DIR}/tpcds_sf1"

@comphead
Copy link
Copy Markdown
Contributor

comphead commented Jan 13, 2026

I just checked for #19635 the TPCDS benchmark the commands provided in
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#comparing-performance-of-main-and-a-pr
and it worked fine

@gabotechs
Copy link
Copy Markdown
Contributor Author

gabotechs commented Jan 13, 2026

Ok, I see what happened here:

IMO it would be nicer if ./benchmarks/bench.sh data tpcds && ./benchmarks/bench.sh run tpcds worked out of the box without requiring users to set the DATA_DIR env in the same way it works for the TPC-H benchmark.

In fact, I'd bet the intention behind this code here https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L644-L646 is that it works that way, as it's explicitly extracting the contents to "${DATA_DIR}/tpcds_sf1":

        echo "Extracting TPC-DS parquet data to ${TPCDS_DIR}..."
        unzip -o -j -d "${TPCDS_DIR}" "${DATA_DIR}/datafusion-benchmarks.zip" datafusion-benchmarks-main/tpcds/data/sf1/*
        echo "TPC-DS data extracted."

However happy to follow your lead here, I can survive setting up an extra env variable.

@Dandandan Dandandan added this pull request to the merge queue Jan 13, 2026
Merged via the queue into apache:main with commit 36880d8 Jan 13, 2026
28 checks passed
@comphead
Copy link
Copy Markdown
Contributor

IMO it would be nicer if ./benchmarks/bench.sh data tpcds && ./benchmarks/bench.sh run tpcds worked out of the box without requiring users to set the DATA_DIR env in the same way it works for the TPC-H benchmark.

That would be real nice but data generation is tricky for TPCDS so we rely on pregenerated data in datafusion-benchmarks repo

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jan 13, 2026

IMO it would be nicer if ./benchmarks/bench.sh data tpcds && ./benchmarks/bench.sh run tpcds worked out of the box without requiring users to set the DATA_DIR env in the same way it works for the TPC-H benchmark.

That would be real nice but data generation is tricky for TPCDS so we rely on pregenerated data in datafusion-benchmarks repo

BTW @clflushopt and I are working on a solution to that -- sneak peek:

de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Running ` ./bench.sh run tpcds` with a freshly created `./bench.sh data
tpcds` fails with the following error:

```
Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds
```

This PR fixes it

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Fixes the `TPCDS_DIR` variable in `run_tpcds`

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

just benchmark scripts

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

no need
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants