Skip to content

ci: update tpc-h dataset generation to use less ram#5477

Merged
westonpace merged 1 commit intolance-format:mainfrom
westonpace:ci/tpch-gen-less-ram
Dec 15, 2025
Merged

ci: update tpc-h dataset generation to use less ram#5477
westonpace merged 1 commit intolance-format:mainfrom
westonpace:ci/tpch-gen-less-ram

Conversation

@westonpace
Copy link
Copy Markdown
Member

The TPC-H dataset generation currently materializes the table in RAM at least twice. Since this table is close to 10GB it can require a lot of memory and our CI benchmarking jobs are failing.

This PR changes the dataset generation to use disk and streaming so that it uses much less RAM. There is little change to the overall data generation time.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions github-actions Bot added python ci Github Action or Test issues labels Dec 15, 2025
@westonpace westonpace merged commit 176a75a into lance-format:main Dec 15, 2025
12 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
The TPC-H dataset generation currently materializes the table in RAM at
least twice. Since this table is close to 10GB it can require a lot of
memory and our CI benchmarking jobs are failing.

This PR changes the dataset generation to use disk and streaming so that
it uses much less RAM. There is little change to the overall data
generation time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Github Action or Test issues python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants