Skip to content

fix: generate integer keys instead of floats in TPC-DS data#31

Open
Dandandan wants to merge 1 commit intoapache:mainfrom
Dandandan:fix-tpcds-integer-keys
Open

fix: generate integer keys instead of floats in TPC-DS data#31
Dandandan wants to merge 1 commit intoapache:mainfrom
Dandandan:fix-tpcds-integer-keys

Conversation

@Dandandan
Copy link
Copy Markdown

@Dandandan Dandandan commented Apr 9, 2026

Summary

  • Fix tpcdsgen.py trailing pipe detection to work with dsdgen v4.0.0 (which no longer adds a trailing | as field terminator)
  • Switch from snappy to zstd compression
  • Regenerate all SF1 parquet data with dsdgen v4.0.0 and current datafusion-python, fixing nullable integer columns (surrogate keys, quantities) that were incorrectly stored as double/float64 instead of int32

Background

The pre-existing SF1 parquet data had ~100 columns across 15 tables stored as double that should be int32 (e.g. ss_sold_date_sk, cs_bill_customer_sk, ss_quantity). This was likely caused by an older version of datafusion-python or dsdgen. The current datafusion-python correctly writes int32 columns when an explicit schema is provided.

Test plan

  • Verified the old parquet files on main have double types for key columns
  • Verified all 24 regenerated parquet files have zero double columns
  • Verified null counts match between old and new data
  • All files under GitHub's 100 MiB file size limit

🤖 Generated with Claude Code

@Dandandan
Copy link
Copy Markdown
Author

@comphead FYI

The pre-existing SF1 parquet data had nullable integer columns (surrogate
keys, quantities, etc.) incorrectly stored as double (float64). This was
likely caused by an older version of datafusion-python or dsdgen.

Regenerated all SF1 parquet data with dsdgen v4.0.0 and the current
datafusion-python, which correctly writes int32 columns.

Also fixed trailing pipe detection in tpcdsgen.py to work with dsdgen
v4.0.0 (which no longer adds a trailing | as field terminator), and
switched to zstd compression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Dandandan Dandandan force-pushed the fix-tpcds-integer-keys branch from 97563d3 to f1560cf Compare April 9, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant