feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream#1337
Merged
gongxun0928 merged 2 commits intoapache:mainfrom Sep 3, 2025
Conversation
Contributor
|
Just formatted the table for better viewing, and the compression result looks very good. |
my-ship-it
reviewed
Aug 29, 2025
my-ship-it
reviewed
Aug 29, 2025
761e6d8 to
970e0ea
Compare
my-ship-it
reviewed
Sep 1, 2025
my-ship-it
reviewed
Sep 1, 2025
Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance. The following is a comparison of file sizes for different encoding methods on TPC-DS 20G: Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100% call_center 12 kB 231 kB 10185 bytes 4.31% catalog_page 499 kB 653 kB 393 kB 60.18% catalog_returns 240 MB 171 MB 178 MB 104.09% catalog_sales 3033 MB 1837 MB 1977 MB 107.63% customer 16 MB 12 MB 12 MB 100.00% customer_address 7008 kB 3161 kB 3115 kB 98.54% customer_demographics 28 MB 8164 kB 9292 kB 113.82% date_dim 3193 kB 1406 kB 1249 kB 88.85% household_demographics 42 kB 248 kB 28 kB 11.29% income_band 1239 bytes 225 kB 1239 bytes 0.54% inventory 36 MB 71 MB 36 MB 50.70% item 3084 kB 2479 kB 2227 kB 89.84% promotion 27 kB 239 kB 18 kB 7.53% reason 2730 bytes 226 kB 2280 bytes 0.99% ship_mode 3894 bytes 227 kB 3315 bytes 1.43% store 23 kB 239 kB 18 kB 7.53% store_returns 400 MB 265 MB 277 MB 104.53% store_sales 4173 MB 2384 MB 2554 MB 107.12% time_dim 1702 kB 819 kB 627 kB 76.56% warehouse 5394 bytes 227 kB 4698 bytes 2.02% web_page 21 kB 236 kB 14 kB 5.93% web_returns 116 MB 83 MB 85 MB 102.41% web_sales 1513 MB 908 MB 982 MB 108.15%
Stop recording payload_size to reduce offset disk usage by at least 5%. Add tests for delta encoding
867999a to
0374762
Compare
gfphoenix78
approved these changes
Sep 3, 2025
hw118118
pushed a commit
to hw118118/cloudberrydb
that referenced
this pull request
Jan 27, 2026
apache#1337) * feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance. The following is a comparison of file sizes for different encoding methods on TPC-DS 20G: Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100% call_center 12 kB 231 kB 10185 bytes 4.31% catalog_page 499 kB 653 kB 393 kB 60.18% catalog_returns 240 MB 171 MB 178 MB 104.09% catalog_sales 3033 MB 1837 MB 1977 MB 107.63% customer 16 MB 12 MB 12 MB 100.00% customer_address 7008 kB 3161 kB 3115 kB 98.54% customer_demographics 28 MB 8164 kB 9292 kB 113.82% date_dim 3193 kB 1406 kB 1249 kB 88.85% household_demographics 42 kB 248 kB 28 kB 11.29% income_band 1239 bytes 225 kB 1239 bytes 0.54% inventory 36 MB 71 MB 36 MB 50.70% item 3084 kB 2479 kB 2227 kB 89.84% promotion 27 kB 239 kB 18 kB 7.53% reason 2730 bytes 226 kB 2280 bytes 0.99% ship_mode 3894 bytes 227 kB 3315 bytes 1.43% store 23 kB 239 kB 18 kB 7.53% store_returns 400 MB 265 MB 277 MB 104.53% store_sales 4173 MB 2384 MB 2554 MB 107.12% time_dim 1702 kB 819 kB 627 kB 76.56% warehouse 5394 bytes 227 kB 4698 bytes 2.02% web_page 21 kB 236 kB 14 kB 5.93% web_returns 116 MB 83 MB 85 MB 102.41% web_sales 1513 MB 908 MB 982 MB 108.15%
hw118118
pushed a commit
to hw118118/cloudberrydb
that referenced
this pull request
Jan 27, 2026
apache#1337) * feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance. The following is a comparison of file sizes for different encoding methods on TPC-DS 20G: Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100% call_center 12 kB 231 kB 10185 bytes 4.31% catalog_page 499 kB 653 kB 393 kB 60.18% catalog_returns 240 MB 171 MB 178 MB 104.09% catalog_sales 3033 MB 1837 MB 1977 MB 107.63% customer 16 MB 12 MB 12 MB 100.00% customer_address 7008 kB 3161 kB 3115 kB 98.54% customer_demographics 28 MB 8164 kB 9292 kB 113.82% date_dim 3193 kB 1406 kB 1249 kB 88.85% household_demographics 42 kB 248 kB 28 kB 11.29% income_band 1239 bytes 225 kB 1239 bytes 0.54% inventory 36 MB 71 MB 36 MB 50.70% item 3084 kB 2479 kB 2227 kB 89.84% promotion 27 kB 239 kB 18 kB 7.53% reason 2730 bytes 226 kB 2280 bytes 0.99% ship_mode 3894 bytes 227 kB 3315 bytes 1.43% store 23 kB 239 kB 18 kB 7.53% store_returns 400 MB 265 MB 277 MB 104.53% store_sales 4173 MB 2384 MB 2554 MB 107.12% time_dim 1702 kB 819 kB 627 kB 76.56% warehouse 5394 bytes 227 kB 4698 bytes 2.02% web_page 21 kB 236 kB 14 kB 5.93% web_returns 116 MB 83 MB 85 MB 102.41% web_sales 1513 MB 908 MB 982 MB 108.15%
oppenheimer01
pushed a commit
to oppenheimer01/cloudberrydb
that referenced
this pull request
Apr 12, 2026
apache#1337) * feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance. The following is a comparison of file sizes for different encoding methods on TPC-DS 20G: Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100% call_center 12 kB 231 kB 10185 bytes 4.31% catalog_page 499 kB 653 kB 393 kB 60.18% catalog_returns 240 MB 171 MB 178 MB 104.09% catalog_sales 3033 MB 1837 MB 1977 MB 107.63% customer 16 MB 12 MB 12 MB 100.00% customer_address 7008 kB 3161 kB 3115 kB 98.54% customer_demographics 28 MB 8164 kB 9292 kB 113.82% date_dim 3193 kB 1406 kB 1249 kB 88.85% household_demographics 42 kB 248 kB 28 kB 11.29% income_band 1239 bytes 225 kB 1239 bytes 0.54% inventory 36 MB 71 MB 36 MB 50.70% item 3084 kB 2479 kB 2227 kB 89.84% promotion 27 kB 239 kB 18 kB 7.53% reason 2730 bytes 226 kB 2280 bytes 0.99% ship_mode 3894 bytes 227 kB 3315 bytes 1.43% store 23 kB 239 kB 18 kB 7.53% store_returns 400 MB 265 MB 277 MB 104.53% store_sales 4173 MB 2384 MB 2554 MB 107.12% time_dim 1702 kB 819 kB 627 kB 76.56% warehouse 5394 bytes 227 kB 4698 bytes 2.02% web_page 21 kB 236 kB 14 kB 5.93% web_returns 116 MB 83 MB 85 MB 102.41% web_sales 1513 MB 908 MB 982 MB 108.15%
oppenheimer01
pushed a commit
to oppenheimer01/cloudberrydb
that referenced
this pull request
Apr 21, 2026
apache#1337) * feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance. The following is a comparison of file sizes for different encoding methods on TPC-DS 20G: Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100% call_center 12 kB 231 kB 10185 bytes 4.31% catalog_page 499 kB 653 kB 393 kB 60.18% catalog_returns 240 MB 171 MB 178 MB 104.09% catalog_sales 3033 MB 1837 MB 1977 MB 107.63% customer 16 MB 12 MB 12 MB 100.00% customer_address 7008 kB 3161 kB 3115 kB 98.54% customer_demographics 28 MB 8164 kB 9292 kB 113.82% date_dim 3193 kB 1406 kB 1249 kB 88.85% household_demographics 42 kB 248 kB 28 kB 11.29% income_band 1239 bytes 225 kB 1239 bytes 0.54% inventory 36 MB 71 MB 36 MB 50.70% item 3084 kB 2479 kB 2227 kB 89.84% promotion 27 kB 239 kB 18 kB 7.53% reason 2730 bytes 226 kB 2280 bytes 0.99% ship_mode 3894 bytes 227 kB 3315 bytes 1.43% store 23 kB 239 kB 18 kB 7.53% store_returns 400 MB 265 MB 277 MB 104.53% store_sales 4173 MB 2384 MB 2554 MB 107.12% time_dim 1702 kB 819 kB 627 kB 76.56% warehouse 5394 bytes 227 kB 4698 bytes 2.02% web_page 21 kB 236 kB 14 kB 5.93% web_returns 116 MB 83 MB 85 MB 102.41% web_sales 1513 MB 908 MB 982 MB 108.15%
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance.
The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:
the offsets array of benchmark was generated by GenerateMonotonicOffsets(1024*1024, 0x12345);
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions