Skip to content

feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream#1337

Merged
gongxun0928 merged 2 commits intoapache:mainfrom
gongxun0928:feature/use-delta-encoding-instead-of-zstd-for-offset-stream
Sep 3, 2025
Merged

feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream#1337
gongxun0928 merged 2 commits intoapache:mainfrom
gongxun0928:feature/use-delta-encoding-instead-of-zstd-for-offset-stream

Conversation

@gongxun0928
Copy link
Copy Markdown
Contributor

@gongxun0928 gongxun0928 commented Aug 29, 2025

Optimize performance of variable-length column offsets by switching from Zstd to delta encoding. This approach better compresses incremental integer sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name PAX(ZSTD) AOCS_SIZE PAX(Delta) PAX SIZE / AOCS * 100%
call_center 12 kB 231 kB 10185 bytes 4.31%
catalog_page 499 kB 653 kB 393 kB 60.18%
catalog_returns 240 MB 171 MB 178 MB 104.09%
catalog_sales 3033 MB 1837 MB 1977 MB 107.63%
customer 16 MB 12 MB 12 MB 100.00%
customer_address 7008 kB 3161 kB 3115 kB 98.54%
customer_demographics 28 MB 8164 kB 9292 kB 113.82%
date_dim 3193 kB 1406 kB 1249 kB 88.85%
household_demographics 42 kB 248 kB 28 kB 11.29%
income_band 1239 bytes 225 kB 1239 bytes 0.54%
inventory 36 MB 71 MB 36 MB 50.70%
item 3084 kB 2479 kB 2227 kB 89.84%
promotion 27 kB 239 kB 18 kB 7.53%
reason 2730 bytes 226 kB 2280 bytes 0.99%
ship_mode 3894 bytes 227 kB 3315 bytes 1.43%
store 23 kB 239 kB 18 kB 7.53%
store_returns 400 MB 265 MB 277 MB 104.53%
store_sales 4173 MB 2384 MB 2554 MB 107.12%
time_dim 1702 kB 819 kB 627 kB 76.56%
warehouse 5394 bytes 227 kB 4698 bytes 2.02%
web_page 21 kB 236 kB 14 kB 5.93%
web_returns 116 MB 83 MB 85 MB 102.41%
web_sales 1513 MB 908 MB 982 MB 108.15%
static std::vector<uint32_t> GenerateMonotonicOffsets(size_t n, uint32_t seed) {
  std::vector<uint32_t> offsets;
  offsets.resize(n);
  offsets[0] = 0;
  std::mt19937 rng(seed);
  std::uniform_int_distribution<int> step_dist(1, 256);
  for (size_t i = 1; i < n; ++i) {
    offsets[i] = offsets[i - 1] + static_cast<uint32_t>(step_dist(rng));
  }
  return offsets;
}

the offsets array of benchmark was generated by GenerateMonotonicOffsets(1024*1024, 0x12345);

Benchmark Name Time (ns) Speed Additional Info
BM_RLEV2_Encode 11531021 346.89 Mi/s raw_kb=4.096k, rle_kb=1.91948k
BM_RLEV2_Decode 485947 8.03843 Gi/s
BM_Delta_Encode 3560027 1.09725 Gi/s delta_kb=1.12001k
BM_Delta_Decode 1523382 2.56423 Gi/s
BM_ZSTD_Compress 17258801 231.765 Mi/s zstd_kb=3.42082k
BM_ZSTD_Decompress 4416240 905.747 Mi/s

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@my-ship-it my-ship-it requested a review from gfphoenix78 August 29, 2025 05:49
@my-ship-it
Copy link
Copy Markdown
Contributor

my-ship-it commented Aug 29, 2025

Just formatted the table for better viewing, and the compression result looks very good.

Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_encoding_non_fixed_column.cc Outdated
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc Outdated
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc Outdated
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc Outdated
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.h Outdated
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc Outdated
@gongxun0928 gongxun0928 force-pushed the feature/use-delta-encoding-instead-of-zstd-for-offset-stream branch from 761e6d8 to 970e0ea Compare August 29, 2025 16:50
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc
Comment thread contrib/pax_storage/src/cpp/storage/columns/pax_delta_encoding.cc
Optimize performance of variable-length column offsets by switching from
Zstd to delta encoding. This approach better compresses incremental integer
sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name                   PAX(ZSTD)    AOCS_SIZE    PAX(Delta)    PAX SIZE / AOCS * 100%
call_center               12 kB       231 kB      10185 bytes        4.31%
catalog_page             499 kB       653 kB       393 kB           60.18%
catalog_returns          240 MB       171 MB       178 MB          104.09%
catalog_sales           3033 MB      1837 MB      1977 MB          107.63%
customer                  16 MB        12 MB        12 MB          100.00%
customer_address        7008 kB      3161 kB      3115 kB           98.54%
customer_demographics     28 MB      8164 kB      9292 kB          113.82%
date_dim                3193 kB      1406 kB      1249 kB           88.85%
household_demographics    42 kB       248 kB        28 kB           11.29%
income_band            1239 bytes     225 kB      1239 bytes         0.54%
inventory                 36 MB        71 MB        36 MB           50.70%
item                    3084 kB      2479 kB      2227 kB           89.84%
promotion                 27 kB       239 kB        18 kB            7.53%
reason                 2730 bytes     226 kB      2280 bytes         0.99%
ship_mode              3894 bytes     227 kB      3315 bytes         1.43%
store                     23 kB       239 kB        18 kB            7.53%
store_returns            400 MB       265 MB       277 MB          104.53%
store_sales             4173 MB      2384 MB      2554 MB          107.12%
time_dim                1702 kB       819 kB       627 kB           76.56%
warehouse              5394 bytes     227 kB      4698 bytes         2.02%
web_page                  21 kB       236 kB        14 kB            5.93%
web_returns              116 MB        83 MB        85 MB          102.41%
web_sales               1513 MB       908 MB       982 MB          108.15%
Stop recording payload_size to reduce offset disk usage by at least 5%.
Add tests for delta encoding
@gongxun0928 gongxun0928 force-pushed the feature/use-delta-encoding-instead-of-zstd-for-offset-stream branch from 867999a to 0374762 Compare September 1, 2025 18:12
Copy link
Copy Markdown
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongxun0928 gongxun0928 merged commit 29d2a2a into apache:main Sep 3, 2025
27 checks passed
hw118118 pushed a commit to hw118118/cloudberrydb that referenced this pull request Jan 27, 2026
apache#1337)

* feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream

Optimize performance of variable-length column offsets by switching from
Zstd to delta encoding. This approach better compresses incremental integer
sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name                   PAX(ZSTD)    AOCS_SIZE    PAX(Delta)    PAX SIZE / AOCS * 100%
call_center               12 kB       231 kB      10185 bytes        4.31%
catalog_page             499 kB       653 kB       393 kB           60.18%
catalog_returns          240 MB       171 MB       178 MB          104.09%
catalog_sales           3033 MB      1837 MB      1977 MB          107.63%
customer                  16 MB        12 MB        12 MB          100.00%
customer_address        7008 kB      3161 kB      3115 kB           98.54%
customer_demographics     28 MB      8164 kB      9292 kB          113.82%
date_dim                3193 kB      1406 kB      1249 kB           88.85%
household_demographics    42 kB       248 kB        28 kB           11.29%
income_band            1239 bytes     225 kB      1239 bytes         0.54%
inventory                 36 MB        71 MB        36 MB           50.70%
item                    3084 kB      2479 kB      2227 kB           89.84%
promotion                 27 kB       239 kB        18 kB            7.53%
reason                 2730 bytes     226 kB      2280 bytes         0.99%
ship_mode              3894 bytes     227 kB      3315 bytes         1.43%
store                     23 kB       239 kB        18 kB            7.53%
store_returns            400 MB       265 MB       277 MB          104.53%
store_sales             4173 MB      2384 MB      2554 MB          107.12%
time_dim                1702 kB       819 kB       627 kB           76.56%
warehouse              5394 bytes     227 kB      4698 bytes         2.02%
web_page                  21 kB       236 kB        14 kB            5.93%
web_returns              116 MB        83 MB        85 MB          102.41%
web_sales               1513 MB       908 MB       982 MB          108.15%
hw118118 pushed a commit to hw118118/cloudberrydb that referenced this pull request Jan 27, 2026
apache#1337)

* feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream

Optimize performance of variable-length column offsets by switching from
Zstd to delta encoding. This approach better compresses incremental integer
sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name                   PAX(ZSTD)    AOCS_SIZE    PAX(Delta)    PAX SIZE / AOCS * 100%
call_center               12 kB       231 kB      10185 bytes        4.31%
catalog_page             499 kB       653 kB       393 kB           60.18%
catalog_returns          240 MB       171 MB       178 MB          104.09%
catalog_sales           3033 MB      1837 MB      1977 MB          107.63%
customer                  16 MB        12 MB        12 MB          100.00%
customer_address        7008 kB      3161 kB      3115 kB           98.54%
customer_demographics     28 MB      8164 kB      9292 kB          113.82%
date_dim                3193 kB      1406 kB      1249 kB           88.85%
household_demographics    42 kB       248 kB        28 kB           11.29%
income_band            1239 bytes     225 kB      1239 bytes         0.54%
inventory                 36 MB        71 MB        36 MB           50.70%
item                    3084 kB      2479 kB      2227 kB           89.84%
promotion                 27 kB       239 kB        18 kB            7.53%
reason                 2730 bytes     226 kB      2280 bytes         0.99%
ship_mode              3894 bytes     227 kB      3315 bytes         1.43%
store                     23 kB       239 kB        18 kB            7.53%
store_returns            400 MB       265 MB       277 MB          104.53%
store_sales             4173 MB      2384 MB      2554 MB          107.12%
time_dim                1702 kB       819 kB       627 kB           76.56%
warehouse              5394 bytes     227 kB      4698 bytes         2.02%
web_page                  21 kB       236 kB        14 kB            5.93%
web_returns              116 MB        83 MB        85 MB          102.41%
web_sales               1513 MB       908 MB       982 MB          108.15%
oppenheimer01 pushed a commit to oppenheimer01/cloudberrydb that referenced this pull request Apr 12, 2026
apache#1337)

* feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream

Optimize performance of variable-length column offsets by switching from
Zstd to delta encoding. This approach better compresses incremental integer
sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name                   PAX(ZSTD)    AOCS_SIZE    PAX(Delta)    PAX SIZE / AOCS * 100%
call_center               12 kB       231 kB      10185 bytes        4.31%
catalog_page             499 kB       653 kB       393 kB           60.18%
catalog_returns          240 MB       171 MB       178 MB          104.09%
catalog_sales           3033 MB      1837 MB      1977 MB          107.63%
customer                  16 MB        12 MB        12 MB          100.00%
customer_address        7008 kB      3161 kB      3115 kB           98.54%
customer_demographics     28 MB      8164 kB      9292 kB          113.82%
date_dim                3193 kB      1406 kB      1249 kB           88.85%
household_demographics    42 kB       248 kB        28 kB           11.29%
income_band            1239 bytes     225 kB      1239 bytes         0.54%
inventory                 36 MB        71 MB        36 MB           50.70%
item                    3084 kB      2479 kB      2227 kB           89.84%
promotion                 27 kB       239 kB        18 kB            7.53%
reason                 2730 bytes     226 kB      2280 bytes         0.99%
ship_mode              3894 bytes     227 kB      3315 bytes         1.43%
store                     23 kB       239 kB        18 kB            7.53%
store_returns            400 MB       265 MB       277 MB          104.53%
store_sales             4173 MB      2384 MB      2554 MB          107.12%
time_dim                1702 kB       819 kB       627 kB           76.56%
warehouse              5394 bytes     227 kB      4698 bytes         2.02%
web_page                  21 kB       236 kB        14 kB            5.93%
web_returns              116 MB        83 MB        85 MB          102.41%
web_sales               1513 MB       908 MB       982 MB          108.15%
oppenheimer01 pushed a commit to oppenheimer01/cloudberrydb that referenced this pull request Apr 21, 2026
apache#1337)

* feat: use ColumnEncoding_Kind_DIRECT_DELTA as default in offset stream

Optimize performance of variable-length column offsets by switching from
Zstd to delta encoding. This approach better compresses incremental integer
sequences, cutting disk space by more than half while maintaining performance.

The following is a comparison of file sizes for different encoding methods on TPC-DS 20G:

Name                   PAX(ZSTD)    AOCS_SIZE    PAX(Delta)    PAX SIZE / AOCS * 100%
call_center               12 kB       231 kB      10185 bytes        4.31%
catalog_page             499 kB       653 kB       393 kB           60.18%
catalog_returns          240 MB       171 MB       178 MB          104.09%
catalog_sales           3033 MB      1837 MB      1977 MB          107.63%
customer                  16 MB        12 MB        12 MB          100.00%
customer_address        7008 kB      3161 kB      3115 kB           98.54%
customer_demographics     28 MB      8164 kB      9292 kB          113.82%
date_dim                3193 kB      1406 kB      1249 kB           88.85%
household_demographics    42 kB       248 kB        28 kB           11.29%
income_band            1239 bytes     225 kB      1239 bytes         0.54%
inventory                 36 MB        71 MB        36 MB           50.70%
item                    3084 kB      2479 kB      2227 kB           89.84%
promotion                 27 kB       239 kB        18 kB            7.53%
reason                 2730 bytes     226 kB      2280 bytes         0.99%
ship_mode              3894 bytes     227 kB      3315 bytes         1.43%
store                     23 kB       239 kB        18 kB            7.53%
store_returns            400 MB       265 MB       277 MB          104.53%
store_sales             4173 MB      2384 MB      2554 MB          107.12%
time_dim                1702 kB       819 kB       627 kB           76.56%
warehouse              5394 bytes     227 kB      4698 bytes         2.02%
web_page                  21 kB       236 kB        14 kB            5.93%
web_returns              116 MB        83 MB        85 MB          102.41%
web_sales               1513 MB       908 MB       982 MB          108.15%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants