Skip to content

[enhancement](parquet)Optimize the performance of parquet reader when decode RLE_DICTIONARY encoding#57208

Merged
morningman merged 3 commits intoapache:masterfrom
hubgeter:improve_rle_encoding
Oct 31, 2025
Merged

[enhancement](parquet)Optimize the performance of parquet reader when decode RLE_DICTIONARY encoding#57208
morningman merged 3 commits intoapache:masterfrom
hubgeter:improve_rle_encoding

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Oct 21, 2025

What problem does this PR solve?

Problem Summary:
When parsing RLE_DICTIONARY encoding, the parquet reader uniformly uses memcpy. However, for INT32, INT64, etc., direct assignment is faster than memcpy.

In Parquet dictionary encoding, the actual data is not stored contiguously, resulting in very small memcpy sizes. When analyzing the implementation of memcpy, we can see that for such small sizes, __builtin_memcpy is used instead. The implementation of __builtin_memcpy essentially behaves like a series of simple assignments. You can observe the corresponding assembly code here: https://godbolt.org/z/r9Ma1ozvd.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 21, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 28.68 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7beb05d58e98a4f1f520775ef892cc1bb043a28e, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.06	0.04
query3	0.25	0.09	0.09
query4	1.61	0.12	0.12
query5	0.29	0.26	0.25
query6	1.20	0.67	0.66
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.62	0.53	0.53
query10	0.60	0.59	0.58
query11	0.17	0.12	0.12
query12	0.16	0.13	0.13
query13	0.64	0.64	0.61
query14	1.03	1.03	1.03
query15	0.89	0.86	0.86
query16	0.42	0.42	0.41
query17	1.06	1.11	1.09
query18	0.22	0.20	0.21
query19	1.96	1.92	1.89
query20	0.02	0.01	0.02
query21	15.42	0.19	0.13
query22	5.09	0.08	0.04
query23	15.64	0.27	0.11
query24	2.45	1.69	0.31
query25	0.09	0.06	0.07
query26	0.14	0.14	0.15
query27	0.06	0.06	0.06
query28	5.04	1.18	0.93
query29	12.58	4.13	3.39
query30	0.28	0.14	0.11
query31	2.83	0.63	0.40
query32	3.24	0.56	0.48
query33	3.19	3.10	3.17
query34	16.15	5.55	4.87
query35	4.92	4.93	4.93
query36	0.70	0.52	0.51
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.04	0.03
query40	0.19	0.16	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.04
query43	0.05	0.04	0.04
Total cold run time: 99.77 s
Total hot run time: 28.68 s

@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter hubgeter changed the title [enhancement](parquet)improve parquet fixedLengthDict decode performance [enhancement](parquet)Optimize the performance of parquet reader when decode RLE_DICTIONARY encoding Oct 22, 2025
@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 189993 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1ab29c9b2087d900de21f0abd487788275dad991, data reload: false

query1	1083	425	407	407
query2	6560	1758	1736	1736
query3	6766	227	226	226
query4	26243	23469	23147	23147
query5	4425	671	482	482
query6	345	261	232	232
query7	4667	503	300	300
query8	310	286	272	272
query9	8756	2615	2585	2585
query10	515	368	285	285
query11	15701	15019	14839	14839
query12	200	122	115	115
query13	1676	567	439	439
query14	12631	9300	9360	9300
query15	213	195	180	180
query16	7694	684	490	490
query17	1616	793	613	613
query18	2736	444	344	344
query19	224	243	191	191
query20	145	133	145	133
query21	232	137	118	118
query22	4617	4635	4683	4635
query23	35451	33663	34100	33663
query24	8387	2485	2500	2485
query25	603	541	476	476
query26	1367	293	164	164
query27	2931	542	377	377
query28	4418	2234	2179	2179
query29	828	824	550	550
query30	304	233	210	210
query31	946	842	786	786
query32	85	78	73	73
query33	583	399	346	346
query34	900	913	548	548
query35	879	873	824	824
query36	1008	1082	960	960
query37	138	109	87	87
query38	3535	3612	3515	3515
query39	1492	1414	1425	1414
query40	218	124	121	121
query41	63	58	59	58
query42	123	114	151	114
query43	498	496	476	476
query44	1224	733	744	733
query45	183	181	177	177
query46	916	1021	633	633
query47	1764	1811	1713	1713
query48	403	411	319	319
query49	787	498	419	419
query50	660	707	412	412
query51	3908	3886	3937	3886
query52	108	107	101	101
query53	250	275	201	201
query54	599	600	527	527
query55	87	80	86	80
query56	336	335	312	312
query57	1158	1197	1155	1155
query58	294	281	287	281
query59	2539	2677	2563	2563
query60	368	368	361	361
query61	189	184	204	184
query62	798	723	680	680
query63	233	195	197	195
query64	4416	1162	867	867
query65	4024	3952	3980	3952
query66	1111	434	345	345
query67	15417	15206	14900	14900
query68	8254	885	595	595
query69	488	327	293	293
query70	1390	1333	1239	1239
query71	429	355	329	329
query72	5871	4863	4810	4810
query73	649	591	352	352
query74	8909	9094	8674	8674
query75	3379	3346	2928	2928
query76	3288	1204	741	741
query77	512	394	315	315
query78	9526	9672	8961	8961
query79	2127	811	636	636
query80	711	563	516	516
query81	516	270	227	227
query82	223	170	143	143
query83	280	265	263	263
query84	259	111	94	94
query85	867	537	422	422
query86	381	307	294	294
query87	3693	3746	3660	3660
query88	3272	2299	2320	2299
query89	390	325	313	313
query90	2029	223	225	223
query91	170	164	141	141
query92	89	79	69	69
query93	2218	976	644	644
query94	708	459	347	347
query95	420	330	329	329
query96	496	614	293	293
query97	2933	2979	2869	2869
query98	245	211	237	211
query99	1357	1390	1272	1272
Total cold run time: 279963 ms
Total hot run time: 189993 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.59 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 1ab29c9b2087d900de21f0abd487788275dad991, data reload: false

query1	0.06	0.06	0.05
query2	0.10	0.06	0.06
query3	0.27	0.08	0.08
query4	1.60	0.12	0.12
query5	0.27	0.28	0.26
query6	1.18	0.66	0.66
query7	0.03	0.03	0.03
query8	0.06	0.04	0.04
query9	0.63	0.54	0.51
query10	0.59	0.58	0.58
query11	0.17	0.12	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.60
query14	1.05	1.01	1.01
query15	0.85	0.83	0.84
query16	0.40	0.40	0.39
query17	1.04	1.04	1.02
query18	0.22	0.21	0.20
query19	2.08	1.88	1.76
query20	0.01	0.01	0.02
query21	15.43	0.20	0.13
query22	4.92	0.07	0.05
query23	15.67	0.26	0.10
query24	2.48	0.51	0.38
query25	0.07	0.06	0.06
query26	0.14	0.13	0.14
query27	0.07	0.05	0.07
query28	4.41	1.18	0.93
query29	12.65	3.98	3.33
query30	0.29	0.15	0.11
query31	2.81	0.62	0.39
query32	3.26	0.55	0.48
query33	3.02	3.12	3.02
query34	15.76	5.20	4.56
query35	4.62	4.54	4.58
query36	0.67	0.52	0.51
query37	0.10	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.04	0.03
query40	0.18	0.14	0.14
query41	0.10	0.03	0.04
query42	0.04	0.04	0.04
query43	0.04	0.03	0.04
Total cold run time: 98.21 s
Total hot run time: 27.59 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 90.00% (27/30) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.64% (17933/34065)
Line Coverage 37.88% (162710/429497)
Region Coverage 32.30% (124156/384326)
Branch Coverage 33.70% (54372/161344)

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 28, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (30/30) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.38% (23873/33447)
Line Coverage 57.78% (248338/429828)
Region Coverage 52.84% (205861/389589)
Branch Coverage 54.62% (88662/162329)

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 6fe6656 into apache:master Oct 31, 2025
29 of 31 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 31, 2025
… decode RLE_DICTIONARY encoding (#57208)

### What problem does this PR solve?
Problem Summary:
When parsing RLE_DICTIONARY encoding, the parquet reader uniformly uses
memcpy. However, for INT32, INT64, etc., direct assignment is faster
than memcpy.

In Parquet dictionary encoding, the actual data is not stored
contiguously, resulting in very small memcpy sizes. When analyzing the
implementation of `memcpy`, we can see that for such small sizes,
`__builtin_memcpy` is used instead. The implementation of
`__builtin_memcpy` essentially behaves like a series of simple
assignments. You can observe the corresponding assembly code here:
https://godbolt.org/z/r9Ma1ozvd.
dwdwqfwe pushed a commit to dwdwqfwe/doris that referenced this pull request Oct 31, 2025
… decode RLE_DICTIONARY encoding (apache#57208)

### What problem does this PR solve?
Problem Summary:
When parsing RLE_DICTIONARY encoding, the parquet reader uniformly uses
memcpy. However, for INT32, INT64, etc., direct assignment is faster
than memcpy.

In Parquet dictionary encoding, the actual data is not stored
contiguously, resulting in very small memcpy sizes. When analyzing the
implementation of `memcpy`, we can see that for such small sizes,
`__builtin_memcpy` is used instead. The implementation of
`__builtin_memcpy` essentially behaves like a series of simple
assignments. You can observe the corresponding assembly code here:
https://godbolt.org/z/r9Ma1ozvd.
hubgeter added a commit to hubgeter/doris that referenced this pull request Nov 3, 2025
… decode RLE_DICTIONARY encoding (apache#57208)

Problem Summary:
When parsing RLE_DICTIONARY encoding, the parquet reader uniformly uses
memcpy. However, for INT32, INT64, etc., direct assignment is faster
than memcpy.

In Parquet dictionary encoding, the actual data is not stored
contiguously, resulting in very small memcpy sizes. When analyzing the
implementation of `memcpy`, we can see that for such small sizes,
`__builtin_memcpy` is used instead. The implementation of
`__builtin_memcpy` essentially behaves like a series of simple
assignments. You can observe the corresponding assembly code here:
https://godbolt.org/z/r9Ma1ozvd.
morningman pushed a commit that referenced this pull request Nov 4, 2025
yiguolei pushed a commit that referenced this pull request Nov 10, 2025
… reader when decode RLE_DICTIONARY encoding #57208 (#57563)

Cherry-picked from #57208

Co-authored-by: daidai <changyuwei@selectdb.com>
@yiguolei yiguolei mentioned this pull request Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.3-merged dev/4.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants