Skip to content

Conversation

@liutang123
Copy link
Contributor

What problem does this PR solve?

If a cluster has many tablets, During cluster scaling (both scale-up and scale-down), the cache warmup process takes an extremely long time.
With an 8-node cluster (each node hosting approximately 400,000 tablets), taking 2 nodes offline takes roughly 1.5 hour.
Clipboard_Screenshot_1766060797

I found that a large portion of the time is spent on the FE sending RPCs to the BEs. Although the latency of each RPC is short, the time consumed by serially executing hundreds of thousands of RPCs is still quite considerable.

I attempted to implement batching and delayed RPC sending, which reduced the overall time cost by a factor of 5, bringing it down to 15 minutes.
Clipboard_Screenshot_1766061206

I will add UT and RT later.

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liutang123 liutang123 force-pushed the opt-cloud-balance-warmup-master branch 3 times, most recently from 18f574d to 8be3b78 Compare December 18, 2025 13:01
@liutang123 liutang123 force-pushed the opt-cloud-balance-warmup-master branch from 8be3b78 to acb754a Compare December 18, 2025 13:02
@liutang123
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34835 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit acb754a31e20f724c4d006616c487230db4ba3f6, data reload: false

------ Round 1 ----------------------------------
q1	17621	4178	4055	4055
q2	2005	352	237	237
q3	10173	1322	736	736
q4	10219	833	312	312
q5	7551	2134	1884	1884
q6	190	175	139	139
q7	999	863	706	706
q8	9352	1426	1149	1149
q9	6997	5326	5316	5316
q10	6827	2399	1980	1980
q11	526	327	303	303
q12	658	720	583	583
q13	17770	3691	3031	3031
q14	283	299	272	272
q15	606	528	522	522
q16	686	679	628	628
q17	709	865	425	425
q18	7639	7016	6970	6970
q19	1092	959	603	603
q20	394	348	245	245
q21	4156	3982	3793	3793
q22	1036	999	946	946
Total cold run time: 107489 ms
Total hot run time: 34835 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4087	4010	4018	4010
q2	334	413	314	314
q3	2161	2643	2314	2314
q4	1320	1725	1281	1281
q5	4205	4459	4721	4459
q6	243	184	134	134
q7	2035	2004	1816	1816
q8	2710	2519	2536	2519
q9	7614	7494	7449	7449
q10	3061	3207	2916	2916
q11	601	494	504	494
q12	718	851	642	642
q13	3514	3896	3360	3360
q14	284	310	296	296
q15	544	535	509	509
q16	661	697	624	624
q17	1192	1724	1406	1406
q18	7794	7663	7603	7603
q19	880	873	866	866
q20	1984	2134	2037	2037
q21	4783	4305	4162	4162
q22	1059	999	984	984
Total cold run time: 51784 ms
Total hot run time: 50195 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 178294 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit acb754a31e20f724c4d006616c487230db4ba3f6, data reload: false

query5	5072	596	428	428
query6	336	232	233	232
query7	4226	461	269	269
query8	320	268	247	247
query9	8775	2523	2545	2523
query10	575	363	343	343
query11	15241	14763	15008	14763
query12	194	115	118	115
query13	1265	509	397	397
query14	6698	2995	2756	2756
query14_1	2666	2635	2655	2635
query15	217	204	177	177
query16	915	481	452	452
query17	1132	707	616	616
query18	2713	441	355	355
query19	243	236	207	207
query20	128	119	115	115
query21	223	143	121	121
query22	4038	4071	3962	3962
query23	16611	16214	15899	15899
query23_1	16170	16186	15999	15999
query24	7312	1628	1249	1249
query24_1	1240	1226	1249	1226
query25	591	509	459	459
query26	1265	270	166	166
query27	2743	469	313	313
query28	4466	2126	2135	2126
query29	818	587	467	467
query30	313	247	215	215
query31	806	706	647	647
query32	82	73	73	73
query33	556	348	307	307
query34	897	903	560	560
query35	766	814	726	726
query36	862	946	809	809
query37	129	90	74	74
query38	2893	2852	2815	2815
query39	784	737	704	704
query39_1	705	684	685	684
query40	224	137	119	119
query41	67	67	60	60
query42	105	102	106	102
query43	432	418	398	398
query44	1314	753	724	724
query45	195	191	183	183
query46	873	977	608	608
query47	1674	1696	1604	1604
query48	314	321	254	254
query49	649	430	346	346
query50	652	289	220	220
query51	3835	3791	3824	3791
query52	104	109	97	97
query53	318	348	287	287
query54	289	253	252	252
query55	77	76	76	76
query56	300	299	300	299
query57	1149	1176	1074	1074
query58	272	249	252	249
query59	2462	2547	2393	2393
query60	309	313	295	295
query61	203	157	149	149
query62	699	666	622	622
query63	321	295	296	295
query64	4866	1292	996	996
query65	4029	3969	3918	3918
query66	1380	447	315	315
query67	15450	14852	15014	14852
query68	8359	996	718	718
query69	484	349	310	310
query70	1080	1020	991	991
query71	374	306	278	278
query72	6070	5010	5099	5010
query73	716	640	310	310
query74	8878	8702	8555	8555
query75	3179	3181	2823	2823
query76	3934	1127	721	721
query77	582	391	278	278
query78	9442	9524	8910	8910
query79	1536	936	628	628
query80	722	661	554	554
query81	500	272	233	233
query82	215	136	104	104
query83	261	254	241	241
query84	265	125	108	108
query85	921	526	476	476
query86	374	294	279	279
query87	3018	3113	2918	2918
query88	3261	2282	2284	2282
query89	465	412	387	387
query90	2026	157	154	154
query91	181	161	143	143
query92	77	64	65	64
query93	1140	921	564	564
query94	476	286	271	271
query95	592	323	367	323
query96	607	469	211	211
query97	2257	2292	2263	2263
query98	217	194	192	192
query99	1251	1295	1224	1224
Total cold run time: 260777 ms
Total hot run time: 178294 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit acb754a31e20f724c4d006616c487230db4ba3f6, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.05	0.04
query3	0.25	0.09	0.09
query4	1.61	0.11	0.11
query5	0.26	0.25	0.26
query6	1.17	0.67	0.64
query7	0.03	0.03	0.03
query8	0.05	0.05	0.04
query9	0.57	0.51	0.50
query10	0.56	0.55	0.56
query11	0.16	0.11	0.12
query12	0.15	0.13	0.12
query13	0.61	0.60	0.61
query14	0.99	0.98	0.99
query15	0.82	0.80	0.80
query16	0.39	0.40	0.41
query17	0.99	0.99	1.00
query18	0.23	0.21	0.21
query19	1.87	1.89	1.75
query20	0.02	0.01	0.01
query21	15.43	0.28	0.15
query22	4.75	0.05	0.05
query23	15.92	0.28	0.11
query24	1.15	0.47	0.56
query25	0.06	0.11	0.06
query26	0.13	0.13	0.14
query27	0.07	0.06	0.06
query28	4.70	1.23	1.03
query29	12.58	4.16	3.43
query30	0.29	0.14	0.11
query31	2.81	0.61	0.40
query32	3.23	0.56	0.47
query33	2.94	3.05	3.09
query34	16.95	5.18	4.47
query35	4.58	4.58	4.61
query36	0.65	0.49	0.50
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.15	0.13
query41	0.08	0.03	0.03
query42	0.05	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 97.67 s
Total hot run time: 27.69 s

@liutang123
Copy link
Contributor Author

@deardeng Hi, Do you have time to see this PR

@deardeng
Copy link
Contributor

deardeng commented Dec 24, 2025

我这里有个修复,解的更彻底些,be 层面也需要修复的
#58962

@liutang123 可以帮review下的

private Map<InfightTablet, InfightTask> tabletToInfightTask = new HashMap<>();
private Map<InfightTablet, InfightTask> tabletToInfightTask = new ConcurrentHashMap<>();

private ForkJoinPool warmUpSendRpcPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forkjoinpool 用起来会有奇怪的bug,参看这个https://github.com/apache/doris/pull/57382,

@liutang123 liutang123 closed this Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants