Skip to content

[feature](cloud) Support balance sync warm up#56164

Merged
gavinchou merged 6 commits intoapache:masterfrom
deardeng:balance-warm-up-sync
Oct 30, 2025
Merged

[feature](cloud) Support balance sync warm up#56164
gavinchou merged 6 commits intoapache:masterfrom
deardeng:balance-warm-up-sync

Conversation

@deardeng
Copy link
Contributor

What problem does this PR solve?

  1. A new type of balance is supported on the cloud. When balancing, the BE mapping of the tablet service is modified on the FE only after all the file caches of the old BE are migrated to the new BE.
  2. Fix dest be not sync rs in time. src have new rs meta, but dest be not sync rs, download_file_cache_block will retrun err

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes. later

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 35364 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2fcbc6a50186ffdbe6ab9d26e1547012d4ef270b, data reload: false

------ Round 1 ----------------------------------
q1	17623	5289	5197	5197
q2	2054	325	212	212
q3	10218	1329	745	745
q4	10241	1003	556	556
q5	7533	2437	2409	2409
q6	193	173	142	142
q7	1022	776	672	672
q8	9378	1349	1243	1243
q9	7077	5183	5233	5183
q10	7026	2445	2006	2006
q11	501	320	294	294
q12	390	364	249	249
q13	17802	3693	3055	3055
q14	242	240	221	221
q15	596	492	499	492
q16	1036	1029	960	960
q17	624	907	383	383
q18	7558	7078	7020	7020
q19	1091	958	602	602
q20	361	358	258	258
q21	4155	3282	2492	2492
q22	1059	1038	973	973
Total cold run time: 107780 ms
Total hot run time: 35364 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5225	5185	5216	5185
q2	254	348	233	233
q3	2212	2684	2333	2333
q4	1403	1805	1380	1380
q5	4414	4610	4413	4413
q6	238	180	144	144
q7	2093	2016	1859	1859
q8	2666	2771	2620	2620
q9	7577	7264	7438	7264
q10	3166	3356	2918	2918
q11	583	557	511	511
q12	721	785	640	640
q13	3510	3995	3367	3367
q14	308	304	284	284
q15	549	480	479	479
q16	1043	1133	1072	1072
q17	1198	1561	1463	1463
q18	7914	7851	7290	7290
q19	829	863	906	863
q20	1914	2021	1888	1888
q21	5131	4375	4398	4375
q22	1087	1054	998	998
Total cold run time: 54035 ms
Total hot run time: 51579 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 188661 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2fcbc6a50186ffdbe6ab9d26e1547012d4ef270b, data reload: false

query1	1070	435	401	401
query2	6557	1700	1706	1700
query3	6748	223	225	223
query4	26266	23498	23001	23001
query5	4434	645	475	475
query6	341	242	232	232
query7	4654	522	301	301
query8	341	278	250	250
query9	8691	2638	2675	2638
query10	475	386	306	306
query11	15695	15137	14783	14783
query12	181	119	116	116
query13	1734	580	440	440
query14	11306	9383	9444	9383
query15	211	194	180	180
query16	7692	691	525	525
query17	1250	781	673	673
query18	2063	494	334	334
query19	216	196	168	168
query20	136	120	121	120
query21	209	131	114	114
query22	4331	4320	3953	3953
query23	33900	32935	33169	32935
query24	8495	2458	2416	2416
query25	560	512	450	450
query26	1235	279	162	162
query27	2727	519	353	353
query28	4348	2238	2213	2213
query29	801	631	497	497
query30	294	225	198	198
query31	931	820	738	738
query32	80	75	69	69
query33	596	378	334	334
query34	811	866	552	552
query35	820	869	751	751
query36	983	992	911	911
query37	121	111	89	89
query38	3502	3530	3499	3499
query39	1510	1450	1429	1429
query40	218	129	122	122
query41	69	66	62	62
query42	128	117	119	117
query43	516	529	474	474
query44	1386	887	847	847
query45	196	177	171	171
query46	887	1060	652	652
query47	1759	1813	1717	1717
query48	408	435	319	319
query49	770	517	431	431
query50	684	702	414	414
query51	3878	3966	3953	3953
query52	113	110	106	106
query53	247	274	200	200
query54	617	599	540	540
query55	87	88	84	84
query56	321	320	323	320
query57	1213	1207	1117	1117
query58	276	270	270	270
query59	2592	2716	2608	2608
query60	346	343	329	329
query61	163	157	181	157
query62	839	715	667	667
query63	232	194	195	194
query64	4483	1152	832	832
query65	4050	3932	3967	3932
query66	1081	460	347	347
query67	15354	15236	15008	15008
query68	9071	954	591	591
query69	495	322	273	273
query70	1405	1314	1333	1314
query71	569	341	312	312
query72	5970	5056	5281	5056
query73	780	713	375	375
query74	9382	8834	8874	8834
query75	4371	3324	2871	2871
query76	3697	1180	752	752
query77	821	434	318	318
query78	9565	9782	8920	8920
query79	1824	815	599	599
query80	698	578	503	503
query81	485	264	233	233
query82	332	165	135	135
query83	294	273	253	253
query84	303	114	106	106
query85	856	478	428	428
query86	363	314	314	314
query87	3799	3709	3607	3607
query88	2979	2254	2246	2246
query89	408	340	307	307
query90	2082	224	228	224
query91	168	168	137	137
query92	84	68	61	61
query93	1336	1058	655	655
query94	686	440	335	335
query95	445	324	315	315
query96	483	599	282	282
query97	2954	3026	2865	2865
query98	239	211	211	211
query99	1443	1423	1327	1327
Total cold run time: 277615 ms
Total hot run time: 188661 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.68 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2fcbc6a50186ffdbe6ab9d26e1547012d4ef270b, data reload: false

query1	0.06	0.05	0.04
query2	0.11	0.06	0.05
query3	0.25	0.09	0.08
query4	1.61	0.12	0.12
query5	0.28	0.26	0.25
query6	1.18	0.67	0.64
query7	0.03	0.03	0.02
query8	0.06	0.05	0.05
query9	0.63	0.52	0.51
query10	0.58	0.56	0.57
query11	0.16	0.11	0.11
query12	0.16	0.12	0.14
query13	0.63	0.63	0.63
query14	1.02	1.05	1.04
query15	0.88	0.87	0.86
query16	0.40	0.40	0.40
query17	1.07	1.07	1.05
query18	0.22	0.20	0.21
query19	2.00	1.79	1.89
query20	0.02	0.02	0.02
query21	15.40	0.95	0.59
query22	0.78	1.19	0.78
query23	14.74	1.36	0.71
query24	7.31	1.01	0.36
query25	0.33	0.38	0.07
query26	0.66	0.17	0.12
query27	0.07	0.06	0.07
query28	9.06	0.96	0.44
query29	12.56	3.95	3.26
query30	0.29	0.14	0.11
query31	2.82	0.62	0.39
query32	3.26	0.56	0.48
query33	3.01	3.16	3.14
query34	16.19	5.49	4.85
query35	4.93	4.89	4.90
query36	0.71	0.53	0.51
query37	0.11	0.07	0.07
query38	0.07	0.05	0.04
query39	0.03	0.03	0.03
query40	0.18	0.15	0.15
query41	0.09	0.03	0.02
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 104.03 s
Total hot run time: 29.68 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.31% (17548/33543)
Line Coverage 37.45% (159030/424634)
Region Coverage 32.06% (121418/378726)
Branch Coverage 33.38% (53215/159415)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.05% (23293/32783)
Line Coverage 57.40% (243117/423553)
Region Coverage 52.98% (203241/383645)
Branch Coverage 54.48% (87231/160104)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/38) 🎉
Increment coverage report
Complete coverage report

@deardeng deardeng force-pushed the balance-warm-up-sync branch 2 times, most recently from 9995b58 to 495fc49 Compare September 25, 2025 13:11
@deardeng deardeng force-pushed the balance-warm-up-sync branch 7 times, most recently from 11af924 to 76bb003 Compare September 29, 2025 11:48
@deardeng
Copy link
Contributor Author

run buildall

@deardeng deardeng force-pushed the balance-warm-up-sync branch from 76bb003 to 67fe918 Compare September 29, 2025 11:53
@deardeng
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 9.52% (2/21) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 83.61% (1632/1952)
Line Coverage 68.10% (28834/42340)
Region Coverage 68.33% (14209/20794)
Branch Coverage 58.66% (7570/12904)

@doris-robot
Copy link

ClickBench: Total hot run time: 30.31 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 67fe91811a647bf4d1a11ce32f6f693c96862bb6, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.06	0.05
query3	0.25	0.09	0.09
query4	1.63	0.11	0.12
query5	0.29	0.27	0.26
query6	1.22	0.65	0.64
query7	0.03	0.02	0.02
query8	0.06	0.04	0.05
query9	0.62	0.53	0.52
query10	0.58	0.58	0.58
query11	0.17	0.11	0.12
query12	0.15	0.12	0.12
query13	0.63	0.64	0.62
query14	1.04	1.03	1.07
query15	0.87	0.85	0.87
query16	0.43	0.41	0.40
query17	1.09	1.08	1.06
query18	0.22	0.23	0.21
query19	1.95	1.90	1.86
query20	0.02	0.02	0.01
query21	15.42	0.92	0.59
query22	0.77	1.24	0.65
query23	14.92	1.39	0.70
query24	7.60	1.18	0.53
query25	0.51	0.29	0.07
query26	0.69	0.16	0.15
query27	0.07	0.06	0.05
query28	9.12	1.41	0.95
query29	12.61	3.95	3.25
query30	0.28	0.13	0.12
query31	2.84	0.61	0.40
query32	3.27	0.59	0.48
query33	3.14	3.15	3.12
query34	16.14	5.54	4.83
query35	4.90	4.90	4.92
query36	0.71	0.52	0.50
query37	0.10	0.07	0.07
query38	0.07	0.05	0.05
query39	0.04	0.03	0.03
query40	0.17	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 104.94 s
Total hot run time: 30.31 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.44% (17650/33659)
Line Coverage 37.64% (160304/425886)
Region Coverage 32.16% (122368/380483)
Branch Coverage 33.53% (53637/159967)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.16% (23382/32857)
Line Coverage 57.60% (244631/424715)
Region Coverage 52.78% (203129/384890)
Branch Coverage 54.56% (87597/160560)

@hello-stephen
Copy link
Contributor

run buildall

1 similar comment
@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 9.52% (2/21) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.63% (1673/2075)
Line Coverage 66.97% (29503/44052)
Region Coverage 67.44% (14651/21725)
Branch Coverage 57.73% (7791/13496)

@doris-robot
Copy link

ClickBench: Total hot run time: 28.77 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ce5c6547ddab7efcf590c23632723a5f973a108f, data reload: false

query1	0.05	0.04	0.04
query2	0.13	0.07	0.07
query3	0.30	0.07	0.07
query4	1.61	0.09	0.08
query5	0.27	0.25	0.25
query6	1.16	0.66	0.65
query7	0.04	0.03	0.02
query8	0.07	0.06	0.06
query9	0.69	0.54	0.54
query10	0.61	0.60	0.60
query11	0.26	0.14	0.14
query12	0.27	0.14	0.14
query13	0.65	0.64	0.64
query14	1.04	1.02	1.02
query15	0.94	0.87	0.86
query16	0.40	0.39	0.40
query17	1.05	1.10	1.06
query18	0.24	0.24	0.23
query19	1.99	1.87	1.79
query20	0.02	0.02	0.02
query21	15.39	0.28	0.26
query22	5.01	0.10	0.10
query23	15.36	0.39	0.23
query24	2.94	0.48	0.32
query25	0.10	0.09	0.10
query26	0.18	0.18	0.18
query27	0.10	0.10	0.09
query28	3.72	1.27	1.08
query29	12.55	4.12	3.39
query30	0.34	0.12	0.11
query31	2.82	0.64	0.45
query32	3.25	0.61	0.52
query33	3.07	3.17	3.18
query34	16.57	5.31	4.54
query35	4.52	4.56	4.53
query36	0.66	0.53	0.52
query37	0.24	0.09	0.09
query38	0.20	0.06	0.07
query39	0.06	0.05	0.05
query40	0.21	0.17	0.18
query41	0.11	0.06	0.05
query42	0.07	0.05	0.05
query43	0.06	0.06	0.05
Total cold run time: 99.32 s
Total hot run time: 28.77 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.72% (18042/34224)
Line Coverage 37.97% (163730/431261)
Region Coverage 32.34% (124867/386118)
Branch Coverage 33.70% (54584/161955)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.34% (24327/33628)
Line Coverage 59.15% (255400/431804)
Region Coverage 54.99% (215298/391522)
Branch Coverage 56.33% (91821/163004)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 20.91% (60/287) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.47% (24033/33628)
Line Coverage 57.87% (249887/431804)
Region Coverage 53.00% (207525/391522)
Branch Coverage 54.65% (89087/163004)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 20.91% (60/287) 🎉
Increment coverage report
Complete coverage report

@gavinchou gavinchou merged commit 7e0503b into apache:master Oct 30, 2025
25 of 26 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 30, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
dwdwqfwe pushed a commit to dwdwqfwe/doris that referenced this pull request Oct 31, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
deardeng added a commit to deardeng/incubator-doris that referenced this pull request Nov 4, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
deardeng added a commit to deardeng/incubator-doris that referenced this pull request Nov 4, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
deardeng added a commit to deardeng/incubator-doris that referenced this pull request Nov 4, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
deardeng added a commit to deardeng/incubator-doris that referenced this pull request Nov 4, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
deardeng added a commit to deardeng/incubator-doris that referenced this pull request Nov 4, 2025
1. A new type of balance is supported on the cloud. When balancing, the
BE mapping of the tablet service is modified on the FE only after all
the file caches of the old BE are migrated to the new BE.
2. Fix dest be not sync rs in time. src have new rs meta, but dest be
not sync rs, download_file_cache_block will retrun err
yiguolei pushed a commit that referenced this pull request Nov 6, 2025
)

Cherry-picked from #56164

Co-authored-by: deardeng <565620795@qq.com>
morrySnow pushed a commit that referenced this pull request Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.3-merged dev/4.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants