Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 2, 2025

What problem does this PR solve?

routine load task will block in following case:

  1. The user created a job using the admin user of clusterA, and at some point deleted clusterA, and renamed clusterB to clusterA
  2. The cluster ID saved in the job is invalid and can't find any BE
  3. This task was repeatedly taken out of the queue and was put back to queue for there was no BE to execute, causing the other tasks to get stuck.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jul 2, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Jul 2, 2025

run buildall

@sollhui sollhui force-pushed the fix_rl_job_schedule branch from a9d926e to ed83b14 Compare July 2, 2025 08:31
@sollhui
Copy link
Contributor Author

sollhui commented Jul 2, 2025

run buildall

liaoxin01
liaoxin01 previously approved these changes Jul 2, 2025
Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 2, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 2, 2025

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 34881 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ed83b147d4c5897efc76affd90fbd27ac7140109, data reload: false

------ Round 1 ----------------------------------
q1	17667	5436	5260	5260
q2	1944	296	186	186
q3	10571	1410	728	728
q4	10257	1090	517	517
q5	8091	2647	2570	2570
q6	204	175	131	131
q7	969	767	606	606
q8	9322	1490	1202	1202
q9	7024	5346	5359	5346
q10	6951	2445	1976	1976
q11	489	280	273	273
q12	353	395	220	220
q13	17759	3824	3140	3140
q14	232	228	221	221
q15	560	486	476	476
q16	436	426	376	376
q17	589	924	360	360
q18	7885	7109	7159	7109
q19	1222	1094	600	600
q20	334	370	217	217
q21	3848	3241	2431	2431
q22	1085	1026	936	936
Total cold run time: 107792 ms
Total hot run time: 34881 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5440	5461	5388	5388
q2	273	344	226	226
q3	2209	2763	2321	2321
q4	1432	1896	1390	1390
q5	4798	4569	4546	4546
q6	254	181	126	126
q7	2135	2009	1789	1789
q8	2983	2880	2891	2880
q9	7298	7326	7170	7170
q10	3277	3348	2935	2935
q11	622	515	483	483
q12	716	807	630	630
q13	3785	4093	3362	3362
q14	292	286	286	286
q15	564	469	491	469
q16	459	505	467	467
q17	1280	1843	1460	1460
q18	7932	7757	7602	7602
q19	892	937	1230	937
q20	2127	2140	1932	1932
q21	5219	4661	4625	4625
q22	1185	1111	1029	1029
Total cold run time: 55172 ms
Total hot run time: 52053 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 57.14% (4/7) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-DS: Total hot run time: 185697 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ed83b147d4c5897efc76affd90fbd27ac7140109, data reload: false

query1	1017	383	396	383
query2	6524	1635	1639	1635
query3	6743	212	212	212
query4	26452	23780	23548	23548
query5	4309	602	448	448
query6	304	225	199	199
query7	4625	520	290	290
query8	284	229	217	217
query9	8629	2658	2657	2657
query10	474	334	282	282
query11	15234	15048	14763	14763
query12	156	104	102	102
query13	1659	551	416	416
query14	8658	5767	5868	5767
query15	212	202	183	183
query16	7361	629	509	509
query17	1207	727	622	622
query18	2034	408	317	317
query19	197	195	168	168
query20	129	117	115	115
query21	212	124	114	114
query22	4440	4641	4448	4448
query23	34900	33983	33133	33133
query24	8452	2383	2401	2383
query25	533	473	394	394
query26	1221	262	142	142
query27	2721	515	345	345
query28	4294	2151	2119	2119
query29	756	563	432	432
query30	279	216	185	185
query31	917	818	753	753
query32	71	59	63	59
query33	547	413	302	302
query34	807	833	525	525
query35	770	847	749	749
query36	940	956	881	881
query37	112	95	75	75
query38	4130	4105	4061	4061
query39	1453	1407	1414	1407
query40	200	115	98	98
query41	53	50	49	49
query42	119	106	109	106
query43	507	519	461	461
query44	1341	835	820	820
query45	188	167	165	165
query46	861	1032	641	641
query47	1736	1813	1702	1702
query48	379	432	304	304
query49	742	492	391	391
query50	654	688	404	404
query51	4183	4137	4096	4096
query52	109	106	97	97
query53	225	251	185	185
query54	586	571	503	503
query55	84	80	79	79
query56	313	297	282	282
query57	1149	1194	1127	1127
query58	268	252	247	247
query59	2570	2601	2473	2473
query60	327	329	314	314
query61	129	123	120	120
query62	786	715	649	649
query63	226	187	188	187
query64	4312	1011	669	669
query65	4334	4184	4165	4165
query66	1150	405	302	302
query67	15832	15532	15288	15288
query68	8808	889	583	583
query69	477	309	273	273
query70	1206	1101	1115	1101
query71	503	338	296	296
query72	5499	4691	4734	4691
query73	708	588	356	356
query74	9097	9144	8930	8930
query75	4076	3193	2698	2698
query76	3736	1152	740	740
query77	780	401	291	291
query78	10001	10200	9369	9369
query79	2730	840	588	588
query80	683	510	440	440
query81	462	265	228	228
query82	457	123	91	91
query83	272	253	241	241
query84	294	107	84	84
query85	772	348	388	348
query86	340	298	290	290
query87	4450	4374	4293	4293
query88	3313	2297	2269	2269
query89	408	317	283	283
query90	1967	212	212	212
query91	139	140	106	106
query92	72	58	54	54
query93	1769	963	580	580
query94	690	408	296	296
query95	375	291	289	289
query96	495	581	286	286
query97	2749	2792	2634	2634
query98	236	212	211	211
query99	1427	1370	1282	1282
Total cold run time: 276117 ms
Total hot run time: 185697 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.89 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ed83b147d4c5897efc76affd90fbd27ac7140109, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.24	0.07	0.07
query4	1.60	0.11	0.11
query5	0.45	0.43	0.41
query6	1.18	0.66	0.66
query7	0.03	0.02	0.02
query8	0.04	0.03	0.04
query9	0.61	0.51	0.53
query10	0.58	0.58	0.58
query11	0.16	0.11	0.12
query12	0.16	0.12	0.12
query13	0.64	0.62	0.61
query14	0.80	0.81	0.82
query15	0.92	0.87	0.88
query16	0.39	0.38	0.38
query17	1.06	1.10	1.05
query18	0.24	0.22	0.22
query19	2.02	1.84	1.87
query20	0.01	0.02	0.01
query21	15.38	0.95	0.56
query22	0.77	1.27	0.76
query23	14.72	1.47	0.68
query24	7.94	1.14	0.75
query25	0.47	0.12	0.06
query26	0.66	0.17	0.13
query27	0.06	0.06	0.05
query28	8.86	0.94	0.46
query29	12.53	4.02	3.37
query30	0.26	0.10	0.06
query31	2.84	0.62	0.39
query32	3.24	0.56	0.49
query33	3.10	3.13	3.19
query34	16.02	5.43	4.82
query35	4.83	4.86	4.90
query36	0.68	0.50	0.49
query37	0.09	0.07	0.07
query38	0.05	0.04	0.04
query39	0.04	0.03	0.02
query40	0.17	0.14	0.14
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 104.1 s
Total hot run time: 29.89 s

@sollhui sollhui changed the title [fix](load) fix routine load task scheduler block for one job has no load privilege [fix](job) fix routine load task scheduler block for one job can not found any be Jul 3, 2025
@sollhui sollhui changed the title [fix](job) fix routine load task scheduler block for one job can not found any be [fix](job) fix routine load task scheduler block for one job can not found any BE Jul 3, 2025
@sollhui sollhui force-pushed the fix_rl_job_schedule branch from ed83b14 to ce01328 Compare July 3, 2025 06:11
@sollhui
Copy link
Contributor Author

sollhui commented Jul 3, 2025

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jul 3, 2025
@sollhui sollhui force-pushed the fix_rl_job_schedule branch from ce01328 to f6f4eb2 Compare July 3, 2025 06:34
@sollhui
Copy link
Contributor Author

sollhui commented Jul 3, 2025

run buildall

@sollhui sollhui force-pushed the fix_rl_job_schedule branch from f6f4eb2 to b1115a3 Compare July 3, 2025 06:38
@sollhui
Copy link
Contributor Author

sollhui commented Jul 3, 2025

run buildall

@sollhui sollhui changed the title [fix](job) fix routine load task scheduler block for one job can not found any BE [fix](job) fix routine load task scheduler block for one job can not find any BE Jul 3, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2025

PR approved by at least one committer and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 33788 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b1115a38488cc62199fa6e40b4bdf43ebd794003, data reload: false

------ Round 1 ----------------------------------
q1	17610	5180	5006	5006
q2	1925	286	184	184
q3	10315	1313	707	707
q4	10222	1017	503	503
q5	7521	2300	2387	2300
q6	180	155	125	125
q7	870	724	588	588
q8	9294	1291	1161	1161
q9	6827	5150	5070	5070
q10	6901	2365	1940	1940
q11	478	288	283	283
q12	337	350	208	208
q13	17764	3665	3091	3091
q14	225	225	214	214
q15	558	485	477	477
q16	419	416	384	384
q17	625	854	393	393
q18	7520	7108	7205	7108
q19	1701	952	544	544
q20	325	343	219	219
q21	3700	3065	2309	2309
q22	1025	1013	974	974
Total cold run time: 106342 ms
Total hot run time: 33788 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5179	5449	5083	5083
q2	235	331	217	217
q3	2156	2650	2277	2277
q4	1363	1772	1331	1331
q5	4209	4472	4533	4472
q6	220	167	128	128
q7	2010	1992	1875	1875
q8	2638	2629	2505	2505
q9	7317	7282	7289	7282
q10	3113	3271	2893	2893
q11	583	501	522	501
q12	670	789	623	623
q13	3641	3966	3296	3296
q14	284	297	272	272
q15	507	476	470	470
q16	459	491	443	443
q17	1157	1560	1371	1371
q18	7889	7839	7484	7484
q19	800	741	816	741
q20	1925	1971	1843	1843
q21	4722	4334	4284	4284
q22	1067	979	990	979
Total cold run time: 52144 ms
Total hot run time: 50370 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184778 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b1115a38488cc62199fa6e40b4bdf43ebd794003, data reload: false

query1	1007	379	378	378
query2	6570	1732	1685	1685
query3	6734	213	208	208
query4	26431	23581	23114	23114
query5	4330	571	425	425
query6	318	219	194	194
query7	4612	494	275	275
query8	267	212	209	209
query9	8582	2595	2603	2595
query10	467	339	266	266
query11	15297	15150	14767	14767
query12	153	107	104	104
query13	1651	561	406	406
query14	8526	5677	5586	5586
query15	203	188	176	176
query16	7180	613	486	486
query17	1228	721	589	589
query18	1984	414	313	313
query19	196	185	156	156
query20	115	118	122	118
query21	213	124	103	103
query22	4170	4148	3999	3999
query23	33989	33222	33258	33222
query24	8425	2375	2372	2372
query25	574	499	462	462
query26	1230	259	148	148
query27	2752	501	337	337
query28	4320	2103	2079	2079
query29	758	563	419	419
query30	289	224	189	189
query31	904	837	747	747
query32	68	62	57	57
query33	545	354	328	328
query34	801	824	506	506
query35	786	835	769	769
query36	943	977	887	887
query37	109	100	74	74
query38	4130	4058	4087	4058
query39	1491	1392	1427	1392
query40	209	122	120	120
query41	54	56	52	52
query42	119	105	101	101
query43	499	517	475	475
query44	1275	826	807	807
query45	181	174	160	160
query46	873	1001	616	616
query47	1759	1832	1708	1708
query48	370	408	315	315
query49	747	474	402	402
query50	631	690	401	401
query51	4141	4173	4156	4156
query52	115	102	99	99
query53	221	245	193	193
query54	575	558	494	494
query55	83	81	84	81
query56	293	286	274	274
query57	1207	1182	1130	1130
query58	257	267	256	256
query59	2681	2717	2672	2672
query60	323	322	304	304
query61	125	116	120	116
query62	817	708	632	632
query63	225	182	189	182
query64	4353	981	655	655
query65	4268	4123	4169	4123
query66	1196	410	337	337
query67	15512	15766	15445	15445
query68	8699	903	560	560
query69	467	300	265	265
query70	1257	1125	1089	1089
query71	446	317	312	312
query72	5792	4760	4821	4760
query73	742	622	347	347
query74	8884	9120	8605	8605
query75	3925	3185	2688	2688
query76	3622	1168	712	712
query77	789	357	291	291
query78	9998	10490	9346	9346
query79	2413	784	584	584
query80	620	566	445	445
query81	476	257	228	228
query82	443	129	95	95
query83	279	251	235	235
query84	300	104	80	80
query85	794	355	325	325
query86	377	315	284	284
query87	4454	4405	4396	4396
query88	3221	2286	2290	2286
query89	400	319	278	278
query90	1903	204	208	204
query91	142	140	110	110
query92	74	64	53	53
query93	1181	921	576	576
query94	670	408	314	314
query95	374	285	285	285
query96	502	575	284	284
query97	2786	2760	2734	2734
query98	227	203	208	203
query99	1458	1390	1300	1300
Total cold run time: 273172 ms
Total hot run time: 184778 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.63 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b1115a38488cc62199fa6e40b4bdf43ebd794003, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.25	0.08	0.08
query4	1.62	0.11	0.11
query5	0.43	0.44	0.42
query6	1.15	0.66	0.65
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.61	0.51	0.52
query10	0.57	0.57	0.57
query11	0.16	0.11	0.11
query12	0.15	0.11	0.12
query13	0.62	0.61	0.61
query14	0.80	0.82	0.84
query15	0.89	0.88	0.86
query16	0.39	0.39	0.39
query17	1.09	1.08	1.07
query18	0.22	0.21	0.21
query19	1.93	1.82	1.79
query20	0.02	0.01	0.02
query21	15.38	0.87	0.53
query22	0.74	1.01	0.65
query23	15.13	1.37	0.69
query24	6.71	2.16	0.74
query25	0.52	0.16	0.07
query26	0.58	0.17	0.15
query27	0.07	0.05	0.05
query28	8.78	0.85	0.45
query29	12.53	4.05	3.43
query30	0.25	0.09	0.07
query31	2.83	0.59	0.39
query32	3.25	0.56	0.47
query33	3.05	3.07	3.09
query34	16.06	5.42	4.77
query35	4.83	4.82	4.87
query36	0.70	0.51	0.49
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.17	0.15	0.14
query41	0.08	0.02	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 103.01 s
Total hot run time: 29.63 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 57.14% (4/7) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@MoanasDaddyXu MoanasDaddyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 merged commit 46cbe76 into apache:master Jul 4, 2025
29 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Jul 4, 2025
…find any BE (#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
github-actions bot pushed a commit that referenced this pull request Jul 4, 2025
…find any BE (#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
koarz pushed a commit to koarz/doris that referenced this pull request Jul 4, 2025
…find any BE (apache#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
koarz pushed a commit to koarz/doris that referenced this pull request Jul 4, 2025
…find any BE (apache#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
sollhui added a commit to sollhui/doris that referenced this pull request Jul 4, 2025
… find any BE (apache#52654)

    ### What problem does this PR solve?

    routine load task will block in following case:
    1. The user created a job using the admin user of clusterA, and at some
    point deleted clusterA, and renamed clusterB to clusterA
    2. The cluster ID saved in the job is invalid and can't find any BE
    3. This task was repeatedly taken out of the queue and was put back to
    queue for there was no BE to execute, causing the other tasks to get
    stuck.
sollhui added a commit to sollhui/doris that referenced this pull request Jul 4, 2025
…find any BE (apache#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
seawinde pushed a commit to seawinde/doris that referenced this pull request Jul 4, 2025
…find any BE (apache#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.
dataroaring pushed a commit that referenced this pull request Jul 8, 2025
…job can not find any BE (#52654) (#52791)

pick (#52654)

### What problem does this PR solve?

routine load task will block in following case:
1. The user created a job using the admin user of clusterA, and at some
point deleted clusterA, and renamed clusterB to clusterA
2. The cluster ID saved in the job is invalid and can't find any BE
3. This task was repeatedly taken out of the queue and was put back to
queue for there was no BE to execute, causing the other tasks to get
stuck.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
morrySnow pushed a commit that referenced this pull request Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.7-merged dev/3.1.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants