Skip to content

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 7, 2025

What problem does this PR solve?

Routine load job could not transform RUNNING to NEED_SCHEDULE, when partition num increase and reschedule job, it will throw exception, causing new partition can not consume:

2025-07-07 14:35:39,847 WARN (Routine load scheduler|41) [RoutineLoadScheduler.runAfterCatalogReady():59] Failed to process one round of RoutineLoadScheduler
org.apache.doris.common.DdlException: errCode = 2, detailMessage = Could not transform RUNNING to NEED_SCHEDULE
        at org.apache.doris.load.routineload.RoutineLoadJob.checkStateTransform(RoutineLoadJob.java:788) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.unprotectUpdateState(RoutineLoadJob.java:1366) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.update(RoutineLoadJob.java:1483) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadManager.updateRoutineLoadJob(RoutineLoadManager.java:839) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.process(RoutineLoadScheduler.java:65) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.runAfterCatalogReady(RoutineLoadScheduler.java:57) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]

introduced by #40728, and should remove this limit.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Jul 7, 2025

run buildall

@sollhui sollhui changed the title [fix](job) fix routine load job can not reschedule [fix](job) remove routine load job can not transform RUNNING to NEED_SCHEDULE limit Jul 7, 2025
@doris-robot
Copy link

TPC-H: Total hot run time: 33453 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 533a7e2e46f1263b9020293a4db02a961ec0ef16, data reload: false

------ Round 1 ----------------------------------
q1	17579	5186	5071	5071
q2	1954	282	188	188
q3	10550	1305	732	732
q4	10308	1037	563	563
q5	8751	2459	2314	2314
q6	206	164	129	129
q7	895	735	597	597
q8	9334	1325	1110	1110
q9	7208	5216	5167	5167
q10	6973	2399	1960	1960
q11	480	287	260	260
q12	369	357	218	218
q13	17788	3663	3074	3074
q14	227	229	215	215
q15	548	482	477	477
q16	422	426	385	385
q17	616	869	360	360
q18	7504	7178	7240	7178
q19	1417	964	541	541
q20	320	358	214	214
q21	3712	3180	2402	2402
q22	361	320	298	298
Total cold run time: 107522 ms
Total hot run time: 33453 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5190	5076	5100	5076
q2	238	325	218	218
q3	2203	2658	2244	2244
q4	1419	1763	1325	1325
q5	4191	4574	4586	4574
q6	215	165	129	129
q7	2050	1940	1744	1744
q8	2641	2673	2569	2569
q9	7320	7267	7252	7252
q10	3123	3354	2894	2894
q11	574	489	498	489
q12	726	772	601	601
q13	3634	3978	3362	3362
q14	303	311	289	289
q15	527	487	478	478
q16	452	517	453	453
q17	1259	1498	1387	1387
q18	8047	7805	7656	7656
q19	807	809	906	809
q20	2036	2115	1881	1881
q21	5065	4572	4518	4518
q22	677	618	591	591
Total cold run time: 52697 ms
Total hot run time: 50539 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186621 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 533a7e2e46f1263b9020293a4db02a961ec0ef16, data reload: false

query1	1001	387	389	387
query2	6526	1635	1610	1610
query3	6749	211	212	211
query4	26127	23361	23361	23361
query5	4320	567	438	438
query6	317	206	198	198
query7	4622	480	282	282
query8	281	224	205	205
query9	8591	2607	2611	2607
query10	472	340	267	267
query11	15675	14971	14748	14748
query12	154	107	106	106
query13	1632	510	390	390
query14	8421	5853	5811	5811
query15	199	202	181	181
query16	7431	435	266	266
query17	1334	705	630	630
query18	1999	413	309	309
query19	196	186	157	157
query20	126	124	116	116
query21	217	124	112	112
query22	4103	4288	4382	4288
query23	35066	34027	33685	33685
query24	8500	2387	2369	2369
query25	539	465	384	384
query26	1229	261	144	144
query27	2777	496	346	346
query28	4288	2130	2106	2106
query29	732	547	424	424
query30	282	265	184	184
query31	905	841	766	766
query32	70	55	56	55
query33	558	351	266	266
query34	825	838	525	525
query35	586	646	569	569
query36	941	990	911	911
query37	108	96	74	74
query38	4157	4140	4150	4140
query39	1467	1398	1423	1398
query40	206	116	103	103
query41	53	55	49	49
query42	121	101	107	101
query43	485	483	467	467
query44	1289	810	826	810
query45	179	166	165	165
query46	829	1021	630	630
query47	1740	1816	1716	1716
query48	390	421	310	310
query49	732	490	403	403
query50	623	686	414	414
query51	4138	4260	4188	4188
query52	104	106	90	90
query53	218	250	179	179
query54	594	574	509	509
query55	83	82	83	82
query56	305	286	281	281
query57	1160	1167	1108	1108
query58	260	265	245	245
query59	2562	2633	2513	2513
query60	327	303	299	299
query61	128	121	120	120
query62	823	726	650	650
query63	221	189	182	182
query64	4427	1266	831	831
query65	4248	4174	4142	4142
query66	1117	408	306	306
query67	15833	15539	15473	15473
query68	8381	901	531	531
query69	503	299	265	265
query70	1206	1126	1063	1063
query71	494	316	293	293
query72	5555	4781	4816	4781
query73	685	593	345	345
query74	9158	9263	8950	8950
query75	3943	3223	2679	2679
query76	3748	1165	735	735
query77	789	366	307	307
query78	11066	11152	10258	10258
query79	1951	832	587	587
query80	573	510	419	419
query81	492	258	219	219
query82	423	118	92	92
query83	250	259	233	233
query84	241	105	83	83
query85	783	367	319	319
query86	336	307	299	299
query87	4465	4412	4420	4412
query88	3479	2265	2271	2265
query89	376	315	278	278
query90	1950	207	201	201
query91	136	142	167	142
query92	65	60	52	52
query93	1444	950	587	587
query94	685	317	195	195
query95	374	288	287	287
query96	488	569	289	289
query97	2656	2741	2617	2617
query98	229	205	206	205
query99	1343	1400	1241	1241
Total cold run time: 274985 ms
Total hot run time: 186621 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.2 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 533a7e2e46f1263b9020293a4db02a961ec0ef16, data reload: false

query1	0.05	0.04	0.03
query2	0.08	0.04	0.04
query3	0.24	0.08	0.07
query4	1.61	0.12	0.11
query5	0.43	0.42	0.42
query6	1.19	0.66	0.64
query7	0.02	0.02	0.02
query8	0.04	0.04	0.04
query9	0.61	0.51	0.51
query10	0.55	0.56	0.56
query11	0.16	0.11	0.12
query12	0.16	0.11	0.12
query13	0.62	0.62	0.61
query14	0.79	0.81	0.81
query15	0.89	0.89	0.86
query16	0.38	0.37	0.39
query17	1.09	1.06	1.09
query18	0.23	0.22	0.21
query19	1.93	1.82	1.86
query20	0.01	0.01	0.01
query21	15.40	0.90	0.54
query22	0.75	1.06	0.63
query23	15.10	1.36	0.66
query24	7.34	1.59	0.48
query25	0.49	0.17	0.16
query26	0.68	0.16	0.15
query27	0.07	0.05	0.05
query28	9.27	0.82	0.44
query29	12.54	3.92	3.27
query30	0.25	0.09	0.07
query31	2.83	0.58	0.38
query32	3.22	0.56	0.49
query33	3.06	3.06	3.17
query34	16.06	5.42	4.79
query35	4.86	4.84	4.80
query36	0.69	0.51	0.48
query37	0.10	0.07	0.07
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.18	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.03	0.03
query43	0.03	0.04	0.03
Total cold run time: 104.19 s
Total hot run time: 29.2 s

@sollhui sollhui changed the title [fix](job) remove routine load job can not transform RUNNING to NEED_SCHEDULE limit [fix](job) remove can not transform RUNNING to NEED_SCHEDULE limit Jul 7, 2025
Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2025

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 527a714 into apache:master Jul 8, 2025
33 of 35 checks passed
github-actions bot pushed a commit that referenced this pull request Jul 8, 2025
…52887)

### What problem does this PR solve?

Routine load job could not transform RUNNING to NEED_SCHEDULE, when
partition num increase and reschedule job, it will throw exception,
causing new partition can not consume:
```
2025-07-07 14:35:39,847 WARN (Routine load scheduler|41) [RoutineLoadScheduler.runAfterCatalogReady():59] Failed to process one round of RoutineLoadScheduler
org.apache.doris.common.DdlException: errCode = 2, detailMessage = Could not transform RUNNING to NEED_SCHEDULE
        at org.apache.doris.load.routineload.RoutineLoadJob.checkStateTransform(RoutineLoadJob.java:788) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.unprotectUpdateState(RoutineLoadJob.java:1366) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.update(RoutineLoadJob.java:1483) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadManager.updateRoutineLoadJob(RoutineLoadManager.java:839) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.process(RoutineLoadScheduler.java:65) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.runAfterCatalogReady(RoutineLoadScheduler.java:57) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
```

introduced by #40728, and should
remove this limit.
github-actions bot pushed a commit that referenced this pull request Jul 8, 2025
…52887)

### What problem does this PR solve?

Routine load job could not transform RUNNING to NEED_SCHEDULE, when
partition num increase and reschedule job, it will throw exception,
causing new partition can not consume:
```
2025-07-07 14:35:39,847 WARN (Routine load scheduler|41) [RoutineLoadScheduler.runAfterCatalogReady():59] Failed to process one round of RoutineLoadScheduler
org.apache.doris.common.DdlException: errCode = 2, detailMessage = Could not transform RUNNING to NEED_SCHEDULE
        at org.apache.doris.load.routineload.RoutineLoadJob.checkStateTransform(RoutineLoadJob.java:788) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.unprotectUpdateState(RoutineLoadJob.java:1366) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.update(RoutineLoadJob.java:1483) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadManager.updateRoutineLoadJob(RoutineLoadManager.java:839) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.process(RoutineLoadScheduler.java:65) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.runAfterCatalogReady(RoutineLoadScheduler.java:57) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
```

introduced by #40728, and should
remove this limit.
github-actions bot pushed a commit that referenced this pull request Jul 8, 2025
…52887)

### What problem does this PR solve?

Routine load job could not transform RUNNING to NEED_SCHEDULE, when
partition num increase and reschedule job, it will throw exception,
causing new partition can not consume:
```
2025-07-07 14:35:39,847 WARN (Routine load scheduler|41) [RoutineLoadScheduler.runAfterCatalogReady():59] Failed to process one round of RoutineLoadScheduler
org.apache.doris.common.DdlException: errCode = 2, detailMessage = Could not transform RUNNING to NEED_SCHEDULE
        at org.apache.doris.load.routineload.RoutineLoadJob.checkStateTransform(RoutineLoadJob.java:788) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.unprotectUpdateState(RoutineLoadJob.java:1366) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadJob.update(RoutineLoadJob.java:1483) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadManager.updateRoutineLoadJob(RoutineLoadManager.java:839) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.process(RoutineLoadScheduler.java:65) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.load.routineload.RoutineLoadScheduler.runAfterCatalogReady(RoutineLoadScheduler.java:57) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
```

introduced by #40728, and should
remove this limit.
morrySnow pushed a commit that referenced this pull request Jul 8, 2025
…ULE limit #52887 (#52910)

Cherry-picked from #52887

Co-authored-by: hui lai <laihui@selectdb.com>
dataroaring pushed a commit that referenced this pull request Jul 9, 2025
…ULE limit #52887 (#52908)

Cherry-picked from #52887

Co-authored-by: hui lai <laihui@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.x dev/3.0.7-merged dev/3.1.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants