Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Mar 17, 2025

What problem does this PR solve?

fix for #48400, when fe send GetDeleteBitmapUpdateLock rpc to low version MS which will not set tablet states field and get response from it, FE will encounter IndexOutOfBoundsException.

2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 17, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 17, 2025

run buildall

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 17, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bobhan1 bobhan1 changed the title [Fix](cloud-mow) Fix wrong handling when MS don't set tablet states for GetDeleteBitmapUpdateLockResponse [Fix](cloud-mow) Fix FE's wrong handling when low version MS don't set tablet states for GetDeleteBitmapUpdateLockResponse Mar 17, 2025
@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 17, 2025

run p0

@bobhan1
Copy link
Contributor Author

bobhan1 commented Mar 18, 2025

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 32263 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 63b2e91b4d03681f9417b3dcee476a5801aee366, data reload: false

------ Round 1 ----------------------------------
q1	24242	5041	5002	5002
q2	2045	302	173	173
q3	10393	1219	683	683
q4	10242	999	547	547
q5	7532	2299	2382	2299
q6	184	170	130	130
q7	893	756	619	619
q8	9327	1260	1043	1043
q9	5051	4947	4848	4848
q10	6870	2307	1885	1885
q11	482	290	268	268
q12	352	357	219	219
q13	17757	3689	3058	3058
q14	236	234	211	211
q15	537	476	490	476
q16	630	601	582	582
q17	596	866	350	350
q18	6977	6469	6192	6192
q19	2033	937	564	564
q20	307	310	202	202
q21	2797	2217	1914	1914
q22	1040	1001	998	998
Total cold run time: 110523 ms
Total hot run time: 32263 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5286	5126	5131	5126
q2	241	329	237	237
q3	2146	2688	2300	2300
q4	1403	1784	1405	1405
q5	4216	4117	4423	4117
q6	220	171	128	128
q7	1968	1920	1801	1801
q8	2636	2618	2557	2557
q9	7292	7260	7224	7224
q10	3019	3194	2659	2659
q11	575	507	487	487
q12	671	771	628	628
q13	3555	3920	3373	3373
q14	278	319	271	271
q15	547	472	469	469
q16	663	690	675	675
q17	1159	1646	1315	1315
q18	7753	7565	7523	7523
q19	840	777	896	777
q20	1992	2010	1883	1883
q21	5446	4703	4808	4703
q22	1119	1076	1001	1001
Total cold run time: 53025 ms
Total hot run time: 50659 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192468 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 63b2e91b4d03681f9417b3dcee476a5801aee366, data reload: false

query1	1460	1039	1015	1015
query2	6244	1944	1941	1941
query3	11050	4489	4763	4489
query4	26122	23393	23413	23393
query5	4481	667	500	500
query6	309	202	209	202
query7	3990	520	291	291
query8	323	251	238	238
query9	8530	2519	2504	2504
query10	478	320	256	256
query11	15388	15419	15054	15054
query12	161	109	105	105
query13	1711	516	401	401
query14	9940	6641	6342	6342
query15	209	188	178	178
query16	7633	658	476	476
query17	1284	775	595	595
query18	2048	450	330	330
query19	204	203	178	178
query20	130	131	120	120
query21	216	128	106	106
query22	4686	4828	4542	4542
query23	34322	33941	33539	33539
query24	7189	2479	2405	2405
query25	509	467	400	400
query26	1205	272	153	153
query27	2696	479	333	333
query28	4674	2435	2404	2404
query29	715	563	452	452
query30	275	258	197	197
query31	925	873	787	787
query32	79	66	62	62
query33	548	376	318	318
query34	801	871	501	501
query35	805	837	773	773
query36	968	993	891	891
query37	115	102	75	75
query38	4133	4231	4117	4117
query39	1495	1448	1451	1448
query40	227	114	102	102
query41	53	51	52	51
query42	126	108	110	108
query43	510	495	485	485
query44	1310	789	788	788
query45	182	175	170	170
query46	861	1044	658	658
query47	1880	1918	1809	1809
query48	398	418	306	306
query49	766	513	434	434
query50	706	747	430	430
query51	4276	4391	4223	4223
query52	114	111	100	100
query53	243	265	193	193
query54	493	506	424	424
query55	85	84	88	84
query56	294	271	281	271
query57	1168	1168	1135	1135
query58	241	248	251	248
query59	2817	2808	2727	2727
query60	281	274	267	267
query61	129	120	122	120
query62	790	817	664	664
query63	233	194	187	187
query64	4291	1115	696	696
query65	4592	4428	4474	4428
query66	970	414	320	320
query67	15976	15424	15358	15358
query68	8384	875	498	498
query69	473	302	264	264
query70	1216	1139	1088	1088
query71	480	306	270	270
query72	5614	3582	3846	3582
query73	786	736	357	357
query74	9073	9013	8694	8694
query75	3851	3147	2728	2728
query76	3818	1196	750	750
query77	851	373	281	281
query78	10121	10122	9392	9392
query79	2951	815	582	582
query80	658	514	449	449
query81	481	255	215	215
query82	687	121	97	97
query83	209	169	152	152
query84	296	88	76	76
query85	769	349	310	310
query86	380	296	281	281
query87	4423	4496	4536	4496
query88	3478	2162	2176	2162
query89	405	310	279	279
query90	1925	207	205	205
query91	139	140	112	112
query92	74	57	56	56
query93	1983	1061	576	576
query94	650	404	300	300
query95	354	258	258	258
query96	480	548	273	273
query97	3327	3395	3339	3339
query98	214	213	198	198
query99	1450	1399	1251	1251
Total cold run time: 282063 ms
Total hot run time: 192468 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.35 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 63b2e91b4d03681f9417b3dcee476a5801aee366, data reload: false

query1	0.05	0.04	0.04
query2	0.12	0.10	0.10
query3	0.24	0.20	0.19
query4	1.59	0.19	0.19
query5	0.59	0.59	0.58
query6	1.18	0.71	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.59	0.54	0.53
query10	0.57	0.58	0.56
query11	0.16	0.11	0.11
query12	0.15	0.11	0.11
query13	0.61	0.61	0.60
query14	2.73	2.67	2.69
query15	0.94	0.84	0.86
query16	0.38	0.38	0.39
query17	1.02	0.99	1.09
query18	0.21	0.20	0.19
query19	1.90	1.95	1.87
query20	0.01	0.01	0.01
query21	15.38	0.88	0.55
query22	0.75	1.16	0.66
query23	14.97	1.36	0.62
query24	6.62	2.44	1.07
query25	0.51	0.35	0.08
query26	0.45	0.16	0.13
query27	0.05	0.05	0.05
query28	9.73	0.82	0.42
query29	12.55	4.04	3.35
query30	0.25	0.09	0.07
query31	2.82	0.59	0.39
query32	3.22	0.54	0.47
query33	2.98	3.07	3.00
query34	15.69	5.08	4.47
query35	4.53	4.53	4.49
query36	0.65	0.50	0.49
query37	0.08	0.06	0.06
query38	0.06	0.04	0.04
query39	0.03	0.02	0.02
query40	0.18	0.13	0.13
query41	0.09	0.02	0.02
query42	0.04	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 104.76 s
Total hot run time: 31.35 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 323a2dd into apache:master Mar 18, 2025
36 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 18, 2025
…t tablet states for `GetDeleteBitmapUpdateLockResponse` (#49165)

### What problem does this PR solve?

fix for #48400, when fe send
`GetDeleteBitmapUpdateLock` rpc to low version MS which will not set
tablet states field and get response from it, FE will encounter
`IndexOutOfBoundsException`.
```
2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 
```
dataroaring pushed a commit that referenced this pull request Mar 18, 2025
… MS don't set tablet states for `GetDeleteBitmapUpdateLockResponse` #49165 (#49187)

Cherry-picked from #49165

Co-authored-by: bobhan1 <baohan@selectdb.com>
@gavinchou gavinchou mentioned this pull request Apr 23, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…t tablet states for `GetDeleteBitmapUpdateLockResponse` (apache#49165)

### What problem does this PR solve?

fix for apache#48400, when fe send
`GetDeleteBitmapUpdateLock` rpc to low version MS which will not set
tablet states field and get response from it, FE will encounter
`IndexOutOfBoundsException`.
```
2025-03-17 18:05:35,224 WARN (thrift-server-pool-77|200) [FrontendServiceImpl.loadTxnCommit():1676] catch unknown result.
java.lang.IndexOutOfBoundsException: Index:0, Size:0
        at com.google.protobuf.LongArrayList.ensureIndexInRange(LongArrayList.java:288) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.getLong(LongArrayList.java:136) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:131) ~[protobuf-java-3.24.3.jar:?]
        at com.google.protobuf.LongArrayList.get(LongArrayList.java:45) ~[protobuf-java-3.24.3.jar:?]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.getDeleteBitmapUpdateLock(CloudGlobalTransactionMgr.java:949) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitTransaction(CloudGlobalTransactionMgr.java:361) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.cloud.transaction.CloudGlobalTransactionMgr.commitAndPublishTransaction(CloudGlobalTransactionMgr.java:1203) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommitImpl(FrontendServiceImpl.java:1730) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.service.FrontendServiceImpl.loadTxnCommit(FrontendServiceImpl.java:1660) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor121.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
        at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) ~[doris-fe.jar:1.2-SNAPSHOT]
        at jdk.proxy2.$Proxy45.loadTxnCommit(Unknown Source) ~[?:?]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4282) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.doris.thrift.FrontendService$Processor$loadTxnCommit.getResult(FrontendService.java:4262) ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250) ~[libthrift-0.16.0.jar:0.16.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?] 
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.5-merged p0_b reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants