Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Mar 31, 2025

What problem does this PR solve?

considering the following problem:

  1. Transaction X acquires the lock and attempts to publish with version a. This task is sent to the BE. At this point, the tablet's maximum version is a-1, and task (1) starts computation.
  2. Transaction X fails on FE due to timeout and releases the lock.
  3. Transaction Y acquires the lock, attempts to publish with version a, and succeeds.
  4. Transaction X retries and acquires the lock again, and attempts to publish with version b.
  5. Meanwhile, task (1) from Transaction X completes its computation on BE and writes the generated delete bitmap to the MS with version a. Since Transaction X currently holds the lock, this write operation succeeds, overwriting the delete bitmaps written of actual version a by Transaction Y.
  6. Subsequent transactions on the tablet will use the pending delete bitmap to delete the version a delete bitmap written by task (1) in the MS.

The root cause is that when a load txn retries in publish phase, the locks it gains are different, but they are the same in the current implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to avoid this problem.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the check-partition-ver-when-update-delete-bitmap-o-ms branch 3 times, most recently from 63e9286 to 7b7c1f6 Compare March 31, 2025 11:33
@bobhan1 bobhan1 force-pushed the check-partition-ver-when-update-delete-bitmap-o-ms branch from 72706d6 to e748596 Compare April 1, 2025 03:51
@bobhan1 bobhan1 marked this pull request as ready for review April 1, 2025 03:51
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

1 similar comment
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.07% (1089/1311)
Line Coverage: 66.14% (18168/27470)
Region Coverage: 65.52% (8939/13643)
Branch Coverage: 55.37% (4816/8698)
Coverage Report: http://coverage.selectdb-in.cc/coverage/1c945bb46538c430e018f9ac70a6d8fbda3761bc_1c945bb46538c430e018f9ac70a6d8fbda3761bc_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 35348 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1c945bb46538c430e018f9ac70a6d8fbda3761bc, data reload: false

------ Round 1 ----------------------------------
q1	26140	5080	5057	5057
q2	2069	273	194	194
q3	10446	1242	707	707
q4	10250	1020	554	554
q5	7615	2441	2370	2370
q6	190	166	135	135
q7	941	771	617	617
q8	9328	1308	1154	1154
q9	6864	5142	5159	5142
q10	6880	2310	1929	1929
q11	486	303	283	283
q12	368	365	224	224
q13	17781	3749	3104	3104
q14	242	246	211	211
q15	531	484	477	477
q16	626	653	586	586
q17	624	858	406	406
q18	7654	7136	7139	7136
q19	2269	1089	568	568
q20	339	335	242	242
q21	4426	3559	3271	3271
q22	1090	1031	981	981
Total cold run time: 117159 ms
Total hot run time: 35348 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5355	5124	5182	5124
q2	252	326	233	233
q3	2186	2684	2271	2271
q4	1537	1950	1516	1516
q5	4507	4455	4401	4401
q6	218	172	128	128
q7	1998	1921	1747	1747
q8	2655	2520	2575	2520
q9	7357	7100	7324	7100
q10	3035	3221	2775	2775
q11	570	508	499	499
q12	704	745	631	631
q13	3543	3868	3383	3383
q14	279	299	272	272
q15	541	507	483	483
q16	672	681	634	634
q17	1197	1489	1395	1395
q18	7901	7572	7569	7569
q19	808	831	863	831
q20	1915	1964	1829	1829
q21	5481	5004	4983	4983
q22	1150	1094	1019	1019
Total cold run time: 53861 ms
Total hot run time: 51343 ms

@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.07% (1089/1311)
Line Coverage: 66.13% (18167/27470)
Region Coverage: 65.54% (8941/13643)
Branch Coverage: 55.36% (4815/8698)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6ce48bdf635c88a0999ab15d86a912fd5b5330d6_6ce48bdf635c88a0999ab15d86a912fd5b5330d6_cloud/report/index.html

@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run cloud_p0

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 25.00% (5/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.33% (13737/26760)
Line Coverage 40.69% (119619/293944)
Region Coverage 39.42% (60940/154605)
Branch Coverage 34.21% (30585/89396)

@bobhan1 bobhan1 force-pushed the check-partition-ver-when-update-delete-bitmap-o-ms branch 3 times, most recently from 0fd617e to c196ca2 Compare April 1, 2025 08:46
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.07% (1089/1311)
Line Coverage: 66.15% (18174/27474)
Region Coverage: 65.53% (8941/13645)
Branch Coverage: 55.33% (4813/8698)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c196ca2ec89ae2ed46dc2edc960f253a430ea52c_c196ca2ec89ae2ed46dc2edc960f253a430ea52c_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 34223 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c196ca2ec89ae2ed46dc2edc960f253a430ea52c, data reload: false

------ Round 1 ----------------------------------
q1	26051	5160	5032	5032
q2	2059	300	198	198
q3	10727	1257	697	697
q4	10300	1035	549	549
q5	9193	2438	2319	2319
q6	205	162	133	133
q7	930	764	615	615
q8	9318	1300	1087	1087
q9	6808	5158	5087	5087
q10	6812	2312	1896	1896
q11	468	293	274	274
q12	342	356	235	235
q13	17799	3698	3067	3067
q14	237	226	211	211
q15	530	470	475	470
q16	634	614	619	614
q17	629	860	394	394
q18	7588	7167	7101	7101
q19	1648	972	593	593
q20	348	339	234	234
q21	3946	3515	2469	2469
q22	1041	1005	948	948
Total cold run time: 117613 ms
Total hot run time: 34223 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5226	5113	5089	5089
q2	246	325	233	233
q3	2124	2647	2255	2255
q4	1453	1962	1462	1462
q5	4457	4485	4372	4372
q6	225	170	134	134
q7	1988	1934	1743	1743
q8	2602	2495	2554	2495
q9	7341	7036	7125	7036
q10	2998	3202	2755	2755
q11	607	550	509	509
q12	673	748	604	604
q13	3452	3904	3353	3353
q14	280	296	272	272
q15	520	469	481	469
q16	653	686	646	646
q17	1167	1569	1389	1389
q18	7796	7538	7482	7482
q19	797	812	939	812
q20	1952	1968	1926	1926
q21	5365	4819	4849	4819
q22	1065	1086	1002	1002
Total cold run time: 52987 ms
Total hot run time: 50857 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194016 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c196ca2ec89ae2ed46dc2edc960f253a430ea52c, data reload: false

query1	1430	1080	1064	1064
query2	5999	1962	1930	1930
query3	11002	4529	4493	4493
query4	52711	25159	23380	23380
query5	4882	549	455	455
query6	359	216	196	196
query7	4896	488	287	287
query8	297	254	233	233
query9	5255	2591	2606	2591
query10	449	339	266	266
query11	15066	15167	15125	15125
query12	164	111	102	102
query13	1020	508	384	384
query14	10215	6495	6576	6495
query15	207	196	183	183
query16	7100	676	478	478
query17	1081	731	600	600
query18	1600	418	321	321
query19	193	188	167	167
query20	137	133	124	124
query21	209	125	107	107
query22	4542	4645	4246	4246
query23	33892	33116	33317	33116
query24	6696	2487	2421	2421
query25	467	504	453	453
query26	682	289	164	164
query27	2188	520	351	351
query28	2996	2468	2477	2468
query29	621	592	492	492
query30	290	227	195	195
query31	874	889	821	821
query32	75	65	65	65
query33	460	410	329	329
query34	765	871	554	554
query35	803	847	765	765
query36	955	1000	917	917
query37	125	107	104	104
query38	4289	4178	4040	4040
query39	1495	1425	1452	1425
query40	216	120	112	112
query41	55	57	53	53
query42	125	113	111	111
query43	513	509	498	498
query44	1316	840	833	833
query45	183	174	165	165
query46	849	1052	651	651
query47	1884	1860	1804	1804
query48	395	421	312	312
query49	690	531	486	486
query50	685	708	409	409
query51	4271	4308	4281	4281
query52	120	107	105	105
query53	229	264	185	185
query54	579	597	515	515
query55	88	83	87	83
query56	309	306	322	306
query57	1188	1193	1109	1109
query58	281	285	268	268
query59	2915	2830	2896	2830
query60	343	329	307	307
query61	135	138	130	130
query62	776	749	683	683
query63	228	184	192	184
query64	1500	1072	770	770
query65	4461	4336	4209	4209
query66	742	400	308	308
query67	16052	15510	15238	15238
query68	7182	894	517	517
query69	541	307	264	264
query70	1212	1172	1089	1089
query71	483	325	337	325
query72	5969	4966	5004	4966
query73	1534	725	347	347
query74	8958	9068	8818	8818
query75	3849	3207	2746	2746
query76	4201	1201	741	741
query77	638	382	288	288
query78	10192	10205	9351	9351
query79	2226	832	586	586
query80	592	515	517	515
query81	501	256	228	228
query82	451	127	99	99
query83	259	266	236	236
query84	302	114	91	91
query85	756	360	327	327
query86	367	313	291	291
query87	4681	4433	4333	4333
query88	3355	2294	2240	2240
query89	415	311	286	286
query90	1832	216	221	216
query91	145	147	118	118
query92	77	62	58	58
query93	1792	967	584	584
query94	657	467	324	324
query95	377	308	293	293
query96	488	589	288	288
query97	3235	3256	3122	3122
query98	230	216	200	200
query99	1426	1409	1277	1277
Total cold run time: 296538 ms
Total hot run time: 194016 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.8 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c196ca2ec89ae2ed46dc2edc960f253a430ea52c, data reload: false

query1	0.04	0.04	0.04
query2	0.14	0.11	0.11
query3	0.35	0.20	0.20
query4	1.59	0.20	0.19
query5	0.62	0.62	0.61
query6	1.16	0.72	0.73
query7	0.02	0.02	0.01
query8	0.05	0.05	0.05
query9	0.61	0.52	0.52
query10	0.58	0.58	0.58
query11	0.26	0.12	0.13
query12	0.25	0.13	0.13
query13	0.63	0.62	0.62
query14	2.65	2.69	2.84
query15	1.00	0.88	0.88
query16	0.37	0.36	0.37
query17	1.00	1.05	1.05
query18	0.19	0.18	0.18
query19	1.96	2.01	1.85
query20	0.02	0.01	0.01
query21	15.36	0.96	0.66
query22	0.92	0.98	0.78
query23	14.73	1.52	0.76
query24	5.51	0.54	0.27
query25	0.17	0.08	0.08
query26	0.57	0.22	0.17
query27	0.08	0.08	0.08
query28	11.01	1.15	0.57
query29	12.53	4.03	3.39
query30	0.27	0.08	0.06
query31	2.85	0.64	0.43
query32	3.24	0.60	0.51
query33	3.12	3.05	3.12
query34	16.80	5.15	4.44
query35	4.45	4.48	4.47
query36	0.64	0.51	0.50
query37	0.19	0.17	0.17
query38	0.16	0.16	0.17
query39	0.05	0.05	0.04
query40	0.22	0.16	0.15
query41	0.11	0.06	0.05
query42	0.06	0.05	0.05
query43	0.05	0.05	0.04
Total cold run time: 106.58 s
Total hot run time: 31.8 s

@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

1 similar comment
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 1, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.07% (1089/1311)
Line Coverage: 66.09% (18152/27465)
Region Coverage: 65.48% (8934/13644)
Branch Coverage: 55.31% (4811/8698)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6f88a49fdda0e8efcc05890055f74f184a870058_6f88a49fdda0e8efcc05890055f74f184a870058_cloud/report/index.html

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.07% (1089/1311)
Line Coverage: 66.09% (18168/27488)
Region Coverage: 65.53% (8950/13657)
Branch Coverage: 55.33% (4818/8708)
Coverage Report: http://coverage.selectdb-in.cc/coverage/efbaf9ec1fa5cd833c7d077c6196ae250619b4fa_efbaf9ec1fa5cd833c7d077c6196ae250619b4fa_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 34201 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit efbaf9ec1fa5cd833c7d077c6196ae250619b4fa, data reload: false

------ Round 1 ----------------------------------
q1	26244	5086	5071	5071
q2	2076	283	180	180
q3	10475	1263	701	701
q4	10244	1010	513	513
q5	7554	2328	2377	2328
q6	185	164	131	131
q7	919	718	617	617
q8	9327	1310	1106	1106
q9	6712	5178	5143	5143
q10	6866	2315	1879	1879
q11	503	289	272	272
q12	352	356	226	226
q13	17783	3675	3091	3091
q14	224	245	210	210
q15	533	487	497	487
q16	629	600	599	599
q17	604	869	377	377
q18	7847	7191	7062	7062
q19	1934	965	559	559
q20	336	348	230	230
q21	3935	3390	2417	2417
q22	1086	1007	1002	1002
Total cold run time: 116368 ms
Total hot run time: 34201 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5290	5166	5099	5099
q2	236	331	221	221
q3	2173	2653	2319	2319
q4	1438	1896	1513	1513
q5	4445	4372	4364	4364
q6	212	166	126	126
q7	2040	1913	1760	1760
q8	2595	2579	2542	2542
q9	7213	7152	7183	7152
q10	3003	3169	2749	2749
q11	602	500	483	483
q12	714	792	624	624
q13	3459	3909	3387	3387
q14	277	298	255	255
q15	527	475	459	459
q16	664	679	671	671
q17	1161	1613	1401	1401
q18	7790	7702	7409	7409
q19	826	838	953	838
q20	1931	1972	1837	1837
q21	5228	4937	4861	4861
q22	1128	1091	1014	1014
Total cold run time: 52952 ms
Total hot run time: 51084 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193840 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit efbaf9ec1fa5cd833c7d077c6196ae250619b4fa, data reload: false

query1	1382	1107	1032	1032
query2	6112	1963	1985	1963
query3	10998	4493	4613	4493
query4	25552	23961	23354	23354
query5	4775	634	463	463
query6	309	215	198	198
query7	3995	491	280	280
query8	303	250	237	237
query9	8500	2592	2579	2579
query10	464	319	279	279
query11	15496	15534	14805	14805
query12	166	115	114	114
query13	1579	523	400	400
query14	9130	6169	6123	6123
query15	203	199	173	173
query16	7661	648	445	445
query17	1163	804	581	581
query18	2029	406	308	308
query19	219	191	165	165
query20	129	120	131	120
query21	215	135	109	109
query22	4650	4665	4484	4484
query23	34417	33541	33665	33541
query24	8781	2421	2444	2421
query25	524	489	416	416
query26	1193	284	154	154
query27	2750	527	337	337
query28	4600	2431	2433	2431
query29	726	586	455	455
query30	280	225	200	200
query31	919	902	827	827
query32	75	63	63	63
query33	537	364	311	311
query34	799	914	546	546
query35	866	848	780	780
query36	984	993	915	915
query37	124	104	78	78
query38	4330	4234	4271	4234
query39	1493	1474	1447	1447
query40	226	120	111	111
query41	61	57	96	57
query42	120	103	111	103
query43	523	510	477	477
query44	1357	829	828	828
query45	175	174	166	166
query46	857	1028	647	647
query47	1873	1850	1838	1838
query48	379	420	325	325
query49	781	532	419	419
query50	660	696	405	405
query51	4261	4231	4229	4229
query52	112	113	97	97
query53	231	267	176	176
query54	592	629	502	502
query55	80	80	84	80
query56	310	315	334	315
query57	1175	1204	1160	1160
query58	273	259	258	258
query59	2762	2985	2819	2819
query60	341	318	299	299
query61	128	133	137	133
query62	774	737	669	669
query63	217	192	188	188
query64	4120	1035	703	703
query65	4469	4343	4401	4343
query66	950	413	314	314
query67	16439	15634	15539	15539
query68	8883	890	530	530
query69	483	296	264	264
query70	1205	1169	1086	1086
query71	478	309	283	283
query72	5579	4709	4855	4709
query73	747	653	353	353
query74	8900	9004	8953	8953
query75	4018	3206	2710	2710
query76	3774	1199	759	759
query77	782	374	282	282
query78	9928	10150	9259	9259
query79	3187	811	553	553
query80	668	523	458	458
query81	476	255	226	226
query82	454	126	95	95
query83	292	256	234	234
query84	292	103	94	94
query85	804	359	321	321
query86	335	297	292	292
query87	4414	4566	4404	4404
query88	3065	2211	2258	2211
query89	432	303	279	279
query90	2012	206	209	206
query91	140	153	109	109
query92	80	60	55	55
query93	2134	935	576	576
query94	763	411	308	308
query95	373	284	288	284
query96	482	562	275	275
query97	3204	3253	3130	3130
query98	227	215	200	200
query99	1468	1432	1281	1281
Total cold run time: 282753 ms
Total hot run time: 193840 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.64 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit efbaf9ec1fa5cd833c7d077c6196ae250619b4fa, data reload: false

query1	0.03	0.03	0.03
query2	0.15	0.11	0.11
query3	0.35	0.21	0.21
query4	1.59	0.20	0.20
query5	0.61	0.62	0.62
query6	1.20	0.72	0.72
query7	0.01	0.02	0.02
query8	0.05	0.05	0.05
query9	0.63	0.51	0.51
query10	0.56	0.58	0.58
query11	0.25	0.13	0.12
query12	0.25	0.13	0.12
query13	0.63	0.61	0.61
query14	2.69	2.73	2.68
query15	1.00	0.88	0.87
query16	0.36	0.37	0.37
query17	1.00	1.01	1.04
query18	0.19	0.18	0.18
query19	1.88	1.97	1.84
query20	0.01	0.01	0.02
query21	15.35	0.97	0.67
query22	0.93	1.03	0.80
query23	14.69	1.50	0.75
query24	5.45	0.58	0.29
query25	0.16	0.09	0.08
query26	0.54	0.22	0.17
query27	0.08	0.08	0.08
query28	11.02	1.14	0.57
query29	12.55	4.11	3.41
query30	0.27	0.08	0.06
query31	2.81	0.60	0.43
query32	3.23	0.60	0.49
query33	3.08	3.08	3.10
query34	16.60	5.12	4.35
query35	4.47	4.41	4.48
query36	0.64	0.50	0.50
query37	0.20	0.18	0.17
query38	0.17	0.15	0.15
query39	0.05	0.05	0.04
query40	0.22	0.18	0.15
query41	0.10	0.05	0.06
query42	0.07	0.05	0.04
query43	0.05	0.05	0.04
Total cold run time: 106.17 s
Total hot run time: 31.64 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 25.00% (5/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.14% (13968/26790)
Line Coverage 40.81% (120194/294489)
Region Coverage 39.59% (61289/154821)
Branch Coverage 34.23% (30645/89514)

@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 3, 2025

run feut

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2025

PR approved by anyone and no changes requested.

@zhannngchen zhannngchen merged commit a2979e6 into apache:master Apr 3, 2025
24 of 26 checks passed
bobhan1 added a commit to bobhan1/doris that referenced this pull request Apr 3, 2025
…sible versions' delete bitmaps (apache#49710)

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Apr 3, 2025
…sible versions' delete bitmaps (apache#49710)

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Apr 3, 2025
…sible versions' delete bitmaps (apache#49710)

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Apr 3, 2025
…sible versions' delete bitmaps (apache#49710)

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
dataroaring pushed a commit that referenced this pull request Apr 7, 2025
…wrongly update visible versions' delete bitmaps (#49710) (#49796)

pick #49710
@gavinchou gavinchou mentioned this pull request Apr 23, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…sible versions' delete bitmaps (apache#49710)

### What problem does this PR solve?

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
dataroaring pushed a commit that referenced this pull request Jul 7, 2025
… bitmap response regardless of status code (#52547)

### What problem does this PR solve?

#49710 add a check in MS to forbid
stale calc delete bitmap task to wrongly update delete bitmaps in MS.
But this may lead to load fail due to the check on FE.
This PR let FE retry to commit the txn when encounter stale calc delete
bitmap response regardless of task's status code to avoid the problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 7, 2025
… bitmap response regardless of status code (apache#52547)

apache#49710 add a check in MS to forbid
stale calc delete bitmap task to wrongly update delete bitmaps in MS.
But this may lead to load fail due to the check on FE.
This PR let FE retry to commit the txn when encounter stale calc delete
bitmap response regardless of task's status code to avoid the problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 7, 2025
… bitmap response regardless of status code (apache#52547)

apache#49710 add a check in MS to forbid
stale calc delete bitmap task to wrongly update delete bitmaps in MS.
But this may lead to load fail due to the check on FE.
This PR let FE retry to commit the txn when encounter stale calc delete
bitmap response regardless of task's status code to avoid the problem.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 8, 2025
… bitmap response regardless of status code (apache#52547)

apache#49710 add a check in MS to forbid
stale calc delete bitmap task to wrongly update delete bitmaps in MS.
But this may lead to load fail due to the check on FE.
This PR let FE retry to commit the txn when encounter stale calc delete
bitmap response regardless of task's status code to avoid the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.5-merged p0_w reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants