Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Apr 3, 2025

pick #49710

@bobhan1 bobhan1 requested a review from dataroaring as a code owner April 3, 2025 09:12
@Thearas
Copy link
Contributor

Thearas commented Apr 3, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 force-pushed the branch-3.0-pick-49710 branch 2 times, most recently from acce601 to 64bac5d Compare April 3, 2025 09:17
…sible versions' delete bitmaps (apache#49710)

considering the following problem:
1. Transaction X acquires the lock and attempts to publish with version
a. This task is sent to the BE. At this point, the tablet's maximum
version is a-1, and task (1) starts computation.
2. Transaction X fails on FE due to timeout and releases the lock.
3. Transaction Y acquires the lock, attempts to publish with version a,
and succeeds.
4. Transaction X retries and acquires the lock again, and attempts to
publish with version b.
5. Meanwhile, task (1) from Transaction X completes its computation on
BE and writes the generated delete bitmap to the MS with version a.
**Since Transaction X currently holds the lock, this write operation
succeeds, overwriting the delete bitmaps written of actual version a by
Transaction Y.**
6. Subsequent transactions on the tablet will use the pending delete
bitmap to delete the version a delete bitmap written by task (1) in the
MS.

The root cause is that when a load txn retries in publish phase, the
locks it gains are different, but they are the same in the current
implementation because they have the same lock_id and initiator.

This PR checks target partition's version when update delete bitmaps to
avoid this problem.
@bobhan1 bobhan1 force-pushed the branch-3.0-pick-49710 branch from 64bac5d to d31be44 Compare April 3, 2025 09:19
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 3, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.99% (1083/1305)
Line Coverage: 65.94% (17924/27182)
Region Coverage: 65.48% (8842/13504)
Branch Coverage: 55.36% (4770/8616)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ba37a5ff0991bb2aad76a780ebf38544d09bfdd5_ba37a5ff0991bb2aad76a780ebf38544d09bfdd5_cloud/report/index.html

@bobhan1 bobhan1 changed the title [Fix](cloud-mow) Check partition's version to avoid wrongly update visible versions' delete bitmaps (#49710) branch-3.0-pick: [Fix](cloud-mow) Check partition's version to avoid wrongly update visible versions' delete bitmaps (#49710) Apr 3, 2025
@bobhan1
Copy link
Contributor Author

bobhan1 commented Apr 3, 2025

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.99% (1083/1305)
Line Coverage: 65.97% (17931/27182)
Region Coverage: 65.49% (8844/13504)
Branch Coverage: 55.43% (4776/8616)
Coverage Report: http://coverage.selectdb-in.cc/coverage/092c710fbb7a389bd68d3ea58291c0a603b3e1ee_092c710fbb7a389bd68d3ea58291c0a603b3e1ee_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 40506 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 092c710fbb7a389bd68d3ea58291c0a603b3e1ee, data reload: false

------ Round 1 ----------------------------------
q1	17639	7147	6662	6662
q2	2071	173	181	173
q3	10651	1107	1193	1107
q4	10576	769	788	769
q5	7748	2972	2911	2911
q6	227	138	142	138
q7	966	618	614	614
q8	9375	2007	2077	2007
q9	6626	6426	6395	6395
q10	6998	2256	2336	2256
q11	472	261	264	261
q12	424	218	206	206
q13	17778	2997	2980	2980
q14	232	218	206	206
q15	501	456	468	456
q16	668	585	580	580
q17	1020	618	573	573
q18	7371	6765	6675	6675
q19	1398	1114	1172	1114
q20	479	201	203	201
q21	4020	3249	3347	3249
q22	1071	973	990	973
Total cold run time: 108311 ms
Total hot run time: 40506 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6586	6609	6569	6569
q2	334	241	237	237
q3	2905	2764	2976	2764
q4	2023	1811	1838	1811
q5	5729	5796	5747	5747
q6	209	128	129	128
q7	2236	1820	1821	1820
q8	3405	3611	3591	3591
q9	8705	8965	8869	8869
q10	3588	3533	3541	3533
q11	604	498	491	491
q12	809	620	616	616
q13	9974	3255	3189	3189
q14	307	270	270	270
q15	521	472	461	461
q16	674	655	640	640
q17	1862	1641	1640	1640
q18	8378	7968	7670	7670
q19	1668	1546	1606	1546
q20	2120	1893	1888	1888
q21	5438	5354	5355	5354
q22	1111	1016	975	975
Total cold run time: 69186 ms
Total hot run time: 59809 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 15.79% (3/19) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 39.11% (10243/26191)
Line Coverage 30.44% (87314/286809)
Region Coverage 29.51% (44913/152206)
Branch Coverage 26.04% (22871/87844)

@doris-robot
Copy link

TPC-DS: Total hot run time: 197617 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 092c710fbb7a389bd68d3ea58291c0a603b3e1ee, data reload: false

query1	1317	962	924	924
query2	6223	2089	2037	2037
query3	10826	4321	4271	4271
query4	65790	28703	23327	23327
query5	4956	454	449	449
query6	398	173	182	173
query7	5629	319	307	307
query8	306	225	229	225
query9	9501	2634	2612	2612
query10	457	273	266	266
query11	17774	15167	16139	15167
query12	166	104	118	104
query13	1569	457	435	435
query14	10766	7916	7436	7436
query15	210	188	186	186
query16	7082	496	484	484
query17	1091	615	628	615
query18	2027	343	324	324
query19	239	168	162	162
query20	122	111	113	111
query21	214	107	107	107
query22	4635	4331	4363	4331
query23	35683	33946	34639	33946
query24	6509	2943	2908	2908
query25	508	421	414	414
query26	670	180	171	171
query27	2179	359	345	345
query28	4396	2455	2420	2420
query29	671	442	451	442
query30	249	164	161	161
query31	1013	847	837	837
query32	67	65	57	57
query33	450	295	309	295
query34	918	550	522	522
query35	872	758	755	755
query36	1127	970	990	970
query37	119	73	73	73
query38	4198	4056	4041	4041
query39	1486	1483	1506	1483
query40	208	105	103	103
query41	48	54	47	47
query42	122	117	105	105
query43	563	501	496	496
query44	1220	831	823	823
query45	189	174	174	174
query46	1173	756	730	730
query47	2008	1880	1935	1880
query48	478	379	416	379
query49	768	431	433	431
query50	873	462	460	460
query51	7352	7290	7129	7129
query52	103	92	96	92
query53	273	189	191	189
query54	577	466	463	463
query55	81	78	77	77
query56	272	262	269	262
query57	1243	1149	1120	1120
query58	223	221	242	221
query59	3068	2990	2822	2822
query60	295	274	281	274
query61	135	133	135	133
query62	769	698	679	679
query63	223	198	215	198
query64	1384	684	663	663
query65	3294	3210	3232	3210
query66	627	302	310	302
query67	15641	15757	15607	15607
query68	4089	572	558	558
query69	420	269	274	269
query70	1215	1158	1094	1094
query71	340	264	256	256
query72	6058	4115	4004	4004
query73	797	352	354	352
query74	10099	9240	9002	9002
query75	3351	2662	2687	2662
query76	1852	1154	1175	1154
query77	519	287	294	287
query78	10531	9664	9542	9542
query79	1459	601	622	601
query80	888	437	431	431
query81	513	243	236	236
query82	1279	96	88	88
query83	168	151	159	151
query84	285	86	88	86
query85	933	324	299	299
query86	356	296	309	296
query87	4514	4183	4267	4183
query88	3789	2391	2369	2369
query89	433	299	291	291
query90	1905	190	189	189
query91	187	151	147	147
query92	67	51	54	51
query93	1871	555	556	555
query94	746	297	288	288
query95	358	262	263	262
query96	641	280	283	280
query97	3324	3166	3190	3166
query98	213	210	202	202
query99	1630	1304	1336	1304
Total cold run time: 320453 ms
Total hot run time: 197617 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.17 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 092c710fbb7a389bd68d3ea58291c0a603b3e1ee, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.23	0.05	0.06
query4	1.65	0.07	0.08
query5	0.51	0.50	0.51
query6	1.12	0.74	0.74
query7	0.02	0.01	0.03
query8	0.05	0.04	0.05
query9	0.56	0.50	0.50
query10	0.56	0.57	0.54
query11	0.16	0.12	0.13
query12	0.17	0.13	0.13
query13	0.60	0.60	0.59
query14	2.75	2.71	2.84
query15	0.92	0.85	0.83
query16	0.36	0.36	0.38
query17	1.04	1.08	1.08
query18	0.19	0.19	0.19
query19	1.99	1.89	2.03
query20	0.01	0.01	0.01
query21	15.35	0.66	0.65
query22	5.14	6.30	1.95
query23	18.25	1.43	1.37
query24	2.22	0.23	0.23
query25	0.15	0.09	0.09
query26	0.28	0.18	0.19
query27	0.08	0.08	0.08
query28	13.20	0.62	0.57
query29	12.67	3.45	3.43
query30	0.25	0.06	0.06
query31	2.84	0.40	0.40
query32	3.23	0.49	0.49
query33	2.95	3.07	3.05
query34	17.00	4.52	4.50
query35	4.60	4.59	4.58
query36	0.65	0.49	0.48
query37	0.20	0.17	0.17
query38	0.18	0.15	0.16
query39	0.06	0.05	0.04
query40	0.17	0.13	0.13
query41	0.10	0.05	0.06
query42	0.07	0.05	0.06
query43	0.05	0.04	0.04
Total cold run time: 112.7 s
Total hot run time: 33.17 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 0983526 into apache:branch-3.0 Apr 7, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants