Skip to content

Conversation

@yujun777
Copy link
Contributor

@yujun777 yujun777 commented Aug 1, 2024

BUG:

  1. BE begin collect tablet report;
  2. BE clone a new replica A;
  3. FE handle this BE's tablet report from step 1. But it's stale, it don't include the replica A, then FE mark replica A as bad;

only after 1min later, BE report tablets again, then the new report contains replica A, only after that, FE will change replica A from bad to good.

Fix:
If BE clone a new replica, it should increase its report version and tell FE to update it. Then if FE handle the stale tablet report, it will compare BE's report version, then found the tablet report is stale and discard it.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@github-actions github-actions bot added the doing label Aug 1, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

std::vector<TTabletInfo>* tablet_infos);
~EngineCloneTask() override = default;

public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: redundant access specifier has the same accessibility as the previous access specifier [readability-redundant-access-specifiers]

Suggested change
public:
Additional context

be/src/olap/task/engine_clone_task.h:49: previously declared here

public:
^

@doris-robot
Copy link

TPC-H: Total hot run time: 41337 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit dce7914e4c626cbb85277f094c2ba35d1035cd82, data reload: false

------ Round 1 ----------------------------------
q1	17633	4033	4021	4021
q2	2023	196	193	193
q3	10478	1305	1371	1305
q4	10162	839	923	839
q5	7622	2936	2956	2936
q6	217	134	136	134
q7	1039	618	612	612
q8	9428	1926	1915	1915
q9	8433	6576	6576	6576
q10	8763	3827	3838	3827
q11	429	243	241	241
q12	409	228	224	224
q13	17765	2912	2946	2912
q14	267	248	252	248
q15	520	491	491	491
q16	536	398	385	385
q17	963	943	909	909
q18	7904	7181	7240	7181
q19	1410	1207	1215	1207
q20	567	326	343	326
q21	5324	4719	4572	4572
q22	342	283	285	283
Total cold run time: 112234 ms
Total hot run time: 41337 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4076	4024	4034	4024
q2	337	224	215	215
q3	3003	2974	3068	2974
q4	1975	1992	1998	1992
q5	5588	5469	5434	5434
q6	218	140	137	137
q7	2116	1740	1831	1740
q8	3295	3333	3311	3311
q9	8628	8591	8730	8591
q10	3955	4046	3885	3885
q11	552	461	477	461
q12	774	599	609	599
q13	16486	3104	3134	3104
q14	298	278	268	268
q15	542	474	507	474
q16	459	407	410	407
q17	1761	1746	1683	1683
q18	8242	7624	7599	7599
q19	1697	1731	1733	1731
q20	2048	1835	1832	1832
q21	5627	5468	5315	5315
q22	526	465	465	465
Total cold run time: 72203 ms
Total hot run time: 56241 ms

dataroaring
dataroaring previously approved these changes Aug 1, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 1, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2024

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2024

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 169973 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit dce7914e4c626cbb85277f094c2ba35d1035cd82, data reload: false

query1	914	370	363	363
query2	6455	1739	1730	1730
query3	6675	214	228	214
query4	19859	17634	17223	17223
query5	3614	513	508	508
query6	262	181	149	149
query7	4587	300	298	298
query8	255	194	219	194
query9	8511	2396	2393	2393
query10	429	283	277	277
query11	10600	10041	10307	10041
query12	123	91	90	90
query13	1666	380	375	375
query14	10038	7617	7552	7552
query15	204	164	169	164
query16	6986	433	438	433
query17	970	593	564	564
query18	1940	299	311	299
query19	200	157	155	155
query20	94	86	85	85
query21	209	97	99	97
query22	4321	4110	4143	4110
query23	33874	33712	33452	33452
query24	9445	3097	3080	3080
query25	652	419	429	419
query26	728	154	154	154
query27	2420	293	284	284
query28	6170	2025	1995	1995
query29	911	437	456	437
query30	241	155	149	149
query31	947	783	807	783
query32	105	58	56	56
query33	669	325	347	325
query34	921	526	529	526
query35	877	776	745	745
query36	1054	886	889	886
query37	135	86	83	83
query38	2950	2857	2761	2761
query39	864	807	813	807
query40	199	110	111	110
query41	45	43	44	43
query42	125	106	101	101
query43	462	419	429	419
query44	1184	718	735	718
query45	214	177	179	177
query46	1100	807	775	775
query47	1794	1724	1733	1724
query48	369	296	289	289
query49	846	418	424	418
query50	897	437	424	424
query51	6763	6631	6699	6631
query52	106	90	87	87
query53	257	184	186	184
query54	619	462	449	449
query55	76	74	76	74
query56	289	265	261	261
query57	1130	1017	1033	1017
query58	272	255	283	255
query59	2724	2355	2428	2355
query60	301	278	282	278
query61	100	95	94	94
query62	882	687	671	671
query63	212	192	186	186
query64	4721	1910	1872	1872
query65	3171	3109	3122	3109
query66	971	342	333	333
query67	15296	14650	14922	14650
query68	4507	576	586	576
query69	715	363	309	309
query70	1121	1041	1082	1041
query71	459	332	284	284
query72	7771	2697	2511	2511
query73	788	339	334	334
query74	6044	5669	5719	5669
query75	3612	2711	2752	2711
query76	3086	1169	1266	1169
query77	603	316	327	316
query78	9425	8899	8741	8741
query79	2402	533	530	530
query80	2035	505	502	502
query81	561	227	227	227
query82	830	133	136	133
query83	280	178	171	171
query84	270	80	81	80
query85	1462	321	306	306
query86	464	283	315	283
query87	3235	3109	3099	3099
query88	3793	2418	2402	2402
query89	407	289	291	289
query90	1956	193	195	193
query91	130	105	104	104
query92	63	53	51	51
query93	2314	613	611	611
query94	931	308	333	308
query95	380	273	273	273
query96	616	284	282	282
query97	3251	3103	3048	3048
query98	225	203	194	194
query99	1637	1266	1288	1266
Total cold run time: 262373 ms
Total hot run time: 169973 ms

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Aug 1, 2024
@doris-robot
Copy link

ClickBench: Total hot run time: 30.05 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit dce7914e4c626cbb85277f094c2ba35d1035cd82, data reload: false

query1	0.05	0.03	0.03
query2	0.07	0.04	0.04
query3	0.22	0.05	0.05
query4	1.72	0.07	0.07
query5	0.48	0.48	0.49
query6	1.14	0.72	0.71
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.57	0.53	0.52
query10	0.56	0.57	0.56
query11	0.15	0.11	0.12
query12	0.15	0.12	0.12
query13	0.61	0.60	0.60
query14	0.78	0.80	0.78
query15	0.91	0.86	0.87
query16	0.35	0.36	0.36
query17	0.96	0.99	0.99
query18	0.22	0.21	0.21
query19	1.83	1.76	1.76
query20	0.02	0.01	0.01
query21	15.40	0.79	0.67
query22	3.58	7.43	1.27
query23	18.17	1.33	1.29
query24	2.25	0.23	0.22
query25	0.18	0.09	0.08
query26	0.32	0.22	0.21
query27	0.45	0.23	0.24
query28	13.16	1.01	0.97
query29	12.60	3.29	3.27
query30	0.24	0.05	0.06
query31	2.87	0.40	0.39
query32	3.26	0.47	0.50
query33	2.95	2.98	2.98
query34	15.44	4.25	4.26
query35	4.32	4.30	4.31
query36	0.67	0.48	0.49
query37	0.19	0.18	0.16
query38	0.16	0.16	0.15
query39	0.04	0.04	0.04
query40	0.16	0.14	0.14
query41	0.11	0.04	0.05
query42	0.06	0.05	0.04
query43	0.04	0.04	0.04
Total cold run time: 107.48 s
Total hot run time: 30.05 s

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run external

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@yujun777 yujun777 force-pushed the fix-clone-new-tablet-report-version branch from eb8087f to 3fc4f8e Compare August 1, 2024 09:37
@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

1 similar comment
@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

@yujun777
Copy link
Contributor Author

yujun777 commented Aug 1, 2024

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 41295 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3ac7b89f62ce1e99b61ca3891ebbb6833d23ba65, data reload: false

------ Round 1 ----------------------------------
q1	17725	4169	4054	4054
q2	2020	193	203	193
q3	10458	1254	1330	1254
q4	10166	775	911	775
q5	7671	3000	2959	2959
q6	217	134	135	134
q7	1022	613	605	605
q8	9427	1823	1917	1823
q9	8456	6572	6550	6550
q10	8765	3888	3844	3844
q11	433	241	247	241
q12	425	225	224	224
q13	17765	2962	2928	2928
q14	264	244	244	244
q15	529	485	500	485
q16	517	392	383	383
q17	969	860	819	819
q18	8087	7251	7291	7251
q19	1490	1212	1212	1212
q20	568	331	341	331
q21	5314	4709	4732	4709
q22	356	277	282	277
Total cold run time: 112644 ms
Total hot run time: 41295 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4074	3969	4010	3969
q2	329	224	218	218
q3	2978	2969	3151	2969
q4	2000	2036	2000	2000
q5	5528	5453	5500	5453
q6	220	136	131	131
q7	2116	1767	1826	1767
q8	3329	3348	3319	3319
q9	8658	8585	8652	8585
q10	3993	4009	3977	3977
q11	548	445	467	445
q12	726	616	593	593
q13	14553	3104	3101	3101
q14	308	279	266	266
q15	543	492	493	492
q16	453	410	416	410
q17	1747	1732	1757	1732
q18	8253	7856	7772	7772
q19	1713	1718	1702	1702
q20	2049	1885	1804	1804
q21	5706	5551	5390	5390
q22	531	463	472	463
Total cold run time: 70355 ms
Total hot run time: 56558 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169677 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3ac7b89f62ce1e99b61ca3891ebbb6833d23ba65, data reload: false

query1	911	383	363	363
query2	6492	1747	1690	1690
query3	6662	218	226	218
query4	19641	17159	17281	17159
query5	3683	497	510	497
query6	294	181	178	178
query7	4595	305	289	289
query8	249	196	192	192
query9	8531	2368	2358	2358
query10	433	280	273	273
query11	10344	10005	10146	10005
query12	116	89	87	87
query13	1637	369	381	369
query14	9365	6976	6895	6895
query15	203	155	167	155
query16	6834	451	465	451
query17	927	557	553	553
query18	2169	279	274	274
query19	198	144	144	144
query20	91	89	88	88
query21	220	98	100	98
query22	4272	4356	4172	4172
query23	33710	33751	33910	33751
query24	9454	3092	3074	3074
query25	655	396	417	396
query26	1270	157	154	154
query27	2314	272	283	272
query28	7100	2005	1996	1996
query29	1004	448	428	428
query30	243	155	153	153
query31	963	783	793	783
query32	102	55	55	55
query33	670	310	320	310
query34	917	482	504	482
query35	901	786	756	756
query36	1034	921	893	893
query37	201	95	84	84
query38	2959	2843	2834	2834
query39	884	818	816	816
query40	205	114	117	114
query41	52	47	48	47
query42	125	100	106	100
query43	462	428	437	428
query44	1160	735	725	725
query45	207	176	185	176
query46	1090	826	782	782
query47	1835	1715	1721	1715
query48	359	297	298	297
query49	894	418	433	418
query50	893	437	437	437
query51	6739	6701	6646	6646
query52	102	86	92	86
query53	254	183	178	178
query54	611	470	447	447
query55	75	76	74	74
query56	294	240	251	240
query57	1121	1036	1052	1036
query58	286	267	268	267
query59	2490	2441	2374	2374
query60	288	276	275	275
query61	97	92	92	92
query62	878	666	659	659
query63	220	178	180	178
query64	4605	1871	1862	1862
query65	3140	3096	3103	3096
query66	924	321	330	321
query67	15216	14962	14761	14761
query68	5322	571	583	571
query69	689	414	316	316
query70	1088	1056	1037	1037
query71	431	275	275	275
query72	7633	2727	2479	2479
query73	909	336	327	327
query74	5981	5567	5564	5564
query75	3512	2711	2735	2711
query76	3086	1266	1269	1266
query77	551	307	319	307
query78	9441	8895	8880	8880
query79	1380	543	599	543
query80	947	511	501	501
query81	579	233	224	224
query82	750	130	129	129
query83	260	168	179	168
query84	267	78	80	78
query85	1254	310	302	302
query86	470	279	313	279
query87	3336	3136	3115	3115
query88	3679	2407	2420	2407
query89	393	288	291	288
query90	1757	195	196	195
query91	130	102	99	99
query92	58	50	49	49
query93	1974	617	611	611
query94	771	294	303	294
query95	379	263	278	263
query96	596	282	291	282
query97	3225	3061	3061	3061
query98	221	202	195	195
query99	1666	1285	1310	1285
Total cold run time: 259605 ms
Total hot run time: 169677 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.61 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3ac7b89f62ce1e99b61ca3891ebbb6833d23ba65, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.08	0.07
query5	0.50	0.49	0.48
query6	1.14	0.71	0.72
query7	0.03	0.01	0.01
query8	0.05	0.04	0.04
query9	0.57	0.51	0.51
query10	0.56	0.58	0.57
query11	0.16	0.11	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.60
query14	0.76	0.80	0.78
query15	0.92	0.86	0.85
query16	0.36	0.36	0.36
query17	1.02	0.98	1.03
query18	0.22	0.21	0.20
query19	1.82	1.80	1.73
query20	0.01	0.00	0.00
query21	15.38	0.76	0.65
query22	4.06	7.47	1.07
query23	17.87	1.21	1.19
query24	2.23	0.22	0.21
query25	0.18	0.08	0.08
query26	0.32	0.21	0.21
query27	0.45	0.23	0.23
query28	13.18	0.99	0.97
query29	12.55	3.30	3.27
query30	0.25	0.06	0.05
query31	2.91	0.40	0.41
query32	3.22	0.50	0.48
query33	2.93	2.91	2.97
query34	15.47	4.29	4.32
query35	4.32	4.30	4.34
query36	0.68	0.48	0.48
query37	0.19	0.16	0.16
query38	0.17	0.14	0.16
query39	0.04	0.03	0.04
query40	0.17	0.13	0.13
query41	0.10	0.04	0.05
query42	0.05	0.05	0.04
query43	0.04	0.04	0.04
Total cold run time: 107.67 s
Total hot run time: 29.61 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 73c8dbc into apache:master Aug 4, 2024
yiguolei pushed a commit that referenced this pull request Aug 5, 2024
dataroaring pushed a commit that referenced this pull request Aug 11, 2024
…8695)

BUG:
1. BE begin collect tablet report;
2. BE clone a new replica A;
3. FE handle this BE's tablet report from step 1. But it's stale, it
don't include the replica A, then FE mark replica A as bad;

only after 1min later, BE report tablets again, then the new report
contains replica A, only after that, FE will change replica A from bad
to good.

Fix:
If BE clone a new replica, it should increase its report version and
tell FE to update it. Then if FE handle the stale tablet report, it will
compare BE's report version, then found the tablet report is stale and
discard it.
dataroaring pushed a commit that referenced this pull request Aug 16, 2024
…8695)

BUG:
1. BE begin collect tablet report;
2. BE clone a new replica A;
3. FE handle this BE's tablet report from step 1. But it's stale, it
don't include the replica A, then FE mark replica A as bad;

only after 1min later, BE report tablets again, then the new report
contains replica A, only after that, FE will change replica A from bad
to good.

Fix:
If BE clone a new replica, it should increase its report version and
tell FE to update it. Then if FE handle the stale tablet report, it will
compare BE's report version, then found the tablet report is stale and
discard it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants