Skip to content

Conversation

@kaijchen
Copy link
Member

@kaijchen kaijchen commented Feb 25, 2025

What problem does this PR solve?

Issue Number: DORIS-17659 DORIS-18383

Problem Summary:

In a 3-replica setup, when the BE detects a write error on one replica, it continues submitting the commit information. If the FE checks the commit status and finds that one of the two successful replicas is missing the previous version, an error is returned. However, since the BE does not report the failure reason, the FE's error message does not provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users to check the BE logs and SHOW TABLET for more detailed error reasons and suggests retrying the operation later.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason: only error strings are changed.
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 25, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaijchen
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31816 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 74b007a630083e7afbe98113020c34115075796a, data reload: false

------ Round 1 ----------------------------------
q1	17597	5328	5163	5163
q2	2048	308	173	173
q3	10396	1411	723	723
q4	10270	1004	574	574
q5	7990	2417	2322	2322
q6	194	169	134	134
q7	904	767	630	630
q8	9322	1296	1111	1111
q9	4951	5016	4584	4584
q10	6846	2294	1882	1882
q11	504	284	270	270
q12	352	362	227	227
q13	17781	3660	3092	3092
q14	226	228	206	206
q15	517	475	465	465
q16	633	624	587	587
q17	573	863	344	344
q18	6660	6293	6243	6243
q19	1209	952	555	555
q20	333	339	198	198
q21	2896	2230	2030	2030
q22	364	343	303	303
Total cold run time: 102566 ms
Total hot run time: 31816 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5174	5184	5138	5138
q2	237	331	232	232
q3	2174	2735	2299	2299
q4	1453	1855	1455	1455
q5	4231	4072	4130	4072
q6	202	168	125	125
q7	1885	1968	1821	1821
q8	2699	2657	2589	2589
q9	7221	7119	7072	7072
q10	3010	3187	2749	2749
q11	575	509	495	495
q12	696	783	650	650
q13	3612	3910	3315	3315
q14	287	297	266	266
q15	508	472	461	461
q16	646	704	640	640
q17	1144	1575	1353	1353
q18	7589	7316	7252	7252
q19	817	890	1003	890
q20	1956	2024	1869	1869
q21	5569	5075	4817	4817
q22	643	565	554	554
Total cold run time: 52328 ms
Total hot run time: 50114 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184058 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 74b007a630083e7afbe98113020c34115075796a, data reload: false

query1	995	430	390	390
query2	6535	1886	1858	1858
query3	6792	215	209	209
query4	26161	23839	23441	23441
query5	4382	677	490	490
query6	306	231	190	190
query7	4615	489	292	292
query8	294	248	234	234
query9	8596	2516	2532	2516
query10	463	320	261	261
query11	15606	15484	15080	15080
query12	166	112	108	108
query13	1661	519	385	385
query14	9872	6793	7028	6793
query15	215	200	182	182
query16	7654	645	462	462
query17	1426	705	555	555
query18	1988	400	305	305
query19	201	186	158	158
query20	129	125	119	119
query21	206	121	107	107
query22	4074	4276	3977	3977
query23	33893	33018	33055	33018
query24	7738	2332	2402	2332
query25	484	449	375	375
query26	1193	260	156	156
query27	2098	499	324	324
query28	3916	2419	2382	2382
query29	583	519	433	433
query30	228	188	160	160
query31	936	837	751	751
query32	78	64	69	64
query33	561	365	294	294
query34	782	881	502	502
query35	803	809	722	722
query36	967	1003	882	882
query37	123	95	71	71
query38	4113	4220	4043	4043
query39	1432	1427	1408	1408
query40	210	114	106	106
query41	54	51	51	51
query42	123	105	104	104
query43	482	499	478	478
query44	1299	799	787	787
query45	176	165	156	156
query46	852	1030	639	639
query47	1730	1773	1689	1689
query48	377	395	298	298
query49	775	501	408	408
query50	678	721	417	417
query51	4236	4179	4158	4158
query52	107	113	95	95
query53	227	258	181	181
query54	496	494	407	407
query55	82	77	83	77
query56	272	250	271	250
query57	1106	1173	1024	1024
query58	242	238	237	237
query59	2736	2783	2557	2557
query60	289	265	251	251
query61	126	125	118	118
query62	804	744	659	659
query63	237	186	183	183
query64	4247	1000	646	646
query65	3238	3147	3167	3147
query66	1062	397	299	299
query67	15504	15580	15305	15305
query68	8818	878	518	518
query69	470	290	279	279
query70	1197	1138	1081	1081
query71	450	291	271	271
query72	5260	3514	3717	3514
query73	813	747	356	356
query74	9242	9093	8707	8707
query75	3899	3206	2677	2677
query76	3579	1158	729	729
query77	773	359	279	279
query78	10015	10202	9293	9293
query79	2331	844	593	593
query80	666	517	441	441
query81	481	286	235	235
query82	676	130	97	97
query83	187	168	149	149
query84	240	156	76	76
query85	772	341	348	341
query86	340	294	287	287
query87	4397	4384	4361	4361
query88	3218	2208	2211	2208
query89	392	306	291	291
query90	1945	198	197	197
query91	136	137	116	116
query92	76	58	58	58
query93	1134	1030	587	587
query94	674	397	290	290
query95	354	263	257	257
query96	480	548	268	268
query97	3307	3421	3245	3245
query98	240	206	196	196
query99	1634	1406	1305	1305
Total cold run time: 272228 ms
Total hot run time: 184058 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.8 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 74b007a630083e7afbe98113020c34115075796a, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.04	0.03
query3	0.23	0.07	0.07
query4	1.60	0.10	0.11
query5	0.56	0.56	0.55
query6	1.20	0.71	0.73
query7	0.02	0.01	0.02
query8	0.03	0.04	0.03
query9	0.57	0.55	0.50
query10	0.56	0.57	0.56
query11	0.16	0.10	0.10
query12	0.14	0.11	0.11
query13	0.61	0.60	0.60
query14	2.79	2.79	2.79
query15	0.92	0.86	0.86
query16	0.38	0.39	0.38
query17	1.03	1.03	1.05
query18	0.21	0.19	0.20
query19	1.86	1.79	1.93
query20	0.02	0.01	0.02
query21	15.36	0.90	0.53
query22	0.76	1.18	0.77
query23	14.83	1.37	0.63
query24	7.19	1.32	0.69
query25	0.51	0.14	0.13
query26	0.53	0.17	0.14
query27	0.06	0.05	0.04
query28	9.53	0.92	0.46
query29	12.56	4.00	3.29
query30	0.26	0.08	0.06
query31	2.83	0.61	0.38
query32	3.22	0.55	0.47
query33	2.97	3.00	3.06
query34	15.80	5.14	4.48
query35	4.50	4.49	4.48
query36	0.67	0.51	0.48
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.04	0.02	0.02
query40	0.17	0.13	0.13
query41	0.08	0.02	0.02
query42	0.04	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 105.08 s
Total hot run time: 30.8 s

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Feb 27, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 merged commit f79292f into apache:master Feb 27, 2025
37 of 38 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 27, 2025
In a 3-replica setup, when the BE detects a write error on one replica,
it continues submitting the commit information. If the FE checks the
commit status and finds that one of the two successful replicas is
missing the previous version, an error is returned. However, since the
BE does not report the failure reason, the FE's error message does not
provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users
to check the BE logs and SHOW TABLET for more detailed error reasons and
suggests retrying the operation later.
seawinde pushed a commit to seawinde/doris that referenced this pull request Feb 28, 2025
In a 3-replica setup, when the BE detects a write error on one replica,
it continues submitting the commit information. If the FE checks the
commit status and finds that one of the two successful replicas is
missing the previous version, an error is returned. However, since the
BE does not report the failure reason, the FE's error message does not
provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users
to check the BE logs and SHOW TABLET for more detailed error reasons and
suggests retrying the operation later.
dataroaring pushed a commit that referenced this pull request Feb 28, 2025
…48436)

Cherry-picked from #48316

Co-authored-by: Kaijie Chen <chenkaijie@selectdb.com>
mymeiyi pushed a commit to mymeiyi/doris that referenced this pull request Mar 4, 2025
In a 3-replica setup, when the BE detects a write error on one replica,
it continues submitting the commit information. If the FE checks the
commit status and finds that one of the two successful replicas is
missing the previous version, an error is returned. However, since the
BE does not report the failure reason, the FE's error message does not
provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users
to check the BE logs and SHOW TABLET for more detailed error reasons and
suggests retrying the operation later.
github-actions bot pushed a commit that referenced this pull request Mar 11, 2025
In a 3-replica setup, when the BE detects a write error on one replica,
it continues submitting the commit information. If the FE checks the
commit status and finds that one of the two successful replicas is
missing the previous version, an error is returned. However, since the
BE does not report the failure reason, the FE's error message does not
provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users
to check the BE logs and SHOW TABLET for more detailed error reasons and
suggests retrying the operation later.
yiguolei pushed a commit that referenced this pull request Mar 11, 2025
…48891)

Cherry-picked from #48316

Co-authored-by: Kaijie Chen <chenkaijie@selectdb.com>
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
In a 3-replica setup, when the BE detects a write error on one replica,
it continues submitting the commit information. If the FE checks the
commit status and finds that one of the two successful replicas is
missing the previous version, an error is returned. However, since the
BE does not report the failure reason, the FE's error message does not
provide details on the underlying cause.

This PR improves the error message for this scenario. It advises users
to check the BE logs and SHOW TABLET for more detailed error reasons and
suggests retrying the operation later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.9-merged dev/3.0.5-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants