Skip to content

Conversation

@Gabriel39
Copy link
Contributor

@Gabriel39 Gabriel39 commented Jun 26, 2025

Cherry-pick from #51492

Issue Number: close #51491

Problem Summary:
When the queue of the FragmentMgrAsync thread pool is full, newly submitted tasks are rejected and return early. However, previously submitted tasks may still be scheduled for execution later. This can lead to premature destruction of objects such as PipelineFragmentContext and TPipelineFragmentParams that are referenced by those tasks, resulting in null pointer exceptions during task execution and ultimately causing a coredump.

The pr policy is to wait until all previously submitted tasks are completed before returning.

*** SIGSEGV address not mapped to object (@0x1c8) received by PID 3941201 (TID 2115617 OR 0xfe1685bb97f0) from PID 456; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 4# 0x0000FFFF6B2A07C0 in linux-vdso.so.1
 5# doris::TUniqueId::TUniqueId(doris::TUniqueId const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/gensrc/build/gen_cpp/Types_types.cpp:2354
 6# doris::AttachTask::AttachTask(doris::QueryContext*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/runtime/thread_context.cpp:60
 7# std::_Function_handler<void (), doris::pipeline::PipelineXFragmentContext::_build_pipeline_x_tasks(doris::TPipelineFragmentParams const&, doris::ThreadPool*)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/aarch64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
 8# doris::ThreadPool::dispatch_thread() at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/threadpool.cpp:552
 9# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/thread.cpp:499
10# 0x0000FFFF6AF187AC in /lib64/libpthread.so.0
11# 0x0000FFFF6B16548C in /lib64/libc.so.6

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…epare execution (apache#51492)

Issue Number: close apache#51491

Problem Summary:
When the queue of the FragmentMgrAsync thread pool is full, newly
submitted tasks are rejected and return early. However, previously
submitted tasks may still be scheduled for execution later. This can
lead to premature destruction of objects such as PipelineFragmentContext
and TPipelineFragmentParams that are referenced by those tasks,
resulting in null pointer exceptions during task execution and
ultimately causing a coredump.

The pr policy is to wait until all previously submitted tasks are
completed before returning.

```
*** SIGSEGV address not mapped to object (@0x1c8) received by PID 3941201 (TID 2115617 OR 0xfe1685bb97f0) from PID 456; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 4# 0x0000FFFF6B2A07C0 in linux-vdso.so.1
 5# doris::TUniqueId::TUniqueId(doris::TUniqueId const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/gensrc/build/gen_cpp/Types_types.cpp:2354
 6# doris::AttachTask::AttachTask(doris::QueryContext*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/runtime/thread_context.cpp:60
 7# std::_Function_handler<void (), doris::pipeline::PipelineXFragmentContext::_build_pipeline_x_tasks(doris::TPipelineFragmentParams const&, doris::ThreadPool*)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/aarch64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
 8# doris::ThreadPool::dispatch_thread() at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/threadpool.cpp:552
 9# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/thread.cpp:499
10# 0x0000FFFF6AF187AC in /lib64/libpthread.so.0
11# 0x0000FFFF6B16548C in /lib64/libc.so.6
```

Co-authored-by: XLPE <weiwh1@chinatelecom.cn>
@Gabriel39 Gabriel39 requested a review from morrySnow as a code owner June 26, 2025 09:17
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39784 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f872eddb1f2e8d5f0ff2f5f5bba11b76289a31eb, data reload: false

------ Round 1 ----------------------------------
q1	17591	6770	6617	6617
q2	2082	176	185	176
q3	10524	1108	1169	1108
q4	10219	801	778	778
q5	7754	2915	2831	2831
q6	217	132	130	130
q7	981	630	613	613
q8	9366	1977	2029	1977
q9	6638	6409	6481	6409
q10	6960	2292	2299	2292
q11	455	258	265	258
q12	420	219	208	208
q13	17804	2987	3030	2987
q14	241	207	204	204
q15	519	466	466	466
q16	473	379	372	372
q17	976	607	543	543
q18	7527	6701	6738	6701
q19	1320	964	1036	964
q20	504	199	199	199
q21	3869	3092	2978	2978
q22	1086	973	975	973
Total cold run time: 107526 ms
Total hot run time: 39784 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6630	6566	6737	6566
q2	330	234	236	234
q3	2955	2772	2903	2772
q4	2006	1756	1794	1756
q5	5750	5781	5765	5765
q6	211	130	125	125
q7	2282	1776	1808	1776
q8	3402	3631	3608	3608
q9	9004	8804	9069	8804
q10	3555	3525	3533	3525
q11	586	502	499	499
q12	781	627	646	627
q13	9242	3153	3239	3153
q14	313	264	276	264
q15	507	462	469	462
q16	508	433	457	433
q17	1854	1646	1613	1613
q18	8322	7697	7886	7697
q19	1873	1758	1788	1758
q20	2183	1830	1808	1808
q21	5328	5171	5348	5171
q22	1198	1000	1024	1000
Total cold run time: 68820 ms
Total hot run time: 59416 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196522 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f872eddb1f2e8d5f0ff2f5f5bba11b76289a31eb, data reload: false

query1	1282	946	896	896
query2	6375	1891	1909	1891
query3	10801	4290	4240	4240
query4	61452	30612	23649	23649
query5	5224	460	462	460
query6	424	188	184	184
query7	5454	317	320	317
query8	331	239	236	236
query9	8731	2568	2554	2554
query10	470	277	263	263
query11	17461	15234	16266	15234
query12	161	105	109	105
query13	1462	447	434	434
query14	9934	7590	6944	6944
query15	206	188	192	188
query16	6935	477	499	477
query17	1217	603	597	597
query18	1557	320	331	320
query19	211	193	162	162
query20	120	112	114	112
query21	203	105	103	103
query22	4507	4363	4545	4363
query23	34825	34023	33970	33970
query24	6135	2880	2928	2880
query25	539	419	416	416
query26	656	168	171	168
query27	1834	356	360	356
query28	4040	2198	2112	2112
query29	702	445	424	424
query30	244	160	156	156
query31	978	822	856	822
query32	67	54	57	54
query33	434	313	307	307
query34	942	516	525	516
query35	869	744	733	733
query36	1080	950	970	950
query37	118	68	77	68
query38	4046	4039	3982	3982
query39	1485	1451	1472	1451
query40	211	103	101	101
query41	48	45	47	45
query42	114	99	109	99
query43	512	480	499	480
query44	1192	810	796	796
query45	192	163	170	163
query46	1165	745	721	721
query47	2020	1899	1929	1899
query48	447	351	345	345
query49	760	420	405	405
query50	833	439	433	433
query51	7323	7331	7233	7233
query52	103	91	92	91
query53	271	180	184	180
query54	593	477	487	477
query55	79	85	84	84
query56	267	268	245	245
query57	1333	1210	1224	1210
query58	234	216	224	216
query59	3287	3089	2932	2932
query60	298	271	279	271
query61	176	121	109	109
query62	775	690	668	668
query63	212	190	193	190
query64	1295	654	619	619
query65	3236	3155	3219	3155
query66	682	294	295	294
query67	15874	15474	15400	15400
query68	4050	593	568	568
query69	429	272	268	268
query70	1207	1108	1128	1108
query71	354	254	260	254
query72	6352	4047	4032	4032
query73	752	349	361	349
query74	10671	9306	8951	8951
query75	3393	2682	2689	2682
query76	1919	1074	1023	1023
query77	508	283	270	270
query78	10574	9532	9638	9532
query79	1295	588	599	588
query80	851	431	436	431
query81	489	217	225	217
query82	1316	94	91	91
query83	245	153	162	153
query84	276	80	81	80
query85	946	384	370	370
query86	341	311	309	309
query87	4411	4301	4269	4269
query88	3728	2428	2368	2368
query89	417	299	296	296
query90	1998	230	189	189
query91	148	109	111	109
query92	68	50	54	50
query93	1330	542	545	542
query94	745	293	300	293
query95	353	260	268	260
query96	616	280	289	280
query97	3307	3152	3145	3145
query98	217	205	197	197
query99	1535	1293	1298	1293
Total cold run time: 311885 ms
Total hot run time: 196522 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.86 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f872eddb1f2e8d5f0ff2f5f5bba11b76289a31eb, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.03
query3	0.22	0.07	0.06
query4	1.62	0.10	0.10
query5	0.50	0.49	0.50
query6	1.14	0.76	0.72
query7	0.02	0.01	0.01
query8	0.05	0.02	0.03
query9	0.56	0.49	0.50
query10	0.56	0.55	0.56
query11	0.14	0.10	0.10
query12	0.15	0.12	0.11
query13	0.60	0.60	0.60
query14	0.77	0.79	0.82
query15	0.84	0.83	0.83
query16	0.41	0.38	0.37
query17	1.03	1.03	1.01
query18	0.23	0.21	0.22
query19	1.93	1.82	1.87
query20	0.01	0.02	0.01
query21	15.39	0.58	0.57
query22	2.59	1.93	1.75
query23	16.76	1.02	0.73
query24	3.40	1.17	0.86
query25	0.25	0.13	0.26
query26	0.37	0.15	0.14
query27	0.04	0.05	0.04
query28	10.32	0.54	0.46
query29	12.60	3.22	3.21
query30	0.25	0.06	0.06
query31	2.84	0.37	0.38
query32	3.28	0.45	0.45
query33	2.99	3.01	3.04
query34	16.85	4.45	4.48
query35	4.56	4.59	4.51
query36	0.68	0.47	0.48
query37	0.08	0.06	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.02
query40	0.16	0.12	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 104.51 s
Total hot run time: 29.86 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/24) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 42.38% (11534/27213)
Line Coverage 33.38% (100248/300295)
Region Coverage 32.67% (52194/159759)
Branch Coverage 29.74% (28030/94262)

@morrySnow morrySnow changed the title [fix](pipeline) premature exit causing core dump during concurrent pr… branch-3.1: [fix](pipeline) premature exit causing core dump during concurrent prepare execution #51492 Jun 27, 2025
@morrySnow morrySnow merged commit 0e5fc37 into apache:branch-3.1 Jun 27, 2025
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants