Skip to content

Conversation

@ghkang98
Copy link
Contributor

@ghkang98 ghkang98 commented Mar 12, 2025

…en importing data from streamload

What problem does this PR solve?

This is mainly to solve the multithreading problem caused by inconsistent visible order of EOS and data_queue variables in doris's streamload function and asyn_result_writer in the process_block process due to the compilation reordering of the ARM system or the weak memory order problem, which leads to data loss.

Problem Summary:

Mainly in the arm architecture, streamload has data loss problems. The transaction of importing data can be executed and submitted normally, but the NumberTotalRowshe NumberFilterRows in the returned load result are both zero
Uploading stream_load_lost_data.docx…
((https://github.com/user-attachments/files/19201955/stream_load.docx))

@Thearas
Copy link
Contributor

Thearas commented Mar 12, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wm1581066 wm1581066 requested a review from liaoxin01 March 12, 2025 03:22
@ghkang98
Copy link
Contributor Author

rebuid all

@ghkang98 ghkang98 force-pushed the streamload-lost-data branch from f23c3c8 to f9ea572 Compare March 12, 2025 05:00
@ghkang98
Copy link
Contributor Author

rebuid all

_data_queue.clear();
break;
//1) wait scan operator write data
if (!_eos && _data_queue.empty() && _writer_status.ok()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!_eos && _data_queue.empty() && _writer_status.ok()) {
{
std::unique_lock l(_m);
while (!_eos && _data_queue.empty() && _writer_status.ok()) {
// Add 1s to check to avoid lost signal
_cv.wait_for(l, std::chrono::seconds(1));
}
}

I think we can solve this problem by simply deleting this if condition. The issue of out-of-order execution will be synchronized through the lock later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if condition is removed. At the same time, for the rigor and readability of the code, the data is taken from the queue or moved down.

@ghkang98 ghkang98 force-pushed the streamload-lost-data branch from 68856cc to f84f912 Compare March 12, 2025 10:26
@liaoxin01 liaoxin01 requested a review from HappenLee March 12, 2025 12:02
@ghkang98
Copy link
Contributor Author

rebuid all

@ghkang98 ghkang98 force-pushed the streamload-lost-data branch 3 times, most recently from a19d050 to 705d30c Compare March 13, 2025 03:33
if ((_eos && _data_queue.empty()) || !_writer_status.ok()) {
_data_queue.clear();
break;
//1) wait scan operator write data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better do the change:

 {
            std::unique_lock l(_m);
            while (!_eos && _data_queue.empty() && _writer_status.ok()) {
                // Add 1s to check to avoid lost signal
                _cv.wait_for(l, std::chrono::seconds(1));
            }

            if ((_eos && _data_queue.empty()) || !_writer_status.ok()) {
                   _data_queue.clear();
                    break;
             }
        }

@ghkang98 ghkang98 force-pushed the streamload-lost-data branch from 705d30c to f1b0bfe Compare March 14, 2025 12:12
Copy link
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 15, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@HappenLee
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32784 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f1b0bfe1faccb3939ec977c727e640ccd777ad95, data reload: false

------ Round 1 ----------------------------------
q1	17600	5214	5089	5089
q2	2040	293	178	178
q3	10415	1285	739	739
q4	10205	1015	535	535
q5	7563	2448	2328	2328
q6	193	162	134	134
q7	929	759	628	628
q8	9302	1266	1122	1122
q9	4960	4928	4810	4810
q10	6845	2340	1914	1914
q11	471	269	263	263
q12	340	347	220	220
q13	18689	3712	3134	3134
q14	222	230	205	205
q15	530	499	477	477
q16	635	625	573	573
q17	570	852	347	347
q18	6760	6592	6327	6327
q19	1211	944	569	569
q20	332	334	202	202
q21	2911	2232	1958	1958
q22	1103	1088	1032	1032
Total cold run time: 103826 ms
Total hot run time: 32784 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5108	5119	5167	5119
q2	251	345	247	247
q3	2312	2783	2481	2481
q4	1449	1926	1422	1422
q5	4366	4167	4246	4167
q6	203	164	126	126
q7	1959	1922	1764	1764
q8	2651	2691	2668	2668
q9	7331	7225	7118	7118
q10	3045	3258	2754	2754
q11	601	505	484	484
q12	689	763	604	604
q13	3501	3868	3257	3257
q14	282	293	281	281
q15	525	484	485	484
q16	646	700	654	654
q17	1143	1637	1345	1345
q18	7810	7655	7480	7480
q19	851	867	980	867
q20	1947	2024	1876	1876
q21	5467	5168	4641	4641
q22	1109	1076	1040	1040
Total cold run time: 53246 ms
Total hot run time: 50879 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192129 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f1b0bfe1faccb3939ec977c727e640ccd777ad95, data reload: false

query1	1408	1036	996	996
query2	6305	1950	1942	1942
query3	11014	4376	4523	4376
query4	55714	24907	23220	23220
query5	5007	649	477	477
query6	336	203	199	199
query7	4867	525	299	299
query8	316	262	244	244
query9	5321	2622	2627	2622
query10	433	311	254	254
query11	15105	15133	14922	14922
query12	166	118	104	104
query13	1035	526	400	400
query14	10221	6629	6260	6260
query15	205	197	173	173
query16	7091	659	481	481
query17	1073	713	555	555
query18	1526	429	322	322
query19	200	185	174	174
query20	143	127	118	118
query21	217	125	107	107
query22	4496	4492	4413	4413
query23	33997	33204	33251	33204
query24	6378	2399	2459	2399
query25	451	474	405	405
query26	721	283	161	161
query27	2060	503	331	331
query28	3064	2463	2479	2463
query29	569	576	427	427
query30	272	216	186	186
query31	868	908	760	760
query32	77	59	63	59
query33	451	363	301	301
query34	769	855	518	518
query35	816	849	773	773
query36	949	998	939	939
query37	121	100	86	86
query38	4161	4172	4097	4097
query39	1486	1420	1449	1420
query40	212	114	101	101
query41	52	51	52	51
query42	123	107	105	105
query43	513	504	489	489
query44	1361	816	820	816
query45	179	172	170	170
query46	858	1043	643	643
query47	1836	1926	1820	1820
query48	404	438	310	310
query49	710	532	407	407
query50	770	757	430	430
query51	4312	4331	4242	4242
query52	108	107	96	96
query53	247	266	193	193
query54	481	496	414	414
query55	81	87	84	84
query56	281	280	275	275
query57	1152	1196	1109	1109
query58	255	247	237	237
query59	2851	2999	2795	2795
query60	294	302	266	266
query61	126	125	119	119
query62	747	705	702	702
query63	229	192	191	191
query64	1681	1080	678	678
query65	4471	4568	4449	4449
query66	719	430	293	293
query67	15748	15501	15321	15321
query68	6338	803	499	499
query69	545	303	263	263
query70	1182	1133	1123	1123
query71	443	303	263	263
query72	6139	3873	3790	3790
query73	1184	757	347	347
query74	8977	9207	8957	8957
query75	3275	3159	2757	2757
query76	3789	1202	759	759
query77	524	364	280	280
query78	10106	10227	9338	9338
query79	2421	825	575	575
query80	618	514	458	458
query81	509	259	229	229
query82	459	125	98	98
query83	180	168	146	146
query84	292	92	76	76
query85	761	352	309	309
query86	349	301	290	290
query87	4458	4659	4392	4392
query88	3583	2296	2365	2296
query89	413	319	281	281
query90	1654	215	216	215
query91	134	136	106	106
query92	72	57	60	57
query93	1850	1074	588	588
query94	746	406	310	310
query95	354	280	259	259
query96	484	568	278	278
query97	3368	3506	3338	3338
query98	228	215	200	200
query99	1453	1410	1259	1259
Total cold run time: 297098 ms
Total hot run time: 192129 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.38 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f1b0bfe1faccb3939ec977c727e640ccd777ad95, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.03	0.04
query3	0.23	0.06	0.07
query4	1.63	0.11	0.11
query5	0.57	0.55	0.54
query6	1.19	0.73	0.72
query7	0.02	0.02	0.02
query8	0.04	0.02	0.04
query9	0.59	0.52	0.53
query10	0.59	0.60	0.59
query11	0.16	0.10	0.11
query12	0.14	0.11	0.12
query13	0.63	0.61	0.60
query14	2.68	2.68	2.80
query15	0.93	0.86	0.84
query16	0.38	0.37	0.37
query17	1.06	1.04	1.06
query18	0.21	0.20	0.20
query19	1.91	1.87	2.00
query20	0.02	0.01	0.01
query21	15.36	0.93	0.55
query22	0.75	1.11	0.76
query23	14.91	1.40	0.60
query24	6.72	1.39	1.58
query25	0.55	0.29	0.05
query26	0.48	0.16	0.13
query27	0.06	0.05	0.04
query28	10.94	0.86	0.42
query29	12.54	4.02	3.26
query30	0.25	0.11	0.06
query31	2.81	0.61	0.38
query32	3.23	0.56	0.47
query33	3.01	3.01	3.03
query34	15.84	5.13	4.49
query35	4.51	4.51	4.58
query36	0.66	0.49	0.50
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.02	0.02	0.02
query40	0.17	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.17 s
Total hot run time: 31.38 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 48.33% (12928/26750)
Line Coverage 37.83% (110885/293123)
Region Coverage 36.79% (56573/153784)
Branch Coverage 32.01% (28485/89000)

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 changed the title [fix] (straemload) fixed the issue of data loss due to concurrency wh… [fix] (streamload) fixed the issue of data loss due to concurrency wh… Mar 16, 2025
@morningman morningman merged commit 2a2b229 into apache:master Mar 17, 2025
28 of 30 checks passed
@morningman morningman added the usercase Important user case type label label Mar 17, 2025
dataroaring pushed a commit that referenced this pull request Mar 31, 2025
…en importing data from streamload (#48948) (#49666)

Cherry-picked from #48948

Co-authored-by: kang <35803862+ghkang98@users.noreply.github.com>
Co-authored-by: lik40 <lik40@chinatelecom.cn>
liaoxin01 pushed a commit to liaoxin01/doris that referenced this pull request Apr 10, 2025
…en importing data from streamload (apache#48948)

This is mainly to solve the multithreading problem caused by
inconsistent visible order of EOS and data_queue variables in doris's
streamload function and asyn_result_writer in the process_block process
due to the compilation reordering of the ARM system or the weak memory
order problem, which leads to data loss.

Problem Summary:

Mainly in the arm architecture, streamload has data loss problems. The
transaction of importing data can be executed and submitted normally,
but the NumberTotalRowshe NumberFilterRows in the returned load result
are both zero
[Uploading stream_load_lost_data.docx…]()
((https://github.com/user-attachments/files/19201955/stream_load.docx))

Co-authored-by: lik40 <lik40@chinatelecom.cn>
liaoxin01 pushed a commit to liaoxin01/doris that referenced this pull request Apr 10, 2025
…en importing data from streamload (apache#48948)

This is mainly to solve the multithreading problem caused by
inconsistent visible order of EOS and data_queue variables in doris's
streamload function and asyn_result_writer in the process_block process
due to the compilation reordering of the ARM system or the weak memory
order problem, which leads to data loss.

Problem Summary:

Mainly in the arm architecture, streamload has data loss problems. The
transaction of importing data can be executed and submitted normally,
but the NumberTotalRowshe NumberFilterRows in the returned load result
are both zero
[Uploading stream_load_lost_data.docx…]()
((https://github.com/user-attachments/files/19201955/stream_load.docx))

Co-authored-by: lik40 <lik40@chinatelecom.cn>
yiguolei pushed a commit that referenced this pull request Apr 11, 2025
…ncurrency wh… #48948 (#49937)

cherry pick from #48948

Co-authored-by: kang <35803862+ghkang98@users.noreply.github.com>
Co-authored-by: lik40 <lik40@chinatelecom.cn>
starocean999 pushed a commit to starocean999/incubator-doris that referenced this pull request Apr 27, 2025
@yiguolei yiguolei mentioned this pull request May 13, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…en importing data from streamload (apache#48948)

### What problem does this PR solve?

This is mainly to solve the multithreading problem caused by
inconsistent visible order of EOS and data_queue variables in doris's
streamload function and asyn_result_writer in the process_block process
due to the compilation reordering of the ARM system or the weak memory
order problem, which leads to data loss.

Problem Summary:

Mainly in the arm architecture, streamload has data loss problems. The
transaction of importing data can be executed and submitted normally,
but the NumberTotalRowshe NumberFilterRows in the returned load result
are both zero
[Uploading stream_load_lost_data.docx…]()
((https://github.com/user-attachments/files/19201955/stream_load.docx))

Co-authored-by: lik40 <lik40@chinatelecom.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.10-merged dev/3.0.5-merged p0_w reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants