Skip to content

Conversation

@morningman
Copy link
Contributor

@morningman morningman commented Apr 23, 2025

What problem does this PR solve?

Problem Summary:

After query finished, the audit event will be created and:

  1. Try to put into queryAuditEventList in WorkloadRuntimeStatusMgr

    WorkloadRuntimeStatusMgr has a background thread to visit queryAuditEventList, try to wait for
    the statistic info of the event(such as scan rows, cpu time, etc.) before logging this event.

    The size of queryAuditEventList is very large, default is 250000, which is configured by FE config
    audit_event_log_queue_size. So that it can always hold all events within a certain period (typically, 5 seconds).

  2. If queryAuditEventList is full, the statistic info of this audit event will be ignored, and event will be logged directly.

Either the event has the statistic info or not, it will be put into a eventQueue in AuditEventProcessor.
But this queue only has capacity of 10000, which may not enough to cache all events in a certain periods.
So some of audit event will be discarded and lost.

This PR mainly changes:

  1. Use the same config audit_event_log_queue_size as the max size of eventQueue.
  2. Change some log message for easy debugging in future.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Apr 23, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman requested a review from Copilot April 25, 2025 22:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces logging adjustments to help debug missing audit logs. The key changes include:

  • Removing the boolean flag from handleAuditEvent calls to adjust audit event processing.
  • Updating log messages and their levels to provide more informative warnings.
  • Correcting a comment typo regarding “nereids” in the query processor.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/resource/workloadschedpolicy/WorkloadRuntimeStatusMgr.java Updated audit event handling and log warn message when queue is full.
fe/fe-core/src/main/java/org/apache/doris/qe/MysqlConnectProcessor.java Fixed typo in log comment.
fe/fe-core/src/main/java/org/apache/doris/qe/ConnectProcessor.java Changed log level to warn in exception handling for query execution.
fe/fe-core/src/main/java/org/apache/doris/qe/AuditLogHelper.java Refactored audit event submission and added debug logging for event query.
fe/fe-core/src/main/java/org/apache/doris/qe/AuditEventProcessor.java Added a queue threshold check and updated exception log levels.
fe/fe-core/src/main/java/org/apache/doris/plugin/audit/AuditLogBuilder.java Updated log level from debug to warn on audit event process failures.
Comments suppressed due to low confidence (5)

fe/fe-core/src/main/java/org/apache/doris/resource/workloadschedpolicy/WorkloadRuntimeStatusMgr.java:92

  • Removal of the boolean flag in the handleAuditEvent call changes the event processing logic. Confirm that this behavior change is intentional and does not lead to unintended audit logging issues.
boolean ret = Env.getCurrentAuditEventProcessor().handleAuditEvent(auditEvent);

fe/fe-core/src/main/java/org/apache/doris/qe/MysqlConnectProcessor.java:212

  • Corrected the spelling of 'nereids'.
// nereids

fe/fe-core/src/main/java/org/apache/doris/qe/ConnectProcessor.java:384

  • [nitpick] Changing the log level from debug to warn may generate excessive logs in production. Verify that warn is appropriate for this exception handling context.
LOG.warn("throw exception for query: {}", DebugUtil.printId(ctx.queryId), throwable);

fe/fe-core/src/main/java/org/apache/doris/qe/AuditEventProcessor.java:98

  • The manual check for queue size to discard audit events may inadvertently drop important events. Confirm that this threshold logic aligns with the intended audit event processing behavior.
if (eventQueue.size() >= Config.audit_event_log_queue_size) {

fe/fe-core/src/main/java/org/apache/doris/plugin/audit/AuditLogBuilder.java:96

  • [nitpick] Ensure that changing the log level to warn adequately signals failures without creating excessive noise in production logs.
LOG.warn("failed to process audit event: {}", event.queryId, e);

@morningman morningman changed the title [tmp] add log for debugging audit log missing [fix](audit) fix potential audit log missing issue Apr 25, 2025
@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34094 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b32c93ed075d9295d646392068f3631947007d39, data reload: false

------ Round 1 ----------------------------------
q1	25921	5066	5041	5041
q2	2072	284	195	195
q3	10370	1212	696	696
q4	10241	972	519	519
q5	7507	2359	2305	2305
q6	183	166	133	133
q7	914	764	615	615
q8	9335	1216	1019	1019
q9	6834	5091	5119	5091
q10	6921	2309	1904	1904
q11	498	291	277	277
q12	354	348	215	215
q13	17826	3736	3100	3100
q14	221	228	217	217
q15	536	509	488	488
q16	456	456	408	408
q17	591	849	387	387
q18	7416	7298	7138	7138
q19	1776	953	560	560
q20	338	339	228	228
q21	4345	3265	2576	2576
q22	1045	1011	982	982
Total cold run time: 115700 ms
Total hot run time: 34094 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5373	5195	5204	5195
q2	235	332	228	228
q3	2155	2656	2313	2313
q4	1398	1802	1461	1461
q5	4562	4507	4422	4422
q6	220	171	124	124
q7	2032	2053	1859	1859
q8	2701	2669	2643	2643
q9	7609	7468	7554	7468
q10	3100	3217	2809	2809
q11	595	530	504	504
q12	691	755	599	599
q13	3625	3899	3384	3384
q14	276	285	277	277
q15	537	497	496	496
q16	468	518	468	468
q17	1175	1571	1383	1383
q18	7669	7688	7428	7428
q19	816	841	942	841
q20	1982	1933	1821	1821
q21	5154	4892	4686	4686
q22	1088	1038	995	995
Total cold run time: 53461 ms
Total hot run time: 51404 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192735 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b32c93ed075d9295d646392068f3631947007d39, data reload: false

query1	1419	1059	1064	1059
query2	6227	1921	1900	1900
query3	10991	4502	4493	4493
query4	57812	24837	23082	23082
query5	5179	475	450	450
query6	373	204	214	204
query7	4990	505	281	281
query8	322	237	230	230
query9	6276	2566	2587	2566
query10	451	326	255	255
query11	15091	15066	14787	14787
query12	156	109	99	99
query13	1095	498	397	397
query14	10023	6316	6355	6316
query15	213	194	178	178
query16	7219	675	500	500
query17	1119	747	614	614
query18	1793	408	320	320
query19	208	198	194	194
query20	141	126	127	126
query21	211	124	110	110
query22	4308	4344	4335	4335
query23	34078	33579	33463	33463
query24	6612	2402	2400	2400
query25	465	468	457	457
query26	706	269	157	157
query27	2303	512	338	338
query28	3199	2155	2169	2155
query29	573	565	434	434
query30	272	218	189	189
query31	879	884	763	763
query32	76	66	61	61
query33	444	374	341	341
query34	761	875	526	526
query35	822	851	737	737
query36	957	1002	934	934
query37	119	99	77	77
query38	4225	4328	4328	4328
query39	1514	1428	1457	1428
query40	212	123	108	108
query41	56	59	52	52
query42	126	114	119	114
query43	522	534	486	486
query44	1351	821	817	817
query45	182	172	168	168
query46	840	1030	649	649
query47	1846	1849	1799	1799
query48	380	408	299	299
query49	684	506	436	436
query50	670	689	407	407
query51	4220	4403	4174	4174
query52	111	109	101	101
query53	231	252	191	191
query54	579	578	525	525
query55	83	84	84	84
query56	317	308	291	291
query57	1179	1175	1148	1148
query58	264	259	253	253
query59	2769	2922	2827	2827
query60	327	318	296	296
query61	130	127	125	125
query62	745	731	679	679
query63	216	190	190	190
query64	1944	1036	681	681
query65	4345	4271	4182	4182
query66	716	392	355	355
query67	15704	15661	15404	15404
query68	7838	878	519	519
query69	539	303	265	265
query70	1205	1135	1072	1072
query71	495	323	294	294
query72	5507	4808	4903	4808
query73	1388	646	349	349
query74	8961	8812	9042	8812
query75	3769	3217	2678	2678
query76	4246	1207	749	749
query77	614	377	291	291
query78	9938	10040	9276	9276
query79	2502	801	572	572
query80	657	516	434	434
query81	498	252	217	217
query82	453	124	97	97
query83	379	250	238	238
query84	285	113	96	96
query85	842	349	311	311
query86	426	301	268	268
query87	4399	4477	4274	4274
query88	3488	2226	2210	2210
query89	399	322	292	292
query90	1908	214	214	214
query91	141	140	112	112
query92	82	65	56	56
query93	1860	962	579	579
query94	678	419	297	297
query95	368	303	351	303
query96	482	563	276	276
query97	3164	3251	3180	3180
query98	219	206	219	206
query99	1443	1427	1265	1265
Total cold run time: 303544 ms
Total hot run time: 192735 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.5 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b32c93ed075d9295d646392068f3631947007d39, data reload: false

query1	0.04	0.04	0.03
query2	0.12	0.11	0.12
query3	0.25	0.20	0.19
query4	1.59	0.19	0.19
query5	0.59	0.61	0.62
query6	1.20	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.58	0.52	0.53
query10	0.57	0.56	0.56
query11	0.15	0.10	0.11
query12	0.15	0.11	0.11
query13	0.63	0.60	0.60
query14	1.18	1.21	1.20
query15	0.88	0.87	0.84
query16	0.39	0.37	0.38
query17	1.03	1.03	1.04
query18	0.21	0.19	0.19
query19	1.89	1.80	1.84
query20	0.02	0.01	0.01
query21	15.39	0.87	0.54
query22	0.76	1.09	0.68
query23	14.98	1.36	0.60
query24	6.87	1.29	0.76
query25	0.51	0.20	0.06
query26	0.58	0.16	0.13
query27	0.05	0.06	0.05
query28	9.73	0.88	0.43
query29	12.53	3.92	3.31
query30	0.25	0.10	0.06
query31	2.81	0.59	0.39
query32	3.24	0.54	0.46
query33	3.11	3.14	3.08
query34	15.75	5.05	4.48
query35	4.53	4.55	4.46
query36	0.66	0.50	0.48
query37	0.09	0.06	0.06
query38	0.05	0.03	0.04
query39	0.04	0.02	0.03
query40	0.17	0.14	0.12
query41	0.08	0.03	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 103.77 s
Total hot run time: 29.5 s

@morningman morningman marked this pull request as ready for review April 26, 2025 04:47
@morningman morningman closed this Apr 26, 2025
@morningman morningman reopened this Apr 26, 2025
Copy link
Contributor

@wangbo wangbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 26, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 0e92232 into apache:master Apr 27, 2025
31 of 32 checks passed
morningman added a commit to morningman/doris that referenced this pull request Apr 28, 2025
Problem Summary:

After query finished, the audit event will be created and:
1. Try to put into `queryAuditEventList` in `WorkloadRuntimeStatusMgr`

`WorkloadRuntimeStatusMgr` has a background thread to visit
`queryAuditEventList`, try to wait for
the statistic info of the event(such as scan rows, cpu time, etc.)
before logging this event.

The size of `queryAuditEventList` is very large, default is `250000`,
which is configured by FE config
`audit_event_log_queue_size`. So that it can always hold all events
within a certain period (typically, 5 seconds).

2. If `queryAuditEventList` is full, the statistic info of this audit
event will be ignored, and event will be logged directly.

Either the event has the statistic info or not, it will be put into a
`eventQueue` in `AuditEventProcessor`.
But this queue only has capacity of `10000`, which may not enough to
cache all events in a certain periods.
So some of audit event will be discarded and lost.

This PR mainly changes:
1. Use the same config `audit_event_log_queue_size` as the max size of
`eventQueue`.
2. Change some log message for easy debugging in future.
morningman added a commit to morningman/doris that referenced this pull request Apr 29, 2025
Problem Summary:

After query finished, the audit event will be created and:
1. Try to put into `queryAuditEventList` in `WorkloadRuntimeStatusMgr`

`WorkloadRuntimeStatusMgr` has a background thread to visit
`queryAuditEventList`, try to wait for
the statistic info of the event(such as scan rows, cpu time, etc.)
before logging this event.

The size of `queryAuditEventList` is very large, default is `250000`,
which is configured by FE config
`audit_event_log_queue_size`. So that it can always hold all events
within a certain period (typically, 5 seconds).

2. If `queryAuditEventList` is full, the statistic info of this audit
event will be ignored, and event will be logged directly.

Either the event has the statistic info or not, it will be put into a
`eventQueue` in `AuditEventProcessor`.
But this queue only has capacity of `10000`, which may not enough to
cache all events in a certain periods.
So some of audit event will be discarded and lost.

This PR mainly changes:
1. Use the same config `audit_event_log_queue_size` as the max size of
`eventQueue`.
2. Change some log message for easy debugging in future.
@yiguolei yiguolei mentioned this pull request May 13, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
### What problem does this PR solve?

Problem Summary:

After query finished, the audit event will be created and:
1. Try to put into `queryAuditEventList` in `WorkloadRuntimeStatusMgr`

`WorkloadRuntimeStatusMgr` has a background thread to visit
`queryAuditEventList`, try to wait for
the statistic info of the event(such as scan rows, cpu time, etc.)
before logging this event.

The size of `queryAuditEventList` is very large, default is `250000`,
which is configured by FE config
`audit_event_log_queue_size`. So that it can always hold all events
within a certain period (typically, 5 seconds).

2. If `queryAuditEventList` is full, the statistic info of this audit
event will be ignored, and event will be logged directly.

Either the event has the statistic info or not, it will be put into a
`eventQueue` in `AuditEventProcessor`.
But this queue only has capacity of `10000`, which may not enough to
cache all events in a certain periods.
So some of audit event will be discarded and lost.

This PR mainly changes:
1. Use the same config `audit_event_log_queue_size` as the max size of
`eventQueue`.
2. Change some log message for easy debugging in future.
@gavinchou gavinchou mentioned this pull request Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.x dev/2.1.10-merged dev/3.0.6-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants