Skip to content

Conversation

@Baymine
Copy link
Contributor

@Baymine Baymine commented Dec 27, 2024

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

  • Currently, the SQL cache system in Doris may miss cache hits due to semantically identical queries being treated as different because of:
    • Extra whitespace characters in the SQL query
    • SQL comments that don't affect the query execution
  • For example, these queries are semantically identical but would generate different cache keys:
    SELECT * FROM table;
    -- Same query with comments and extra spaces
    /* Comment */  SELECT   *   FROM   table  ;
  • This PR improves the SQL cache hit rate by:
    • Trimming whitespace from SQL queries
    • Removing SQL comments before calculating the cache key MD5
  • This ensures that queries that are semantically identical but differ only in whitespace or comments will now hit the same cache entry, improving cache efficiency and reducing unnecessary query executions

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Dec 27, 2024

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@924060929
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32710 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3f42e6e093c52cfedf2675419941951dc5d46d96, data reload: false

------ Round 1 ----------------------------------
q1	17578	6207	6003	6003
q2	2037	304	176	176
q3	10402	1226	760	760
q4	10209	862	438	438
q5	7516	2178	1960	1960
q6	202	179	147	147
q7	895	746	593	593
q8	9236	1326	1124	1124
q9	5243	4952	5013	4952
q10	6763	2301	1876	1876
q11	459	271	253	253
q12	342	358	228	228
q13	17755	3677	3019	3019
q14	229	229	227	227
q15	570	504	519	504
q16	612	604	583	583
q17	547	856	330	330
q18	6952	6555	6527	6527
q19	1660	984	550	550
q20	321	316	194	194
q21	2792	2133	1952	1952
q22	359	332	314	314
Total cold run time: 102679 ms
Total hot run time: 32710 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6227	6183	6220	6183
q2	242	322	230	230
q3	2277	2625	2341	2341
q4	1390	1835	1394	1394
q5	4339	4745	4792	4745
q6	181	178	147	147
q7	2088	1915	1873	1873
q8	2628	2837	2704	2704
q9	7298	7260	7315	7260
q10	3079	3365	2857	2857
q11	587	518	507	507
q12	638	706	567	567
q13	3482	3898	3259	3259
q14	281	299	289	289
q15	561	531	501	501
q16	667	663	675	663
q17	1233	1762	1264	1264
q18	7635	7447	7423	7423
q19	836	1099	1119	1099
q20	1986	2060	1964	1964
q21	5694	5288	5027	5027
q22	624	588	570	570
Total cold run time: 53973 ms
Total hot run time: 52867 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196492 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3f42e6e093c52cfedf2675419941951dc5d46d96, data reload: false

query1	1289	968	942	942
query2	6493	2364	2331	2331
query3	10978	4603	4689	4603
query4	32946	23686	23365	23365
query5	3666	583	480	480
query6	275	203	187	187
query7	3984	494	306	306
query8	309	252	259	252
query9	9403	2658	2629	2629
query10	474	309	251	251
query11	17962	15733	15283	15283
query12	160	111	109	109
query13	1685	533	417	417
query14	10990	6904	7805	6904
query15	246	215	193	193
query16	8048	650	465	465
query17	1570	786	603	603
query18	1962	418	323	323
query19	264	179	165	165
query20	118	123	115	115
query21	208	127	106	106
query22	4709	4694	4574	4574
query23	34481	33641	33221	33221
query24	6092	2329	2400	2329
query25	457	468	403	403
query26	896	280	154	154
query27	2662	443	340	340
query28	5605	2508	2468	2468
query29	564	582	453	453
query30	209	184	162	162
query31	1005	945	881	881
query32	72	63	57	57
query33	481	361	292	292
query34	785	851	517	517
query35	871	850	800	800
query36	1038	1077	969	969
query37	117	103	78	78
query38	4382	4333	4175	4175
query39	1524	1470	1449	1449
query40	204	117	104	104
query41	46	43	43	43
query42	120	102	109	102
query43	532	549	509	509
query44	1325	820	830	820
query45	201	180	176	176
query46	897	1074	655	655
query47	2050	2017	1940	1940
query48	411	411	325	325
query49	754	504	405	405
query50	634	690	406	406
query51	7330	7151	7217	7151
query52	100	103	92	92
query53	232	266	185	185
query54	496	505	422	422
query55	88	80	83	80
query56	284	263	280	263
query57	1274	1226	1160	1160
query58	241	227	227	227
query59	3366	3553	3379	3379
query60	270	264	247	247
query61	120	136	110	110
query62	891	815	768	768
query63	237	195	197	195
query64	3045	1071	690	690
query65	3351	3250	3237	3237
query66	794	419	316	316
query67	16449	15862	15610	15610
query68	7613	698	514	514
query69	478	293	253	253
query70	1254	1147	1120	1120
query71	439	293	251	251
query72	6581	3944	3820	3820
query73	643	752	360	360
query74	10486	9197	8912	8912
query75	4055	3156	2682	2682
query76	3738	1204	775	775
query77	762	362	287	287
query78	10320	10223	9368	9368
query79	3261	827	580	580
query80	575	517	442	442
query81	498	267	229	229
query82	630	149	123	123
query83	162	172	140	140
query84	237	83	84	83
query85	788	386	308	308
query86	359	313	312	312
query87	4466	4446	4358	4358
query88	4658	2168	2174	2168
query89	425	346	296	296
query90	1906	188	180	180
query91	131	133	105	105
query92	67	56	52	52
query93	2095	889	533	533
query94	646	411	290	290
query95	339	265	255	255
query96	477	611	276	276
query97	2894	2919	2821	2821
query98	227	205	195	195
query99	1490	1555	1436	1436
Total cold run time: 297422 ms
Total hot run time: 196492 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.58 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3f42e6e093c52cfedf2675419941951dc5d46d96, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.04	0.03
query3	0.24	0.08	0.07
query4	1.62	0.11	0.10
query5	0.41	0.41	0.41
query6	1.15	0.64	0.65
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.58	0.50	0.51
query10	0.54	0.57	0.54
query11	0.14	0.10	0.10
query12	0.15	0.11	0.11
query13	0.60	0.61	0.61
query14	2.73	2.86	2.71
query15	0.90	0.82	0.82
query16	0.39	0.38	0.38
query17	1.03	1.06	1.06
query18	0.22	0.20	0.22
query19	1.86	2.02	1.91
query20	0.01	0.01	0.02
query21	15.35	0.92	0.59
query22	0.76	0.78	0.76
query23	15.22	1.37	0.56
query24	3.41	1.59	0.98
query25	0.19	0.20	0.11
query26	0.32	0.16	0.14
query27	0.06	0.05	0.04
query28	13.60	1.52	1.05
query29	12.59	3.97	3.29
query30	0.26	0.10	0.07
query31	2.81	0.58	0.38
query32	3.23	0.52	0.48
query33	3.06	3.21	3.14
query34	16.78	5.07	4.46
query35	4.49	4.46	4.50
query36	0.66	0.53	0.49
query37	0.10	0.07	0.06
query38	0.04	0.03	0.04
query39	0.03	0.02	0.02
query40	0.17	0.15	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.01 s
Total hot run time: 31.58 s

Copy link
Contributor

@924060929 924060929 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not cache the normalized sql in SqlCacheContext to skip normalize sql every times

@924060929 924060929 changed the title [fix](cache) enhance cache key computation by removing comments and trimming SQL input [opt](cache) enhance cache key computation by removing comments and trimming SQL input Jan 6, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 6, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2025

PR approved by anyone and no changes requested.

@924060929 924060929 merged commit 6225732 into apache:master Jan 6, 2025
34 of 37 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 6, 2025
…rimming SQL input (#46099)

- Currently, the SQL cache system in Doris may miss cache hits due to
semantically identical queries being treated as different because of:
  - Extra whitespace characters in the SQL query
  - SQL comments that don't affect the query execution
- For example, these queries are semantically identical but would
generate different cache keys:
  ```sql
  SELECT * FROM table;
  -- Same query with comments and extra spaces
  /* Comment */  SELECT   *   FROM   table  ;
  ```
- This PR improves the SQL cache hit rate by:
  - Trimming whitespace from SQL queries
  - Removing SQL comments before calculating the cache key MD5
- This ensures that queries that are semantically identical but differ
only in whitespace or comments will now hit the same cache entry,
improving cache efficiency and reducing unnecessary query executions
github-actions bot pushed a commit that referenced this pull request Jan 6, 2025
…rimming SQL input (#46099)

- Currently, the SQL cache system in Doris may miss cache hits due to
semantically identical queries being treated as different because of:
  - Extra whitespace characters in the SQL query
  - SQL comments that don't affect the query execution
- For example, these queries are semantically identical but would
generate different cache keys:
  ```sql
  SELECT * FROM table;
  -- Same query with comments and extra spaces
  /* Comment */  SELECT   *   FROM   table  ;
  ```
- This PR improves the SQL cache hit rate by:
  - Trimming whitespace from SQL queries
  - Removing SQL comments before calculating the cache key MD5
- This ensures that queries that are semantically identical but differ
only in whitespace or comments will now hit the same cache entry,
improving cache efficiency and reducing unnecessary query executions
yiguolei pushed a commit that referenced this pull request Jan 6, 2025
…rimming SQL input (#46099)

- Currently, the SQL cache system in Doris may miss cache hits due to
semantically identical queries being treated as different because of:
  - Extra whitespace characters in the SQL query
  - SQL comments that don't affect the query execution
- For example, these queries are semantically identical but would
generate different cache keys:
  ```sql
  SELECT * FROM table;
  -- Same query with comments and extra spaces
  /* Comment */  SELECT   *   FROM   table  ;
  ```
- This PR improves the SQL cache hit rate by:
  - Trimming whitespace from SQL queries
  - Removing SQL comments before calculating the cache key MD5
- This ensures that queries that are semantically identical but differ
only in whitespace or comments will now hit the same cache entry,
improving cache efficiency and reducing unnecessary query executions
yiguolei pushed a commit that referenced this pull request Jan 7, 2025
…mments and trimming SQL input #46099 (#46472)

Cherry-picked from #46099

Co-authored-by: York Cao <52438394+Baymine@users.noreply.github.com>
dataroaring pushed a commit that referenced this pull request Mar 20, 2025
…mments and trimming SQL input #46099 (#46471)

Cherry-picked from #46099

Co-authored-by: York Cao <52438394+Baymine@users.noreply.github.com>
@924060929 924060929 mentioned this pull request Mar 24, 2025
16 tasks
924060929 added a commit that referenced this pull request Mar 26, 2025
github-actions bot pushed a commit that referenced this pull request Mar 26, 2025
github-actions bot pushed a commit that referenced this pull request Mar 26, 2025
yiguolei pushed a commit that referenced this pull request Mar 29, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.5-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants