Skip to content

Conversation

@morningman
Copy link
Contributor

@morningman morningman commented Jun 24, 2025

morningman and others added 3 commits June 24, 2025 14:39
…de (apache#47299)

Previously, BE node use principal and keytab to do the kerberos
authentication.
But only the modified hadoop libhdfs support authenticating in this way,
the origin libhdfs
only support setting kerberos ticket cache path, or use system level
kerberos authentication context.

This pull request introduces a comprehensive Kerberos authentication
module for the BE.
The module is designed to handle Kerberos ticket management, including
initialization, authentication, and periodic ticket refresh.
It provides a robust interface for integrating Kerberos authentication,
ensuring secure and efficient credential management.

1. **KerberosConfig** (`kerberos_config.h` and `kerberos_config.cpp`):
- This class encapsulates the configuration settings required for
Kerberos authentication, such as principal, keytab path, and refresh
intervals.
   - Provides methods to set and retrieve configuration parameters.

2. **KerberosTicketCache** (`kerberos_ticket_cache.h` and
`kerberos_ticket_cache.cpp`):
- Manages the Kerberos ticket cache, including initialization, login,
and periodic refresh of tickets.
- Supports operations like writing to the ticket cache and checking if a
refresh is needed.
- Utilizes a background thread to periodically refresh tickets based on
configured intervals.
- The default cache file will be written in `/tmp` dir, but can be
modified using `kerberos_ccache_path` in be.conf

3. **KerberosTicketMgr** (`kerberos_ticket_mgr.h` and
`kerberos_ticket_mgr.cpp`):
- Acts as a manager for multiple Kerberos ticket caches, handling their
lifecycle, including creation, access, and cleanup.
- Provides methods to get or set ticket caches and retrieve cache file
paths.
- Includes a background thread for cleaning up expired ticket caches
every 1 hour. If a cache is longer being referenced, it will be removed.

4. **HdfsMgr**
   - A simple and new class to manager the hdfs fs handler.
   - It replace the old `HdfsHandlerCache`
- It will check HdfsHandler every 1 hour, and remove unused HdfsHandler
after 24 hours.

1. Introduce a comprehensive kerberos ticket cache management on BE side
1. Use ticket cache path instead of principal and keytab to do the
kerberos authentication of libhdfs.
2. Fix the issue that `kerberos_krb5_conf_path` in be.conf does not take
effect.
3. Add a new system table `backend_kerberos_ticket_cache`, to view the
krb ticket cache of each backend:

```
Doris > select * from information_schema.backend_kerberos_ticket_cache\G
*************************** 1. row ***************************
                  BE_ID: 1738304534666
                  BE_IP: 172.20.32.136
              PRINCIPAL: hdfs/master-1-1@EMR.C-0596176698BD4D17.COM
                 KEYTAB: /path/to/hdfs.keytab
      SERVICE_PRINCIPAL: krbtgt/EMR@EMR.C-0596176698BD4D17.COM
      TICKET_CACHE_PATH: /tmp/doris_krb_ce93d5ebb2a6554c7ba9f43aee3a9e6c
              HASH_CODE: ce93d5ebb2a6554c7ba9f43aee3a9e6c
             START_TIME: 2025-02-01 00:08:26
            EXPIRE_TIME: 2025-02-01 00:09:26
              AUTH_TIME: 2025-02-01 00:08:26
              REF_COUNT: 1
REFRESH_INTERVAL_SECOND: 3600
```

The user interface remains unchanged.
1. set krb5.conf path in be.conf `kerberos_krb5_conf_path`, default is
`/etc/krb5.conf`
2. provide kerberos principal the keytab path as usual.

be.conf

1. `kerberos_ccache_path`
The dir where kerber ticket cache file saved. the file name as format
`doris_krb_xxxx`

2. `kerberos_krb5_conf_path`
	The path of krb5.conf file

6. `kerberos_refresh_interval_second`
The min interval to refresh a kerberos ticket cache file. default is 1h.

7. cleanup logic

	If the ticket cache is not used for 1 day, it will be deleted.
…kerberos ticket. (apache#47826)

### What problem does this PR solve?

Related PR: apache#47299

Problem Summary:
fix the `KerberosTicketEntry entry` initialization to enable compile pass.
…stead of using kerberos ticket cache. (apache#48655)

Related PR: apache#47299, apache#49181

This PR mainly changes:

1. Back to use principal and keytab to login kerberos instead of using
kerberos ticket cache.
Discard what I did in apache#47299. It looks like there are a lot of issue
when using ticket cache in multi-kerberos env.
    So I abandoned that logic.

2. Config's default value
    Change the default value of related to hdfs file handle cache

    1. `max_hdfs_file_handle_cache_num`: from 1000 to 20000
    2. `max_hdfs_file_handle_cache_time_sec`: from 3600 to 28800

3. Fix a bug the cleanup thread of `FileHandleCache` is not working
@morningman morningman requested a review from morrySnow as a code owner June 24, 2025 06:43
@Thearas
Copy link
Contributor

Thearas commented Jun 24, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman changed the title 31 bp48655 branch-3.1: [opt](kerberos) opt hdfs kerberos logic (#47299 #47826 #48655) Jun 24, 2025
@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 83.20% (1119/1345)
Line Coverage 66.47% (19001/28586)
Region Coverage 66.21% (9428/14239)
Branch Coverage 56.18% (5101/9080)

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 43.32% (334/771) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 41.40% (11013/26600)
Line Coverage 32.26% (94572/293192)
Region Coverage 31.33% (48725/155545)
Branch Coverage 27.99% (25142/89810)

@doris-robot
Copy link

TPC-H: Total hot run time: 39834 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bb9c5e6a0d87e5d1a2a6c01de30ca15c0151008e, data reload: false

------ Round 1 ----------------------------------
q1	17751	6805	6573	6573
q2	2069	172	187	172
q3	10862	1103	1175	1103
q4	10262	752	697	697
q5	7772	2906	2864	2864
q6	215	131	132	131
q7	972	632	606	606
q8	9366	1977	2030	1977
q9	6666	6424	6441	6424
q10	7054	2310	2342	2310
q11	467	268	262	262
q12	400	211	206	206
q13	17787	2964	2977	2964
q14	238	205	205	205
q15	500	466	471	466
q16	482	372	386	372
q17	978	585	522	522
q18	7302	6672	6727	6672
q19	1324	993	986	986
q20	501	202	200	200
q21	4183	3133	3178	3133
q22	1082	989	1024	989
Total cold run time: 108233 ms
Total hot run time: 39834 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6606	6562	6564	6562
q2	332	229	228	228
q3	2920	2777	2865	2777
q4	2013	1808	1827	1808
q5	5733	5754	5724	5724
q6	210	131	129	129
q7	2223	1817	1805	1805
q8	3372	3585	3566	3566
q9	9000	8833	8963	8833
q10	3621	3524	3459	3459
q11	600	484	505	484
q12	808	620	611	611
q13	10572	3122	3224	3122
q14	309	278	264	264
q15	513	475	481	475
q16	491	448	435	435
q17	1841	1637	1596	1596
q18	8186	7825	7775	7775
q19	1698	1596	1604	1596
q20	2156	1801	1800	1800
q21	5036	4962	4975	4962
q22	1109	1009	1053	1009
Total cold run time: 69349 ms
Total hot run time: 59020 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195535 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bb9c5e6a0d87e5d1a2a6c01de30ca15c0151008e, data reload: false

query1	1320	924	884	884
query2	6383	1944	1925	1925
query3	10843	4389	4258	4258
query4	61570	28580	23158	23158
query5	5238	461	452	452
query6	409	186	191	186
query7	5485	310	307	307
query8	320	230	227	227
query9	8693	2584	2573	2573
query10	474	287	255	255
query11	18112	15250	15709	15250
query12	161	106	104	104
query13	1452	448	436	436
query14	10515	6652	6652	6652
query15	206	175	168	168
query16	7110	502	475	475
query17	1216	588	630	588
query18	1886	323	312	312
query19	205	175	160	160
query20	120	106	111	106
query21	203	105	101	101
query22	4800	4515	4652	4515
query23	34931	33708	34182	33708
query24	6136	2843	2925	2843
query25	541	415	417	415
query26	673	165	167	165
query27	2029	364	351	351
query28	4193	2229	2162	2162
query29	706	469	456	456
query30	238	160	161	160
query31	1008	870	815	815
query32	68	57	56	56
query33	430	298	304	298
query34	905	510	504	504
query35	830	749	735	735
query36	1109	972	973	972
query37	112	72	69	69
query38	4107	3957	3992	3957
query39	1487	1478	1443	1443
query40	206	98	103	98
query41	46	46	46	46
query42	117	99	98	98
query43	536	478	471	471
query44	1169	823	822	822
query45	187	172	173	172
query46	1170	732	724	724
query47	2055	1912	1904	1904
query48	443	346	348	346
query49	737	390	390	390
query50	832	445	420	420
query51	7351	7169	7227	7169
query52	97	90	94	90
query53	258	182	181	181
query54	549	464	469	464
query55	75	78	78	78
query56	260	261	244	244
query57	1302	1209	1167	1167
query58	228	220	219	219
query59	3258	3084	2995	2995
query60	291	266	271	266
query61	112	107	122	107
query62	770	690	690	690
query63	212	188	190	188
query64	1410	663	622	622
query65	3238	3242	3206	3206
query66	689	315	294	294
query67	16102	15533	15476	15476
query68	4294	596	583	583
query69	428	257	264	257
query70	1188	1082	1069	1069
query71	352	253	251	251
query72	6327	4032	3999	3999
query73	740	343	348	343
query74	10526	9217	9094	9094
query75	3348	2623	2632	2623
query76	1920	1153	1054	1054
query77	549	273	271	271
query78	10721	9643	9539	9539
query79	2029	581	596	581
query80	1397	449	426	426
query81	521	222	218	218
query82	1195	86	87	86
query83	279	142	138	138
query84	281	77	76	76
query85	1044	303	292	292
query86	374	302	301	301
query87	4433	4215	4258	4215
query88	3972	2391	2369	2369
query89	417	289	291	289
query90	1995	179	183	179
query91	139	109	109	109
query92	63	50	49	49
query93	2586	565	568	565
query94	792	294	294	294
query95	359	255	253	253
query96	615	277	279	277
query97	3296	3124	3126	3124
query98	219	210	191	191
query99	1576	1272	1298	1272
Total cold run time: 317899 ms
Total hot run time: 195535 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.68 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bb9c5e6a0d87e5d1a2a6c01de30ca15c0151008e, data reload: false

query1	0.04	0.03	0.02
query2	0.08	0.04	0.05
query3	0.23	0.05	0.05
query4	1.64	0.09	0.09
query5	0.52	0.50	0.51
query6	1.12	0.75	0.76
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.55	0.49	0.49
query10	0.56	0.56	0.55
query11	0.16	0.12	0.12
query12	0.16	0.13	0.13
query13	0.61	0.60	0.60
query14	0.78	0.81	0.83
query15	0.85	0.84	0.84
query16	0.37	0.38	0.38
query17	1.05	1.05	1.06
query18	0.19	0.17	0.19
query19	1.90	1.86	1.75
query20	0.02	0.01	0.02
query21	15.39	0.67	0.66
query22	3.96	5.42	2.82
query23	18.33	1.34	1.30
query24	2.31	0.21	0.22
query25	0.16	0.08	0.09
query26	0.27	0.18	0.18
query27	0.08	0.09	0.08
query28	13.20	0.62	0.55
query29	12.68	3.39	3.35
query30	0.24	0.06	0.06
query31	2.85	0.40	0.41
query32	3.22	0.48	0.48
query33	3.00	3.01	3.07
query34	16.77	4.57	4.47
query35	4.58	4.57	4.56
query36	0.66	0.46	0.48
query37	0.21	0.17	0.16
query38	0.17	0.16	0.15
query39	0.05	0.04	0.04
query40	0.17	0.12	0.13
query41	0.09	0.04	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 109.4 s
Total hot run time: 31.68 s

@morrySnow morrySnow merged commit 324f787 into apache:branch-3.1 Jun 24, 2025
21 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants