Skip to content

Conversation

@zddr
Copy link
Contributor

@zddr zddr commented Apr 26, 2023

Proposed changes

Issue Number: close #xxx

Problem summary

主要改动:

1.如果配置文件开启fqdn,fe启动的时候localAddr会获取fqdn,而不是ip,priority_networks会失效

2.Backend和Frontend的ip和hostname合为一个字段host,开启fqdn的时候代表hostname,不开启的时候代表ip

3.集群间通信直接使用fqdn,各种连接池增加验证机制,防止域名的ip发生变化,节点间连接出错

4.不再需要轮询验证ip是否发生变化,删除fqdnManager

5.改变fe间验证节点合法性的方式,由获取客户端ip改为在http的请求头或thrift的消息体里面显示传递节点自身标识

6.处理心跳时,如果be发现自己存储的host和master存储的host不一致,验证host合法性后,会更改自身host,而不是直接报错

7.简化fe name的生成逻辑

Main changes:

  1. If fqdn is enabled in the configuration file, when fe starts, localAddr will obtain fqdn instead of IP, priority_ Networks will fail
  2. The IP and host names of Backend and Front are combined into one field, host. When fqdn is enabled, it represents the host name, and when not enabled, it represents the IP address
  3. The communication between clusters directly uses fqdn, and various Connection pool add authentication mechanisms to prevent the IP address of the domain name from changing and the connection between nodes from making errors
  4. No longer requires polling to verify if the IP has changed, delete fqdnManager
  5. Change the method of verifying the legitimacy of nodes between FEs from obtaining client IP to displaying the identity of the transmitting node itself in the HTTP request header or the message body of the throttle
  6. When processing the heartbeat, if BE finds that the host stored by itself is inconsistent with the host stored by the master, after verifying the legitimacy of the host, it will change its own host instead of directly reporting an error
  7. Simplify the generation logic of fe name

影响范围:

1.集群间通信建立连接

2.通过ip等属性判断是否为同一节点

3.打印日志

4.信息展示

5.地址拼接

6.k8s部署

7.升级兼容性

Scope of influence:

  1. Establishing communication connections between clusters
  2. Determine whether it is the same node through attributes such as IP
  3. Print Log
  4. Information display
  5. Address Splicing
  6. k8s deployment
  7. Upgrade compatibility

测试方案:

1.节点更换ip,在fqdn保持不变的情况下,改变fe和be的ip,验证集群能否正常读写数据

2.使用master的代码生成元数据,在当前pr上使用之前的元数据,验证能否兼容旧版本(之前就开启过fqdn的不再支持升级)

3.使用k8s部署fe和be集群,验证集群能否正常读写数据

4.按照https://doris.apache.org/zh-CN/docs/dev/admin-manual/cluster-management/fqdn?_highlight=fqdn#%E6%97%A7%E9%9B%86%E7%BE%A4%E5%90%AF%E7%94%A8fqdn升级旧集群

5.使用streamload分别指定fe,be的fqdn导入数据

6.使用不同用户开启事务用insert语句写入数据

Test plan:

  1. Change the IP address of the node, while keeping the fqdn unchanged, change the IP addresses of fe and be, and verify whether the cluster can read and write data normally
  2. Use the master code to generate metadata, and use the previous metadata on the current pr to verify whether it is compatible with the old version (upgrading is no longer supported if fqdn has been enabled before)
  3. Deploy fe and be clusters using k8s to verify whether the cluster can read and write data normally
  4. According to https://doris.apache.org/zh-CN/docs/dev/admin-manual/cluster-management/fqdn?_highlight=fqdn#%E6%97%A7%E9%9B%86%E7%BE%A4%E5%90%AF%E7%94%A8fqdn Upgrading old clusters
  5. Use streamload to specify the fqdn of fe and be to import data separately
  6. Use different users to start transactions and write data using insert statements

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added area/load Issues or PRs related to all kinds of load area/nereids area/planner Issues or PRs related to the query planner area/routine load area/spark-load Issues or PRs related to the spark load labels Apr 26, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@zddr
Copy link
Contributor Author

zddr commented Apr 27, 2023

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

hello-stephen commented Apr 27, 2023

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 33.87 seconds
stream load tsv: 427 seconds loaded 74807831229 Bytes, about 167 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230510141716_clickbench_pr_141867.html

@morningman
Copy link
Contributor

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@morningman
Copy link
Contributor

run buildall

@zddr
Copy link
Contributor Author

zddr commented May 8, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zddr
Copy link
Contributor Author

zddr commented May 9, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zddr
Copy link
Contributor Author

zddr commented May 9, 2023

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zddr
Copy link
Contributor Author

zddr commented May 10, 2023

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@zddr
Copy link
Contributor Author

zddr commented May 10, 2023

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit b129c99 into apache:master May 10, 2023
@zddr zddr mentioned this pull request May 15, 2023
5 tasks
@zddr zddr mentioned this pull request May 23, 2023
3 tasks
@zddr zddr deleted the pn_domain branch March 28, 2024 02:34
@LGDHuaOPER
Copy link

你好!
我使用 Doris v3.0.4,doris-operator v25,K8S 双栈(默认IPv6),部署 doris 集群,Pod IP 是 IPv6 形式,配置文件中 enable_fqdn_mode = true,启动时出现以下问题:

  • 如果设置了 priority_networks,则 fe-0 running,fe-1 和 fe-2 crash,日志报错如下:
RuntimeLogger 2025-08-13 04:03:32,999 INFO (main|1) [DorisFE.start():165] Doris FE starting...
RuntimeLogger 2025-08-13 04:03:33,004 INFO (main|1) [FrontendOptions.initAddrUsingFqdn():168] Use FQDN init local addr, FQDN: doriscluster-cloudlog-fe-1.doriscluster-cloudlog-fe-internal.deepwatch.svc.cluster.local, IP: 10.222.47.183
RuntimeLogger 2025-08-13 04:03:33,285 ERROR (main|1) [LogLog.warn():157] log4j:WARN No appenders could be found for logger (io.netty.util.internal.InternalThreadLocalMap).
RuntimeLogger 2025-08-13 04:03:33,286 ERROR (main|1) [LogLog.warn():157] log4j:WARN Please initialize the log4j system properly.
RuntimeLogger 2025-08-13 04:03:33,287 ERROR (main|1) [LogLog.warn():157] log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig](javascript:;) for more info.
RuntimeLogger 2025-08-13 04:03:33,336 INFO (main|1) [ConsistencyChecker.initWorkTime():105] consistency checker will work from 23:00 to 23:00
RuntimeLogger 2025-08-13 04:03:34,485 INFO (main|1) [PrivTable.addEntry():89] add priv entry: Node_priv,Admin_priv
RuntimeLogger 2025-08-13 04:03:34,486 INFO (main|1) [PrivTable.addEntry():89] add priv entry: Admin_priv
RuntimeLogger 2025-08-13 04:03:34,503 INFO (main|1) [PrivTable.addEntry():89] add priv entry: database privilege.ctl: internal, db: information_schema, priv: Select_priv
RuntimeLogger 2025-08-13 04:03:34,503 INFO (main|1) [PrivTable.addEntry():89] add priv entry: database privilege.ctl: internal, db: mysql, priv: Select_priv
RuntimeLogger 2025-08-13 04:03:34,504 INFO (main|1) [PrivTable.addEntry():89] add priv entry: origWorkloadGroup:normal, priv:Usage_priv
RuntimeLogger 2025-08-13 04:03:34,505 INFO (main|1) [Auth.createUserInternal():551] finished to create user: 'root'@'%', is replay: true
RuntimeLogger 2025-08-13 04:03:34,505 INFO (main|1) [PrivTable.addEntry():89] add priv entry: database privilege.ctl: internal, db: information_schema, priv: Select_priv
RuntimeLogger 2025-08-13 04:03:34,506 INFO (main|1) [PrivTable.addEntry():89] add priv entry: database privilege.ctl: internal, db: mysql, priv: Select_priv
RuntimeLogger 2025-08-13 04:03:34,506 INFO (main|1) [PrivTable.addEntry():89] add priv entry: origWorkloadGroup:normal, priv:Usage_priv
RuntimeLogger 2025-08-13 04:03:34,507 INFO (main|1) [Auth.createUserInternal():551] finished to create user: 'admin'@'%', is replay: true
RuntimeLogger 2025-08-13 04:03:34,509 INFO (main|1) [AuthenticatorManager.<init>():42] authenticate type: DEFAULT
RuntimeLogger 2025-08-13 04:03:34,648 INFO (main|1) [MTMVService.registerHook():69] registerHook: MTMVJobManager
RuntimeLogger 2025-08-13 04:03:34,648 INFO (main|1) [MTMVService.registerHook():69] registerHook: MTMVRelationManager
RuntimeLogger 2025-08-13 04:03:34,667 INFO (main|1) [Env.getSelfHostPort():1470] get self node: HostInfo{host='doriscluster-cloudlog-fe-1.doriscluster-cloudlog-fe-internal.deepwatch.svc.cluster.local', port=9010}
RuntimeLogger 2025-08-13 04:03:34,667 INFO (main|1) [Env.getHelperNodes():1524] get helper nodes: [HostInfo{host='doriscluster-cloudlog-fe-1.doriscluster-cloudlog-fe-internal.deepwatch.svc.cluster.local', port=9010}]
RuntimeLogger 2025-08-13 04:03:34,681 INFO (main|1) [Env.checkDeployMode():1386] The current deployment mode is local.
RuntimeLogger 2025-08-13 04:03:34,681 INFO (main|1) [Env.getClusterIdAndRole():1370] finished to get cluster id: 195318165, isElectable: true, role: FOLLOWER and node name: fe_7a59cb79_af4c_43ad_8b4e_87e4b65a51a5
RuntimeLogger 2025-08-13 04:03:34,693 INFO (main|1) [Env.loadImage():2114] image does not exist: /opt/apache-doris/fe/doris-meta/image/image.0
RuntimeLogger 2025-08-13 04:03:35,277 WARN (UNKNOWN fe_7a59cb79_af4c_43ad_8b4e_87e4b65a51a5(-1)|1) [Env.notifyNewFETypeTransfer():2838] notify new FE type transfer: MASTER
RuntimeLogger 2025-08-13 04:03:35,278 INFO (UNKNOWN fe_7a59cb79_af4c_43ad_8b4e_87e4b65a51a5(-1)|1) [LogUtils.stdout():50] StdoutLogger 2025-08-13 04:03:35,278 notify new FE type transfer: MASTER
RuntimeLogger 2025-08-13 04:03:35,292 INFO (stateListener|84) [Env$5.runOneCycle():2861] begin to transfer FE type from INIT to MASTER
RuntimeLogger 2025-08-13 04:03:35,294 INFO (stateListener|84) [BDBHA.fencing():79] start fencing, epoch number is 24
RuntimeLogger 2025-08-13 04:03:35,298 INFO (stateListener|84) [Env.transferToMaster():1589] finish replay in 1 msec
RuntimeLogger 2025-08-13 04:03:35,300 INFO (stateListener|84) [BDBEnvironment.getReplicationGroupAdmin():231] addresses is empty
RuntimeLogger 2025-08-13 04:03:35,300 ERROR (stateListener|84) [Env.checkCurrentNodeExist():2009] current node doriscluster-cloudlog-fe-1.doriscluster-cloudlog-fe-internal.deepwatch.svc.cluster.local:9010 is not added to the cluster, will exit. Your FE IP maybe changed, please set 'priority_networks' config in fe.conf properly.
  • 如果没有设置 priority_networks,fe-0、fe-1、fe-2 都 crash,报错日志如下(Error 级别):
    9BF2DE0D916868B76221020049D58CE1

我想请问下,这个是什么原因呢?我要如何操作才能使得集群正常?是否是因为 FQDN 不支持 IPv6?

谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. area/load Issues or PRs related to all kinds of load area/nereids area/planner Issues or PRs related to the query planner area/routine load area/spark-load Issues or PRs related to the spark load kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/behavior-changed kind/test reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants