diff --git a/TOC.md b/TOC.md index 5f5698e860abc..25cd148f4a160 100644 --- a/TOC.md +++ b/TOC.md @@ -110,7 +110,7 @@ + [SQL Plan Management](/sql-plan-management.md) + [Access Tables Using `IndexMerge`](/index-merge.md) + Tutorials - + [Geo-Redundant Deployment](/geo-redundancy-deployment.md) + + [Multiple Data Centers in One City Deployment](/multi-data-centers-in-one-city-deployment.md) + Best Practices + [Use TiDB](/tidb-best-practices.md) + [Java Application Development](/best-practices/java-app-best-practices.md) diff --git a/faq/tidb-faq.md b/faq/tidb-faq.md index d75dcff5e6e30..06a5bd226c55a 100644 --- a/faq/tidb-faq.md +++ b/faq/tidb-faq.md @@ -53,7 +53,7 @@ At the bottom layer, TiKV uses a model of replication log + State Machine to rep #### Does TiDB support distributed transactions? -Yes. TiDB distributes transactions across your cluster, whether it is a few nodes in a single location or many [nodes across multiple datacenters](/geo-redundancy-deployment.md). +Yes. TiDB distributes transactions across your cluster, whether it is a few nodes in a single location or many [nodes across multiple data centers](/multi-data-centers-in-one-city-deployment.md). Inspired by Google's Percolator, the transaction model in TiDB is mainly a two-phase commit protocol with some practical optimizations. This model relies on a timestamp allocator to assign the monotone increasing timestamp for each transaction, so conflicts can be detected. [PD](/architecture.md#placement-driver-server) works as the timestamp allocator in a TiDB cluster. diff --git a/geo-redundancy-deployment.md b/geo-redundancy-deployment.md deleted file mode 100644 index 5474365d402f1..0000000000000 --- a/geo-redundancy-deployment.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -title: Cross-DC Deployment Solutions -category: how-to -aliases: ['/docs/dev/geo-redundancy-deployment/','/docs/dev/how-to/deploy/geographic-redundancy/overview/'] ---- - -# Cross-DC Deployment Solutions - -As a NewSQL database, TiDB excels in the best features of the traditional relational database and the scalability of the NoSQL database and is of course, highly available across data centers (hereinafter referred to as DC). This document is to introduce different deployment solutions in cross-DC environment. - -## 3-DC deployment solution - -TiDB, TiKV and PD are distributed among 3 DCs, which is the most common deployment solution with the highest availability. - -![3-DC Deployment Architecture](/media/deploy-3dc.png) - -### Advantages - -All the replicas are distributed among 3 DCs. Even if one DC is down, the other 2 DCs will initiate leader election and resume service within a reasonable amount of time (within 20s in most cases) and no data is lost. See the following diagram for more information: - -![Disaster Recovery for 3-DC Deployment](/media/deploy-3dc-dr.png) - -### Disadvantages - -The performance is greatly limited by the network latency. - -- For writes, all the data has to be replicated to at least 2 DCs. Because TiDB uses 2-phase commit for writes, the write latency is at least twice the latency of the network between two DCs. -- The read performance will also suffer if the leader is not in the same DC as the TiDB node with the read request. -- Each TiDB transaction needs to obtain TimeStamp Oracle (TSO) from the PD leader. So if TiDB and PD leader are not in the same DC, the performance of the transactions will also be impacted by the network latency because each transaction with write request will have to get TSO twice. - -### Optimizations - -If not all of the three DCs need to provide service to the applications, you can dispatch all the requests to one DC and configure the scheduling policy to migrate all the TiKV Region leader and PD leader to the same DC, as what we have done in the following test. In this way, neither obtaining TSO or reading TiKV Regions will be impacted by the network latency between DCs. If this DC is down, the PD leader and Region leader will be automatically elected in other surviving DCs, and you just need to switch the requests to the DC that are still online. - -![Read Performance Optimized 3-DC Deployment](/media/deploy-3dc-optimize.png) - -## 3-DC in 2 cities deployment solution - -This solution is similar to the previous 3-DC deployment solution and can be considered as an optimization based on the business scenario. The difference is that the distance between the 2 DCs within the same city is short and thus the latency is very low. In this case, we can dispatch the requests to the two DCs within the same city and configure the TiKV leader and PD leader to be in the 2 DCs in the same city. - -![2-DC in 2 Cities Deployment Architecture](/media/deploy-2city3dc.png) - -Compared with the 3-DC deployment, the 3-DC in 2 cities deployment has the following advantages: - -1. Better write performance. -2. Better usage of the resources because 2 DCs can provide services to the applications. -3. Even if one DC is down, the TiDB cluster will be still available and no data is lost. - -However, the disadvantage is that if the 2 DCs within the same city goes down, whose probability is higher than that of the outage of 2 DCs in 2 cities, the TiDB cluster will not be available and some of the data will be lost. - -## 2-DC + Binlog replication deployment solution - -The 2-DC + Binlog replication is similar to the MySQL Master-Slave solution. 2 complete sets of TiDB clusters (each complete set of the TiDB cluster includes TiDB, PD and TiKV) are deployed in 2 DCs, one acts as the Master and one as the Slave. Under normal circumstances, the Master DC handles all the requests and the data written to the Master DC is asynchronously written to the Slave DC via Binlog. - -![Data Replication in 2-DC in 2 Cities Deployment](/media/deploy-binlog.png) - -If the Master DC goes down, the requests can be switched to the slave cluster. Similar to MySQL, some data might be lost. But different from MySQL, this solution can ensure the high availability within the same DC: if some nodes within the DC are down, the online workloads won’t be impacted and no manual efforts are needed because the cluster will automatically re-elect leaders to provide services. - -![2-DC as a Mutual Backup Deployment](/media/deploy-backup.png) - -Some of our production users also adopt the 2-DC multi-active solution, which means: - -1. The application requests are separated and dispatched into 2 DCs. -2. Each DC has 1 cluster and each cluster has two databases: A Master database to serve part of the application requests and a Slave database to act as the backup of the other DC’s Master database. Data written into the Master database is replicated via Binlog to the Slave database in the other DC, forming a loop of backup. - -Please be noted that for the 2-DC + Binlog replication solution, data is asynchronously replicated via Binlog. If the network latency between 2 DCs is too high, the data in the Slave cluster will fall much behind of the Master cluster. If the Master cluster goes down, some data will be lost and it cannot be guaranteed the lost data is within 5 minutes. - -## Overall analysis for HA and DR - -For the 3-DC deployment solution and 3-DC in 2 cities solution, we can guarantee that the cluster will automatically recover, no human interference is needed and that the data is strongly consistent even if any one of the 3 DCs goes down. All the scheduling policies are to tune the performance, but availability is the top 1 priority instead of performance in case of an outage. - -For 2-DC + Binlog replication solution, we can guarantee that the cluster will automatically recover, no human interference is needed and that the data is strongly consistent even if any some of the nodes within the Master cluster go down. When the entire Master cluster goes down, manual efforts will be needed to switch to the Slave and some data will be lost. The amount of the lost data depends on the network latency and is decided by the network condition. diff --git a/media/multi-data-centers-in-one-city-deployment-sample.png b/media/multi-data-centers-in-one-city-deployment-sample.png new file mode 100644 index 0000000000000..f5dd7eb8f873b Binary files /dev/null and b/media/multi-data-centers-in-one-city-deployment-sample.png differ diff --git a/multi-data-centers-in-one-city-deployment.md b/multi-data-centers-in-one-city-deployment.md new file mode 100644 index 0000000000000..bb69d9a9769df --- /dev/null +++ b/multi-data-centers-in-one-city-deployment.md @@ -0,0 +1,134 @@ +--- +title: Multiple Data Centers in One City Deployment +summary: Learn the deployment solution to multi-data centers in one city. +category: tutorials +aliases: ['/docs/dev/how-to/deploy/geographic-redundancy/overview/','/docs/dev/geo-redundancy-deployment/','/tidb/dev/geo-redundancy-deployment'] +--- + +# Multiple Data Centers in One City Deployment + +As a NewSQL database, TiDB combines the best features of the traditional relational database and the scalability of the NoSQL database, and is highly available across data centers (DC). This document introduces the deployment of multiple DCs in one city. + +## Raft protocol + +Raft is a distributed consensus algorithm. Using this algorithm, both PD and TiKV, among components of the TiDB cluster, achieve disaster recovery of data, which is implemented through the following mechanisms: + +- The essential role of Raft members is to perform log replication and act as a state machine. Among Raft members, data replication is implemented by replicating logs. Raft members change their own states in different conditions to elect a leader to provide services. +- Raft is a voting system that follows the majority protocol. In a Raft group, if a member gets the majority of votes, its membership changes to leader. In other words, when the majority of nodes remain in the Raft group, a leader can be elected to provide services. + +To take advantage of Raft's reliability, the following conditions must be met in a real deployment scenario: + +- Use at least three servers in case one server fails. +- Use at least three racks in case one rack fails. +- Use at least three DCs in case one DC fails. +- Deploy TiDB in at least three cities in case data safety issue occurs in one city. + +The native Raft protocol does not have a good support for an even number of replicas. Considering the impact of cross-city network latency, three DCs in the same city might be the most suitable solution to a highly available and disaster tolerant Raft deployment. + +## Three DCs in one city deployment + +TiDB clusters can be deployed in three DCs in the same city. In this solution, data replication across the three DCs is implemented using the Raft protocol within the cluster. These three DCs can provide read and write services at the same time. Data consistency is not affected even if one DC fails. + +### Simple architecture + +TiDB, TiKV and PD are distributed among three DCs, which is the most common deployment with the highest availability. + +![3-DC Deployment Architecture](/media/deploy-3dc.png) + +**Advantages:** + +- All replicas are distributed among three DCs, with high availability and disaster recovery capability. +- No data will be lost if one DC is down (RPO = 0). +- Even if one DC is down, the other two DCs will automatically start leader election and automatically resume services within a reasonable amount of time (within 20 seconds in most cases, RTO <= 20s). See the following diagram for more information: + +![Disaster Recovery for 3-DC Deployment](/media/deploy-3dc-dr.png) + +**Disadvantages:** + +The performance can be affected by the network latency. + +- For writes, all the data has to be replicated to at least 2 DCs. Because TiDB uses 2-phase commit for writes, the write latency is at least twice the latency of the network between two DCs. +- The read performance will also be affected by the network latency if the leader is not in the same DC with the TiDB node that sends the read request. +- Each TiDB transaction needs to obtain TimeStamp Oracle (TSO) from the PD leader. So if the TiDB and PD leaders are not in the same DC, the performance of the transactions will also be affected by the network latency because each transaction with the write request has to obtain TSO twice. + +### Optimized architecture + +If not all of the three DCs need to provide services to the applications, you can dispatch all the requests to one DC and configure the scheduling policy to migrate all the TiKV Region leader and PD leader to the same DC. In this way, neither obtaining TSO nor reading TiKV Regions will be impacted by the network latency across DCs. If this DC is down, the PD leader and TiKV Region leader will be automatically elected in other surviving DCs, and you just need to switch the requests to the DCs that are still alive. + +![Read Performance Optimized 3-DC Deployment](/media/deploy-3dc-optimize.png) + +**Advantages:** + +The cluster's read performance and the capability to get TSO are improved. A configuration template of scheduling policy is as follows: + +```shell +-- Evicts all leaders of other DCs to the DC that provides services to the application. +config set label-property reject-leader LabelName labelValue + +-- Migrates PD leaders and sets priority. +member leader transfer pdName1 +member leader_priority pdName1 5 +member leader_priority pdName2 4 +member leader_priority pdName3 3 +``` + +**Disadvantages:** + +- Write scenarios are still affected by network latency across DCs. This is because Raft follows the majority protocol and all written data must be replicated to at least two DCs. +- The TiDB server that provides services is only in one DC. +- All application traffic is processed by one DC and the performance is limited by the network bandwidth pressure of that DC. +- The capability to get TSO and the read performance are affected by whether the PD server and TiKV server are up in the DC that processes application traffic. If these servers are down, the application is still affected by the cross-center network latency. + +### Deployment example + +This section provides a topology example, and introduces TiKV labels and TiKV labels planning. + +#### Topology example + +The following example assumes that three DCs (IDC1, IDC2, and IDC3) are located in one city; each IDC has two sets of racks and each rack has three servers. The example ignores the hybrid deployment or the scenario where multiple instances are deployed on one machine. The deployment of a TiDB cluster (three replicas) on three DCs in one city is as follows: + +![3-DC in One City](/media/multi-data-centers-in-one-city-deployment-sample.png) + +#### TiKV labels + +TiKV is a Multi-Raft system where data is divided into Regions and the size of each Region is 96 MB by default. Three replicas of each Region form a Raft group. For a TiDB cluster of three replicas, because the number of Region replicas is independent of the TiKV instance numbers, three replicas of a Region are only scheduled to three TiKV instances. This means that even if the cluster is scaled out to have N TiKV instances, it is still a cluster of three replicas. + +Because a Raft group of three replicas tolerates only one replica failure, even if the cluster is scaled out to have N TiKV instances, this cluster still tolerates only one replica failure. Two failed TiKV instances might cause some Regions to lose replicas and the data in this cluster is no longer complete. SQL requests that access data from these Regions will fail. The probability of two simultaneous failures among N TiKV instances is much higher than the probability of two simultaneous failures among three TiKV instances. This means that the more TiKV instances the Multi-Raft system is scaled out to have, the less the availability of the system. + +Because of the limitation described above, `label` is used to describe the location information of TiKV. The label information is refreshed to the TiKV startup configuration file with deployment or rolling upgrade operations. The started TiKV reports its latest label information to PD. Based on the user-registered label name (the label metadata) and the TiKV topology, PD optimally schedules Region replicas and improves the system availability. + +#### TiKV labels planning example + +To improve the availability and disaster recovery of the system, you need to design and plan TiKV labels according to your existing physical resources and the disaster recovery capability. You also need to configure the relevant `tidb-ansible inventory.ini` file according to the planned topology: + +```ini +[tikv_servers] +TiKV-30 ansible_host=10.63.10.30 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z1,dc=d1,rack=r1,host=30" +TiKV-31 ansible_host=10.63.10.31 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z1,dc=d1,rack=r1,host=31" +TiKV-32 ansible_host=10.63.10.32 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z1,dc=d1,rack=r2,host=30" +TiKV-33 ansible_host=10.63.10.33 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z1,dc=d1,rack=r2,host=30" + +TiKV-34 ansible_host=10.63.10.34 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z2,dc=d1,rack=r1,host=34" +TiKV-35 ansible_host=10.63.10.35 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z2,dc=d1,rack=r1,host=35" +TiKV-36 ansible_host=10.63.10.36 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z2,dc=d1,rack=r2,host=36" +TiKV-37 ansible_host=10.63.10.36 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z2,dc=d1,rack=r2,host=37" + +TiKV-38 ansible_host=10.63.10.38 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z3,dc=d1,rack=r1,host=38" +TiKV-39 ansible_host=10.63.10.39 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z3,dc=d1,rack=r1,host=39" +TiKV-40 ansible_host=10.63.10.40 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z3,dc=d1,rack=r2,host=40" +TiKV-41 ansible_host=10.63.10.41 deploy_dir=/data/tidb_cluster/tikv tikv_port=20170 tikv_status_port=20180 labels="zone=z3,dc=d1,rack=r2,host=41" + +## Group variables +[pd_servers:vars] +location_labels = ["zone","dc","rack","host"] +``` + +In the example above, `zone` is the logical availability zone layer that controls the isolation of replicas (three replicas in the example cluster). + +Considering that the DC might be scaled out in the future, the three-layer label structure (`dc`, `rack`, `host`) is not directly adopted. Assuming that `d2`, `d3`, and `d4` are to be scaled out, you only need to scale out the DCs in the corresponding availability zone and scale out the racks in the corresponding DC. + +If this three-layer label structure is directly adopted, after scaling out a DC, you might need to apply new labels and the data in TiKV needs to be rebalanced. + +### High availability and disaster recovery analysis + +The multiple DCs in one city deployment can guarantee that if one DC fails, the cluster can automatically recover services without manual intervention. Data consistency is also guaranteed. Note that scheduling policies are used to optimize performance, but when failure occurs, these policies prioritize availability over performance.