Skip to content
50 changes: 33 additions & 17 deletions how-to/monitor/monitor-a-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,22 +96,30 @@ Assume that the TiDB cluster topology is as follows:

#### Step 1: Download the binary package

{{< copyable "shell-regular" >}}

```bash
# Downloads the package.
$ wget https://github.com/prometheus/prometheus/releases/download/v2.2.1/prometheus-2.2.1.linux-amd64.tar.gz
$ wget https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz
$ wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.6.3.linux-x64.tar.gz
wget https://download.pingcap.org/prometheus-2.8.1.linux-amd64.tar.gz
wget https://download.pingcap.org/node_exporter-0.17.0.linux-amd64.tar.gz
wget https://download.pingcap.org/grafana-6.1.6.linux-amd64.tar.gz
```

{{< copyable "shell-regular" >}}

```bash
# Extracts the package.
$ tar -xzf prometheus-2.2.1.linux-amd64.tar.gz
$ tar -xzf node_exporter-0.15.2.linux-amd64.tar.gz
$ tar -xzf grafana-4.6.3.linux-x64.tar.gz
tar -xzf prometheus-2.8.1.linux-amd64.tar.gz
tar -xzf node_exporter-0.17.0.linux-amd64.tar.gz
tar -xzf grafana-6.1.6.linux-amd64.tar.gz
```

#### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4

{{< copyable "shell-regular" >}}

```bash
$ cd node_exporter-0.15.2.linux-amd64
cd node_exporter-0.17.0.linux-amd64

# Starts the node_exporter service.
$ ./node_exporter --web.listen-address=":9100" \
Expand All @@ -122,10 +130,14 @@ $ ./node_exporter --web.listen-address=":9100" \

Edit the Prometheus configuration file:

```yml
$ cd prometheus-2.2.1.linux-amd64
$ vi prometheus.yml
{{< copyable "shell-regular" >}}

```bash
cd prometheus-2.8.1.linux-amd64 &&
vi prometheus.yml
```

```ini
...

global:
Expand Down Expand Up @@ -191,9 +203,11 @@ $ ./prometheus \

Edit the Grafana configuration file:

{{< copyable "shell-regular" >}}

```ini
$ cd grafana-4.6.3
$ vi conf/grafana.ini
cd grafana-6.1.6 &&
vi conf/grafana.ini

...

Expand Down Expand Up @@ -256,20 +270,22 @@ This section describes how to configure Grafana.
- Default account: admin
- Default password: admin

2. Click the Grafana logo to open the sidebar menu.
> **Note:**
>
> For the **Change Password** step, you can choose **Skip**.

3. In the sidebar menu, click **Data Source**.
2. In the Grafana sidebar menu, click **Data Source** within the **Configuration**.

4. Click **Add data source**.
3. Click **Add data source**.

5. Specify the data source information.
4. Specify the data source information.

- Specify a **Name** for the data source.
- For **Type**, select **Prometheus**.
- For **URL**, specify the Prometheus address.
- Specify other fields as needed.

6. Click **Add** to save the new data source.
5. Click **Add** to save the new data source.

#### Step 2: Import a Grafana dashboard

Expand Down
41 changes: 30 additions & 11 deletions how-to/scale/with-ansible.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,8 @@ For example, if you want to add a PD node (node103) with the IP address `172.16.
> You cannot add the `#` character at the beginning of the line. Otherwise, the following configuration cannot take effect.

2. Add `--join="http://172.16.10.1:2379" \`. The IP address (`172.16.10.1`) can be any of the existing PD IP address in the cluster.
3. Manually start the PD service in the newly added PD node:

3. Start the PD service in the newly added PD node:

```
{deploy_dir}/scripts/start_pd.sh
Expand All @@ -220,26 +221,35 @@ For example, if you want to add a PD node (node103) with the IP address `172.16.
>
> `pd-ctl` is a command used to check the number of PD nodes.

5. Apply a rolling update to the entire cluster:
5. Start the monitoring service:

```
ansible-playbook rolling_update.yml
ansible-playbook start.yml -l 172.16.10.103
```

6. Start the monitor service:
> **Note:**
>
> If you use an alias (inventory_name), use the `-l` option to specify the alias.

6. Update the cluster configuration:

```
ansible-playbook start.yml -l 172.16.10.103
ansible-playbook deploy.yml
```

7. Update the Prometheus configuration and restart the cluster:
7. Restart Prometheus, and enable the monitoring of PD nodes used for increasing the capacity:

```
ansible-playbook rolling_update_monitor.yml --tags=prometheus
ansible-playbook stop.yml --tags=prometheus
ansible-playbook start.yml --tags=prometheus
```

8. Monitor the status of the entire cluster and the newly added node by opening a browser to access the monitoring platform: `http://172.16.10.3:3000`.

> **Note:**
>
> The PD Client in TiKV caches the list of PD nodes. Currently, the list is updated only if the PD leader is switched or the TiKV server is restarted to load the latest configuration. To avoid TiKV caching an outdated list, there should be at least two existing PD members in the PD cluster after increasing or decreasing the capacity of a PD node. If this condition is not met, transfer the PD leader manually to update the list of PD nodes.

## Decrease the capacity of a TiDB node

For example, if you want to remove a TiDB node (node5) with the IP address `172.16.10.5`, take the following steps:
Expand Down Expand Up @@ -430,6 +440,10 @@ For example, if you want to remove a PD node (node2) with the IP address `172.16
ansible-playbook stop.yml -l 172.16.10.2
```

> **Note:**
>
> In this example, you can only stop the PD service on node2. If there are any other services deployed with the IP address `172.16.10.2`, use the `-t` option to specify the service (such as `-t tidb`).

4. Edit the `inventory.ini` file and remove the node information:

```ini
Expand Down Expand Up @@ -480,16 +494,21 @@ For example, if you want to remove a PD node (node2) with the IP address `172.16
| node8 | 172.16.10.8 | TiKV3 |
| node9 | 172.16.10.9 | TiKV4 |

5. Perform a rolling update to the entire TiDB cluster:
5. Update the cluster configuration:

```
ansible-playbook rolling_update.yml
ansible-playbook deploy.yml
```

6. Update the Prometheus configuration and restart the cluster:
6. Restart Prometheus, and disable the monitoring of PD nodes used for increasing the capacity:

```
ansible-playbook rolling_update_monitor.yml --tags=prometheus
ansible-playbook stop.yml --tags=prometheus
ansible-playbook start.yml --tags=prometheus
```

7. To monitor the status of the entire cluster, open a browser to access the monitoring platform: `http://172.16.10.3:3000`.

> **Note:**
>
> The PD Client in TiKV caches the list of PD nodes. Currently, the list is updated only if the PD leader is switched or the TiKV server is restarted to load the latest configuration. To avoid TiKV caching an outdated list, there should be at least two existing PD members in the PD cluster after increasing or decreasing the capacity of a PD node. If this condition is not met, transfer the PD leader manually to update the list of PD nodes.