Skip to content

NFS server's disconnection caused the host high load and high VIRT size #1121

@wfhu

Description

@wfhu

Host operating system: output of uname -a

Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Tue Jun 26 16:32:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70)
build user: root@a67a9bc13a69
build date: 20180515-15:52:42
go version: go1.9.6

node_exporter command line flags

/usr/local/prometheus/node_exporter-0.16.0.linux-amd64/node_exporter

Are you running node_exporter in Docker?

No, not in Docker

What did you do that produced an error?

The server has mounted a NFS volume to local directory, when the NFS server is down, the server's load average increase to 330.
After I restarted the node_exporter process, everything came back to normal.

What did you expect to see?

This should not happen, the node_exporter should detect that the NFS server is down.

What did you see instead?

top - 16:41:00 up 103 days, 20:04, 2 users, load average: 331.40, 330.96, 302.72
Tasks: 222 total, 1 running, 221 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.4 us, 1.1 sy, 0.0 ni, 96.8 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65975208 total, 487048 free, 33177676 used, 32310484 buff/cache
KiB Swap: 16777212 total, 16654076 free, 123136 used. 31770544 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19263 root 20 0 0 0 0 D 0.0 0.0 0:00.00 172.16.89.197-m
21651 nobody 20 0 10.924g 200524 5236 D 0.0 0.3 0:25.59 node_exporter

As we can see from the above, the VIRT of node_exporter is about 10G, and the load of the server rised up to 331, which are VERY high.

After I kill -9 the node_exporter process, the load drop quickly.

top - 16:41:06 up 103 days, 20:04, 2 users, load average: 305.02, 325.50, 301.11
Tasks: 223 total, 2 running, 221 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 1.2 sy, 0.0 ni, 94.8 id, 1.8 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65975208 total, 919604 free, 32944664 used, 32110940 buff/cache
KiB Swap: 16777212 total, 16654076 free, 123136 used. 32004128 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19263 root 20 0 0 0 0 D 0.0 0.0 0:00.00 172.16.89.197-m

after just 4 minutes, everything is normal now

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print;
count++} } END {print "Total status D: "count}'

top - 16:45:40 up 103 days, 20:09, 2 users, load average: 5.06, 132.79, 225.54
Tasks: 221 total, 1 running, 220 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.4 us, 1.2 sy, 0.0 ni, 95.4 id, 1.9 wa, 0.0 hi, 0.0 si, 0.1 st
KiB Mem : 65975208 total, 841576 free, 32951132 used, 32182500 buff/cache
KiB Swap: 16777212 total, 16654076 free, 123136 used. 31997280 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19263 root 20 0 0 0 0 D 0.0 0.0 0:00.00 172.16.89.197-m
Total status D: 1

and the node_exporter's VIRT size:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nobody 25360 0.4 0.0 1263600 12488 ? Ssl 16:41 0:46 /usr/local/prometheus/node_exporter-0.16.0.linux-amd64/node_exporter

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions