[HOW TO] Linux Server High Load

Hello,

I didn't write a article long time. In fact, I don't need to explain anymore, but I have a busy. I have changed the company. Today, I come back. Roughly two days ago, I met the incident about high load. My colleagues did to it, bu he didn't find out anything.

He told me: Tien, I didn't see anything related the high load. Because, as you see, the load of top process is okay.

Tien: Okay, I will take care of this.


And then, I started to find out.


top - 09:55:39 up 63 days, 21:42, 4 users, load average: 14.26, 14.27, 14.25Tasks: 168 total, 1 running, 167 sleeping, 0 stopped, 0 zombie%Cpu(s): 0.7 us, 0.1 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stKiB Mem : 7747272 total, 1385396 free, 1415356 used, 4946520 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 5655232 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 190720 3760 2428 S 0.0 0.0 6:09.07 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.29 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:03.07 ksoftirqd/0


Firstly, I see the process is running normally, and taking a normal performance load. But, the load average always high as above.

So, what happened?
After went around, I saw the php-fpm as potential event. It always sit on top process. So, I used the HTOP to see the STATE of process. And, I saw 14 uninterruptible sleep PHP-FPM: POOL WWW at here.

What is D - uninterruptible sleep state?
An uninterruptable process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal. Unlike interruptible sleep, you cannot wake up this process with a signal. That is why many people dread seeing this state. You can't kill such processes because killing means sending SIGKILL signals to processes. Of course, it stays at here.

What happened in uninterruptible sleep PHP-FPM?
I used strace command to see what is going on? 
AWS:[root@71 ~]# strace -p 5087strace: Process 5087 attachedflock(10, LOCK_EX) = 0gettimeofday({1510306801, 695208}, NULL) = 0gettimeofday({1510306801, 695321}, NULL) = 0open("/data/shared/partners/typo3temp/var/locks/flock_cc5e752af9d3afa9e93ad2244046b482", O_WRONLY|O_CREAT, 0666) = 11fstat(11, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0...gettimeofday({1510306801, 702748}, NULL) = 0flock(11, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)gettimeofday({1510306801, 702897}, NULL) = 0gettimeofday({1510306801, 703041}, NULL) = 0...gettimeofday({1510306801, 833556}, NULL) = 0chmod("/data/partners/www/typo3temp/var/locks/flock_cc5e752af9d3afa9e93ad2244046b482", 0664) = 0gettimeofday({1510306801, 837579}, NULL) = 0flock(12, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)(and more if you use strace -p 5087 )

It means that this PHP-FPM is uninterruptible sleep, but it still try to get the resource in /data/partners/www/typo3temp/var/locks/flock_*. It made the System Load Averages up by the time.

Interestingly, /data/partners/www/ is network mount
e-----.amazonaws.com:/ 8.0E 994M 8.0E 1% /data/shared

So, I think that the Linux load averages increase due to a disk (or network mount) I/O workload, not just CPU demand. In my mind, it's mean to reflect demand in a more general sense, rather than just CPU demand (e.g Disk Performance Read/ Write ). It also is a reason that Linux engineer changed from "CPU load averages" to what one might call "System Load Averages".

Finally, I cannot make sure about kill uninterruptible sleep process, so I suggest you should restart the PHP-FPM process to kill them.

To investigate this problem, I read some useful link. you can refer here & here.

Tiến Phan - R0039

Knowledge is Endless

Sharing for Success