Your Guide to Troubleshooting Slow Linux Servers

You know the dreaded message of "Slow website, please fix." always comes occasionally with no further information from the users. While it's easy to blame the user's network connectivity or device performance, we need to know if something is wrong with the servers.

In this guide, we will try to learn the troubleshooting steps for server performance or stability issues, even with almost little to zero information provided by the user or client.

Try accessing it yourself.

The easiest step to confirm the slow server's complaint is to try accessing it yourself, it helps to know if the server is unresponsive or only affecting certain users.

Sometimes, the slow issue only affects certain functionality on your site such as trying to log in or accessing the dashboard, but accessing the front page is still relatively quick. It helps to confirm especially if page caching is enabled that the issues come from trying to access or process uncached content or data.

Also, if you enabled ssh in your server you could try accessing the server and check how responsive it is, or maybe you can't log in to the server.

The ssh command includes a verbose debug message tool to help you diagnose timeout issues and a helpful tool as a first step for debugging server issues.

$ ssh -vvv username@hostname

The magical power of rebooting.

Sometimes the easiest fix is the best solution, rebooting allows you to solve a lot of problems with minimal effort and is one of the best ways to diagnose application and hardware issues, especially if performance problems persist after you reboot the server.

Check the hardware usage.

Compute, Memory, Storage, and Networking are the backbone of a working server and one or many of them could be part of the issues affecting your server.

  • A slow or overwhelmed CPU could cause a process to wait a long before finishing an execution task.
  • A full memory issue could cause the process to be killed before finishing a task or if you have swap-enabled impact finishing a task on a slow I/O procedure.
  • Slow or overwhelmed storage use could impact database read/write performance, and slow down data access considerably.
  • Significant network latency or slow throughput could cause a considerable delay on the end user and impact overall responsiveness.

Start checking usage with top.

To monitor usage, we will be using top as a utility to help monitor hardware usage for CPU and Memory, and top should be included by default in most Linux distros.

Screenshot of top, running on Ubuntu-based server.

From a glance, you can see a couple of things that stood out after executing top.

  • Server uptime
  • Load average in 1, 5, and 15 minutes
  • Number of tasks running including idle and zombie processes
  • CPU usage in percentage for user, system, nice, idle, waiting i/o, hardware interrupts, software interrupts, and steal
  • Memory and swap usage
  • List of currently running tasks and processes

Understanding CPU usage.

  • User (us) reveals the use that external programs (like web servers and databases) make use of. Most of the time, high utilization is completely typical; however, if user usage is high for an extended time, your application is likely being slowed down by the CPU.
  • System (sy) use refers to the CPU time allotted to the kernel when user processes ask it to do a task, such as allocating memory or starting a child process. This amount should be as low as possible, and if it spikes frequently during a certain period, you might want to evaluate how many processes are spawning and how memory and storage are being used to see whether the CPU might need to allocate more resources.
  • Nice (ni) refers to how much CPU is spent running user processes that have been niced, if this value is high for your production application running in lower priority, you may need to increase the priority scale or run at default priority.
  • Idle (id) is simply the time the CPU is idling, most of the time when there is no usage this value should be high.
  • Iowait (wa) corresponds to processes on input and output, such as reading or writing data to storage, When the value is high, it indicates that the CPU is waiting for the disk to complete its task. If the server still uses a conventional hard drive, you may need to improve disk IOPS or move to a quicker storage option like SSD.
  • Hardware and Software Interrupts (hi & si) show how much time the processor has spent servicing interrupts. Hardware interrupts are physical interrupts sent to the CPU from various peripherals like disks and network interfaces. Software interrupts come from processes running on the system. High hardware interrupts could be caused by faulty hardware or processes that cause a lot of software interrupts.
  • Steal (st) utilization is the time the CPU waits for the hypervisor to complete serving another virtual CPU. When the steal value is too high, which only pertains to virtual machines, the host system that is running the hypervisor is overloaded. Check the other virtual machines that are currently using the hypervisor, and/or move your virtual machine to a different host, if at all possible.

Memory

After CPU, the next obvious thing to troubleshoot is memory use, full memory usage will make your server unresponsive and even inaccessible.

free -h

Execute the command above to view usage for both memory and swap.

Screenshot of memory and swap usage

If you look at the screenshot above, you may think that we only have around 212 MB left in our 2 GB memory but this is quite normal because our focus needs to be watching the buff/cache and available column.

First, the buff/cache column is used by the kernel buffers and page cache responsible when the kernel performs I/O operations on the disk, the higher the value means that more files and metadata are cached in RAM.

Second, the available column means how much memory can be used by applications without resolving to swap, if the available memory is low applications and the server could be slowed considerably.

Memory (Swap)

While it's tempting to disable and remove swap, in production swap is useful as a protection from OOM issues.

Swap means the applications are using disk as an alternative to RAM, and it's quite slow when the swap usage increases, this value needs to be kept as low as possible, and make sure server memory is adequate to handle your applications.

There's a useful command that you can use to check which applications are using swap.

for file in /proc/*/status; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | less
Application swap usage

The screenshot shows MariaDB swap usage is quite high, and databases are especially prone to timeouts and poor performance. Ensure your important application has adequate memory needs, especially in a high transaction activity environment.

Storage

Storage performance is paramount, a slow storage speed could cause a lot of issues, including freezes, lags, and load time issues. In troubleshooting storage problems, the first thing to check is how much storage is used. Sometimes, the main culprit of slow or unstable servers is the system or app being unable to create a new file because of full storage.

error: No space left on device

The usual culprit of storage issues.

That message could mean multiple things: a full storage partition, insufficient inodes, or disk corruption. To check how much storage is used, the simple thing is to execute the command below:

df -h

df stands for disk free, and one of the commands you should remember as a sysadmin.

Screenshot of df, showing 70% of disk usage.

In most Linux installations, especially on VMs you only need to look at how much the root partition uses, or in this case, it's using 70% of the total available storage with 18GB left. When it's full, the servers still work but you cannot create or even execute simple commands like tab autocomplete on bash.

The next topic is inodes. An inode, also known as an index node, is a filesystem's unique identifier. Every file and folder in the filesystem has metadata that is kept in a shared table. If your server generates a lot of little data over time, it may eventually use up all the inodes.

If you find that the disk space is still sufficient and encountering no space left error, it's best to check if the inodes are full.

df -ih

Command for checking inodes, just add -i to the df command arguments

Inodes usage, showing 3% usage on root directory

After determining storage and inode usage are fine, the best thing to check about storage performance is what applications are stressing and causing huge storage loads with iotop.

iotop
Screenshot of iotop

With iotop, you can see which processes are currently reading and writing to the disk, with the total disk read/write shown at the top of the column list. When applications are heavily using storage, it could be the cause of slow performance especially when the server still running on a spinning hard disk.

Network Latency

The last hardware topic we going to discuss is networking, a slow or congested network activity could create a bad experience for the end-user especially if the server is exposed to the internet.

There are a lot of troubleshooting steps to check network performance, the first and most known step to check is doing a ping to and from the servers to determine network latency.

# ping to server ip address
ping 192.168.0.133

# ping from the server to global internet or other local server
ping google.com
Screenshot of ping command on google.com

The value you need to watch is how long the response time is, as shown in the screenshot above it stays on a constant good value of 29-30ms. This value should be as low and consistent as possible, a huge spike or inconsistent response time could cause network instability.

The next useful to determine slow network response time is the traceroute command, it shows more granular details on what path that traffic takes to its destination.

traceroute google.com
Screenshot of traceroute command on google.com

Network Utilization

We can monitor network bandwidth usage by using the iftop command utility, it's counting network packets coming through the network interface. With the tool, you can check active network traffic and active network utilization.

Screenshot of iftop command tool

Wrapping up

Being an administrator, particularly working on servers, is exhausting and necessitates much knowledge and hands-on experience. Troubleshooting is part of the job, and it isn't always fun or frustrating, but I hope I can give you a few tips and pointers to help you solve and enhance your server performance.