Table of Contents
We have a number of our physical Linux servers set up to use Linux MD RAID to provide either RAID 1 or 5 fault tolerance on our disks. This is all great so long as it is working as expected! I came into work to find that after a reboot from a kernel update one of our servers could not bring up its swap drive. The swap partition was a RAID 1 array made up from two mirrored disks.
I began to look at mdadm to find out what was wrong. Running:
# cat /proc/mdstat
revealed that one of the drives had failed putting both arrays into degraded mode and to make matters worse the only remaining good disk had now developed errors in the partition used for swap! Thankfully the second array / partition which contained the system files was still on-line, albeit in a degraded state.
So the first thing to do was to get a new disk into the array and synchronise the data onto it. After that I needed to remove the other original disk and replace that too. Once all that was done and the data re-synchronised onto both new disks I wanted to look at how we can increase our monitoring of disks so that we don’t get in this situation again!
Removing and adding new disks into a Linux RAID Array
In our case it was /dev/sdb which was the completely failed disk and /dev/sda which was about to fail. Obviously you should adjust this for your own set up before copying anything here!
First we need to mark the disk as failed and then remove it from the array:
# mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
Repeat this for any other arrays which the disk is a member of. In our set up /dev/sdb also had partition sdb2 as a member of /dev/md1 so I failed and removed that partition too. You have now removed the disk from the array. Run:
# cat /proc/mdstat
which should show something like this:
server:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] 24418688 blocks [2/1] [U_] md1 : active raid1 sda2[0] 24418688 blocks [2/1] [U_] unused devices:<none>
Now we need to get the details of the disk we have just removed from the array so we can ensure we remove the correct disk from the server! If you do not already have smartmontools installed, install it by typing:
# apt-get install smartmontools
Now we can get the details of our disk, in this case we are looking for the details of /dev/sdb:
# smartctl /dev/sdb --info
Copy the device model and serial number down so you can refer to it when you shut the server down.We also need to ensure that we can boot from the remaining drive:
# grub-install /dev/sda
Now shut down your system:
# shutdown now -hP
Now remove the failed drive, verifying you have the right one by cross-referencing your serial number from earlier! Now install your replacement drive. It needs to be at least the same size as the original. Mine was larger, but I will cover expanding the size later. With your new drive installed start up your server again and log in and elevate yourself to root. You may need to adjust your hard drive boot priority in your BIOS at this point as your new drive does not have grub installed to it yet!
We are now going to add the drive back into the array. First off we need to copy the partitioning from our other disk on to the new disk:
# sfdisk -d /dev/sda | sfdisk /dev/sdb
If you run:
# fdisk -l
You should be able to confirm that both partition tables are the same. Now we add the new partitions back into the arrays:
# mdadm --manage /dev/md0 --add /dev/sdb1 # mdadm --manage /dev/md1 --add /dev/sdb2
Both arrays will now synchronise. Run:
# cat /proc/mdstat
and wait until you see something like this:
server:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10] md0 : active raid1 sda1[0] sdb1[1] 24418688 blocks [2/2] [UU] md1 : active raid1 sda2[0] sdb2[1] 24418688 blocks [2/2] [UU] unused devices: <none>
You can now repeat these steps to replace and synchronise /dev/sda
Expanding your RAID Array Partitions
Increase the component partition sizes
As I mentioned earlier, the disks we replaced our failing disks with were larger than the original ones, so we obviously wanted to make use of this extra capacity in our server.
Firstly ensure your array is fully synchronised before following these steps:
# cat /proc/mdstat
If it is not synchronised then wait until it finishes! The next thing to do is to remove the first partition you want to expand from the array. In our case this was /dev/sda2 in the array /dev/md1:
# mdadm /dev/md1 --fail /dev/sda2 --remove /dev/sda2
Now we will use the parted programme to resize our partition. The print command in parted will show a list of the partitions on our disk. The one we want to work with (sda2) is listed as partition number 2 and we want to expand it to the end of the disk (500GB)
# parted /dev/sda (parted) print (parted) resizepart 2 500GB (parted) quit
Now we add this disk back into the array and let it re-synchronise:
# mdadm -a /dev/md1 /dev/sda2
and wait for synchronisation to finish:
# cat /proc/mdstat
Once mdstat shows synchronisation is complete, follow these steps for your other partition, /dev/sdb2 in our case. Again wait for full synchronisation before proceeding!
Increase the RAID array partition size
First check the size of the array by issuing:
# mdadm -D /dev/md1 | grep -e "Array Size"
Now we are going to grow the array to use all of the available size of the devices, enter:
# mdadm --grow /dev/md1 -z max
Recheck the size of your array with:
# mdadm -D /dev/md1 | grep -e "Array Size"
Increase the size of the file system
After increasing the array size we now need to increase the size of the file system. First check that your arrays are in sync:
# cat /proc/mdstat
We are using ext4 as our file system, you can also use this command for ext2 and ext3. If you are using any other file system the research the appropriate tool for the job!
# resize2fs /dev/md1
Now we can check that the file system has been expanded by issuing:
# df -h
You should now be presented with your expanded file system!
How to not get here again…
… or at least catch the issue earlier!
So there were two parts of our server hardware we were not monitoring effectively which got us into this situation. We did not know that the array was in a degraded state and we also had no notification that the hard drives were, or had been in a pending failure status.
Fortunately there are a couple of Nagios plug-ins we can use to monitor the status of our arrays and also the SMART status of the drives inside the array.
Monitoring SMART data in Nagios
There are many plug-ins for Nagios available which will monitor the SMART status of a disk including “check_ide_smart” from the standard Nagios Plug-ins. However many of the plug-ins I looked at were not capable of checking the SMART status of drives behind hardware RAID controllers. Whilst this was not essential for this server it would certainly prove useful for others. I came across this plug-in, “check_smart” which is a fork of the 2009 check_smart plug-in by Kurt Yoder. Importantly this plug-in allows us to query the SMART status of disks behind hardware RAID controllers and as it is just a perl script it should not prove problematic to run through NRPE on remote servers.
As it was our main Nagios server which had the disk failures we are installing these plug-ins locally. I will assume that people have enough knowledge to get these working over NRPE… If not there are plenty of resources out there on the web to help you!
Install and configure SMARTMonTools
To use any of the SMART plug-ins for Nagios we are first going to need SMARTMonTools to be installed and configured. If you did not install SMARTMonTools earlier, do so now by typing:
# apt-get install smartmontools
Now we need to configure SMARTMonTools to run as a daemon at system start-up and also to schedule some SMART tests to run on our disks. To configure SMARTMonTools to run as a daemon type:
# nano /etc/default/smartmontools
Find the line which contains “start_smartd=yes” and remove the comment at the start of the line. Save the file and close nano. Now we need to set up our SMART test schedules. Make a backup of the smartd.conf file by typing:
# cp /etc/smartd.conf /etc/smartd.conf.orig # nano /etc/smartd.conf
Now delete all lines in this file and paste the following single line in:
DEVICESCAN -H -l error -l selftest -f -s (S/../.././11|L/../../6/22)
This will scan all devices in the server, reporting results to the selftest and error logs at the following times:
- Short Test – Daily at 11am
- Long Test – Every Saturday at 10pm
If this schedule does not suit your needs or you want to know more, have a read of this. Save the file and close nano. Now we can restart the smartmontools daemon to get everything running, type:
# /etc/init.d/smartmontools restart
Set up plug-in pre-requisites
Move into your Nagios plug-ins folder, download the plug-in and make it executable by typing:
# cd /usr/local/nagios/libexec # wget http://www.claudiokuenzler.com/nagios-plugins/check_smart.pl # chmod +x ./check_smart.pl
We then can check that this plug-in can return our SMART data by executing the following inside the nagios libexec folder:
# ./check_smart.pl -d /dev/sda -i ata
All being well you should see something like this:
root@server:/usr/local/nagios/libexec# ./check_smart.pl -d /dev/sda -i ata OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=0 Start_Stop_Count=1 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=23 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=1 Power-Off_Retract_Count=0 Load_Cycle_Count=14 Temperature_Celsius=18 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0
Now before we go ahead and set up our Nagios checks and commands we need to allow the Nagios user to run this plug-in as root. The documentation provides two ways to do this. I went for option 2 and allowed the Nagios user to run the smartctl program with root privileges. Type the following to open your sudoers file for editing:
# visudo
and then add the following line to it and save and close the file:
nagios ALL = NOPASSWD: /usr/sbin/smartctl
Now that the Nagios user is allowed to run the smartctl program we can begin to set up our Nagios checks.
Set up Nagios checks
Open your Nagios commands.cfg file for editing and enter the following check command:
# Improved disk SMART status define command{ command_name check_smart command_line $USER1$/check_smart.pl -d $ARG1$ -i $ARG2$ $ARG3$ }
Now open up your Nagios config file which contains your server service checks for editing and create a new service definition:
define service{ use disk-check-service host_name localhost service_description Check SMART status on /dev/sda check_command check_smart!/dev/sda!ata }
The $ARG3$ on the check command will let you pass any extra parameters to the check_smart command you want. Verify your Nagios config files and then if they pass with no errors or warnings go ahead and restart Nagios.
You should now have a service check for the SMART status of /dev/sda on your Nagios server. Go and run the check through the web browser and confirm that it returns the same data as you saw when you ran it from the command line earlier. If all is well add service definitions for all local disks in the server.
Monitoring RAID Statuses in Nagios
After looking around quite a bit I came across Elan Ruusamäe’s “check_raid.pl” plug-in for Nagios. This plug-in supports a whole load of different RAID controllers including all the ones we use.
Installing and configuring the plug-in
First I downloaded the latest version of the plug-in file to my Nagios libexec directory and make it executable:
# cd /usr/local/nagios/libexec # wget https://raw.github.com/glensc/nagios-plugin-check_raid/master/check_raid.pl -O check_raid.pl # chmod +x ./check_raid.pl
The documentation now states that we should set up our systems sudo rules for using the plug-in. I was told that I did not need any sudo rules set up when I ran this step, but I think that is because this system is just using Linux Software RAID and not a hardware controller. Run the following:
# ./check_raid.pl -S
You should now be able to run the plug-in; no extra parameters needed!
# ./check_raid.pl
Set up Nagios checks
Open your Nagios commands.cfg file for editing and enter the following check command:
# 'check_raid' command definition define command{ command_name check_raid command_line $USER1$/check_raid.pl $ARG1$ }
The $ARG1$ will let you pass any additional parameters to the script, but you may well find that you do not need to use it. Now open your server service checks file again and enter the following service check definition:
define service{ use disk-check-service host_name localhost service_description Check RAID Array Status check_command check_raid }
Now verify your Nagios configuration and restart Nagios if all is clear. You are now monitoring the SMART status of your physical disks and the status of your RAID arrays!
Now I am off to go and get this working via NRPE on all our other physical Linux servers!
Credits / Sources
Sources of Information
These sites were of tremendous use in pulling all this together:
- Replacing A Failed Hard Drive In A Software RAID1 Array
- SUSE Doc: Increasing the Size of a Software RAID
- Smartmontools Ubuntu Documentation
- Fork of 2009’s check_smart Nagios plugin by Kurt Yoder
- Nagios/Icinga plugin to check current server’s RAID status
Image Credit
- Boyan Yurukov (with some minor edits by me!)