We have a number of our physical Linux servers set up to use Linux MD RAID to provide either RAID 1 or 5 fault tolerance on our disks. This is all great so long as it is working as expected! I came into work to find that after a reboot from a kernel update one of our servers could not bring up its swap drive. The swap partition was a RAID 1 array made up from two mirrored disks.
I began to look at mdadm to find out what was wrong. Running:
# cat /proc/mdstat
revealed that one of the drives had failed putting both arrays into degraded mode and to make matters worse the only remaining good disk had now developed errors in the partition used for swap! Thankfully the second array / partition which contained the system files was still on-line, albeit in a degraded state.
So the first thing to do was to get a new disk into the array and synchronise the data onto it. After that I needed to remove the other original disk and replace that too. Once all that was done and the data re-synchronised onto both new disks I wanted to look at how we can increase our monitoring of disks so that we don’t get in this situation again!