It’s been a while… So, for Day 61 of #100daysofhomelab, I thought I should write up how to swap a disk in a Hetzner Dedicated Machine.
I have a dedicated server I rent from Hetzner in Germany. It has an Xeon E5-1650 V2 processor (6 cores, 12 threads, 3.5Gz base, 3.9Gz turbo), 128Gb RAM, and a pretty impressive 15 6Tb HDD. All drives are hooked to a Mega RAID controller, but because I am running ProxMox, I left it in JBOD mode and set up the 15 drives in RAIDZ-2. All 15 drives are in a single pool (probably not ideal, but it works for me). Now and again, I get a message from ProxMox telling me about bad blocks… and every time it happens, I have to remember what to do to find the bad drive, report it to Hetzner, wait for them to replace the drive and then add it back to the pool… Today, it happened, so I thought I better document it, to help future me, and hopefully someone else out there…
First, we need to find the drive in question. Usually, I’m my alerts, I get the Serial number of the drive causing problems. So, I ran the following command:
megacli -PDList -aAll | egrep "Enclosure Device ID:|Slot Number:|Inquiry Data:|Error Count:|state"
This gives me a full list of drives along with the Slot Number (needed when sending to Hetzner) and the Serial Number. the data output starts with the “Enclosure Device ID:” so when you find the Serial number, look above it for the Slot Number… so, my issue is with the disk in Slot 10. I opened a support ticket with Hetzner requesting a replacement disk. It can take an hour or more for this, but sometimes faster. Depends on their load…
Once you get a confirmation that the disk is done, you now need to swap it into the zpool.
first, we must check if the new drive is set up correctly. Run the following:
megacli -PDList -a0 | grep Firmware
We are looking for “Firmware status: Online, Spun Up”. If we have anything marked as configured, we need to run the following:
megacli -CfgForeign -Scan -a0
This shows us any foreign configurations. If that’s more than 0, we run:
megacli -CfgForeign -Clear -a0
This clears out that configuration. Next, we need the Enclosure ID and Slot number for the new drive from:
megacli -PDList -aAll | egrep "Enclosure Device ID:|Slot Number:|Inquiry Data:|Error Count:|state"
cause we need to run:
megacli -PDMakeGood -PhysDrv [<enclosure>:<slot>] -a0
Finally, run:
megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Note: If that fails with a message about cache data, you may need to run:
megacli -DiscardPreservedCache -L"10" -a0
This will clear the cache and then you can run the CfgEachDskRaid0. This will mark all new disks as JBOD disks… used for ZFS. If you have something different, check the docs from Hetzner below.
Next, we need to swap disks in ZFS. Run
zpool status
to get the info about the missing disks. the missing disk will show as unavailable. Next, find the ID of the disk that was added.
cd /dev/disk/by-id/
ls
find the new disk (usually wont have any partitions on it). Now, its a matter of running the following:
zpool replace rpool /dev/disk/by-id/scsi-3600605b008f498802aa37da51674ea7e-part3 /dev/disk/by-id/wwn-0x600605b008f498802b2a3a683752e088
swap the scsi-36xxx and wwn-0x6xxx parts for the ones you found and rpool with your ZFS pool name.
finally, run
zpool status
to see the status, run:
zpool status -v -1
shows you the status with more info and refreshes every second. ZFS is now running in the background resilvering the drives and swapping out the old ones. since the old one is missing, it will wait till the new drive is sorted then remove the old one. This can take some time, depending on your disks and data size.
Hopefully, this helps someone!
Some links for info:
LSI RAID Controller – Hetzner Docs