Proxmox VE Hyper-converged Cluster Disk Replacement Operation Without Service Interruption


Situation Description

A Proxmox VE hyper-converged cluster consisting of four nodes has been running continuously for over 500 days. Each node, apart from the system disk, has four individual 2.4T 10000 RPM SAS disks configured as Ceph OSDs.

Monitoring revealed that one of the disks on one node was in a down state. Attempts to repair it were unsuccessful. Upon logging into the system, numerous IO errors were found in the system logs, indicating a likely physical disk failure. A technician on duty at the data center was notified and confirmed the failure by observing a constantly lit red light on one of the disks, consistent with the diagnosis.

Failure Repair Plan

Since the system is online and services cannot be stopped, this is the fundamental requirement. Fortunately, Proxmox VE’s decentralized hyper-converged cluster allows for the shutdown of any one or more physical servers while maintaining the cluster (unlike other hyper-converged platforms that have a control node which cannot be shut down).

Without the worry of downtime, the following plan was made:

1. Add the virtual machines running on the faulty physical machine to HA.

2. Shut down and replace the hard disk.

3. System recognizes the new hard disk.

4. Create the OSD.

5. Migrate some virtual machines back to the repaired physical node.

Failure Repair Implementation

The following steps were carried out according to the plan:

1. In the web management interface, record the IDs of the virtual machines running on the faulty machine, and then add them to Proxmox VE’s HA (this high availability is different from the PVE cluster, it is built on top of the PVE cluster).

2. Shut down the faulty machine and check if all the virtual machines have automatically migrated (compare with the recorded VM IDs).

3. Notify the standby technician at the data center to remove the faulty disk and swiftly insert a new one. Start the system and check if the system recognizes the new disk. Unfortunately, it was not recognized (as expected), so it was necessary to enter the RAID controller interface, configure the new disk as RAID 0 (single disk RAID 0, strongly advised against RAID 5), and restart. The disk should now be recognized, confirmed by the command `df -h`.

4. Execute the following commands to initialize the newly replaced disk:

    wipefs -af /dev/sdc

    wipefs -af /dev/sdc  # ‘sdc’ is the device name of the new disk

5. In the web management interface, create the OSD. If the drop-down list indicates “no unused disk,” repeat step 4.

6. Refresh the page to check if the new OSD has been correctly added. Additionally, verify synchronization by executing the following command:

    ceph osd tree

7. Migrate some virtual machines back to the repaired physical node. This can be done with a few clicks and does not require further elaboration.


Leave a Reply

Your email address will not be published. Required fields are marked *