How to Handle NODE FAILURE in a Proxmox Cluster?

,

PVE1, PVE2, and PVE3 form a cluster using Ceph storage. PVE1 has failed, and now PVE2 needs to be removed.

Preparatory Steps

1. Ensure that all virtual machines on PVE1 have been migrated to other nodes.

2. Ensure that the data on PVE1 has been backed up and can be deleted completely.

Remove the PVE1 Node

# First, perform operations on PVE1 to be removed; stop the node!!!

systemctl stop pve-cluster.service

systemctl stop corosync.service

pmxcfs -l # Force it into local mode

cd /etc/pve/

rm corosync.conf

rm -rf /etc/corosync/

rm -rf /etc/corosync/*

killall pmxcfs

systemctl start pve-cluster.service

rm -rf /etc/pve/nodes/pve1

# Refresh the browser.

# On all nodes except PVE1, perform the following operation to remove PVE1:

pvecm delnode pve1

After refreshing the browser, PVE1 will have left the cluster.  

Log in to the PVE1 web interface to confirm it has left.

# The following is for an old version of the tutorial and can be ignored:  

Log in to the PVE2 web interface:

pvecm delnode pve1 # Force PVE1 out

# The PVE nodes have a majority rule voting mechanism. For example, if there are three nodes in the cluster, having the two good nodes force out the bad node adheres to the majority rule.  

# If the cluster has only two nodes, one node cannot force out the other due to a lack of majority; it will result in an error.  

# In such a case, you can reduce the quorum threshold with:

pvecm expected 1

Then:

pvecm delnode pve1

# Refresh the browser:

cd /etc/pve/priv

vi authorized_keys # Remove the fingerprint for PVE1

vi known_hosts # Remove PVE1’s information; note that both the fingerprint and the IP for PVE1 need to be deleted.

ls -l /etc/pve/nodes/ # Check if the directory for PVE1 still exists

mv /etc/pve/nodes/pve1 /root/pve1 # Remove PVE1 from the web panel

# Refresh the browser; PVE1 will have disappeared from the cluster, and cluster health will be maintained.  

# Next, you may find that PVE1 still appears in Proxmox Ceph.

Remove PVE1 from Ceph

1. Remove the PVE1 metadata server from CephFS.

2. Delete the OSD nodes associated with the PVE1 node.

In the OSD panel, you can see the relationship between PVE nodes and OSD disks.  

To remove a virtual disk, e.g., `osd.0`, proceed with the following:

ceph osd out osd.0

ceph osd down osd.0

systemctl stop ceph-osd@0

ceph osd tree

ceph osd crush remove osd.0

ceph auth del osd.0

ceph osd rm osd.0

ceph osd tree

# Similarly, replace `0` with `1`, `2`, `3`, etc., to delete other OSD disks.  

# After deletion, it’s best to run:

wipefs -af /dev/sdb

This will delete the entire filesystem on the disk, ensuring Ceph OSD can recognize the disk in the future.

Note: This only removes OSD disks associated with the PVE1 node, but PVE1 may still appear in the panel.

3. Remove the PVE1 node from Ceph monitors:

ceph mon dump # View Ceph monitor information

systemctl stop ceph-mon@pve1.service # This stops the PVE1 monitor service process.

systemctl status ceph-mon@pve1.service # Verify that it has stopped

ceph mon remove pve1 # Remove the PVE1 monitor

ceph mon dump # Confirm the monitor was successfully removed; there should be no info displayed for PVE1.

ceph osd crush remove pve1 # Remove the PVE1 directory from Ceph OSD, resolving any remaining issues from step 2.

On other nodes, remove any remnants of the PVE1 node from `ceph.conf`:

vim /etc/ceph/ceph.conf

4.Wait a few minutes

PVE1 should automatically disappear from the Ceph OSD view.  

However, in some experiments, PVE1 did not automatically disappear, suggesting additional operations may be needed. Further inquiries can be directed to ChatGPT.

Bug observed: The PVE1 Ceph monitor administrator could not be deleted; if PVE1 is repaired and rejoined, CephFS addition errors occur.

Solution:

pveceph mds destroy pve1 # This allows PVE1 to rejoin CephFS.

5.Reinstall the PVE1 node system, rejoin the Proxmox cluster, and reconfigure Ceph. Test afterward.

On the physical machine hosting the PVE1 node, use a USB PE system to delete all disk partitions with partitioning software!!!

Then, install Proxmox from a USB for a clean installation, avoiding any errors.  

Testing confirmed that reinstalling and rejoining the cluster initially produced errors, but after rebooting all nodes and waiting for 10 minutes, health was restored.

Finally, add the reinstalled node to the High Availability (HA) group. Testing with fault scenarios confirmed HA functionality was normal.

This concludes the fault repair process.


Leave a Reply

Your email address will not be published. Required fields are marked *