Mr.PlanB

There's a special kind of panic that only comes when your Proxmox cluster spontaneously reboots in the middle of what was supposed to be a routine hardware upgrade. One moment, everything's humming along. The next, it's chaos—and you're left trying to piece together what happened while your pride quietly crawls away into a corner. That's exactly what went down for one Proxmox user when they decided to overhaul the power delivery in their server rack. They had a solid plan. The goal? Swap out some gear, power cycle a 48-port Dell switch, and get back to business. But they forgot one crucial step: putting High Availability (HA) into maintenance mode. ## The Chain Reaction No One Wants Let's unpack what went wrong. In this setup, all Corosync cluster communications were flowing through that single 1G Dell switch. When it rebooted, the Corosync communication dropped like a rock, and Proxmox did exactly what it was designed to do—triggered HA, assumed nodes were down, and started migrating or rebooting VMs in a frantic attempt to keep the services alive. The result? A full cluster-wide reboot event that no one saw coming. To their credit, the OP wasn't blaming the system. They knew about HA maintenance mode. They just... forgot. One tiny omission in an otherwise well-thought-out plan turned into a not-so-fun lesson in how brittle a misconfigured or incomplete setup can be. ## So, What's HA Maintenance Mode and Why Does It Matter? If you're managing a Proxmox cluster and using HA, this is the part you can't skip. Maintenance mode in this context means telling the cluster to chill—to not react if it thinks a node has gone down. You're intentionally silencing the watchdog so it doesn't try to save you from yourself while you're doing planned work. Without enabling this, Proxmox's HA manager sees the node as failing when it loses network contact and starts trying to "help" by restarting critical VMs on other nodes—sometimes with unintended and disastrous results. The correct move here is simple: before any operation that could disrupt cluster communication, especially anything touching switches or power to networking gear, flip those HA resources into "ignored" mode. ## Real-World Wisdom From the Trenches This wasn't an isolated incident. In fact, it lit up a lively discussion among other Proxmox admins, many of whom had the exact same story. A few shared their own experiences with unexpected cluster behavior when they didn't disable HA properly, especially during upgrades, power swaps, or network changes. And it gets more frustrating: there's no GUI option to put HA into maintenance mode. As one user put it, "From 2023, how is this still not a feature?" Fair point. Proxmox has made strides in bringing more features into the UI, but HA management is still CLI-only—at least for now. The command that would've saved the day? ```bash sudo ha-manager crm-command set vm:<VMID> state=ignored ``` You can also use this for multiple VMs to prevent Proxmox from trying to migrate or restart them. Some confusion popped up in the thread too. One user tried putting the node itself into maintenance mode, thinking that was enough. It's not. You need to disable HA at the resource level, not just the node level. Putting a node into maintenance mode drains VMs, sure, but doesn't stop HA from trying to move them around. What the OP needed was to stop HA from doing anything. ## Designing for Resilience (aka "Don't Be Me") There's a bigger lesson here beyond just flipping HA settings. A number of users chimed in with infrastructure tips that can help you prevent this kind of cluster freakout in the first place: **Use multiple Corosync interfaces.** Proxmox supports this, and having redundant paths (e.g., bonded interfaces across different switches) can mean the difference between seamless uptime and total cluster failure. **Avoid single points of failure.** One switch running all your Corosync traffic is asking for trouble. Dual-fabric setups are recommended, with VLAN separation for Corosync ring 0 and ring 1 traffic. **Active-backup bonding is a popular choice** for many because it avoids the complexity of LACP and doesn't require stacking or advanced switch features. **Avoid routing Corosync through firewalls.** When that firewall reboots or hiccups, your cluster could lose quorum or worse. Keep Corosync traffic on isolated, non-routed VLANs. One user laid out a rock-solid example: two 10G switches, two interfaces per node bonded in active/backup, and dedicated, isolated VLANs for each ring. When they accidentally dropped both switches during a UPS migration, the cluster held up—no split brain, no reboot storms, just a single VM that had to be restarted manually. ## So, What Should You Do Before Maintenance? Here's the mental checklist: 1. **Put HA-managed VMs into "ignored" mode.** This tells the cluster to leave them alone, no matter what happens. 2. **If you must reboot switches or anything Corosync touches,** spread that traffic across multiple interfaces/switches first. 3. **Don't assume "maintenance mode" means the same thing everywhere.** Node maintenance ≠ HA maintenance. 4. **Document your plan.** Seriously. Don't trust your memory. It's not about age—it's about margin for error. ## The Bottom Line Maintenance mode isn't just a button (though it should be one in the GUI). It's a safety net, a reminder that your cluster will only be as reliable as the assumptions you build into it. So before your next maintenance window, double-check your checklist. Put HA on ice when it needs to be. Because no matter how solid your plan is, if you forget that one toggle, your Proxmox cluster might just pull the rug out from under you—and take your pride with it.

Put Your Cluster on Ice: The One Step You Can't Forget in Proxmox HA

Keep Exploring

Related Resources