tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33%
cheaper, but while it’s possible to construct good ones, most will have subtle
The first step towards creating any HA system is to look for and try to
eliminate single points of failure, often abbreviated as
It is impossible to eliminate all risk of downtime and especially when one
considers the additional complexity that comes with introducing additional
redunancy, concentrating on single (rather than chains of related and therefor
decreasingly probable) points of failure is widely accepted as a suitable
The natural starting point then is to have more than one node. However before
the system can move services to the surviving node after a failure, in
general, it needs to be sure that they are not still active elsewhere.
So not only are we looking for SPoFs, but we are also looking to balance risks
and consequences and the calculus will be different for every deployment 
There is no downside if a failure causes both members of a two node cluster to
serve up the same static website. However its a very different story if it
results in both sides independently managing a shared job queue or
providing uncoordinated write access to a replicated database or shared
So in order to prevent a single node failure from corrupting your data or
blocking recovery, we rely on something called fencing.
At it s heart, fencing turns a question Can our peer cause data
corruption? into an answer no by isolating it both from incoming
requests and persistent storage. The most common approach to fencing is to
power off failed nodes.
There are two categories of fencing which I will call direct and indirect
but could equally be called active and passive. Direct methods involve
action on the part of surviving peers, such interacting with an IPMI or iLO
device, whereas indirect relies on the failed node to somehow recognise it is
in an unhealthy state (or is at least preventing remaining members from
recovering) and signal a hardware
watchdog to panic the machine.
Quorum helps in both these scenarios.
In the case of direct fencing, we can use it to prevent fencing races when the
network fails. By including the concept of quorum, there is enough
information in the system (even without connectivity to their peers) for nodes
to automatically know whether they should initiate fencing and/or recovery.
Without quorum, both sides of a network split will rightly assume the other is
dead and rush to fence the other. In the worst case, both sides succeed
leaving the entire cluster offline. The next worse is a death match , a
never ending cycle of nodes coming up, not seeing their peers, rebooting them
and initiating recovery only to be rebooted when their peer goes through the
The problem with fencing is that the most commonly used devices become
inaccessible due to the same failure events we want to use them to recover
from. Most IPMI and iLO cards both loose power with the hosts they control and
by default use the same network that is causing the peers to believe the
others are offline.
Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the
point hardware is being purchased.
Quorum is also crucial for driving indirect fencing and, when done right, can
allow survivors to safely assume that missing nodes have entered a safe state
after a defined period of time.
In such a setup, the watchdog’s timer is reset every N seconds unless quorum
is lost. If the timer (usually some multiple of N) expires, then the machine
performs an ungraceful power off (not shutdown).
This is very effective but without quorum to drive it, there is insufficient
information from within the cluster to determine the difference between a
network outage and the failure of your peer. The reason this matters is that
without a way to differentiate between the two cases, you are forced to choose
a single behaviour mode for both.
The problem with choosing a single response is that there is no course of
action that both maximises availability and prevents corruption.
If you choose to assume the peer is alive but it actually failed, then the
cluster has unnecessarily stopped services.
If you choose to assume the peer is dead but it was just a network outage,
then the best case scenario is that you have signed up for some manual
reconciliation of the resulting datasets.
No matter what heuristics you use, it is trivial to construct a single failure
that either leaves both sides running or where the cluster unnecessarily shuts
down the surviving peer(s). Taking quorum away really does deprive the cluster
of one of the most powerful tools in its arsenal.
Given no other alternative, the best approach is normally to sacrificing
availability. Making corrupted data highly available does no-one any good and
manually reconciling diverant datasets is no fun either.
Quorum sounds great right?
The only drawback is that in order to have it in a cluster with N members, you
need to be able to see
N/2 + 1 of your peers. Which is impossible in a two
node cluster after one node has failed.
Which finally brings us to the fundamental issue with two-nodes:
quorum does not make sense in two node clusters, and
without it there is no way to reliably determine a course of action that
both maximises availability and prevents corruption
Even in a system of two nodes connected by a crossover cable, there is no way
to conclusively differentiate between a network outage and a failure of the
other node. Unplugging one end (who’s likelihood is surely proportional to the
distance between the nodes) would be enough to invalidate any assumption that
link health equals peer node health.
Making Two Nodes Work
Sometimes the client can’t or wont make the additional purchase of a third
node and we need to look for alternatives.
Option 1 - Add a Backup Fencing Method
A node’s iLO or IPMI device represents a SPoF because, by definition, if it
fails the survivors cannot use it to put the node into a safe state. In a
cluster of 3 nodes or more, we can mitigate this a quorum calculation and a
hardware watchdog (an indirect fencing mechanism as previously discussed). In
a two node case we must instead use network power switches (aka. power
distribution units or PDUs).
After a failure, the survivor first attempts to contact the primary (the
built-in iLO or IPMI) fencing device. If that succeeds, recovery proceeds as
normal. Only if the iLO/IPMI device fails is the PDU invoked and assuming it
succeeds, recovery can again continue.
Be sure to place the PDU on a different network to the cluster traffic,
otherwise a single network failure will prevent access to both fencing devices
and block service recovery.
You might be wondering at this point… doesn’t the PDU represent a single
point of failure? To which the answer is “definitely“.
If that risk concerns you, and you would not be alone, connect both peers to
two PDUs and tell your cluster software to use both when powering peers on and
off. Now the cluster remains active if one PDU dies, and would require a
second fencing failure of either the other PDU or an IPMI device in order to
Option 2 - Add an Arbitrator
In some scenarios, although a backup fencing method would be technically
possible, it is politically challenging. Many companies like to have a degree
of separation between the admin and application folks, and security conscious
network admins are not always enthusiastic about handing over the usernames
and passwords to the PDUs.
In this case, the recommended alternative is to create a neutral third-party
that can supplement the quorum calculation.
In the event of a failure, a node needs to be able to see ether its peer or
the arbitrator in order to recover services. The arbitrator also includes to
act as a tie-breaker if both nodes can see the arbitrator but not each other.
This option needs to be paired with an indirect fencing method, such as a
watchdog that is configured to panic the machine if it looses connection to
its peer and the arbitrator. In this way, the survivor is able to assume with
reasonable confidence that its peer will be in a safe state after the watchdog
The practical difference between an arbitrator and a third node is that the
arbitrator has a much lower footprint and can act as a tie-breaker for more
than one cluster.
Option 3 - More Human Than Human
The final approach is for survivors to continue hosting whatever services they
were already running, but not start any new ones until either the problem
resolves itself (network heals, node reboots) or a human takes on the
responsibility of manually confirming that the other side is dead.
Did I already mention you could add a third node?
We test those a lot :-)
For the sake of argument, lets imagine I’ve convinced you the reader on the
merits of a third node, we must now consider the physical arrangement of the
nodes. If they are placed in (and obtain power from), the same rack, that too
represents a SPoF and one that cannot be resolved by adding a second rack.
If this is surprising, consider what happens when the rack with two nodes
fails and how the surviving node would differentiate between this case and a
The short answer is that it can’t and we’re back to having all the problems of
the two-node case. Either the survivor:
ignores quorum and incorrectly tries to initiate recovery during network
outages (whether fencing is able to complete is a different story and depends
on whether PDU is involved and if they share power with any of the racks), or
respects quorum and unnecessarily shuts itself down when its peer fails
Either way, two racks is no better than one and the nodes must either be given
independant supplies of power or be distributed accross three (or more
depending on how many nodes you have) racks.
By this point the more risk averse readers might be thinking about disaster
recovery. What happens when an asteroid hits the one datacenter with our three
nodes distributed across three different racks? Obviously Bad Things(tm) but
depending on your needs, adding a second datacenter might not be enough.
Done properly, a second datacenter gives you a (reasonably) up-to-date and
consistent copy of your services and their data. However just like the two-
node and two-rack scenarios, there is not enough information in the system to
both maximise availability and prevent corruption (or diverging datasets).
Even with three nodes (or racks), distributing them across only two
datacenters leaves the system unable to reliably make the correct decision in
the (now far more likely) event that the two sides cannot communicate.
Which is not to say that a two datacenters solution is never appropriate. It
is not uncommon for companies to want a human in the loop before taking the
extraordinary step of failing over to a backup datacenter. Just be aware that
if you want automated failure, you’re either going to need a third datacenter
in order for quorum to make sense (either directly or via an arbitrator) or
find a way to reliably power fence an entire datacenter.
 Not everyone needs redundant power companies with
independent transmission lines. Although the paranoia paid off for at least
one customer when their monitoring detected a failing transformer. The customer
was on the phone trying to warn the power company when it finally blew.