Split-Brain Syndrome and cluster fracturing FAQs
This article is one in the series of articles that includes the following companion articles:
Managing clusters with Hazelcast (prerequisite)
Troubleshooting Hazelcast cluster management [Restricted Access - Not Available Publicly]
In highly-available clustered environments, you might notice that certain nodes in your cluster cannot see one another. One node appears as if it is the one and only active node. The Cluster Management page does not show all the nodes in your Pega deployment. Upon inspection, you determine that some nodes are in a separate cluster. This error condition is sometimes referred to as Split-Brain Syndrome or cluster fracturing. What causes the cluster to split like this? How can this condition be fixed?
Read the following questions and answers to understand what causes Split-Brain Syndrome and what you can do to fix this problematic condition.
What is Split-Brain Syndrome?
How do you detect Split-Brain Syndrome?
How do you detect actions that cause Split-Brain Syndrome?
Common root causes for cluster fracturing
Under-allocation of resources
System management issues
How do you prevent cluster fracturing?
What steps can be taken to resolve Split-Brain Syndrome?
Split-Brain is a state of decomposition in which a single cluster of nodes was separated into multiple clusters of nodes, each operating as if the other no longer exists.
Cluster fracturing is the process by which nodes end up in a Split-Brain state.
In most cases, you should not need to detect or watch for a Split-Brain state. Hazelcast automatically detects these situations and attempts to heal the cluster automatically. In situations where Hazelcast automatically recovers, you might first notice that some remote operations fail; however, service will be restored shortly afterward.
In situations where Hazelcast is unable to automatically recover, a Split-Brain merge failure message is reported in the logs. A PEGA0108 critical alert is also sent.
There are numerous actions that can cause Split-Brain, including, but not limited to, network outages, GC thrashing, and the over-allocation of hardware resources. Hazelcast includes several APIs that allow for monitoring actions that might lead to a Split-Brain state. This includes detecting lost partitions, failed merges, and dropped nodes. Other events that you should consider for monitoring and notification include high memory or CPU usage (or both), GC events, and known network failures.
Know how to recognize and isolate the following common causes of cluster fracturing:
- Ran out of disk space: Many operations need disk space, and the node might crash if it does not have enough space.
- High CPU usage: Hazelcast requests must be processed in a reasonable amount of time; otherwise, other nodes in the cluster might incorrectly think that a node has crashed or stopped responding.
- Out of Memory (OOM): Lack of available memory leads to garbage collection (GC), which in turn may cause thrashing. This leads to High CPU usage.
- Over-allocated systems: Even on systems with ample resources, resource spikes between Pega nodes and other applications sharing the same VM might lead to OOM, high CPU usage, and other negative conditions.
- High latency: If the latency between nodes and data centers is too high, requests are not processed in sufficient time.
- Network outage: If the connection between two nodes is severed, Hazelcast cannot resume proper communication.
- Domain or firewall issues: Nodes might also fracture because of incorrect firewall or DMZ settings.
- Cycling nodes: Frequent restarts of nodes causes excess partition migrations and merging leading to excess memory and CPU usage. Data might also be lost if nodes are ungracefully shutdown.
- Long-running processes: Long-running processes on a JVM prevent Hazelcast from processing requests in a reasonable amount of time.
An event or chain of events occurred which led the cluster to be in the Split-Brain state. How do you prevent this from occurring again? For each of the causal actions described in the previous section, you need to determine the root cause. In most cases, you need to investigate a number of factors to understand what caused the cluster to deteriorate and fracture because a single event will not explain the whole picture. You need to examine both the PegaRULES and PegaCLUSTER logs for this work.
For example, one exception you might see is a TargetNotMemberException. This error states that a request was made to a node that is no longer known to the node that issues the exception.
- Why is this node no longer a part of my node's cluster?
- Was the node purposely killed from the cluster? This exception occurs when a remote request is issued, but the target node dropped out of the cluster prior to processing.
- Was the node kicked out of the cluster? To keep the cluster running smoothly, healthy nodes will kick unhealthy nodes out of the cluster.
As you can see, understanding why a cluster fractured is not a simple task. A node might have been terminated, causing repartitioning to occur. A distressed node might not be responding in a timely manner to other nodes and, consequently, is kicked out of the cluster. A node might also be operating in a healthy manner but became separated from the rest of the cluster because of network issues. Altogether, it is important to look at all the possibilities because cluster fracturing is usually the result of several issues.
For a list of root causes, read the section How do you detect actions which cause Split-Brain?.
First and foremost, if you are running a cluster in Pega 7.x, you should upgrade to the latest Hazelcast Edition that is available for the Pega Platform release that you are using. Many stability and bug fixes are included in the latest versions of Hazelcast. Pega has also provided hotfixes for Hazelcast stability with the Pega Platform. See Managing clusters with Hazelcast, Hazelcast Editions supported.
Applying hotfixes will considerably improve stability. However, there are still many factors that can cause a cluster to fracture. Further investigation might be necessary.
To determine the root cause of a fractured cluster, ask yourself the following sequence of questions:
- Do the root causes of previous Split-Brain scenarios apply to this issue?
- While examining the logs, can you tell when the cluster began experiencing issues? Does the time factor correlate with other problems (CPU usage, memory, network issues, node cycling, and so on)?
- Using JVM inspection tools, can you tell if there are issues with the nodes themselves?
- Are there multiple nodes experiencing issues? Or are the nodes pointing to just one or two nodes that are causing instability?
- What educated actions can you take for each scenario? Is manual or automated intervention required?
- Are the correct Hazelcast settings and JVM arguments in use?