Impact of failing nodes on system stability

Learn how the number of functional nodes and the current replication factor affect system stability when some of the Cassandra nodes are down.

The replication factor indicates the number of existing copies of each record. The default replication factor is 3, which means that if three or more nodes fail, some data becomes unavailable. At the time of a write operation on a record, Cassandra determines which node will own the record. If all three nodes are unavailable, the write operation fails and writes the Unable to achieve consistency level ONE error to the Cassandra logs.

When three or more nodes are unavailable, some write operations succeed and some fail after a period of several seconds. This causes an increased write time and is the root cause of multiple failures. If an application that performs write operations to a Decision Data Store (DDS) data set does not handle write failures, the system might seem to be functioning correctly, only with a prolonged response time.

Therefore, activities that perform write operations to DDS through the DataSet-Execute method must include the StepStatusFail check-in transition step. The number of failed nodes should then never exceed the replication_factor value, minus 1. Otherwise, the system might behave incorrectly, for example, some write or read operations might fail. If the failed nodes do not become functional, then data might be permanently lost.

You can prevent data loss by determining the maximum affordable number of nodes that can be down at the same time (N), and configuring the replication factor to N+1.

Note: Increasing the replication factor impacts the response times for read and write operations.