Support Article

Impact on DNodes were multiple node are failing

SA-32588

Summary

An environment has 18 PRPC nodes (Dnodes). The application is headless (no user) and mostly run dataflows. Unfortunately, a server went down due to mechanical failure and 6 nodes were stopped. This includes the ADM server nodes. On others nodes, it was observed much higher response time. The response time went back to normal when the 6 nodes were re-started. This lead to the following questions:

As the replication factor used is 3 and 6 nodes went down, can it be assumed some data in the cassandra cluster were completely unavailable?

Could it be clarified what happens then?

Is the data simply not available or does PRPC gets back to the source which could explain high response time as it is slower?

Is there some recommendation for this scenario such as increasing replication factor or decreasing nodes per server?

Error Messages

2016-12-09 07:39:02,661 [erClockSynchDaemon-0] [ STANDARD] [ ] [ ] (nternal.PRClusterHazelcastImpl) ERROR - com.hazelcast.core.MemberLeftException: Member xxxx has left cluster!

Followed soon after by Cassandra Errors: (extract from vmmocfw08p)
2016-12-09 07:41:39,016 [ WebContainer : 27] [ STANDARD] [ ] [ xxxx:01.01.01] ( impl.cassandra.CassandraDao) ERROR |xxxx - Some cassandra nodes appear to be unavailable.
2016-12-09 07:41:39,067 [ WebContainer : 27] [ STANDARD] [ ] [ xxxx:01.01.01] ( impl.cassandra.CassandraDao) ERROR xxxx - Cassandra members state…..

In addition to this the following JMS exception was reported on vmgocfw08p
2016-12-09 07:37:29,820 [ WebContainer : 4] [ STANDARD] [ ] [ xxxx:01.01.01] ( client.jms.JMSHelper) ERROR xxxx - Failed to publish message
javax.jms.JMSException: CWSIA0067E: An exception was received during the call to the method JmsMsgProducerImpl.<constructor>: com.ibm.ws.sib.jfapchannel.JFapConnectionBrokenException: CWSIJ0056E: An unexpected condition caused a network connection to host xxxx on port xx using chain chain_1 to close..
at com.ibm.ws.sib.api.jms.impl.JmsMsgProducerImpl.<init>(JmsMsgProducerImpl.java:456)

Batch node vmmocfw03p reported that it was unable to contact the ADM server (Now know to be due to the failure of the ESX host hosting the ADM VM)
2016-12-09 07:36:10,280 [.PRPCWorkManager : 3] [ STANDARD] [ ] [ xxx:07.10] ( client.impl.ClientImpl) ERROR - Exception thrown while trying to update ADM models (ignored)
org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [xxxx/adm7/ADMServer]; nested exception is java.net.SocketTimeoutException: Read timed out
at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:216)

Steps to Reproduce

Have a environment with 18 Dnodes, running data flows, and have 6 of them becoming unavailable due to mechanical failure.

Root Cause

Not Applcable

Resolution

Replication factor describes how many copies of the data exists. By default PRPC use replication factor of three, which means that if 3 or more nodes fail some data won’t be available.

When PRPC writes a record, Cassandra finds a node owning this record. If all 3 nodes are unavailable, the write fails with "Unable to achieve consistency level ONE" error. PRPC is trying to account for temporary failures and retry write operations after a short period of time failing in the end. When 3 or more nodes are down, some writes go though and some fail after a several seconds delay. This will lead to an increased write time and multiple failures. If an application writing to a DDS data set doesn't handle write failures this may lead to a false impression that system behaves correctly but with higher response times. Activities writing to DDS via DataSet-Execute method must have proper StepStatusFail check in transition step.

Similar behavior is observed when reading data from a DDS data set. A number of failed nodes should never exceed replication_factor-1, otherwise system will behave incorrectly: some data cannot be read and some new data cannot be written. If these failed nodes never come back, then some data will be permanently lost. If a maximum affordable number of nodes being down at the same time is N. Then Replication factor must be set to N+1. Keep in mind that increasing replication time will increase read/write response times.

Tags:

Pega Platform

Pega Platform 7.2.1

Decision Management

Published February 21, 2017 - Updated October 8, 2020

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Collaboration Center

Get Started with Community

COVID-19 Employee Safety and Business Continuity Tracker

Impact on DNodes were multiple node are failing

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

Get Started with Community

COVID-19 Employee Safety and Business Continuity Tracker

Impact on DNodes were multiple node are failing

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.