Skip to main content

This content has been archived and is no longer being updated. Links may not function; however, this content may be relevant to outdated versions of the product.

Support Article

Impact on DNodes were multiple node are failing

SA-32588

Summary



An environment has 18 PRPC nodes (Dnodes). The application is headless (no user) and mostly run dataflows. Unfortunately, a server went down due to mechanical failure and 6 nodes were stopped. This includes the ADM server nodes. On others nodes, it was observed much higher response time. The response time went back to normal when the 6 nodes were re-started. This lead to the following questions:

As the replication factor used is 3 and 6 nodes went down, can it be assumed some data in the cassandra cluster were completely unavailable?

Could it be clarified what happens then?

Is the data simply not available or does PRPC gets back to the source which could explain high response time as it is slower?

Is there some recommendation for this scenario such as increasing replication factor or decreasing nodes per server?

Error Messages



2016-12-09 07:39:02,661 [erClockSynchDaemon-0] [ STANDARD] [ ] [ ] (nternal.PRClusterHazelcastImpl) ERROR - com.hazelcast.core.MemberLeftException: Member xxxx has left cluster!

Followed soon after by Cassandra Errors: (extract from vmmocfw08p)
2016-12-09 07:41:39,016 [ WebContainer : 27] [ STANDARD] [ ] [ xxxx:01.01.01] ( impl.cassandra.CassandraDao) ERROR |xxxx - Some cassandra nodes appear to be unavailable.
2016-12-09 07:41:39,067 [ WebContainer : 27] [ STANDARD] [ ] [ xxxx:01.01.01] ( impl.cassandra.CassandraDao) ERROR xxxx - Cassandra members state…..

In addition to this the following JMS exception was reported on vmgocfw08p
2016-12-09 07:37:29,820 [ WebContainer : 4] [ STANDARD] [ ] [ xxxx:01.01.01] ( client.jms.JMSHelper) ERROR xxxx - Failed to publish message
javax.jms.JMSException: CWSIA0067E: An exception was received during the call to the method JmsMsgProducerImpl.<constructor>: com.ibm.ws.sib.jfapchannel.JFapConnectionBrokenException: CWSIJ0056E: An unexpected condition caused a network connection to host xxxx on port xx using chain chain_1 to close..
at com.ibm.ws.sib.api.jms.impl.JmsMsgProducerImpl.<init>(JmsMsgProducerImpl.java:456)

Batch node vmmocfw03p reported that it was unable to contact the ADM server (Now know to be due to the failure of the ESX host hosting the ADM VM)
2016-12-09 07:36:10,280 [.PRPCWorkManager : 3] [ STANDARD] [ ] [ xxx:07.10] ( client.impl.ClientImpl) ERROR - Exception thrown while trying to update ADM models (ignored)
org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [xxxx/adm7/ADMServer]; nested exception is java.net.SocketTimeoutException: Read timed out
at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:216)


Steps to Reproduce



Have a environment with 18 Dnodes, running data flows, and have 6 of them becoming unavailable due to mechanical failure.

Root Cause



Not Applcable

Resolution



Replication factor describes how many copies of the data exists. By default PRPC use replication factor of three, which means that if 3 or more nodes fail some data won’t be available.

When PRPC writes a record, Cassandra finds a node owning this record. If all 3 nodes are unavailable, the write fails with "Unable to achieve consistency level ONE" error. PRPC is trying to account for temporary failures and retry write operations after a short period of time failing in the end. When 3 or more nodes are down, some writes go though and some fail after a several seconds delay. This will lead to an increased write time and multiple failures. If an application writing to a DDS data set doesn't handle write failures this may lead to a false impression that system behaves correctly but with higher response times. Activities writing to DDS via DataSet-Execute method must have proper StepStatusFail check in transition step.

Similar behavior is observed when reading data from a DDS data set. A number of failed nodes should never exceed replication_factor-1, otherwise system will behave incorrectly: some data cannot be read and some new data cannot be written. If these failed nodes never come back, then some data will be permanently lost. If a maximum affordable number of nodes being down at the same time is N. Then Replication factor must be set to N+1. Keep in mind that increasing replication time will increase read/write response times.

Published February 21, 2017 - Updated October 8, 2020

Was this useful?

0% found this useful

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega Community has detected you are using a browser which may prevent you from experiencing the site as intended. To improve your experience, please update your browser.

Close Deprecation Notice
Contact us