Support Article
Hazelcast Operation Timeout Exception
SA-21698
Summary
In a multi nodes environment, Some nodes fail to start up. The issue is not specific to any particular node. Restarting the node fails with this Hazelcast error. It's saying it's unable to reach a particular node/server, but that node/server is up and running.
Error Messages
[12/15/15 15:46:47:249 CST] 00000082 SystemOut O 2015-12-15 15:46:47,247 [ apsrs2714] [ STANDARD] [ ] ( internal.mgmt.PREnvironment) ERROR - com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=PutOperation{/pega/system/mgmt/nodeidUUID}, partitionId=84, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[<ip>]:8060, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
[12/15/15 15:46:47:325 CST] 00000082 SystemOut O 2015-12-15 15:46:47,253 [ apsrs2714] [ STANDARD] [ ] ( etier.impl.EngineStartup) ERROR - PegaRULES initialization failed. Server: apsrs2714
com.pega.pegarules.pub.context.InitializationFailedError: PRNodeImpl init failed
Steps to Reproduce
Re-start all the the nodes with Hazelcast enabled at once through cluster level startup
Root Cause
A defect in Pegasystems’ code. Whenever there are multiple nodes within multiple clusters and we perform cluster level startup to bring all the nodes at once, sometimes a race condition occurs and some of the nodes fails to establish connection with other nodes in the cluster and ultimately shuts down after some default number of tries.
Resolution
Perform the following local-change to avoid the race condition.
Add the Prconfig DSS settings as
prconfig /cluster/consistency/lockattemptdelayms/default value=5000
prconfig /cluster/consistency/maxlockattempts/default value= 150
Shutdown all the nodes and truncate PR_SYS_STATUSUNODES Database table
Bring up all the nodes by starting the clusters one by one
Published April 7, 2016 - Updated October 8, 2020
Have a question? Get answers now.
Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.