Support Article

Hazelcast Operation Timeout Exception

SA-21698

Summary

In a multi nodes environment, Some nodes fail to start up. The issue is not specific to any particular node. Restarting the node fails with this Hazelcast error. It's saying it's unable to reach a particular node/server, but that node/server is up and running.

Error Messages

[12/15/15 15:46:47:249 CST] 00000082 SystemOut O 2015-12-15 15:46:47,247 [ apsrs2714] [ STANDARD] [ ] ( internal.mgmt.PREnvironment) ERROR - com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=PutOperation{/pega/system/mgmt/nodeidUUID}, partitionId=84, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[<ip>]:8060, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
[12/15/15 15:46:47:325 CST] 00000082 SystemOut O 2015-12-15 15:46:47,253 [ apsrs2714] [ STANDARD] [ ] ( etier.impl.EngineStartup) ERROR - PegaRULES initialization failed. Server: apsrs2714
com.pega.pegarules.pub.context.InitializationFailedError: PRNodeImpl init failed

Steps to Reproduce

Re-start all the the nodes with Hazelcast enabled at once through cluster level startup

Root Cause

A defect in Pegasystems’ code. Whenever there are multiple nodes within multiple clusters and we perform cluster level startup to bring all the nodes at once, sometimes a race condition occurs and some of the nodes fails to establish connection with other nodes in the cluster and ultimately shuts down after some default number of tries.

Resolution

Perform the following local-change to avoid the race condition.

Add the Prconfig DSS settings as

prconfig /cluster/consistency/lockattemptdelayms/default value=5000
prconfig /cluster/consistency/maxlockattempts/default value= 150

Shutdown all the nodes and truncate PR_SYS_STATUSUNODES Database table

Bring up all the nodes by starting the clusters one by one

Tags:

Pega Platform

Pega Platform 7.1.8

Published April 7, 2016 - Updated October 8, 2020

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Collaboration Center

Hazelcast Operation Timeout Exception

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

The Power of Pega Resources

Experience the benefits of Pega Community when you log in.

Hazelcast Operation Timeout Exception

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

The Power of Pega Resources

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.