Support Article
Dataflow failure due to Cassandra Timeout
Summary
Dataflow failures started in production environment with the mentioned exception.
The changes to the environment includes:
- Adding a new application which does not use any DDS facilities
- Visual Business Director (VBD) node is added.
- HFix-31496 which upgrades the Cassandra driver version is also installed recently
Error Messages
Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272)
at com.pega.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$SaveFuture.get(CassandraSaveWithTTLOperation.java:350)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$4.emit(CassandraSaveWithTTLOperation.java:205)
... 32 more
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Steps to Reproduce
Run a data flow
Root Cause
A defect in Pegasystems’ code or rules:
These hotfixes improved performance of dataflow execution thereby caused overloading on Cassandra servers.
This caused the Cassandra to store hinted handoff's locally.
Saving/compacting hinted handoff caused GC which in turn caused JVM pauses hence throwing timeouts errors.
Resolution
Apply HFix-37220. Following configuration changes are also require:.
1. To increase the timeout values, update prconfig.xml as per below nodes.
a. On all DDS nodes.
<env name="dnode/yaml/write_request_timeout_in_ms" value="60000" />
b. On all DF nodes.
<env name="dnode/cassandra_read_timeout_millis" value="50000" />
2. Reduce the DDS Read load on Cassandra servers by executing the DataFlow using an Activity and set the following properties to the RunOptions page.
- One can tweak the below values based on whether your DF execution is Cassandra intensive or Strategy intensive.
- One can deduce this by checking the Dataflow run item progress page and check the '% of total time'.
- Higher values for DDS dataset components would mean its Cassandra intensive.
- Reduce the pyBatchSize and pyNumberOfRequestors for those Dataflow runs that are Cassandra intensive.
.pyNumberOfRequestors to 8.
.pyBatchSize to 50
Published January 24, 2018 - Updated October 8, 2020
Have a question? Get answers now.
Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.