Support Article

Dataflow failure due to Cassandra Timeout

SA-44137

Summary



Dataflow failures started in production environment with the mentioned exception.
The changes to the environment includes:

  • Adding a new application which does not use any DDS facilities
  • Visual Business Director (VBD) node is added.
  • HFix-31496 which upgrades the Cassandra driver version is also installed recently



Error Messages

Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272)
at com.pega.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$SaveFuture.get(CassandraSaveWithTTLOperation.java:350)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$4.emit(CassandraSaveWithTTLOperation.java:205)
... 32 more
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)

Steps to Reproduce



Run a data flow


Root Cause



A defect in Pegasystems’ code or rules:
The issue is caused by two set of scenarios introduced by two separate Hotfixes that got deployed as dependents for another hotfix.

These hotfixes improved performance of dataflow execution thereby caused overloading on Cassandra servers.

Issue 1: No retries performed after timeout, throwing OperationTimedOutException. HFix-31496 upgrades the Cassandra driver from 2.1.9 to 3.1.2.
This upgrades changes the default behavior of the driver to not perform retries unless the cql query is set as idempotent.
Issue 2: Too much read 'load' on Cassandra server, causing it to loose visibility of other nodes in the cluster.
This caused the Cassandra to store hinted handoff's locally.
Saving/compacting hinted handoff caused GC which in turn caused JVM pauses hence throwing timeouts errors.
HFix-35785 has changes, to improve performance of DDS dataset, so that DDS Reads occur in parallel, instead of one by one.


Resolution



Apply HFix-37220. Following configuration changes are also require:.

1. To increase the timeout values, update prconfig.xml as per below nodes.
a. On all DDS nodes.

<env name="dnode/yaml/write_request_timeout_in_ms" value="60000" />

b. On all DF nodes.
<env name="dnode/cassandra_read_timeout_millis" value="50000" />

2. Reduce the DDS Read load on Cassandra servers by executing the DataFlow using an Activity and set the following properties to the RunOptions page.

  • One can tweak the below values based on whether your DF execution is Cassandra intensive or Strategy intensive.
  • One can deduce this by checking the Dataflow run item progress page and check the '% of total time'.
  • Higher values for DDS dataset components would mean its Cassandra intensive.
  • Reduce the pyBatchSize and pyNumberOfRequestors for those Dataflow runs that are Cassandra intensive.

.pyNumberOfRequestors to 8.
.pyBatchSize to 50


Published January 24, 2018 - Updated October 8, 2020


50% found this useful

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.