Support Article

Dataflow failure due to Cassandra Timeout

SA-44137

Summary

Dataflow failures started in production environment with the mentioned exception.
The changes to the environment includes:

Adding a new application which does not use any DDS facilities
Visual Business Director (VBD) node is added.
HFix-31496 which upgrades the Cassandra driver version is also installed recently

Error Messages

Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.pega.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272)
at com.pega.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$SaveFuture.get(CassandraSaveWithTTLOperation.java:350)
at com.pega.dsm.dnode.impl.dataset.cassandra.CassandraSaveWithTTLOperation$4.emit(CassandraSaveWithTTLOperation.java:205)
... 32 more
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/<your IP>:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)

Steps to Reproduce

Run a data flow

Root Cause

A defect in Pegasystems’ code or rules:

The issue is caused by two set of scenarios introduced by two separate Hotfixes that got deployed as dependents for another hotfix.

These hotfixes improved performance of dataflow execution thereby caused overloading on Cassandra servers.

Issue 1: No retries performed after timeout, throwing OperationTimedOutException. HFix-31496 upgrades the Cassandra driver from 2.1.9 to 3.1.2.

This upgrades changes the default behavior of the driver to not perform retries unless the cql query is set as idempotent.

Issue 2: Too much read 'load' on Cassandra server, causing it to loose visibility of other nodes in the cluster.
This caused the Cassandra to store hinted handoff's locally.
Saving/compacting hinted handoff caused GC which in turn caused JVM pauses hence throwing timeouts errors.

HFix-35785 has changes, to improve performance of DDS dataset, so that DDS Reads occur in parallel, instead of one by one.

Resolution

Apply HFix-37220. Following configuration changes are also require:.

1. To increase the timeout values, update prconfig.xml as per below nodes.
a. On all DDS nodes.
<env name="dnode/yaml/write_request_timeout_in_ms" value="60000" />

b. On all DF nodes.
<env name="dnode/cassandra_read_timeout_millis" value="50000" />

2. Reduce the DDS Read load on Cassandra servers by executing the DataFlow using an Activity and set the following properties to the RunOptions page.

One can tweak the below values based on whether your DF execution is Cassandra intensive or Strategy intensive.
One can deduce this by checking the Dataflow run item progress page and check the '% of total time'.
Higher values for DDS dataset components would mean its Cassandra intensive.
Reduce the pyBatchSize and pyNumberOfRequestors for those Dataflow runs that are Cassandra intensive.

.pyNumberOfRequestors to 8.
.pyBatchSize to 50

Tags:

Pega Platform

Pega Platform 7.2.1

Decision Management

Published January 24, 2018 - Updated October 8, 2020

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Collaboration Center

COVID-19 Employee Safety and Business Continuity Tracker

Dataflow failure due to Cassandra Timeout

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

The Power of Pega Resources

Experience the benefits of Pega Community when you log in.

COVID-19 Employee Safety and Business Continuity Tracker

Dataflow failure due to Cassandra Timeout

Summary

Error Messages

Steps to Reproduce

Root Cause

Resolution

Tags:

Have a question? Get answers now.

We'd prefer it if you saw us at our best.