Support Article
VBD status is "JOINING_FAILED" and real time data flows are down
Summary
Visual Business Director (VBD) status is JOINING_FAILED and realtime data flows fail when they are reactivated.
Error Messages
Pega logs display repeated messages as below:
[XXXX]] [STANDARD] [ ] [ ] (tor$QueueBasedDataFlowExecutor) ERROR - Unexpected error occurred during event processing: null
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
[...]
INFO [PERIODIC-COMMIT-LOG-SYNCER] Server.java:225 - Stop listening for CQL clients
ERROR [PERIODIC-COMMIT-LOG-SYNCER] CommitLog.java:398 - Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.14.jar:2.1.14]
at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:195) ~[apache-cassandra-2.1.14.jar:2.1.14] at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.14.jar:2.1.14]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]
Caused by: java.io.IOException: No space left on device
at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_73]
at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_73]
at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:315) ~[apache-cassandra-2.1.14.jar:2.1.14]
... 3 common frames omitted
Steps to Reproduce
Add the VBD node. The node enters the JOINING_FAILED status and realtime data flows enter the FAILED status when reactivated.
Root Cause
A defect or configuration issue in the operating environment:
The issue occurred because Cassandra was Compacting the data. The Commit disk failure policy caused Cassandra to shutdown (as a safety mechanism to protect data integrity).
The diskspace used (on the fileystem '/appvol/pega' - where Cassandra data was stored) was ~50 GB on this filesystem. The size of the filesystem was 100 GB. Hence, ~50% of the filesystem was used.
The 'wasadm' user (that runs Pega or Cassandra) did not have a 'ulimit' imposed. As a result, a quota was not enforced. Hence, Cassandra Compaction caused an increase in diskspace because the existing data was copied.
Based on the error message:
- Cassandra had insufficient disk space to perform Compaction of data.
- Cassandra had shutdown because it could not Compact the data.
- The additional Pega logging 'spike' was caused because Cassandra had shutdown. Hence, the error messages were repeated in the Pega log files.
- The filesystem which Cassandra used to store data was ~50% full of a 100 GB total.
Resolution
Perform the following local-change:
- Increase the diskspace from 100 GB (of which ~50 GB is used) to 200 GB.
- Restart the application
- Compact the data.
Published October 20, 2018 - Updated October 8, 2020
Have a question? Get answers now.
Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.