Cassandra errors occurring during multi-node restart
SummaryOn a standard recycle of the servers which happens weekly we see the following situation. There are two data centres. The first data centre is recycled at 2 am in the morning, the second at 3am.
During the recycle of the realtime nodes we see errors in the startup of the system relating to DNODES. The DNode does start but takes almost 20 seconds, on servers without the error we see a restart in under 10 seconds
After restart all nodes show as available and with correct percentages in DNode cluster management page.
Error Messages2016-10-16 02:20:50,438 [server.net] [ STANDARD] [ ] [ ] ( dnode.api.DNodeBootstrap) INFO - Starting D-Node service
2016-10-16 02:21:11,794 [server.net] [ STANDARD] [ ] [ ] (l.cassandra.CassandraUserUtils) ERROR - Unable to estabish if user cassandra exists
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
2016-10-16 02:21:11,801 [server.net] [ STANDARD] [ ] [ ] ( dnode.api.DNodeBootstrap) INFO - D-Node started in 21363 ms
Steps to ReproduceRestart the nodes in a multi node system
The error message is caused by a timing issue when the Cassandra nodes come up.
Not all Cassandra nodes do hold all available data - this includes the user definitions, too. So when a node comes up that do not hold the data itself, it has to ask one that does. If there is no node with that information at that time, the observed exception is thrown, but startup continues. All nodes are up and operational and all the information required is online, including that user data.
Therefore the reported exception should not occur in the logs for all nodes; at least the logs of two nodes should be free of them.
ResolutionBasically, this is not a product effect as such, just a cosmetic bug.
Nevertheless, with Pega 7.2.1, the way that Cassandra is integrated within PRPC has changed completely.
One of the fundamental areas of this work was to standardize on a services infrastructure to ensure that the various dependencies between services are met before services are started up. In the course of this work we changed the user and password management functionality so that creation of our users is performed when the first Decision Data Store (DDS) node is started. We also added code to ensure that on adding new DDS nodes to the cluster we replicate Cassandra user ids to the new nodes as part of the startup. We now enforce that DDS nodes are started sequentially and that services that require DDS nodes (VBD, ADM etc) wait until a working DDS cluster exists before attempting startup.
The CassandraUserUtils class that generates the error mentioned above was removed from the source as part of this work. The exception shown in the provided logs no longer exists in the codebase.
Published November 14, 2016 - Updated November 20, 2016