One server cannot connect to DNode
SummaryOn a multinode PRPC env with 6 nodes, adding a server has DNode is successful for all but one server. As a result, the DNode for this cluster is stuck in Joining state.
Error Messagesjava.lang.RuntimeException: java.io.IOException: Cannot proceed on repair because a neighbor (-ip address) is dead: session failedat com.google.common.base.Throwables.propagate(Throwables.java:160)at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
Steps to ReproduceOn a multinode system, Add each server has a Dnode server
Root CauseA defect or configuration issue in the operating environment . recently a hotfix HFix-28585 was released for DSM. this hotfix has implementation to perform repair when cluster size changes,
After the hotfix is installed, there is a a special instructions to follow in term of restart procedure. restarting all nodes at once will exhibit this error.
For a fresh Setup, please follow the below steps.
- Decommission all Dnodes from DNode Cluster management page (skip if the decommision action is unavailable)
- Delete dynamic system settings like "dnode/<NodeID>/enableAtStartup"
- Shutdown all servers
- Delete the PegaTempDir of each server, (this also deletes 'prpc' directory that hold cassandra/dnode data.)
- Start the servers one by one, and add to DNode cluster, starting only a new server once the latest server has been added as a DNode.
Published August 19, 2016