Campaign enters into inconsistent state after node failure
When performing a Failover test for Pega Marketing, the following issues occur:
- The data flow is Paused.
- The Run schedule (PegaMKT-Work-ProgramRun instance) remains in the Running state (this is visible in: Pega Marketing portal > Campaigns > Run schedule)
- The corresponding System-Queue-ProgramRun instance remains in the Processing state (this does not change to to Success or Broken)
Steps to Reproduce
- Set up a server with two data nodes.
- Start a scheduled Campaign run.
- Wait till the data flow starts.
- Stop the Java process that executes the Pega application on the second node (kill -9).
- Restart the web server.
The ProcessProgramRun agent executed the Campaigns. On killing the node where the agent is executed, the control over the Campaign run process is lost. Therefore, user intervention is required.
Here’s the explanation for the reported behavior:
When a node crash involves Pega Marketing (PM) agents, the Campaign run continues to be in the Running state.
The PR-xxx data flow run is paused.
When the server is restarted or if the Pega Marketing portal is available, the user must click the Stop button manually on the program run affected by the system crash.
When the Stop action is submitted,
- the program run item is set as Stopped.
- the corresponding PR-xxx data flow run also enters the Stopped state (from the Paused state).