Table of Contents

Article

Event details in data flow runs

The lifecycle of a data flow run consists of multiple events of various types, for example Status changed or Run finished. You can analyze those events in a lifecycle report to verify if the run transitions correctly through stages. By monitoring the run for anomalies, you can identify potential issues, gather the relevant data, and reach out to Global Customer Support if you require advanced assistance.

The lifecycle report lists reasons for the following types of events:

  • PartitionStatusTransitionMessage
  • RunStatusTransitionMessage
  • RebalanceRunMessage

Topology changes are the predominant factor that affects data flow runs. Other triggers include user interactions, failures, and maintenance activities. You can examine the relations among events to trace their consequences and to pinpoint inconsistencies.

The following reasons appear in the Event details section of the report:

 
ReasonDescription
Applying node fail policy after node <node> became unreachableThe data flow engine received a notification of a change in the network topology. The engine applies the fail policy that you configured.
Applying node fail policy for unreachable nodes in run with statusA node failed and the data flow engine applies the node failure policy that you configured.
Assigned to recover a rebalanceThe rebalancing process could not finish and will restart. Some of the most common reasons why rebalancing fails, are changes in the topology. For additional data, check the TopologyChangeEvent event.
Clear state for new runA run starts or restarts and the previous state is cleared. 
Data flow execution failedThe data flow engine could not start or continue the run. For additional data, check the Progress section of the data flow run.
Detected active run with unreachable nodes. Applying node fail policy for nodesWhen a node leaves the cluster, the data flow engine applies the fail policy and, depending on the run configuration, can trigger rebalancing. 
Detected unreachable node <node> will try to rebalance run <run ID>When a node leaves the cluster, the data flow engine applies the fail policy and, depending on the run configuration, may trigger rebalancing. 
Encountered error while processing run: <run ID>The run encountered an exception and did not recover. Partitions move to the Failed state and the event includes the details of the exception. For additional data, check the Progress section of the data flow run.
Ensure finalizationA run that failed or completed with failures has been resubmitted but no partitions require processing. The run transitions to the In progress state and then finishes.
Finalizing run since service <X> does not existThe service instance to which the run was updated does not exist in the system. This reason is typical of runs that you trigger through API or import into the system.
First node joined. Stabilizing statusAll nodes failed in a cluster. The first node that reactivates ensures that all runs are in a stable state so that they can start.
Force status transition which hasn't updated for more than <X> millisNodes failed while the run was changing state and the data flow engine forces the run to move to a stable state.
Forcefully pausing runIf the state of the run does not match the actual state of the system, the first node that detects the inconsistency forces the run to transition to a stable state. For example, this reason can apply when node services are stopped but the run is in the In progress state.
Forcing status transition <status_A -> status_B> as run has unknown serviceThe service instance in which you configured the run is no longer available in the system. This event is likely to occur during import operations.
Found run with unassigned partitions. Node <node> is not processing the runThe data flow engine detected that some partitions of a run are not processed and it initiates the rebalancing process. This reason is typical of streamable runs that require all partitions to run all the time.
Found unreachable nodes <nodes> while rebalancingThe data flow engine detects changes in the topology and attempts to recover the process.
Interrupted run belonging to unknown serviceThe data flow engine interrupted the current run to transition it to a stable state because the service instance is no longer available.
Interrupting run without present run configurationThe data flow engine interrupted the current run to transition it to a stable state because the engine did not locate the configuration for the run.
Joining run due to server startupA node joined the service cluster and the data flow engine rebalances the run for the node to process partitions. This reason is specific to the RebalanceRunMessageevent.
Last node left. Stabilizing statusThe data flow engine receives a notification that the last node leaves the service cluster and the engine moves the run to a stable state.
Node <node> joined data flow service <instance>The data flow engine rebalances all runs that run on the specified service instance.
Paused runThe run paused, for example, because a node failed when it resumed the run. For additional data, check the TopologyChangeEvent event.
Pausing run because it's no longer managed or no nodes are availableThe run paused and the run configuration is updated. For additional data, check the ProcessedRunConfigUpdateMessage event.
Pausing run because node <X> is stoppingThe run paused because a TopologyChangeEvent event affected the run. As a result, the run goes through the failure policy that is part of the run configuration. This event can happen when you remove the node from the cluster through the Services landing page.
Pausing run since service <X> does not existThe run paused because the run was configured to start in a service instance that does not exist. This event can occur during import operations or when you trigger the run through the API. For additional data, check the ProcessedRunConfigUpdateMessage event.
Pausing run while rebalancingThe run paused and the rebalancing process is running.
Pre-activity failedThe pre-processing activity failed. The run moves to the Failed state and does not start.
Pre-activity skipped the runif you design the pre-processing activity to skip the run based on certain pre-conditions, the run moves to the Stopped state and then triggers the post-processing activity.

Resumable failed runs:

Prepare to resume failed partitions for resubmit

Prepare for resubmit of completed with failures run

Non-resumable failed runs:

Prepare resubmit for non-resumable completed with failures run

Prepare resubmit for not completed partitions

The data flow engine prepares the partitions for resubmission for the run. Every case is handled differently depending on the state and the type of the run and the partitions.
Rebalancing run due to run config updateThe data flow engine detected an update to the configuration of the run and rebalances the run to accommodate the change.
Recovering rebalance of an in progress runThe node that initiated the rebalancing process could not finish it. A different node resumes the process. 

Reset partitions belonging to unreachable node <node>

The data flow engine detected a failed node and the run will restart the partition.
Resetting partitions of dead nodes <nodes> as run is resumableThe data flow engine prepares the partitions for a reset.

Resuming a system-paused resumable run

Resuming a paused streamable run

The data flow engine pauses the run. This event can happen mainly due to changes in the topology that affect the run or due to a user action. If the action is not performed through an external API, make sure that the run is active.
Resuming managed run due to service startupManaged runs start automatically when the first service node joins the cluster.
Resuming newly managed runA managed run resumes. For additional data, check the ProcessedRunConfigUpdateMessage events.
Resuming run while rebalancingThe rebalancing process is running. The run moves into the Resuming state and then into the In progress state.
Retrying to finalize runThe node that initiates the run finalization is disabled and a different node attempts to finalize the run. For additional data, check the TopologyChangeEvents event.
Run finished but not all partitions were in end stateThe data flow run finished but some threads did not update their partitions. The data flow engine updates the partitions.
Run is in progressThe run starts processing records from this partition.
Run was pausing while it actually completed. Will ignore the pause and finalize itThe run was pausing at the same time when all the records finished processing.
Stopping/Pausing/Resuming run through API callEither a user or an automated external mechanism triggered a run stop, pause, or resume activity, outside of the data flow engine.
There are no service nodes present to pause the run. Forcing transitionNo nodes are available in the service cluster. For additional data, check the TopologyChangeEvent event.
There are no service nodes while the run is pausingThe data flow engine does not receive a notification that the last node leaves the service cluster and moves the run to the Paused state.
Updating partition state because run reached unexpected status: <status>The data flow engine could not determine the status of the run. To troubleshoot the issue, analyze the previous status, topology change events, and any errors that occurred.

For more information about data flows and services, see Data Flows landing page and Data Flow service.

For more information about accessing event details, see Creating a real-time run for data flows and Creating a batch run for data flows.

Glossary

The following list details some of the terms that you can find in event reasons: 

  • Finalizing a run: This term refers to the post-processing tasks that the data flow engine executes when a run transitions to its final state. 
  • Forced transition: The data flow engine forces a transition to a stable state so that you can take action on the run, for example, when a node fails during transition. 
  • Rebalancing: When a node joins or leaves the cluster, the data flow engine analyzes the associations between partitions and nodes and starts the run so that the run operates in the appropriate cluster topology. The rebalancing process reassigns partitions to a different set of nodes.
  • Resubmission: When a run finishes in the Failed state or the Completed with failures state, you can continue the processing of the run or restart the processing of the partitions that caused issues.
  • Service cluster: The set of nodes that you configured in a data flow service type.
  • Service instance: One of the data flow service types.
  • Stable state: One of the following data flow run states:
    • New
    • Completed
    • Paused
    • Completed with failures
    • Stopped
    • Failed

Published March 19, 2018 — Updated March 22, 2019


0% found this useful

Have a question? Get answers now.

Visit the Pega Support Community to ask questions, engage in discussions, and help others.