Table of Contents

Article

Data flow run updates

Data flow runs on Pega® Platform have been enhanced to increase the resilience, usability, and flexibility with which you run Data Flow rules in your application.

You can now do the following actions:

Resilience for data flow runs

You can determine the behavior of a data flow run when logic errors or system failures occur. In the case of logic errors, you can configure the maximum number of failed records that can occur. If the number of failed records is below or equal to this maximum number, the run continues. However, if the number of errors is greater than the specified threshold, the run automatically fails. You can inspect the records that failed and the root cause of the failure by clicking the link with the number of errors, which is available at each data flow component level on the Component statistics tab.

You can take the following actions for data flow runs if a node fails:

  • You can continue the data flow run on the remaining functional nodes, starting either from the last snapshot (for resumable data flow runs) or from the beginning of the data partition (for non-resumable data flow runs). When this behavior type is enabled, each record can be processed more than once.
  • You can skip the data partitions that were processed on the node that failed. This option is available only for batch mode data flow runs. When this behavior is enabled, each record can be processed at most once. This means that some records might not be processed at all.
    On each Data Flow service node, at any point during a data flow run, there can be as many data partitions as are configured on the Data Flow service tab.
  • You can mark the data flow run as failed.

Batch data flow run resilience settings for resumable and non-resumable data flow runs

Batch data flow run resilience settings for resumable and non-resumable data flow runs

Kafka source data flow

Real-time data flow run resilience settings

Resumability of data flow runs

Beginning with Pega7.3, data flow runs can be characterized as resumable or non-resumable. A resumable data flow run references a Data Flow rule whose primary source is a data set that contains an ordered collection of records. A data flow run is resumable if the primary data source is resumable.

The following data set types are resumable:

  • In Pega 7.3 - Database Table, Kafka, and Stream
  • In Pega 7.3.1 - Database Table, Decision Data Store (DDS), HDFS, Kafka, and Stream
  • In Pega 7.4 - Database Table, Decision Data Store (DDS), HBase, HDFS, Kafka, and Stream

Resumable runs can be paused or resumed and, in the case of node failure, the active data partitions are transferred to the remaining functional nodes and resumed from the last correctly processed record that is captured periodically as a snapshot. For non-resumable runs, no snapshots are taken because the order of incoming records cannot be ensured. Therefore, the starting point for non-resumable data flow runs is the first record in each partition.

Pause and resume

Pausing and resuming a data flow run

Error threshold

You can control the acceptable number of errors per run for both real-time and batch data flow runs by specifying the error threshold. If this error threshold is exceeded, the data flow run is terminated and marked as failed. You can decide to continue such data flow runs. For more information, see Failed data flow runs and records.

State snapshots

You can specify the frequency of saving the state of a data flow run in the number of records processed. Each time the data flow run processes the specified number of records, the state of the data flow run is saved in the form of the last successfully processed record ID. When the data flow run processing resumes, the starting point is the record that immediately follows the record that was last processed.

For more information, see Creating a batch run for data flows and Creating a real-time run for data flows.

Automatic rescaling for data flow runs

When a Data Flow service node fails or is removed from the cluster, the data partitions that were being processed on that node are automatically transferred to the remaining functional nodes in the Data Flow service. Similarly, if you add a Pega Platform node to the Data Flow service while a data flow run is active, that data flow run automatically accommodates the additional processing capacity that is provided by the newly added node. This way, you can automatically adjust the data flow processing to the amount of available hardware and the current node topology.

You can view the processing status for each data partition that was assigned to process the data flow. The status information includes the node ID, number of processed records, the number of values, and the last saved record ID. You can access this information by clicking the partition's number on the Distribution details tab of the data flow run work item.

Failed data flow runs and failed records

Beginning with Pega 7.3.1, you can continue processing batch data flow runs that failed and reprocess only the partition that contain failed records.

You now have an option to continue batch data flow runs from the record where they failed.

A batch data flow run can finish with one of the following states:

  • Completed – Data flow run finished and processed all its records successfully.

  • Completed with failures – Data flow run finished and processed all its records but some of them failed. You need to troubleshoot the failed records and reprocess partitions that contain the failed records.

  • Failed – Data flow run stopped and did not process all its records. Data flow run can fail, for example, due to the exceeded error threshold.

    Continuing a failed data flow run

Continuing a failed data flow run

When the run finishes with failures, you can identify all the records that failed during the run by clicking the number in the # Failed records column. After you troubleshoot the run and the failed records, you can reprocess just the partitions that contain the failed records to see if failures still occur.

Reprocessing failed records in a data flow run

Reprocessing failed records in a data flow run

When you reprocess failures, you resubmit all the partitions that contain failed records to reprocess all the records that are on these partitions, whether or not they failed during the run. If, for example, processing a record successfully triggers an action such as sending an email, resubmitting a partition with failed records results in sending the emails again.

With the option to continue failed batch data flow runs and to reprocess failures in the runs, you do not need to repeat full data flow runs, which saves time when your data flow runs contain millions of records.

For more information, see Reprocessing failed records in batch data flow runs.

Failed records count in batch data flow runs

Failed records in a batch data flow run are tracked for each partition that is used for the run. On the Data Flows landing page, you can view the total number of failed records for each data flow run. For example, a run fails if it exceeds its threshold of 20 failed records:

Run IDPartitionLast IDStatusFailed records
DF-1P1100Completed6
DF-1P285Stopped6
DF-1P3100Completed7
DF-1P4100Completed0
DF-1P511Stopped2
DF-1P66Stopped0
Total number of failed records21

Partitions P1, P3, and P4 were completed before the data flow run failed. If you continue the failed run, you continue processing only the stopped partitions (P2, P5, and P6). The status of the stopped partitions changes to new and their failed records are reset.

Run IDPartitionLast IDStatusFailed records
DF-1P1100Completed6
DF-1P285New0
DF-1P3100Completed7
DF-1P4100Completed0
DF-1P511New0
DF-1P66New0
Total number of failed records13

When you continue the run, the count of failed records includes the records that failed only in the completed partitions. The run can complete processing partitions P2, P5, and P6 if the run does not exceed the threshold once again.

If the run continues to fail repeatedly, you might not be able to complete it, and you might consider troubleshooting it right away.

After the run completes processing partitions P2, P5, and P6, it is completed with failures.

Run IDPartitionLast IDStatusFailed records
DF-1P1100Completed6
DF-1P2100Completed1
DF-1P3100Completed7
DF-1P4100Completed2
DF-1P5100Completed0
DF-1P6100Completed0
Total number of failed records16

Troubleshoot the run and the failed records before you reprocess failures. In these examples, when failures are reprocessed, partitions P1, P2, P3, and P4 are resubmitted for processing because they contained some failed records.

Run IDPartitionLast IDStatusFailed records
DF-1P1 New0
DF-1P2 New0
DF-1P3 New0
DF-1P4 New0
DF-1P5100Completed0
DF-1P6100Completed0
Total number of failed records0

In the best-case scenario, after reprocessing failures, the run finishes with no failed records and its status is completed.

Published April 25, 2017 — Updated August 23, 2018


100% found this useful

Related Content

Have a question? Get answers now.

Visit the Pega Support Community to ask questions, engage in discussions, and help others.