Close popover

Table of Contents

Creating a batch run for data flows

Version:

Create batch runs for your data flows to make simultaneous decisions for large groups of customers. You can also create a batch run for data flows with a non-streamable primary input, for example, a Facebook data set.

  1. Start the Data Flow service.

    For more information, see Configuring the Data Flow service.

  2. Check-in the data flow that you want to run.

    For more information, see Rule check-in process.

  1. In the header of Dev Studio, click Configure Decisioning Decisions Data Flows Batch Processing .

  2. On the Batch processing tab, click New.

  3. On the New: Data Flow Work Item tab, associate a Data Flow rule with the data flow run:

    1. In the Applies to field, press the Down arrow key, and then select the class to which the Data Flow rule applies.

    2. In the Access group field, press the Down arrow key, and then select an access group context for the data flow run.

    3. In the Data flow field, press the Down arrow key, and then select the Data Flow rule that you want to run.

      The class that you select in the Applies to field limits the available rules.
    4. In the Service instance name field, select Batch.

  4. Optional:

    To run activities before and after the data flow run completes, in the Additional processing section, specify the pre-processing and post-processing activities.

  5. Specify the error threshold for the data flow run:

    1. Expand the Resilience section.

    2. In the Fail the run after more than x failed records field, enter an integer greater than 0.

      After the number of failed records reaches or exceeds the threshold that you specify, the run stops processing data and the run status changes to Failed. If the number of failed records does not reach or exceed the threshold, the run continues to process data, and the run status then changes to Completed with failures.
  6. In the Node failure section, specify how you want the run to proceed in case the node becomes unreachable:

    • To resume processing records on the remaining active nodes, from the last processed record that is captured by a snapshot, select Resume on other nodes from the last snapshot. If you enable this option, the run can process each record more than once.

      This option is available only for resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.

    • To resume processing records on the remaining active nodes from the first record in the data partition, select Restart the partitions on other nodes. If you enable this option, the run can process each record more than once.

      This option is available only for non-resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.

    • To skip processing the data on the failed node, select Skip partitions on the failed node. If you enable this option, the run completes without processing all records. Records that process successfully only process once.
    • To terminate the data flow run and change the run status to Failed, select Fail the entire run.

      This option provides backward compatibility with previous versions of Pega Platform.

  7. For resumable data flow runs, in the Snapshot management section, specify how often you want the Data Flow service to take snapshots of the last processed record from the data flow source.

    If you set the Data Flow service to take snapshots more frequently then you increase the chance of not repeating record processing, but you can also lower system performance.
  8. If your data flow references an Event Strategy rule, configure the state management settings:

    1. Expand the Event strategy section.

    2. Optional:

      To specify how you want the incomplete tumbling windows to act when the data flow run stops, in the Event emitting section, select one of the available options.

      By default, when the data flow run stops, all incomplete tumbling windows in the Event Strategy rule emit the collected events. For more information, see Event Strategy rule form - Completing the Event Strategy tab.
    3. In the State management section, specify how you want the Data Flow service to process data from event strategies:

      • To keep the event strategy state in running memory and write the output to a destination when the data flow finishes its run, select Memory.

        If you select this option, the Data Flow service processes records faster, but you can lose data in the event of a system failure.

      • To periodically replicate the state of an event strategy in the form of key values to the Cassandra database that is located in the Decision Data Store, select Database.

        If you select this option, you can fully restore the state of an event strategy after a system failure, and continue processing data.

    4. In the Target cache size field, specify the maximum size of the cache for state management data.

      The default value is 10 megabytes.
  9. Click Done.

    The system creates a batch run for your data flow and opens a new tab with details about the run. The run does not start yet.
  10. Click Start.

    The batch data flow run starts.
  11. Optional:

    To analyze a life cycle during or after a runand troubleshoot potential issues, review the life cycle events:

    1. On the Data flow run tab, click Run details.

    2. On the Run details tab, click View Lifecycle Events.

      The system opens a new window with a list of life cycle events. Each event has a list of assigned details, for example, reason. For more information, see Event details in data flow runs on Pega Community.

      By default, Pega Platform displays events from the last 10 days. You can change this value by editing the dataflow/run/lifecycleEventsRetentionDays dynamic data setting.
    3. Optional:

      To export the life cycle events to a single file, click Actions, and then select a file type.

  • Reprocessing failed records in batch data flow runs

    When a batch data flow run finishes with failures, you can identify all the records that failed during the run. After you fix all the issues that are related to the failed records, you can reprocess the failures to complete the run by resubmitting the partitions with failed records. This option saves time when your data flow run processes millions of records and you do not want to start the run from the beginning.

  • Data Flows landing page

    This landing page provides facilities for managing data flows in your application. Data flows allow you to sequence and combine data based on various sources, and write the results to a destination. Data flow runs that are initiated through this landing page run in the access group context. They always use the checked-in instance of the Data Flow rule and the referenced rules.

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.