This documentation site is for previous versions.

Visit our new documentation site for current releases.

Managing big data to make informed business decisions

Updated on September 10, 2021

Create big data sets and configure a Hadoop connection to process high-volume and complex data on Pega Platform™. Move data between the Apache Hadoop cluster and Pega Platform by performing read-write operations on data sets.

You can work with big data and external environments by using the following Pega Platform tools:

Hadoop records that configure a connection with an Apache Hadoop cluster.
HBase and HDFS data sets that connect to an external system so that you can perform read-write operations on high-volume data.
Monte Carlo data set that you can use to generate synthetic and realistic data for simulations.
External data flows that process data on an external environment.

Apache Cassandra

You can create Decision Data Store (DDS) data sets in Pega Platform to store your data in an internal Cassandra storage that is part of the platform. The Cassandra storage is a structured and distributed database that is scalable, highly available, and designed to manage very large amounts of data. If you already store your data in an external Cassandra cluster, you can integrate it with Pega Platform by creating Connect Cassandra rules.

Apache Hadoop

Pega Platform provides access to data that is stored on an external Apache Hadoop cluster. You can read and write data from either an Apache HBase big data store or Hadoop Distributed File System (HDFS) by using the appropriate data sets. You define connection details for the Hadoop host and configure the connection setting for HBase and HDFS data sets by using Hadoop records.

Monte Carlo data set

You can test the strategies or data flows in your application in the absence of real data by using a special Monte Carlo data set that generates random realistic-looking data. Create a Monte Carlo data set that you can use later as a source in Data Flow rules and run simulations.

Creating a Monte Carlo data set

HBase data set

You can retrieve data that is in a high-volume data source or save a large number of records to it by connecting to the external Apache HBase big data store. Create an HBase data set that is specifically designed for this purpose and that you can use later in Data Flow rules as either a source or destination.

HDFS data set

You can retrieve data that is in a high-volume data source or save a large number of records to it by connecting to the Apache Hadoop Distributed File System (HDFS). Create an HDFS data set that is specifically designed for this purpose and that you can use later in Data Flow rules as either a source or destination.

Kafka

Apache Kafka is a fault-tolerant and scalable platform that you can use as a data source for real-time analysis of customer records (such as messages, calls, and so on) as they occur. The most efficient way of using Kafka data sets in your application is through Data Flow rules that include event strategies. Create a Kafka data set that you can use later as a source in Data Flow rules and run simulations.

File data set

You can access and process data set that are stored in CSV and JSON files.

Creating a File data set

External data flows

You can run predictive analytics models and process high-volume data without overloading the data transfer between the Apache Hadoop cluster and Pega Platform, using external data flows that run on an external system. Create an external data flow that can have only an HDFS data set as a source and destination.

Previous topic Default fact properties in Pega 7.2 to 7.3
Next topic Introduction to big data capabilities on the Pega 7 Platform

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Support Center

Get Started with Community

Managing big data to make informed business decisions

Apache Cassandra

Apache Hadoop

Monte Carlo data set

HBase data set

HDFS data set

Kafka

File data set

External data flows

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

Get Started with Community

Apache Cassandra

Apache Hadoop

Monte Carlo data set

HBase data set

HDFS data set

Kafka

File data set

External data flows

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.