Data Set rule form Completing Data Sets
About New Data Set History

The way a data set is configured to represent data depends on the data set type.

Database Table

Define the keys.

The Database Table section displays the database table name the class is mapped to.
In the Selectable Keys section, add as many keys entries as necessary, and map each key to a property.
In the Partitioning key section, select the property used to split the data into as many equal segments as possible across DNodes.
- To ensure a balanced distribution, select a property that is suitable for partitioning. For example, assuming a table containing customer information, country information is a suitable property for partitioning since it contains enough shared distinct values, but email address is not as it typically has as many distinct values as customer entries.
- Another consideration is the correlation between amount of segments (the grouped distinct values delivered by the property) and number of DNodes. The ideal distribution is considered to be as many segments as DNodes.

Decision Data Store

Define the keys.

The keys that you specify in a data set define the data records managed in the Cassandra internal storage. Add as many keys entries as necessary and map each key to a property.
The first property in the list of keys is the partitioning key used to distribute data across different decision nodes. To keep the decision nodes balanced, make sure you use a partitioning key property with many distinct values.
Changing keys in an existing data set is not supported. You have to create another instance.

HBase Data Set

This data set is designed to read and save data from an external HBase storage. The set is also an extension to the HBase connector.

Connection

In order to connect to this data set, go to the HBase Data Set tab. In the HBase connector field, you need to reference the Rule-Connect-HBase connector rule which applies to the same class. Click Test connectivity to check the file system availability.

The Rule-Connect-HBase connector rule references the Data-Admin-Hadoop configuration rule, this configuration rule should contain the host and port of the HBase master node. For details, see About Data-Admin-Hadoop configuration.

Note: Remember to map fields stored in the data source to the Pega 7 properties. For details, see Completing the Mappings tab.

HDFS Data Set

This data set is designed to read and save data from an external, distributed Apache Hadoop File System (HDFS).

Connection

In order to connect to this data set, go to the HDFS Data Set tab. In the Hadoop configuration instance field, you need to reference the Data-Admin-Hadoop configuration rule which should contain host and port of the NameNode of the HDFS. Click Test connectivity to check the file system availability.

Note: The HDFS data set is optimized to support connections to one Apache Hadoop environment. When, in one data flow, you use HDFS data sets connecting to different Apache Hadoop environments, the data sets cannot use authentication.

File system configuration

In the File path field, set the path to the sources and output file(s).

The file path specifies the group of files that the data set represents. This group is based on a file within the original path but contains also all the files that apply to the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it.

When you connect to the data set and set the file path, click Preview File to check the beginning of the selected file. You can view the first 100 KB of the file.

Parser configuration

In the Parser configuration section, you can choose the file format (CSV or JSON - one JSON object per file row) that the data set will represent. After choosing the file format, additional configuration sections are displayed:

For CSV, you specify what kind of delimiter character you want to use (Comma, Colon, Semicolon, TAB, Vertical bar) and what quotes (Double or Single) are supported.

Properties mappings for the CSV format are based on the columns' order. In this section you can add and reorder properties. The first property is used to populate or read data from the first column, and so on.

For JSON, no additional configuration is required.

There are two properties mapping modes supported for the JSON file format:

By default, the Use property auto mapping check box is selected. In this mode the names of properties are directly used as column names. This mode supports nested JSON structures, which are directly mapped to the page and page list properties in the data model of the class where the data set instance applies to.
If you deselect the Use property auto mapping check box, you need to specify the link between JSON columns and properties. This mode does not support nested structures.

Stream Data Set

This type of data set allows you to process continuous data stream of events (records).

Stream tab

The Stream tab contains details about the exposed services (REST and WebSocket). These exposed services handle stream data set as a resource located at http://<HOST>:7003/stream/<DATA_SET_NAME>, for example: http://10.30.27.102:7003/stream/MyEventStream

Settings tab

The Settings tab allows you to set additional options for your stream data set. After saving the rule instance, you cannot change the settings.

Authentication

The REST and WebSockets endpoints are secured by using the Pega 7 common authentication scheme. Each post to the stream requires authenticating with Pega 7 username and password. By default the Enable basic authentication check box is selected.

In the Retention period field, you specify how long the data set keeps the records. The default value is 1 day.

In the Log file size field, you specify the size of the log files. You need to specify the value between 10 MB and 50 MB; the default value is 10MB.

Visual Business Director

No configuration required. The data set instance is automatically configured with the Visual Business Director server location as defined by the Visual Business Director connection.

Open topic with navigation

Data Set rule formCompleting Data Sets