Data Set rule form
|
|
The way a data set is configured to represent data depends on the data set type. The Pega 7 Platform allows you to create the following data sets:
Define the keys.
Define the keys.
This data set is designed to read and save data from an external HBase storage. The set is also an extension to the HBase connector.
Connection
In order to connect to this data set, go to the HBase Data Set tab. In the HBase connector field, you need to reference the Rule-Connect-HBase connector rule which applies to the same class. Click Test connectivity to check the file system availability.
The Rule-Connect-HBase connector rule references the Data-Admin-Hadoop configuration rule, this configuration rule should contain the host and port of the HBase master node. For details, see About Data-Admin-Hadoop configuration.
Note: Remember to map fields stored in the data source to the Pega 7 Platform properties. For details, see Completing the Mappings tab.
This data set is designed to read and save data from an external, distributed Apache Hadoop File System (HDFS).
Connection
In order to connect to this data set, go to the HDFS Data Set tab. In the Hadoop configuration instance field, you need to reference the Data-Admin-Hadoop configuration rule which should contain host and port of the NameNode of the HDFS. Click Test connectivity to check the file system availability.
Note: The HDFS data set is optimized to support connections to one Apache Hadoop environment. When, in one data flow, you use HDFS data sets connecting to different Apache Hadoop environments, the data sets cannot use authentication.
File system configuration
In the File path field, set the path to the sources and output file(s).
The file path specifies the group of files that the data set represents. This group is based on a file within the original path but contains also all the files that apply to the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it.
When you connect to the data set and set the file path, click Preview File to check the beginning of the selected file. You can view the first 100 KB of the file.
Parser configuration
In the Parser configuration section, you can choose the file format (CSV or JSON - one JSON object per file row) that the data set will represent. After choosing the file format, additional configuration sections are displayed:
Properties mappings for the CSV format are based on the columns' order. In this section you can add and reorder properties. The first property is used to populate or read data from the first column, and so on.
There are two properties mapping modes supported for the JSON file format:
This type of data set allows you to process continuous data stream of events (records).
Stream tab
The Stream tab contains details about the exposed services (REST and WebSocket). These exposed services handle stream data set as a resource located at http://<HOST>:7003/stream/<DATA_SET_NAME>, for example: http://10.30.27.102:7003/stream/MyEventStream
Settings tab
The Settings tab allows you to set additional options for your stream data set. After saving the rule instance, you cannot change the settings.
Authentication
The REST and WebSockets endpoints are secured by using the Pega 7 Platform common authentication scheme. Each post to the stream requires authenticating with your username and password. By default the Enable basic authentication check box is selected.
In the Retention period field, you specify how long the data set keeps the records. The default value is 1 day.
In the Log file size field, you specify the size of the log files. You need to specify the value between 10 MB and 50 MB; the default value is 10MB.
No configuration required. The data set instance is automatically configured with the Visual Business Director server location as defined by the Visual Business Director connection.
Note: Facebook, Twitter, and YouTube data set are available when your application has access to the Pega-NLP ruleset.
Create this data set when you want to connect with the Facebook API. Reference the data set from a data flow and use the Free Text Model rule to analyze text-based content of Facebook posts. The Facebook data set allows you to filter out Facebook posts according to the keywords you specify in it.
Creating an instance of Facebook data set
Prerequisites:
Register on a website for Facebook developers and create a Facebook app. The app is necessary to obtain App ID and App secret details to be used with the Facebook data set.
Note: Do not use one instance of the Facebook data set in multiple data flows. Stopping one of the data flows, stops the Facebook data set in other data flows.
In the Facebook page URL's section, click Add URL and type the name of the Facebook page or pages for which you want to analyze text-based content.
Optional: In the Authors section, click Add author and type a user's name or users' names whose posts you want to ignore.
Note: When specifying numerous keywords and authors, take into consideration the Facebook Graph API limitations. For more information, read documentation about the Graph API.
Create this data set when you want to connect with the Twitter API. Reference the data set from a data flow and use the Free Text Model rule to analyze text-based content of tweets. The Twitter data set allows you to filter out tweets according to the keywords you specify in it.
Note: Do not use one instance of the Twitter data set in multiple data flows. Stopping one of the data flows, stops the Twitter data set in other data flows.
Creating an instance of Twitter data set
Prerequisites:
Optional: Provide Klout score API key.
Optional: In the Keywords section, click Add keyword and type the words that you want to find in the tweets.
In the Keywords section, you can also type Twitter authors (for example @JohnSmith) that you want to find in tweets.
Optional: In the Timeline section, click Add author and type a user's name or users' names whose tweets you want to analyze.
Note: It is recommended to complete the Keywords or Timeline section. If you leave both of them empty, you analyze all the tweets on the platform.
Optional: In the Authors section, click Add author and type a user's name or users' names whose tweets you want to ignore.
Note: When specifying numerous keywords and authors, take into consideration Twitter Rest API limitations. For more information, read documentation about the Twitter's REST APIs.
Click Save.
Create this data set when you want to connect with the YouTube Data API. Reference the data set from a data flow and use the Free Text Model rule to analyze metadata of the YouTube videos. The YouTube data set allows you to filter out metadata of the YouTube videos according to the keywords you specify in it.
Note: Do not use one instance of the YouTube data set in multiple data flows. Stopping one of the data flows, stops the YouTube data set in other data flows.
Creating an instance of YouTube data set
Prerequisites:
Obtain Google API key from the Google developers website. This key is necessary to configure the YouTube data set and get access to the YouTube data.
Optional: Select the Retrieve video URL check box.
If metadata of a particular YouTube video contains the keywords you specify, this option retrieves the URL of this video.
Optional: Select the Retrieve comments check box.
If metadata of a particular YouTube video contains the keywords you specify, this option retrieves all the users' comments about this video.
In the Keywords section, click Add keyword and type the keyword or keywords that you want to find in the video metadata. The metadata containing the keywords undergo text analysis.
Optional: In the Authors section, click Add author and type a user's name or users' names whose video you want to ignore.
Note: When specifying numerous keywords and authors, take into consideration YouTube Data API limitations. For more information, read documentation about the YouTube Data API.
Click Save.