Creating an HDFS data set record
Version:
You must configure each instance of the HDFS data set rule before it can read data from and save it to an external Apache Hadoop Distributed File System (HDFS).
- Data Set rules - Completing the Create, Save As, or Specialization form.
- Connect to an instance of the Data-Admin-Hadoop configuration rule.
- In the Hadoop configuration instance field, reference the JCA Resource Adapter form – Completing the Connection tab that contains the HDFS storage configuration.
- Click Test connectivity to test whether Pega Platform can connect to the HDFS data set. The HDFS data set is optimized to support connections to one Apache Hadoop environment. When HDFS data sets connect to different Apache Hadoop environments in the single instance of a data flow rule, the data sets cannot use authenticated connections concurrently. If you need to use authenticated and non-authenticated connections at the same time, the HDFS data sets must use one Hadoop environment.
- In the File path field, specify a file path to the group of source and output files that the data set represents. This group of files is based on the file within the original path, but also contains all of the files with the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it. You can use * to match multiple files in a folder (for example, /folder/part-r-* ).
- Optional: Click Preview file to view the first 100 KB of records in the selected file.
- In the File format section, select the file type that is used within the selected data set.
-
CSV
If your HDFS data set uses the CSV file format, you must specify the following properties for content parsing within the Pega Platform :
- The delimiter character for separating properties
- The supported quotation marks
- JSON
-
Parquet
For data set write operations, specify the algorithm that is used for file compression in the data set:
- Uncompressed - Select this option if you do not use a file compression method in the data set.
- Gzip - Select this option if you use the GZIP file compression algorithm in your data set.
- Snappy - Select this option if you use the SNAPPY file compression algorithm in your data set.
-
- In the Properties mapping section, map the properties from the HDFS data set to the corresponding Pega Platform properties, depending on your parser configuration.
-
CSV
- Click Add Property.
- In the numbered field that is displayed, specify the property that corresponds to a column in the CSV file.
Property mapping for the CSV format is based on the order of columns in the CSV file. For that reason, the order of the properties in the Properties mapping section must correspond to the order of columns in the CSV file. -
JSON
- To use the auto-mapping mode, select the Use property auto mapping check box. This mode is enabled by default.
- To manually map properties:
- Clear the Use property auto mapping check box.
- In the JSON column, enter the name of the column that you want to map to a Pega Platform property.
- In the Property name field, specify a Pega Platform property that you want to map to the JSON column.
In auto-mapping mode, the column names from the JSON data file are used as Pega Platform properties. This mode supports the nested JSON structures that are directly mapped to Page and Page List properties in the data model of the class that the data set applies to.
-
Parquet
To create the mapping, Parquet utilizes properties that are defined in the data set class. You can map only the properties that are scalar and not inherited. If the property name matches a field name in the Parquet file, the property is populated with the corresponding data from the Parquet file.
You can generate properties from the Parquet file that do not exist in Pega Platform. When you generate missing properties, Pega Platform checks for unmapped columns in the data set, and creates the missing properties in the data set class for any unmapped columns.
To generate missing properties:
- Click Generate missing properties.
- Examine the Properties generation dialog that shows both mapped and unmapped properties.
- Click Submit to generate the unmapped properties.
-
- Click Save.
- Configuring Hadoop settings for an HDFS connection
Use the HDFS settings in the Hadoop data instance to configure connection details for the HDFS data sets.
- About Hadoop host configuration (Data-Admin-Hadoop)
You can use this configuration to define all of the connection details for a Hadoop host in one place, including connection details for datasets and connectors.
- JCA Resource Adapter form – Completing the Connection tab
Complete the Connection tab to identify the resource adapter's Connection Factory and to provide information about how the resource adapter connects to the back-end enterprise information system (EIS).
- Types of Data Set rules
Learn about the types of data set rules that you can create in Pega Platform.
- About Data Set rules
Data sets define collections of records, allowing you to set up instances that make use of data abstraction to represent data stored in different sources and formats. Depending on the type selected when creating a new instance, data sets represent Visual Business Director (VBD) data sources, data in database tables or data in decision data stores. Through the data management operations for each data set type, you can read, insert and remove records. Data sets are used on their own through data management operations, as part of combined data
- Data Set rules - Completing the Create, Save As, or Specialization form