You are here: Reference > Rule types > Data Set > Creating an HDFS data set record

Creating an HDFS data set record

You must configure each instance of the HDFS data set rule before it can read data from and save it to an external Apache Hadoop Distributed File System (HDFS).

  1. Create an instance of the HDFS data set rule.
  2. Connect to an instance of the Data-Admin-Hadoop configuration rule.
    1. In the Hadoop configuration instance field, reference the Data-Admin-Hadoop configuration rule that contains the HDFS storage configuration.
    2. Click Test connectivity to test whether Pega Platform can connect to the HDFS data set.

      The HDFS data set is optimized to support connections to one Apache Hadoop environment. When HDFS data sets connect to different Apache Hadoop environments in the single instance of a data flow rule, the data sets cannot use authenticated connections concurrently. If you need to use authenticated and non-authenticated connections at the same time, the HDFS data sets must use one Hadoop environment.

  3. In the File path field, specify a file path to the group of source and output files that the data set represents.

    This group of files is based on the file within the original path, but also contains all of the files with the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it. You can use * to match multiple files in a folder (for example, /folder/part-r-*).

  4. Optional: Click Preview file to view the first 100 KB of records in the selected file.
  5. In the File format section, select the file type that is used within the selected data set.
  6. In the Properties mapping section, map the properties from the HDFS data set to the corresponding Pega Platform properties, depending on your parser configuration.
  7. Click Save.