Table of Contents

Processing CSV files in sequential batches

Correlate comma-separated values (CSV) files that have different schemas by processing them in sequential batches and performing post-processing actions after all file data is gathered. To set up sequential processing, implement a set of file listeners that each process one file from a sequential batch of CSV files, and then perform post-processing.

The following procedure assumes that the file names for a specific batch of input files that contain a timestamp or GUID can be used as a correlation ID for the other files in the batch. If this correlation ID is stored in an exposed column of each of the records that are persisted from a row in a CSV file, all the data from one batch of files can be retrieved by using a parameterized data page that is backed by a report definition. You can then use an agent or queued process to post-process the data by using this data page. For information about exposing columns, see Property optimization.

Perform the following tasks:

  1. Configure the listeners for each CSV file that you want to map to case types.

    For instructions, see Configuring a file service and file listener to process data in files.

  2. Create a utility function that extracts the correlation ID from the CSV input file name.

    Use a regular expression or other string manipulation code to find and return the correlation ID in the name of the CSV file. The input file name is stored on the clipboard on a page that is typically named LogServiceFile in the Log-Service-File class, in the pyOriginalFileName property.

    For more information, see Creating a function.

  3. Add a text property for storing the correlation ID to the case types that contain the CSV files.

    Ensure that the property is either in the class that will process the records or a parent of the class.

  4. Store the correlation ID value in each record that is persisted from the CSV file data.

    Add a step to the record-level processing activity in the file service that sets the correlation ID property value on the step page. Use the value that was obtained from the utility function that you created in step 2 as follows:

    • Create a data table to manage and audit the processing of the file batches.
    • Create a data type that uses the correlation ID as its primary key and has a processing status property with the following states, at minimum:
      • New – Data is actively being gathered by the set of file listeners for this correlation ID.
      • Ready –  All the files from the batch have been loaded and the data is ready for post-processing.
      • Completed – Post-processing of the batch has been successfully completed.
      • Failed – An error prevented successfully processing a batch of file data.
    • You can add additional processing states and audit data for different reporting requirements and error scenarios, such as record-level mapping or processing failures, missing files from the batch after a specified period, and post-processing failures.

    For more information about creating a data type, see Creating a new data type.

  5. Add a new row to the data table each time a unique correlation ID is detected.

    Add a step to the prolog activity in the file service that checks whether the correlation ID is in the data table, and adds a new row if needed.

    For more information, see Service File form - Completing the Request tab.

  6. When all the data from the batch of CSV files is gathered and persisted, update the batch processing status in the data table.

    Add a step to the epilog activity in the file service that checks whether the current file is the last file from the batch to be processed. This step assumes that the data table tracks the CSV file names, and that each file listener understands what a full batch is.

    For more information, see Service File form - Completing the Request tab.

  7. Implement the post-processing activity after all the batch data is gathered. For best results, run the activity as an independent process, such as an agent process.

    To look up all records that have a specific correlation ID, use a report-backed data page in the post-processing activity.

    For more information, see Define the contents of a data page using a report definition.

 

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.