Table of Contents

Article

Creating entity extraction rules for text analytics

You can use the default decision data that contains entity extraction rules in Pega® Platform to create custom rules for extracting entities from text.

With the natural language processing capabilities of Pega Platform, you can extract structured data from unstructured text. Structured data is any entity in the text that has a regular and predictable form, for example, email addresses, account numbers, time, monetary amounts, and so on. By extracting structured information from sources such as emails or text messages, you can, for example, react to a customer's message with a complaint about missing luggage by automatically creating a case that maps the detected entities to properties such as customer ID, flight number, or airport code.

To extract structured entities from text, Pega Platform integrates scripts that are based on the Apache UIMA Ruta annotation language. Pega Platform provides several entity extraction rules that you can use as part of the text analyzer rules in your application, for example, pyAccountNumber, pyCaseID, pyRelationship, and so on.

Tutorial

The following example use case explains how to detect unintentionally disclosed account numbers in tweets so that the numbers can be replaced with X characters before they are persisted. The example assumes that each account number consists of four 4-digit numbers that can be delimited by any character, for example, 1234-4567-8901-2345.

Prerequisites

Before completing this tutorial, make sure that you understand the following concepts:

  • The components and functionality of the Pega Platform text analytics feature. For more information, see Text Analytics.
  • Basic Java regular expressions
  • Development of Apache UIMA Ruta-based applications

Creating Decision Data rules that contain scripts for entity extraction

You create a Decision Data rule by modifying an existing rule.

  1. Create a Decision Data rule for entity extraction by using the Save as option in an existing Decision Data rule for entity extraction.
  2. Access any Decision Data rule that contains an Apache Ruta script for extracting structured entities:
    1. In the Explorer panel of Designer Studio, click Records > Decision > Decision Data.
    2. From the list of Decision Data rules in your application, select a rule that contains the script to use as the basis for developing the new entity extraction rule.
  3. Create a new Decision Data rule to contain your custom Apache Ruta script:
    1. On the Decision Data rule form, click Save as.
    2. Specify a new label and identifier.
    3. Optional: If you want to save the rule as part of a different ruleset, in the Context section, specify a new ruleset and Applies to class.
    4. Click Create and open.

You created a Decision Data rule for extracting structured entities. Next, customize the existing Apache Ruta script to meet your business needs.

Decision Data rule that contains an Apache Ruta script for entity extraction

Decision Data rule that contains an Apache Ruta script for entity extraction

Modifying Apache Ruta scripts to extract custom structured entities

After you create a Decision Data rule for entity extraction, customize the existing Apache Ruta script to adjust it to your business needs.

  1. On the Data tab of the Decision Data rule, click Open rule to access Apache Ruta script.
  2. Modify the existing script based on the following example:
    <code>PACKAGE uima.ruta.example;
    DECLARE VarA;
    DECLARE VarB;
    DECLARE VarC;
    DECLARE VarD;
     
    NUM{REGEXP("(^[0-9]{4})") -> MARK(VarA)}
      ANY?
      NUM{REGEXP("([0-9]{4})")-> MARK(VarB)}
      ANY?
      NUM{REGEXP("([0-9]{4})")-> MARK(VarC)}
      ANY?
      NUM{REGEXP("([0-9]{4})")-> MARK(VarD),MARK(EntityType,1,7), UNMARK(VarA), UNMARK(VarB), UNMARK(VarC), UNMARK(VarD)};</code>
    Key points from the preceding code example:
    • DECLARE VarA; – Declares an entity to annotate. In this use case, four strings of numbers that are separated by a delimiter character are needed; therefore four declare statements are included.
    • NUM{REGEXP("([0-9]{4})") -> MARK(VarB)} – Detects a single character between 0 and 9, repeated four times. When a match is found, the entity is marked. Note that the caret character (^) in the first regular expression asserts that the entity is marked only when its position is at the beginning of the string.
    • ANY? – Detects whether the entity is separated by any delimiting character, for example, a hyphen (-) or a semicolon (;).
    • MARK(EntityType,1,7) – Merges all annotations (VarA, ANY?, VarB, ANY?, VarC, ANY?, VarD) into a single entity. For an entity to be detected, matches must be found for all enumerated regular expressions.
    • UNMARK(VarD) – Unmarks the matched annotation to prevent an overlap with the entity that resulted from the merged annotations.

      For more information about regular expressions and basic token hierarchy in Apache Ruta scripts, see Apache UIMA Ruta Guide and Reference.

  3. Click Save.
  4. Test whether the script that you entered produces the expected results:
    1. On the Data tab of the Decision Data rule, click Test.
    2. In the Test window, enter or paste your sample text and click Test. If your custom script is correct, the detected entity is displayed in the Entity extraction section at the bottom of the Test window.

Entity extraction testing

Testing entity extraction

In this tutorial, you created a Decision Data rule from an existing one. You also edited the attached Apache Ruta script to extract entities of a specific type to satisfy your business need of finding account numbers in the analyzed text.

Published August 2, 2017 — Updated August 29, 2018


100% found this useful

Have a question? Get answers now.

Visit the Pega Support Community to ask questions, engage in discussions, and help others.