Table of Contents

Best practices for pattern extraction in text analytics


Only available versions of this content are shown in the dropdown

Apache Ruta (Rule-based Text Annotation) is a rule-based language that you can use to detect keywords and phrases that follow certain patterns to populate a case, route an assignment, and so on. Example patterns can include:

  • Lists of possible words, such as product or country names
  • Patterns that you can detect through regular expressions, for example, flight numbers, phone numbers, or email addresses
  • Patterns that you can recognize across multiple tokens, for example, hotel or street names

Learn about rule-based text extraction in Pega Platformthrough the following topics:

Introduction to Ruta

The Ruta language classifies entities based on rules that are combinations of annotation patterns, optional quantifiers, conditions for matching, and actions to perform. Consider the following example:

Configuring and testing a Ruta-based entity type

In this example:

  • W is an annotation of type normal word
  • REGEXP("\bdogs?\b|\bwhales?\b|\bsparrows?\b") is a condition that is fulfilled when a normal word matches the regular expression.
  • -> separates conditions from actions.
  • MARK(EntityType) is the action that is taken when a piece of text fulfills the condition. In this case, the entity extraction model marks an entity of type Animal.

For more information about the Ruta language, including a full list of annotation types, quantifiers, and examples, see the Apache Documentation.

Ruta requirements and considerations for Pega applications

Consider the following points when creating Ruta-based entity types in Pega Platform:

  • To store annotation results, mark them in the Ruta script. You can use the VarA, VarB, VarC, VarD, and VarE variables to store intermediate annotation results. Pega Platform stores the final annotation results in the EntityType annotation, the name of which is equal to the Entity type name property
  • Always clear the declared variables, such as VarA, at the end of your script, so that they do not interfere with the execution of the next script.
  • Pega Platform does not support WORDLIST and WORDTABLE annotations. Starting from Pega Platform 8.3, you can define wordlists as keywords and refer to them in the Ruta script.
  • Starting from Pega Platform 8.3, you can reference other entity types through the Ruta script by using the following command: EntityType{FEATURE("entityType", "<EntityTypeNameInLowerCase>")}. For an example of a use case, see Improve the management of text extraction models through entity types.Reference entity types in lowercase, irrespective of the case in which you defined them.
  • In Pega Platform, Ruta script can detect only a single entity type.

For more information, see Creating entity extraction models.

Examples of Ruta scripts

To create custom Ruta-based entity types, you can use any of the provided default templates. The following list also provides a number of simple but efficient Ruta scripts that you can use to recognize basic entity types in your application.

  • To detect only words, for example, telephone, enter:
    W {-> MARK(EntityType)};
  • To detect letters or words that are followed by numbers, for example, SK123, enter:
    W NUM {-> MARK(EntityType,1,2)};
  • To detect numbers that are surrounded by letters of words, for example, IFSC000ABC, enter:
    W NUM W {-> MARK(EntityType,1,3)};
    Digits 1 and 3 specify the number of annotations to mark as EntityType (from annotation 1 to annotation 3).
  • To detect a string of specific length, for example, six digits, enter:
    NUM{REGEXP(“……”) -> MARK(EntityType)};
    NUM{REGEXP(“.{6}”) -> MARK(EntityType)};
  • To detect case-insensitive strings, for example, USA or India, enter:
    W{REGEXP(“(?i)(usa | india)”) -> MARK(EntityType)};
    In the example above, (?i) indicates that the script ignores the case.
  • To detect a specific word or phrase that is followed by a space and then a number, for example, INTL 1001, enter:
    Document{-> RETAINTYPE(SPACE)};
    W{ REGEXP(“INTL”)} SPACE NUM {-> MARK(EntityType,1,3)}
    In the example above, the Ruta script does not detect INTL1001 because the string does not contain a space. By default, Ruta ignores spaces, unless you specify otherwise (for example, through the RETAINTYPE(SPACE) command).
  • To detect alphanumeric patterns through token annotation, for example, A1B23C, enter:
    Token{ AND(CONTAINS(NUM),CONTAINS(W)) -> MARK(EntityType)};
    To fulfill all conditions within the token annotation, use the AND condition.
  • To detect patterns through a regular expression, for example, two-digit hexadecimal numbers, enter:
    ((W|NUM) (NUM|W)*){REGEXP("[a-fA-F0-9]{2}") -> MARK(EntityType)};
  • To declare a temporary variable, for example, VarA that represents the name of the month, enter:
    DECLARE VarA; //Month name
  • To clear a declared variable, for example, VarA, enter:

For an example use case, seeDetecting transaction details with Ruta

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.