LinkedIn
Copied!

Table of Contents

Detecting transaction details with Ruta

Version:

Only available versions of this content are shown in the dropdown

Locate and extract keywords and phrases through pattern matching by using the Apache Rule-based Text Annotation (Ruta) language.

For example, you can use a Ruta script to detect strings that contain the @ symbol and the .com sub-string, such as email addresses. In addition, you can use this detection method to extract entity types by token length (for example, zip codes or telephone numbers), or to extract entities from a word or token. This data can help you automate case field population, or process customer requests faster.

Use case

As an application developer, you want to extract such details as account numbers, case IDs, and customer information from transaction documents, as in the following example:

  • Account Number: 765422
  • CaseID: AB-1124
  • Customer Name: Miss Sarah Connor
  • Email: sconnor@pega.com
  • Mobile number: +11 1234568901
  • SSN: 345-99-7654
  • Transaction number: ABC123PQ9Z
  • Transaction Date: 22/10/2018
  • Transaction amount: Euros 200

Learn about extracting transaction details by completing the following tasks:

  1. Detecting account numbers
  2. Detecting case IDs
  3. Detecting salutations
  4. Detecting numeric patterns
  5. Detecting email addresses
  6. Detecting alphanumeric strings with a fixed number of characters
  7. Detecting currencies
  8. Detecting monetary values
  9. Detecting dates

Before you begin

  1. In Prediction Studio, create an entity extraction model. For more information, see Creating entity extraction models.
  2. In the entity extraction model that you create, configure each information type that you want to extract (for example, account number, case ID, or phone number), as a separate entity type.

Detecting account numbers

Use regular expression patterns to detect account numbers with a fixed number of digits. The following example uses an account number with six digits.

  1. In the account_number entity type, turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:

    NUM{REGEXP("......") -> MARK(EntityType) };
    where:
    • NUM is the annotation type for marking numbers and digits.
    • REGEXP("......") is the regular expression in which the number of periods represents the number of characters in a specific annotation.

    This expression detects all six-digit numbers in the document as account numbers.

  3. Optional: To improve detection accuracy, add more regular expressions. For example, you can recognize the context in which an account number appears in a document that has a fixed structure (such as in transaction documents).
    • W{ REGEXP("(?i)(account)")}
    • W{ REGEXP("(?i)(number)")}
    • COLON
    • NUM{REGEXP("......") -> MARK(EntityType,4) };
    where:
    • W{ REGEXP("(?i)(account)")} detects the word account. (?i) indicates that the word is case insensitive.
    • W{ REGEXP("(?i)(number)")} detects the word number.
    • COLON detects the colon (:) character.
    NUM{REGEXP("......") -> MARK(EntityType,4) } detects a six-digit number that is preceded by the annotations for the account, number, and colon. MARK(EntityType,4) indicates that only the fourth annotation is marked as the entity type, because you only want to extract the number, not the surrounding context. Lines one through three do not end with a semicolon (;), so Ruta considers these lines as a single script. You can write multiple scripts to detect a single entity type.
  4. Optional: To improve detection accuracy, add more regular expressions. For example, you can recognize the context in which an account number appears in a document without a fixed structure (such as My account number is 435399 or Details for my account number: 394854):
    • W{ REGEXP("(?i)(account)")}
    • W{ REGEXP("(?i)(number)")}
    • ANY[0,4]?
    • NUM{REGEXP("......") -> MARK(EntityType,4) };
    where:
    • ANY[0,4]? denotes an optional requirement to detect a numeric value as an account number if that value has the terms account and number four annotations before, at most.

Detecting case IDs

Detect work assignment numbers, such as case IDs by using a regular expression.

  1. In the case_id entity type turn on the Enable Ruta switch.
  2. In the Ruta script box, enter the following script:
    • CAP{REGEXP("..")}
    • "-"
    • NUM{REGEXP("....") -> MARK(EntityType,1,3)};
    where:
    • CAP{REGEXP("..")} detects capital letters with two characters.
    • "-" detects the dash character.
    • NUM{REGEXP("....") -> MARK(EntityType,1,3)}; detects four-digit numbers that are followed by the previously detected annotations. MARK(EntityType,1,3) means that annotations 1 through 3 are marked as the entity type.

Detecting salutations

Detect such salutations as mr, mrs, dr by using regular expressions.

  1. In the salutation entity type turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:

    W{REGEXP("(?i)(mr|mrs|dr|madam|maam|miss|sir|senorita|senor|ms)") -> MARK(EntityType)};
    where:
    • "(?i)" makes the search case insensitive
    • "|" indicates the OR condition

Detecting numeric patterns

Use a regular expression to detect numeric patters, for example, social security numbers.

  1. In the entity type called security_number turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:
    • UM{REGEXP("0.[1-9]|1..|2..|3..|4..|5..|6..|7..|8..") }
    • "-" NUM{REGEXP("..")}
    • "-" NUM{REGEXP("....") -> MARK(EntityType,1,5)};
    where:
    • NUM{REGEXP("0.[1-9]|1..|2..|3..|4..|5..|6..|7..|8..") }: ensures that the number does not start with 000.

Detecting email addresses

Detect email addresses by extracting phrases or words that contain the @ symbol and .com string.

  1. In the entity type called email_address, turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:
    • Document{-> RETAINTYPE(SPACE)};
    • SPACE ((W|NUM) (W|NUM)[0,1])+ "@" W+? PERIOD+? W{REGEXP("(?i)([a-zA-Z]{3}|[a-zA-Z]{2})")
    • MARK(EntityType,1,5)};
    • SPACE ((W|NUM)+ ("."|"_") )+ (W|NUM)+ "@" W[0,1]? PERIOD[0,1]? W+? PERIOD+? W{REGEXP("(?i)([a-zA-Z]{3}|[a-zAZ]{2})") -> MARK(EntityType,1,8)};
    where:
    • Line 1 allows for spaces in email addresses.
    • Line 2 marks an email address that does not contain a period (.) or an underscore (_) before the @ symbol and contains only one period after the @ symbol. For example, abc@pega.com.
    • Line 3 marks emails that contain a period (.) or an underscore (_) before the @ symbol and domain names with any number of period characters. For example, abc@in.pega.com.
    • + matches at least one annotation.
    • +? matches at least one annotation but stops when the next rule element also matches this annotation.

Detecting alphanumeric strings with a fixed number of characters

You can use regular expressions with Ruta scripts to annotate alphanumeric strings, for example, transaction numbers.

  1. In the transaction_number entity type turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:

    Token{ AND(REGEXP(".........."),CONTAINS(NUM),CONTAINS(W)) -> MARK(EntityType)}
    This script detects 10-character long entities that contain at least one digit and at least one letter.

Detecting currencies

You can detect currency and money by using a variety of methods. For example, you can create a regular expression, or create a list of keywords for various currency names, symbols, and synonyms, and then refer to that list in the Ruta script.

  1. In the currency entity type turn on the Enable Ruta switch.
  2. In the Ruta script box, enter the following script:

    W{REGEXP("(?i)(dollar|pounds|rupees)") -> MARK(EntityType)};
    For information about how to create a list of keywords, see Creating entity extraction models.

Detecting monetary values

You can detect monetary values by combining the currency code (for example, USD) and number entity types.

  1. In the money entity type turn on the Enable Ruta switch.
  2. In the Ruta script box, enter:
    • EntityType{FEATURE("entityType", "currency")}
    • NUM {-> MARK(EntityType,1,2)};
    This script detects the currency that is followed by a number. The name of the entity type must always be in lowercase.

Detecting dates

You can use the Ruta script to detect create a pattern for extracting dates. Ruta enables combining multiple detection patters so that you can detect dates that were written in various formats.

  1. Declare date variables:
    • DECLARE VarA; //Month name
    • DECLARE VarB; //Day
    • DECLARE VarC; //Year
    • DECLARE VarD; //Month Number
  2. Detect a month name:

    W{REGEXP("(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec)") -> MARK(VarA)};
  3. Detect a day number:

    NUM{REGEXP("[0]?[1-9]|1[0-9]|2[0-9]|3[0-1]")} W?{REGEXP("(?i)(th|st|nd|rd)")-> MARK(VarB,1,2)};
  4. Detect a year:

    NUM{REGEXP("19..|20..|..") -> MARK(VarC)};
  5. Detect a month number:

    NUM{REGEXP("[0]?[1-9]|1[0-2]") -> MARK(VarD)};
  6. Detect a full date that follows the January 1st 2008 or February 28, 2010 pattern:

    VarA VarB PM? VarC {-> MARK(EntityType,1,4)};
  7. Clear the date variables that you declared earlier:
    • VarA{->UNMARK(VarA)}
    • VarB{->UNMARK(VarB)}
    • VarC{->UNMARK(VarA)}
    • VarD{->UNMARK(VarB)}
    Unmarking the temporary variables ensures that they do not interfere with the execution of the next script.

The script above detects five unique date patterns, where:

  • W{REGEXP("(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec)") -> MARK(VarA)};
    detects the month and is case insensitive. The annotation is marked as VarA.
  • NUM{REGEXP("[0]?[1-9]|1[0-9]|2[0-9]|3[0-1]")} W?{REGEXP("(?i)(th|st|nd|rd)") -> MARK(VarB,1,2)};
    detects any number from 0 to 31, with optional strings. The question mark character (?) indicates that this annotation is optional. This annotation is marked as VarB.
  • NUM{REGEXP("19..|20..|..") -> MARK(VarC)} detects a four-digit number that starts with 19 or 20, or a two-digit number. This annotation is marked as VarC.
  • NUM{REGEXP("[0]?[1-9]|1[0-2]") -> MARK(VarD)}; detects numbers from 1 to 12 and is marked as VarD.
  • VarA VarB PM? VarC {-> MARK(EntityType,1,4)}; detects a full date that follows the January 1st 2008 or February 28, 2010 pattern.

In this tutorial, you learned how to use Apache Ruta to automatically discover and retrieve such business data as account numbers, email addresses, transaction numbers, financial figures, and dates.

What to do next

Include the entity extraction models that contain Ruta scripts in your application by configuring a Text analyzer rule. For more information, see Building text analyzers.

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.