Back ForwardHow to integrate with Apache UIMA

PRPC V6.2 includes the Apache UIMA Java framework version 2.3.0. This allows your application to call upon the powerful text analysis and searching facilities of the Unstructured Information Management Architecture library.

AdvancedThis feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.

ConceptMaster component

The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.

V6.2 incorporates a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource to look up resources relative to a Java system property uima.datapatch.

ConceptMapper users these UIMA analysis engines:

Aggregate

Primitive(s)

Example

Using the initially installed settings and this dictionary:

<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
  <token canonical="United States" DOCNO="10000">
    <variant base="United States"/>
    <variant base="United States of America"/>
    <variant base="USA"/>
  </token>
  <token canonical="New York City" DOCNO="10001">
    <variant base = "New York City"/>
    <variant base = "NYC"/>
    <variant base = "Big Apple"/>
  </token>
</synonym>

With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):

<?xml version="1.0" encoding="UTF-8" ?>
&help;
  <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
  <tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
  <tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
  <tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />

  <tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
  <tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
  <conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
  <conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />

</xmi:XMI>

Text file rules

Five sample descriptor files are saved as text file rules. Your application can override these as needed.

Analysis Engine Descriptors

Type System Descriptors

The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.

Standard activity

The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity:

  1. initializes path variables
  2. reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
  3. reads the aggregate analysis engine description
  4. performs the analysis on the input text document.

JAR Files

V6.2 includes these libraries in prprivate/libredist:

Supply as input to the activity values for these parameters:

The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput, an XML document in XML Metadata Interchange format. The output has two types – the TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm which is produced by the ConceptMapperOffsetTokenizer annotator.

Initial setup

Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.

Related topics About Text File rules

UpTechnical category