How to integrate with Apache UIMA

PRPC V6.2 includes the Apache UIMA Java framework version 2.3.0. This allows your application to call upon the powerful text analysis and searching facilities of the Unstructured Information Management Architecture library.

This feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.

ConceptMaster component

The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.

V6.2 incorporates a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource to look up resources relative to a Java system property uima.datapatch.

ConceptMapper users these UIMA analysis engines:

Aggregate

OffsetTokenizerMatcher – Composed of primitive analysis engines OffsetTokenizer and ConceptMapperOffsetTokenizer.

Primitive(s)

OffsetTokenizer – Tokenize the input unstructured text document
ConceptMapperOffsetTokenizer – Performs dictionary lookup based on the text associated with each token

Example

Using the initially installed settings and this dictionary:

<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
<token canonical="United States" DOCNO="10000">
    <variant base="United States"/>
    <variant base="United States of America"/>
    <variant base="USA"/>
</token>
<token canonical="New York City" DOCNO="10001">
    <variant base = "New York City"/>
    <variant base = "NYC"/>
    <variant base = "Big Apple"/>
</token>
</synonym>

With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):

<?xml version="1.0" encoding="UTF-8" ?>
&help;
<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
<tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />
…
<tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
<conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
<conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />
…
</xmi:XMI>

Text file rules

Five sample descriptor files are saved as text file rules. Your application can override these as needed.

Analysis Engine Descriptors

TextAnalysis.pyOffsetTokenizerMatcher.xml
TextAnalysis.pyOffsetTokenizer.xml
TextAnalysis.pyConceptMapperOffsetTokenizer.xml

Type System Descriptors

TextAnalysis.pyTokenAnnotation.xml
TextAnalysis.pyDictTerm.xml

The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.

Standard activity

The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity:

initializes path variables
reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
reads the aggregate analysis engine description
performs the analysis on the input text document.

JAR Files

V6.2 includes these libraries in prprivate/libredist:

apache-uima/lib/uima-core.jar
apache-uima/lib/uima-cpe.jar
apache-uima/lib/uima-document-annotation.jar
apache-uima/lib/uimaj-bootstrap.jar
apache-uima/lib/uima-tools.jar
apache-uima/addons/annotator/ConceptMapper/lib/uima-an-conceptMapper.jar

Supply as input to the activity values for these parameters:

content — The unstructured text to be annotated
resourceNames — Optional. Comma separated list of the XML resource/configuration files
analysisDescriptor — Optional. Name of the analysis engine descriptor
startIndex — Optional. Index at which to start the analysis (integer)
endIndex — Optional. Index at which to end the analysis (integer)

The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput, an XML document in XML Metadata Interchange format. The output has two types – the TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm which is produced by the ConceptMapperOffsetTokenizer annotator.

Initial setup

Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.

About Text File rules

Technical category