How to integrate with Apache UIMA
This feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.
ConceptMaster component
The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.
There is a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method
setTokenizerDescriptor
in class
DictionaryResource
to look up resources relative to a Java system property
uima.datapatch.
ConceptMapper users these UIMA analysis engines:
Aggregate
- OffsetTokenizerMatcher – Composed of primitive analysis engines OffsetTokenizer and ConceptMapperOffsetTokenizer.
Primitive(s)
- OffsetTokenizer – Tokenize the input unstructured text document
- ConceptMapperOffsetTokenizer – Performs dictionary lookup based on the text associated with each token
Example
Using the initially installed settings and this dictionary:
<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
<token canonical="United States" DOCNO="10000">
<variant base="United States"/>
<variant base="United States of America"/>
<variant base="USA"/>
</token>
<token canonical="New York City" DOCNO="10001">
<variant base = "New York City"/>
<variant base = "NYC"/>
<variant base = "Big Apple"/>
</token>
</synonym>
With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):
<?xml version="1.0" encoding="UTF-8" ?>
&help;
<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
<tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />
…
<tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
<conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
<conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />
…
</xmi:XMI>
Text file rules
Five sample descriptor files are saved as text file rules. Your application can override these as needed.
Analysis Engine Descriptors
- TextAnalysis.pyOffsetTokenizerMatcher.xml
- TextAnalysis.pyOffsetTokenizer.xml
- TextAnalysis.pyConceptMapperOffsetTokenizer.xml
Type System Descriptors
- TextAnalysis.pyTokenAnnotation.xml
- TextAnalysis.pyDictTerm.xml
The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.
Standard activity
The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity:
- initializes path variables
- reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
- reads the aggregate analysis engine description
- performs the analysis on the input text document.
JAR Files
The following libraries are included in prprivate/libredist :
- apache-uima/lib/uima-core.jar
- apache-uima/lib/uima-cpe.jar
- apache-uima/lib/uima-document-annotation.jar
- apache-uima/lib/uimaj-bootstrap.jar
- apache-uima/lib/uima-tools.jar
- apache-uima/addons/annotator/ConceptMapper/lib/uima-an-conceptMapper.jar
Supply as input to the activity values for these parameters:
-
content
— The unstructured text to be annotated -
resourceNames
— Optional. Comma separated list of the XML resource/configuration files -
analysisDescriptor
— Optional. Name of the analysis engine descriptor -
startIndex
— Optional. Index at which to start the analysis (integer) -
endIndex
— Optional. Index at which to end the analysis (integer)
The activity's result is returned on the parameter page, as the value of a parameter named
xmiOutput
, an XML document in XML Metadata Interchange format. The output has two types – the TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm which is produced by the ConceptMapperOffsetTokenizer annotator.
Initial setup
Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.