Integrating with Apache UIMA

ConceptMaster component

The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.

There is a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource to look up resources relative to a Java system property uima.datapatch.

ConceptMapper users these UIMA analysis engines:

Aggregate
- OffsetTokenizerMatcher – Composed of primitive analysis engines OffsetTokenizer and ConceptMapperOffsetTokenizer
Primitive(s)
- OffsetTokenizer – Tokenize the input unstructured text document
- ConceptMapperOffsetTokenizer – Performs dictionary lookup based on the text associated with each token

Example

Using the initially installed settings and this dictionary:

<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
	<token canonical="United States" DOCNO="10000">
		<variant base="United States"/>
		<variant base="United States of America"/>
		<variant base="USA"/>
	</token>
	<token canonical="New York City" DOCNO="10001">
		<variant base = "New York City"/>
		<variant base = "NYC"/>
		<variant base = "Big Apple"/>
	</token>
</synonym>

With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):

<?xml version="1.0" encoding="UTF-8" ?>
&help;
	<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
	<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
	<tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
	<tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />
	…
	<tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
	<tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
	<conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
	<conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />
	…
</xmi:XMI>

Text file rules

Five sample descriptor files are saved as text file rules. Your application can override these as needed.

Analysis Engine Descriptors
- TextAnalysis.pyOffsetTokenizerMatcher.xml
- TextAnalysis.pyOffsetTokenizer.xml
- TextAnalysis.pyConceptMapperOffsetTokenizer.xml
Type System Descriptors
- TextAnalysis.pyTokenAnnotation.xml
- TextAnalysis.pyDictTerm.xml

The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.

Standard activity

The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity does the following actions:

Initializes path variables
Reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
Reads the aggregate analysis engine description
Performs the analysis on the input text document

JAR Files

The following libraries are included in prprivate/libredist :

apache-uima/lib/uima-core.jar
apache-uima/lib/uima-cpe.jar
apache-uima/lib/uima-document-annotation.jar
apache-uima/lib/uimaj-bootstrap.jar
apache-uima/lib/uima-tools.jar
apache-uima/addons/annotator/ConceptMapper/lib/uima-an-conceptMapper.jar

Supply as input to the activity values for these parameters:

content – The unstructured text to be annotated
resourceNames – Optional. Comma separated list of the XML resource/configuration files
analysisDescriptor – Optional. Name of the analysis engine descriptor
startIndex – Optional. Index at which to start the analysis (integer)
endIndex – Optional. Index at which to end the analysis (integer)

The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput, an XML document in XML Metadata Interchange format. The output has two types: TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm, which is produced by the ConceptMapperOffsetTokenizer annotator.

Initial setup

Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.