This documentation site is for previous versions.

Visit our new documentation site for current releases.

Integrating with Apache UIMA

Updated on April 5, 2022

This feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.

ConceptMaster component

The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.

There is a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource to look up resources relative to a Java system property uima.datapatch.

ConceptMapper users these UIMA analysis engines:

Aggregate
- OffsetTokenizerMatcher – Composed of primitive analysis engines OffsetTokenizer and ConceptMapperOffsetTokenizer
Primitive(s)
- OffsetTokenizer – Tokenize the input unstructured text document
- ConceptMapperOffsetTokenizer – Performs dictionary lookup based on the text associated with each token

Example

Using the initially installed settings and this dictionary:

<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
	<token canonical="United States" DOCNO="10000">
		<variant base="United States"/>
		<variant base="United States of America"/>
		<variant base="USA"/>
	</token>
	<token canonical="New York City" DOCNO="10001">
		<variant base = "New York City"/>
		<variant base = "NYC"/>
		<variant base = "Big Apple"/>
	</token>
</synonym>

With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):

<?xml version="1.0" encoding="UTF-8" ?>
&help;
	<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
	<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
	<tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
	<tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />
	…
	<tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
	<tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
	<conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
	<conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />
	…
</xmi:XMI>

Text file rules

Five sample descriptor files are saved as text file rules. Your application can override these as needed.

Analysis Engine Descriptors
- TextAnalysis.pyOffsetTokenizerMatcher.xml
- TextAnalysis.pyOffsetTokenizer.xml
- TextAnalysis.pyConceptMapperOffsetTokenizer.xml
Type System Descriptors
- TextAnalysis.pyTokenAnnotation.xml
- TextAnalysis.pyDictTerm.xml

The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.

Standard activity

The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity does the following actions:

Initializes path variables
Reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
Reads the aggregate analysis engine description
Performs the analysis on the input text document

JAR Files

The following libraries are included in prprivate/libredist :

apache-uima/lib/uima-core.jar
apache-uima/lib/uima-cpe.jar
apache-uima/lib/uima-document-annotation.jar
apache-uima/lib/uimaj-bootstrap.jar
apache-uima/lib/uima-tools.jar
apache-uima/addons/annotator/ConceptMapper/lib/uima-an-conceptMapper.jar

Supply as input to the activity values for these parameters:

content – The unstructured text to be annotated
resourceNames – Optional. Comma separated list of the XML resource/configuration files
analysisDescriptor – Optional. Name of the analysis engine descriptor
startIndex – Optional. Index at which to start the analysis (integer)
endIndex – Optional. Index at which to end the analysis (integer)

The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput, an XML document in XML Metadata Interchange format. The output has two types: TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm, which is produced by the ConceptMapperOffsetTokenizer annotator.

Initial setup

Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.

Configuring full-text search

Previous topic Reindexing a node in a cluster
Next topic Configuring Java compiler access to information

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Support Center

Get Started with Community

Integrating with Apache UIMA

ConceptMaster component

Example

Text file rules

Standard activity

JAR Files

Initial setup

Related articles

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

Get Started with Community

ConceptMaster component

Example

Text file rules

Standard activity

JAR Files

Initial setup

Related articles

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.