LinkedIn
Copied!

Table of Contents

Integrating with Apache UIMA

Version:

Only available versions of this content are shown in the dropdown

This feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.

ConceptMaster component

The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.

There is a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource to look up resources relative to a Java system property uima.datapatch.

ConceptMapper users these UIMA analysis engines:

  • Aggregate
    • OffsetTokenizerMatcher – Composed of primitive analysis engines OffsetTokenizer and ConceptMapperOffsetTokenizer
  • Primitive(s)
    • OffsetTokenizer – Tokenize the input unstructured text document
    • ConceptMapperOffsetTokenizer – Performs dictionary lookup based on the text associated with each token

Example

Using the initially installed settings and this dictionary:

<?xml version="1.0" encoding="UTF-8" ?> <synonym> <token canonical="United States" DOCNO="10000"> <variant base="United States"/> <variant base="United States of America"/> <variant base="USA"/> </token> <token canonical="New York City" DOCNO="10001"> <variant base = "New York City"/> <variant base = "NYC"/> <variant base = "Big Apple"/> </token> </synonym>

With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):

<?xml version="1.0" encoding="UTF-8" ?> &help; <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" /> <tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" /> <tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" /> <tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" /> … <tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" /> <tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" /> <conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" /> <conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" /> … </xmi:XMI>

Text file rules

Five sample descriptor files are saved as text file rules. Your application can override these as needed.

  • Analysis Engine Descriptors
    • TextAnalysis.pyOffsetTokenizerMatcher.xml
    • TextAnalysis.pyOffsetTokenizer.xml
    • TextAnalysis.pyConceptMapperOffsetTokenizer.xml
  • Type System Descriptors
    • TextAnalysis.pyTokenAnnotation.xml
    • TextAnalysis.pyDictTerm.xml

The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.

Standard activity

The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity does the following actions:

  1. Initializes path variables
  2. Reads the descriptor and resource files (as text file rules) and creates corresponding XML files in the directory indicated by ${uima.datapath} on the file system
  3. Reads the aggregate analysis engine description
  4. Performs the analysis on the input text document

JAR Files

The following libraries are included in prprivate/libredist :

  • apache-uima/lib/uima-core.jar
  • apache-uima/lib/uima-cpe.jar
  • apache-uima/lib/uima-document-annotation.jar
  • apache-uima/lib/uimaj-bootstrap.jar
  • apache-uima/lib/uima-tools.jar
  • apache-uima/addons/annotator/ConceptMapper/lib/uima-an-conceptMapper.jar

Supply as input to the activity values for these parameters:

  • content – The unstructured text to be annotated
  • resourceNames – Optional. Comma separated list of the XML resource/configuration files
  • analysisDescriptor – Optional. Name of the analysis engine descriptor
  • startIndex – Optional. Index at which to start the analysis (integer)
  • endIndex – Optional. Index at which to end the analysis (integer)

The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput, an XML document in XML Metadata Interchange format. The output has two types: TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm, which is produced by the ConceptMapperOffsetTokenizer annotator.

Initial setup

Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.

Related Content

Article

Keystores

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.