uk.ac.shef.dcs.oak.jate.core.npextractor
Class CandidateTermExtractor

java.lang.Object
  extended by uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
Direct Known Subclasses:
NGramExtractor, NounPhraseExtractorOpenNLP, WordExtractor

public abstract class CandidateTermExtractor
extends java.lang.Object

Extract lexical units from texts. Also defines a number of utility methods for normalizing extracted candidate terms. A candidate lexical unit will be stored in its canonical form in lowercase, depending on the Normalizer provided. Each canonical form maps to several variants found in the document/corpus. When the frequency of terms is counted, each variant will be searched in the document/corpus and the frequency adds up to the total frequency for the canonical term form.


Field Summary
protected  Normalizer _normaliser
           
protected  StopList _stoplist
           
 
Constructor Summary
CandidateTermExtractor()
           
 
Method Summary
static java.lang.String applyCharacterReplacement(java.lang.String string, java.lang.String pattern)
          Replaces [^a-zA-Z\-] characters with " "(white space).
static java.lang.String[] applySplitList(java.lang.String string)
          If the input string contains "and", "or" and "," , it is assumed to contain multiple candidate terms, and is split.
static java.lang.String applyTrimStopwords(java.lang.String string, StopList stop, Normalizer normalizer)
          If a string beings or ends with a stop word (e.g., "the"), the stop word is removed.
static boolean containsDigit(java.lang.String string)
           
static boolean containsLetter(java.lang.String string)
           
abstract  java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
           
abstract  java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
           
abstract  java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
           
static boolean hasReasonableNumChars(java.lang.String string)
          This method is used to check if candidate term is a possible noisy symbol or non-term.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_stoplist

protected StopList _stoplist

_normaliser

protected Normalizer _normaliser
Constructor Detail

CandidateTermExtractor

public CandidateTermExtractor()
Method Detail

extract

public abstract java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
                                                                                 throws JATEException
Parameters:
c - corpus
Returns:
a map containing mappings from term canonical form to its variants found in the corpus
Throws:
JATEException

extract

public abstract java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
                                                                                 throws JATEException
Parameters:
d - document
Returns:
a map containing mappings from term canonical form to its variants found in the document
Throws:
JATEException

extract

public abstract java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
                                                                                 throws JATEException
Parameters:
content - a string
Returns:
a map containing mappings from term canonical form to its variants found in the string
Throws:
JATEException

containsLetter

public static boolean containsLetter(java.lang.String string)
Parameters:
string - a string
Returns:
true if the string contains letter; false otherwise

containsDigit

public static boolean containsDigit(java.lang.String string)
Parameters:
string -
Returns:
true if the string contains digit; false otherwise

applyCharacterReplacement

public static java.lang.String applyCharacterReplacement(java.lang.String string,
                                                         java.lang.String pattern)
Replaces [^a-zA-Z\-] characters with " "(white space). Used to normalize texts and candidate terms before counting frequencies

Parameters:
string - input string to be processed
pattern - regular expression pattern defining characters to be replaced by " " (white space)
Returns:

applySplitList

public static java.lang.String[] applySplitList(java.lang.String string)
If the input string contains "and", "or" and "," , it is assumed to contain multiple candidate terms, and is split. This method is used to process extracted candidate terms. Due to imperfections of NLP tools, some times candidate terms can be noisy and need to be further processed/normalized.

Parameters:
string - input string
Returns:
individual strings seperated by "and", "or" and ","

applyTrimStopwords

public static java.lang.String applyTrimStopwords(java.lang.String string,
                                                  StopList stop,
                                                  Normalizer normalizer)
If a string beings or ends with a stop word (e.g., "the"), the stop word is removed. This method is used to further process/normalize extracted candidate terms. Due to imperfections of NLP tools, sometimes candidate terms can be noisy and needs further normalization.

Parameters:
string -
stop -
Returns:
null if the string is a stopword; otherwise the normalized string

hasReasonableNumChars

public static boolean hasReasonableNumChars(java.lang.String string)
This method is used to check if candidate term is a possible noisy symbol or non-term.

Parameters:
string -
Returns:
true if the string contains at least 2 letters or digits; false otherwise