uk.ac.shef.dcs.oak.jate.core.npextractor
Class NounPhraseExtractorOpenNLP

java.lang.Object
  extended by uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
      extended by uk.ac.shef.dcs.oak.jate.core.npextractor.NounPhraseExtractorOpenNLP

public class NounPhraseExtractorOpenNLP
extends CandidateTermExtractor

Nounphrase extractor implemented with OpenNLP tools. It applies certain heuristics to clean a candidate noun phrase and return it to the normalised root. These heuristics include:
-Stopwords will be trimmed from the head and tails of a phrase. E.g, "the cat on the mat" becomes "cat on the mat".
-phrases containing "or" "and" will be split, e.g., "Tom and Jerry" becomes "tom" "jerry"
-must have letters
-must have at least two characters
-characters that do not match the pattern [a-zA-Z\-] are replaced with whitespaces.
-may or may not have digits, this is set by the property file
-must contain no more than N words, this is set by the property file


Field Summary
 
Fields inherited from class uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
_normaliser, _stoplist
 
Constructor Summary
NounPhraseExtractorOpenNLP(StopList stop, Normalizer normaliser)
          Creates an instance with specified stopwords list and norm
 
Method Summary
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
           
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
           
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
           
 
Methods inherited from class uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
applyCharacterReplacement, applySplitList, applyTrimStopwords, containsDigit, containsLetter, hasReasonableNumChars
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NounPhraseExtractorOpenNLP

public NounPhraseExtractorOpenNLP(StopList stop,
                                  Normalizer normaliser)
                           throws java.io.IOException
Creates an instance with specified stopwords list and norm

Parameters:
stop -
normaliser -
Throws:
java.io.IOException
Method Detail

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
c - corpus
Returns:
a map containing mappings from term canonical form to its variants found in the corpus
Throws:
JATEException

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
d - document
Returns:
a map containing mappings from term canonical form to its variants found in the document
Throws:
JATEException

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
content - a string
Returns:
a map containing mappings from term canonical form to its variants found in the string
Throws:
JATEException