uk.ac.shef.dcs.oak.jate.core.npextractor
Class WordExtractor

java.lang.Object
  extended by uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
      extended by uk.ac.shef.dcs.oak.jate.core.npextractor.WordExtractor

public class WordExtractor
extends CandidateTermExtractor

Extracts words from texts. Words will be normalized to reduce inflections . Characters that do not match the pattern [a-zA-A\-] are replaced by whitespaces.


Field Summary
 
Fields inherited from class uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
_normaliser, _stoplist
 
Constructor Summary
WordExtractor(StopList stop, Normalizer normaliser)
          Creates an instance with specified stopwords list and normaliser.
WordExtractor(StopList stop, Normalizer normaliser, boolean removeStop, int minCharsInWord)
          Creates an instance with specified stopwords list and normaliser.
 
Method Summary
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
           
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
           
 java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
           
 
Methods inherited from class uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
applyCharacterReplacement, applySplitList, applyTrimStopwords, containsDigit, containsLetter, hasReasonableNumChars
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordExtractor

public WordExtractor(StopList stop,
                     Normalizer normaliser)
Creates an instance with specified stopwords list and normaliser. By default, stopwords are ignored; words that have less than 2 characters are ignored.

Parameters:
stop - a list of words which are unlikely to occur in a domain specific candidate term
normaliser - an instance of a Normalizer which returns candidate term to canonical form

WordExtractor

public WordExtractor(StopList stop,
                     Normalizer normaliser,
                     boolean removeStop,
                     int minCharsInWord)
Creates an instance with specified stopwords list and normaliser.

Parameters:
stop - a list of words which are unlikely to occur in a domain specific candidate term
normaliser - an instance of a Normalizer which returns candidate term to canonical form
removeStop - whether stop words should be ignored in the extracted words
minCharsInWord - words that contain less than this number of characters (non-white space) are ignored in the extracted words
Method Detail

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
c - corpus
Returns:
a map containing mappings from term canonical form to its variants found in the corpus
Throws:
JATEException

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
d - document
Returns:
a map containing mappings from term canonical form to its variants found in the document
Throws:
JATEException

extract

public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
                                                                        throws JATEException
Specified by:
extract in class CandidateTermExtractor
Parameters:
content - a string
Returns:
a map containing mappings from term canonical form to its variants found in the string
Throws:
JATEException