uk.ac.shef.dcs.oak.jate.core.npextractor
Class WordExtractor
java.lang.Object
uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
uk.ac.shef.dcs.oak.jate.core.npextractor.WordExtractor
public class WordExtractor
- extends CandidateTermExtractor
Extracts words from texts. Words will be normalized to reduce inflections . Characters that do not match the pattern
[a-zA-A\-] are replaced by whitespaces.
Method Summary |
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(Corpus c)
|
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(Document d)
|
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(java.lang.String content)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WordExtractor
public WordExtractor(StopList stop,
Normalizer normaliser)
- Creates an instance with specified stopwords list and normaliser.
By default, stopwords are ignored; words that have less than 2 characters
are ignored.
- Parameters:
stop
- a list of words which are unlikely to occur in a domain specific candidate termnormaliser
- an instance of a Normalizer which returns candidate term to canonical form
WordExtractor
public WordExtractor(StopList stop,
Normalizer normaliser,
boolean removeStop,
int minCharsInWord)
- Creates an instance with specified stopwords list and normaliser.
- Parameters:
stop
- a list of words which are unlikely to occur in a domain specific candidate termnormaliser
- an instance of a Normalizer which returns candidate term to canonical formremoveStop
- whether stop words should be ignored in the extracted wordsminCharsInWord
- words that contain less than this number of characters (non-white space) are ignored in the
extracted words
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
c
- corpus
- Returns:
- a map containing mappings from term canonical form to its variants found in the corpus
- Throws:
JATEException
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
d
- document
- Returns:
- a map containing mappings from term canonical form to its variants found in the document
- Throws:
JATEException
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
content
- a string
- Returns:
- a map containing mappings from term canonical form to its variants found in the string
- Throws:
JATEException