uk.ac.shef.dcs.oak.jate.core.npextractor
Class NounPhraseExtractorOpenNLP
java.lang.Object
uk.ac.shef.dcs.oak.jate.core.npextractor.CandidateTermExtractor
uk.ac.shef.dcs.oak.jate.core.npextractor.NounPhraseExtractorOpenNLP
public class NounPhraseExtractorOpenNLP
- extends CandidateTermExtractor
Nounphrase extractor implemented with OpenNLP tools. It applies certain heuristics to clean a candidate noun phrase and
return it to the normalised root. These heuristics include:
-Stopwords will be trimmed from the head and tails of a phrase. E.g, "the cat on the mat" becomes "cat on the mat".
-phrases containing "or" "and" will be split, e.g., "Tom and Jerry" becomes "tom" "jerry"
-must have letters
-must have at least two characters
-characters that do not match the pattern [a-zA-Z\-] are replaced with whitespaces.
-may or may not have digits, this is set by the property file
-must contain no more than N words, this is set by the property file
Method Summary |
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(Corpus c)
|
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(Document d)
|
java.util.Map<java.lang.String,java.util.Set<java.lang.String>> |
extract(java.lang.String content)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NounPhraseExtractorOpenNLP
public NounPhraseExtractorOpenNLP(StopList stop,
Normalizer normaliser)
throws java.io.IOException
- Creates an instance with specified stopwords list and norm
- Parameters:
stop
- normaliser
-
- Throws:
java.io.IOException
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Corpus c)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
c
- corpus
- Returns:
- a map containing mappings from term canonical form to its variants found in the corpus
- Throws:
JATEException
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(Document d)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
d
- document
- Returns:
- a map containing mappings from term canonical form to its variants found in the document
- Throws:
JATEException
extract
public java.util.Map<java.lang.String,java.util.Set<java.lang.String>> extract(java.lang.String content)
throws JATEException
- Specified by:
extract
in class CandidateTermExtractor
- Parameters:
content
- a string
- Returns:
- a map containing mappings from term canonical form to its variants found in the string
- Throws:
JATEException