de.l3s.boilerpipe.filters.english
Class NumWordsRulesClassifier
java.lang.Object
de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
- All Implemented Interfaces:
- BoilerpipeFilter
public class NumWordsRulesClassifier
- extends java.lang.Object
- implements BoilerpipeFilter
Classifies TextBlock
s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block.
- Author:
- Christian Kohlschütter
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
INSTANCE
public static final NumWordsRulesClassifier INSTANCE
NumWordsRulesClassifier
public NumWordsRulesClassifier()
getInstance
public static NumWordsRulesClassifier getInstance()
- Returns the singleton instance for RulebasedBoilerpipeClassifier.
process
public boolean process(TextDocument doc)
throws BoilerpipeProcessingException
- Description copied from interface:
BoilerpipeFilter
- Processes the given document
doc
.
- Specified by:
process
in interface BoilerpipeFilter
- Parameters:
doc
- The TextDocument
that is to be processed.
- Returns:
true
if changes have been made to the
TextDocument
.
- Throws:
BoilerpipeProcessingException
classify
protected boolean classify(TextBlock prev,
TextBlock curr,
TextBlock next)