de.l3s.boilerpipe.filters.english
Class NumWordsRulesClassifier

java.lang.Object
  extended by de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
All Implemented Interfaces:
BoilerpipeFilter

public class NumWordsRulesClassifier
extends java.lang.Object
implements BoilerpipeFilter

Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.

Author:
Christian Kohlschütter

Field Summary
static NumWordsRulesClassifier INSTANCE
           
 
Constructor Summary
NumWordsRulesClassifier()
           
 
Method Summary
protected  boolean classify(TextBlock prev, TextBlock curr, TextBlock next)
           
static NumWordsRulesClassifier getInstance()
          Returns the singleton instance for RulebasedBoilerpipeClassifier.
 boolean process(TextDocument doc)
          Processes the given document doc.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INSTANCE

public static final NumWordsRulesClassifier INSTANCE
Constructor Detail

NumWordsRulesClassifier

public NumWordsRulesClassifier()
Method Detail

getInstance

public static NumWordsRulesClassifier getInstance()
Returns the singleton instance for RulebasedBoilerpipeClassifier.


process

public boolean process(TextDocument doc)
                throws BoilerpipeProcessingException
Description copied from interface: BoilerpipeFilter
Processes the given document doc.

Specified by:
process in interface BoilerpipeFilter
Parameters:
doc - The TextDocument that is to be processed.
Returns:
true if changes have been made to the TextDocument.
Throws:
BoilerpipeProcessingException

classify

protected boolean classify(TextBlock prev,
                           TextBlock curr,
                           TextBlock next)