de.l3s.boilerpipe.filters.english
Class KeepLargestFulltextBlockFilter
java.lang.Object
de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
- All Implemented Interfaces:
- BoilerpipeFilter
public final class KeepLargestFulltextBlockFilter
- extends java.lang.Object
- implements BoilerpipeFilter
Keeps the largest TextBlock
only (by the number of words). In case of
more than one block with the same number of words, the first block is chosen.
All discarded blocks are marked "not content" and flagged as
DefaultLabels.MIGHT_BE_CONTENT
.
As opposed to KeepLargestBlockFilter
, the number of words are
computed using HeuristicFilterBase.getNumFullTextWords(TextBlock)
, which only counts
words that occur in text elements with at least 9 words and are thus believed to be full text.
NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter
may lead to suboptimal results. You better use KeepLargestBlockFilter
instead, which
works at the level of number-of-words instead of text densities.
- Author:
- Christian Kohlschütter
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
INSTANCE
public static final KeepLargestFulltextBlockFilter INSTANCE
KeepLargestFulltextBlockFilter
public KeepLargestFulltextBlockFilter()
process
public boolean process(TextDocument doc)
throws BoilerpipeProcessingException
- Description copied from interface:
BoilerpipeFilter
- Processes the given document
doc
.
- Specified by:
process
in interface BoilerpipeFilter
- Parameters:
doc
- The TextDocument
that is to be processed.
- Returns:
true
if changes have been made to the
TextDocument
.
- Throws:
BoilerpipeProcessingException
getNumFullTextWords
protected static int getNumFullTextWords(TextBlock tb)
getNumFullTextWords
protected static int getNumFullTextWords(TextBlock tb,
float minTextDensity)