|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
TextBlock
.
TextBlock
.
BlockProximityFusion
instance.
BoilerpipeFilter
.ContentHandler
, used by BoilerpipeSAXInput
.BoilerpipeSAXInput
.TextDocument
s.InputSource
using SAX and returns a TextDocument
.BoilerpipeSAXInput
for the given InputSource
.
TextBlock
s which have explicitly been marked as "not content".TextBlock
s.
TextBlock
s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in the
paper "Boilerplate Detection using Shallow Text Features", particularly using
text densities and link densities.TextBlock
s which contain parts of the HTML
<TITLE>
tag, using some heuristics which are quite
specific to the news domain.TextBlock
s "content" which are between the headline and the part that
has already been marked content, if they are marked TextBlockLabel.MIGHT_BE_CONTENT
.TextDocument
's content.
ArticleExtractor
.
ArticleSentencesExtractor
.
DefaultExtractor
.
LargestContentExtractor
.
NumWordsRulesExtractor
.
null
if no such labels
exist.
InputSource
.
Reader
.
TextDocument
object.
TextDocument
's content, non-content or both
InputSource
.
URL
.
Reader
.
TextDocument
object.
TextBlock
s of this document.
TextDocument
.
null
if no
such title has ben set.
TextDocument
.HTMLHighlighter
for the given TextDocument
and the original HTML text (as a String).
HTMLHighlighter
for the given TextDocument
and the original HTML text (as an InputSource
).
TextBlockLabel.INDICATES_END_OF_TEXT
.TextBlock
sTextBlock
only (by the number of words).TextBlock
only (by the number of words).TextBlock#getNumFullTextWords()
). k is 30 by default.TextBlock
s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block.doc
.
TextBlockLabel.INDICATES_END_OF_TEXT
.TextBlock.addLabel(String)
and TextBlock.hasLabel(String)
.TextBlock
s.TextDocument
with given TextBlock
s, and no
title.
TextDocument
with given TextBlock
s and
given title.
TextDocument
containing the extracted TextBlock
s.
TextDocument
containing the extracted TextBlock
s.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |