de.l3s.boilerpipe.extractors
Class ExtractorBase

java.lang.Object
  extended by de.l3s.boilerpipe.extractors.ExtractorBase
All Implemented Interfaces:
BoilerpipeExtractor, BoilerpipeFilter
Direct Known Subclasses:
ArticleExtractor, ArticleSentencesExtractor, CanolaExtractor, DefaultExtractor, KeepEverythingExtractor, KeepEverythingWithMinKWordsExtractor, LargestContentExtractor, NumWordsRulesExtractor

public abstract class ExtractorBase
extends java.lang.Object
implements BoilerpipeExtractor

The base class of Extractors. Also provides some helper methods to quickly retrieve the text that remained after processing.

Author:
Christian Kohlschütter

Constructor Summary
ExtractorBase()
           
 
Method Summary
 java.lang.String getText(org.xml.sax.InputSource is)
          Extracts text from the HTML code available from the given InputSource.
 java.lang.String getText(java.io.Reader r)
          Extracts text from the HTML code available from the given Reader.
 java.lang.String getText(java.lang.String html)
          Extracts text from the HTML code given as a String.
 java.lang.String getText(TextDocument doc)
          Extracts text from the given TextDocument object.
 java.lang.String getText(java.net.URL url)
          Extracts text from the HTML code available from the given URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.l3s.boilerpipe.BoilerpipeFilter
process
 

Constructor Detail

ExtractorBase

public ExtractorBase()
Method Detail

getText

public java.lang.String getText(java.lang.String html)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code given as a String.

Specified by:
getText in interface BoilerpipeExtractor
Parameters:
html - The HTML code as a String.
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

public java.lang.String getText(org.xml.sax.InputSource is)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code available from the given InputSource.

Specified by:
getText in interface BoilerpipeExtractor
Parameters:
is - The InputSource containing the HTML
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

public java.lang.String getText(java.net.URL url)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code available from the given URL. NOTE: This method is mainly to be used for show case purposes. If you are going to crawl the Web, consider using getText(InputSource) instead.

Parameters:
url - The URL pointing to the HTML code.
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

public java.lang.String getText(java.io.Reader r)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code available from the given Reader.

Specified by:
getText in interface BoilerpipeExtractor
Parameters:
r - The Reader containing the HTML
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

public java.lang.String getText(TextDocument doc)
                         throws BoilerpipeProcessingException
Extracts text from the given TextDocument object.

Specified by:
getText in interface BoilerpipeExtractor
Parameters:
doc - The TextDocument.
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException