de.l3s.boilerpipe.sax
Class HTMLHighlighter

java.lang.Object
  extended by de.l3s.boilerpipe.sax.HTMLHighlighter

public final class HTMLHighlighter
extends java.lang.Object

Highlights text blocks in an HTML document that have been marked as "content" in the corresponding TextDocument.

Author:
Christian Kohlschütter

Method Summary
 java.lang.String getExtraStyleSheet()
          Returns the extra stylesheet definition that will be inserted in the HEAD element.
 java.lang.String getPostHighlight()
          Returns the string that will be inserted after any highlighted HTML block.
 java.lang.String getPreHighlight()
          Returns the string that will be inserted before any highlighted HTML block.
 boolean isOutputHighlightOnly()
          If true, only HTML enclosed within highlighted content will be returned
static HTMLHighlighter newExtractingInstance()
          Creates a new HTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.
static HTMLHighlighter newHighlightingInstance()
          Creates a new HTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.
 java.lang.String process(TextDocument doc, org.xml.sax.InputSource is)
          Processes the given TextDocument and the original HTML text (as an InputSource).
 java.lang.String process(TextDocument doc, java.lang.String origHTML)
          Processes the given TextDocument and the original HTML text (as a String).
 java.lang.String process(java.net.URL url, BoilerpipeExtractor extractor)
           
 void setExtraStyleSheet(java.lang.String extraStyleSheet)
          Sets the extra stylesheet definition that will be inserted in the HEAD element.
 void setOutputHighlightOnly(boolean outputHighlightOnly)
          Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
 void setPostHighlight(java.lang.String postHighlight)
          Sets the string that will be inserted after any highlighted HTML block.
 void setPreHighlight(java.lang.String preHighlight)
          Sets the string that will be inserted prior to any highlighted HTML block.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

newHighlightingInstance

public static HTMLHighlighter newHighlightingInstance()
Creates a new HTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.


newExtractingInstance

public static HTMLHighlighter newExtractingInstance()
Creates a new HTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.


process

public java.lang.String process(TextDocument doc,
                                java.lang.String origHTML)
                         throws BoilerpipeProcessingException
Processes the given TextDocument and the original HTML text (as a String).

Parameters:
doc - The processed TextDocument.
origHTML - The original HTML document.
Throws:
BoilerpipeProcessingException

process

public java.lang.String process(TextDocument doc,
                                org.xml.sax.InputSource is)
                         throws BoilerpipeProcessingException
Processes the given TextDocument and the original HTML text (as an InputSource).

Parameters:
doc - The processed TextDocument.
is - The original HTML document.
Throws:
BoilerpipeProcessingException

process

public java.lang.String process(java.net.URL url,
                                BoilerpipeExtractor extractor)
                         throws java.io.IOException,
                                BoilerpipeProcessingException,
                                org.xml.sax.SAXException
Throws:
java.io.IOException
BoilerpipeProcessingException
org.xml.sax.SAXException

isOutputHighlightOnly

public boolean isOutputHighlightOnly()
If true, only HTML enclosed within highlighted content will be returned


setOutputHighlightOnly

public void setOutputHighlightOnly(boolean outputHighlightOnly)
Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.


getExtraStyleSheet

public java.lang.String getExtraStyleSheet()
Returns the extra stylesheet definition that will be inserted in the HEAD element. By default, this corresponds to a simple definition that marks text in class "x-boilerpipe-mark1" as inline text with yellow background.


setExtraStyleSheet

public void setExtraStyleSheet(java.lang.String extraStyleSheet)
Sets the extra stylesheet definition that will be inserted in the HEAD element. To disable, set it to the empty string: ""

Parameters:
extraStyleSheet - Plain HTML

getPreHighlight

public java.lang.String getPreHighlight()
Returns the string that will be inserted before any highlighted HTML block. By default, this corresponds to <span class=&qupt;x-boilerpipe-mark1">


setPreHighlight

public void setPreHighlight(java.lang.String preHighlight)
Sets the string that will be inserted prior to any highlighted HTML block. To disable, set it to the empty string: ""


getPostHighlight

public java.lang.String getPostHighlight()
Returns the string that will be inserted after any highlighted HTML block. By default, this corresponds to </span>


setPostHighlight

public void setPostHighlight(java.lang.String postHighlight)
Sets the string that will be inserted after any highlighted HTML block. To disable, set it to the empty string: ""