de.l3s.boilerpipe.sax
Class BoilerpipeHTMLContentHandler

java.lang.Object
  extended by de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
All Implemented Interfaces:
org.xml.sax.ContentHandler

public class BoilerpipeHTMLContentHandler
extends java.lang.Object
implements org.xml.sax.ContentHandler

A simple SAX ContentHandler, used by BoilerpipeSAXInput. Can be used by different parser implementations, e.g. NekoHTML and TagSoup.

Author:
Christian Kohlschütter

Constructor Summary
BoilerpipeHTMLContentHandler()
          Constructs a BoilerpipeHTMLContentHandler using the DefaultTagActionMap.
BoilerpipeHTMLContentHandler(TagActionMap tagActions)
          Constructs a BoilerpipeHTMLContentHandler using the given TagActionMap.
 
Method Summary
protected  void addTextBlock(TextBlock tb)
           
 void addWhitespaceIfNecessary()
           
 void characters(char[] ch, int start, int length)
           
 void endDocument()
           
 void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName)
           
 void endPrefixMapping(java.lang.String prefix)
           
 java.lang.String getTitle()
           
 void ignorableWhitespace(char[] ch, int start, int length)
           
 void processingInstruction(java.lang.String target, java.lang.String data)
           
 void recycle()
          Recycles this instance.
 void setDocumentLocator(org.xml.sax.Locator locator)
           
 void setTitle(java.lang.String s)
           
 void skippedEntity(java.lang.String name)
           
 void startDocument()
           
 void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)
           
 void startPrefixMapping(java.lang.String prefix, java.lang.String uri)
           
 TextDocument toTextDocument()
          Returns a TextDocument containing the extracted TextBlock s.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BoilerpipeHTMLContentHandler

public BoilerpipeHTMLContentHandler()
Constructs a BoilerpipeHTMLContentHandler using the DefaultTagActionMap.


BoilerpipeHTMLContentHandler

public BoilerpipeHTMLContentHandler(TagActionMap tagActions)
Constructs a BoilerpipeHTMLContentHandler using the given TagActionMap.

Parameters:
tagActions - The TagActionMap to use, e.g. DefaultTagActionMap.
Method Detail

recycle

public void recycle()
Recycles this instance.


endDocument

public void endDocument()
                 throws org.xml.sax.SAXException
Specified by:
endDocument in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

endPrefixMapping

public void endPrefixMapping(java.lang.String prefix)
                      throws org.xml.sax.SAXException
Specified by:
endPrefixMapping in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

ignorableWhitespace

public void ignorableWhitespace(char[] ch,
                                int start,
                                int length)
                         throws org.xml.sax.SAXException
Specified by:
ignorableWhitespace in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

processingInstruction

public void processingInstruction(java.lang.String target,
                                  java.lang.String data)
                           throws org.xml.sax.SAXException
Specified by:
processingInstruction in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

setDocumentLocator

public void setDocumentLocator(org.xml.sax.Locator locator)
Specified by:
setDocumentLocator in interface org.xml.sax.ContentHandler

skippedEntity

public void skippedEntity(java.lang.String name)
                   throws org.xml.sax.SAXException
Specified by:
skippedEntity in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

startDocument

public void startDocument()
                   throws org.xml.sax.SAXException
Specified by:
startDocument in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

startPrefixMapping

public void startPrefixMapping(java.lang.String prefix,
                               java.lang.String uri)
                        throws org.xml.sax.SAXException
Specified by:
startPrefixMapping in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

startElement

public void startElement(java.lang.String uri,
                         java.lang.String localName,
                         java.lang.String qName,
                         org.xml.sax.Attributes atts)
                  throws org.xml.sax.SAXException
Specified by:
startElement in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

endElement

public void endElement(java.lang.String uri,
                       java.lang.String localName,
                       java.lang.String qName)
                throws org.xml.sax.SAXException
Specified by:
endElement in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

characters

public void characters(char[] ch,
                       int start,
                       int length)
                throws org.xml.sax.SAXException
Specified by:
characters in interface org.xml.sax.ContentHandler
Throws:
org.xml.sax.SAXException

addTextBlock

protected void addTextBlock(TextBlock tb)

getTitle

public java.lang.String getTitle()

setTitle

public void setTitle(java.lang.String s)

toTextDocument

public TextDocument toTextDocument()
Returns a TextDocument containing the extracted TextBlock s. NOTE: Only call this after parsing.

Returns:
The TextDocument

addWhitespaceIfNecessary

public void addWhitespaceIfNecessary()