de.l3s.boilerpipe.sax
Class BoilerpipeHTMLParser

java.lang.Object
  extended by org.apache.xerces.parsers.XMLParser
      extended by org.apache.xerces.parsers.AbstractXMLDocumentParser
          extended by org.apache.xerces.parsers.AbstractSAXParser
              extended by de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
All Implemented Interfaces:
org.apache.xerces.xni.XMLDocumentHandler, org.apache.xerces.xni.XMLDTDContentModelHandler, org.apache.xerces.xni.XMLDTDHandler, org.apache.xerces.xs.PSVIProvider, org.xml.sax.Parser, org.xml.sax.XMLReader

public class BoilerpipeHTMLParser
extends org.apache.xerces.parsers.AbstractSAXParser

A simple SAX Parser, used by BoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.

Author:
Christian Kohlschütter

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser
org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy
 
Field Summary
 
Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser
ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING
 
Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD
 
Fields inherited from class org.apache.xerces.parsers.XMLParser
ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration
 
Fields inherited from interface org.apache.xerces.xni.XMLDTDHandler
CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE
 
Fields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler
OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCE
 
Constructor Summary
BoilerpipeHTMLParser()
          Constructs a BoilerpipeHTMLParser using a default HTML content handler.
BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
          Constructs a BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler.
 
Method Summary
 TextDocument toTextDocument()
          Returns a TextDocument containing the extracted TextBlock s.
 
Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setContentHandler, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl
 
Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl
 
Methods inherited from class org.apache.xerces.parsers.XMLParser
parse
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BoilerpipeHTMLParser

public BoilerpipeHTMLParser()
Constructs a BoilerpipeHTMLParser using a default HTML content handler.


BoilerpipeHTMLParser

public BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
Constructs a BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler.

Parameters:
contentHandler -
Method Detail

toTextDocument

public TextDocument toTextDocument()
Returns a TextDocument containing the extracted TextBlock s. NOTE: Only call this after AbstractSAXParser.parse(org.xml.sax.InputSource).

Returns:
The TextDocument