|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.xerces.parsers.XMLParser
org.apache.xerces.parsers.AbstractXMLDocumentParser
org.apache.xerces.parsers.AbstractSAXParser
de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
public class BoilerpipeHTMLParser
A simple SAX Parser, used by BoilerpipeSAXInput
. The parser uses CyberNeko to parse HTML content.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser |
---|
org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy |
Field Summary |
---|
Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser |
---|
ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING |
Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser |
---|
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD |
Fields inherited from class org.apache.xerces.parsers.XMLParser |
---|
ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration |
Fields inherited from interface org.apache.xerces.xni.XMLDTDHandler |
---|
CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE |
Fields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler |
---|
OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCE |
Constructor Summary | |
---|---|
|
BoilerpipeHTMLParser()
Constructs a BoilerpipeHTMLParser using a default HTML content handler. |
|
BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
Constructs a BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler . |
protected |
BoilerpipeHTMLParser(boolean ignore)
|
Method Summary | |
---|---|
void |
setContentHandler(BoilerpipeHTMLContentHandler contentHandler)
|
void |
setContentHandler(org.xml.sax.ContentHandler contentHandler)
|
TextDocument |
toTextDocument()
Returns a TextDocument containing the extracted TextBlock
s. |
Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser |
---|
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl |
Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser |
---|
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl |
Methods inherited from class org.apache.xerces.parsers.XMLParser |
---|
parse |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public BoilerpipeHTMLParser()
BoilerpipeHTMLParser
using a default HTML content handler.
public BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
BoilerpipeHTMLParser
using the given BoilerpipeHTMLContentHandler
.
contentHandler
- protected BoilerpipeHTMLParser(boolean ignore)
Method Detail |
---|
public void setContentHandler(BoilerpipeHTMLContentHandler contentHandler)
public void setContentHandler(org.xml.sax.ContentHandler contentHandler)
setContentHandler
in interface org.xml.sax.XMLReader
setContentHandler
in class org.apache.xerces.parsers.AbstractSAXParser
public TextDocument toTextDocument()
TextDocument
containing the extracted TextBlock
s. NOTE: Only call this after AbstractSAXParser.parse(org.xml.sax.InputSource)
.
toTextDocument
in interface BoilerpipeDocumentSource
TextDocument
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |