org.hippoecm.hst.utils
Class SimpleHtmlExtractor

java.lang.Object
  extended by org.hippoecm.hst.utils.SimpleHtmlExtractor

public class SimpleHtmlExtractor
extends Object

Simple HTML Tag Extractor

Version:
$Id: SimpleHtmlExtractor.java 22564 2010-04-27 12:53:45Z wko $

Method Summary
protected static org.htmlcleaner.HtmlCleaner getHtmlCleaner()
           
static String getInnerHtml(String html, String tagName, boolean byHtmlCleaner)
          Extracts inner HTML of the tag which is first found by the tagName.
static String getInnerText(String html, String tagName)
          Extracts inner text of the tag which is first found by the tagName.
static String getText(String html)
          Extracts text of the html mark ups.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getHtmlCleaner

protected static org.htmlcleaner.HtmlCleaner getHtmlCleaner()

getInnerHtml

public static String getInnerHtml(String html,
                                  String tagName,
                                  boolean byHtmlCleaner)
Extracts inner HTML of the tag which is first found by the tagName. If byHtmlCleaner parameter is set to true, then HTML Cleaner library will be used to extract the inner content of the tag found by the tagName.

You can use byHtmlCleaner option to extract complex html tags, but it requires more operations because it needs html cleaning. So, for simple html input and for better performance, you can extract tags with simple extracting option by setting byHtmlCleaner to false. If the html input is more complex and you need more correct result, then you need to set byHtmlCleaner to true with more operational cost.

If tagName is null or empty, then the root element is used.

Parameters:
html -
tagName - the name of the tag including the root or null/empty for root tag
byHtmlCleaner -
Returns:
String innerHTML of the tag or null when the tag is not found

getInnerText

public static String getInnerText(String html,
                                  String tagName)
Extracts inner text of the tag which is first found by the tagName.

If tagName is null or empty, then the root element is used.

Parameters:
html -
tagName - the name of the tag including the root or null/empty for root tag
Returns:

getText

public static String getText(String html)
Extracts text of the html mark ups.

Parameters:
html -
Returns:


Copyright © 2008-2012 Hippo. All Rights Reserved.