- Common usage
- Providing custom tag info set
- Example: Basic usage
- Example: Use cleaner from multiple threads
- Example: Traverse DOM tree
- Example: Setting cleaner transformations
Common usage
Tipically the following steps are taken:
// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behaviour with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
Providing custom tag info set
HtmlCleaner implements default HTML tag set and rules for their balancing, that
is similar to the browsers' behavior. However, user is free to implement interface
ITagInfoProvider
or extend some of its imlementations in order to provide custom tag info set.
The easiest way to do that is to write XML configuration file which describes all tags
and their dependacies and use
ConfigFileTagProvider like:
HtmlCleaner cleaner =
new HtmlCleaner( new ConfigFileTagProvider(myConfigFile) );
Perhaps the best starting point is default tag ruleset description file.
It is the basis for
DefaultTagProvider.
For example, someone may not like the rule that implicit TBODY is inserted before TR in the HTML table.
To remove it, find <tag name="tr"... element in the XML and remove tbody from
req-enclosing-tags section.
Example: Basic usage
CleanerProperties props = new CleanerProperties(); // set some properties to non-default values props.setTranslateSpecialEntities(true); props.setTransResCharsToNCR(true); props.setOmitComments(true); // do parsing TagNode tagNode = new HtmlCleaner(props).clean( new URL("http://www.chinadaily.com.cn/") ); // serialize to xml file new PrettyXmlSerializer(props).writeToFile( tagNode, "chinadaily.xml", "utf-8" );
Example: Use cleaner from multiple threads
This example demonstrates HtmlCleaner thread-safety. Single instance can be used from multiple threads safely.
final CleanerProperties props = new CleanerProperties(); final HtmlCleaner htmlCleaner = new HtmlCleaner(props); final SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props); // make 10 threads using the same cleaner and the same serializer for (int i = 1; i <= 10; i++) { final String url = "http://search.eim.ebay.eu/Art/2-1/?en=100&ep=" + i; final String fileName = "c:/temp/ebay_art" + i + ".xml"; new Thread(new Runnable() { public void run() { try { TagNode tagNode = htmlCleaner.clean(new URL(url)); htmlSerializer.writeToFile(tagNode, fileName, "utf-8"); } catch (IOException e) { e.printStackTrace(); } } }).start(); }
Example: Traverse DOM tree
Here node visitor concept is used to traverse the tree structure and update some of the elements.
HtmlCleaner cleaner = new HtmlCleaner(); final String siteUrl = "http://www.themoscowtimes.com/"; TagNode node = cleaner.clean(new URL(siteUrl)); // traverse whole DOM and update images to absolute URLs node.traverse(new TagNodeVisitor() { public boolean visit(TagNode tagNode, HtmlNode htmlNode) { if (htmlNode instanceof TagNode) { TagNode tag = (TagNode) htmlNode; String tagName = tag.getName(); if ("img".equals(tagName)) { String src = tag.getAttributeByName("src"); if (src != null) { tag.setAttribute("src", Utils.fullUrl(siteUrl, src)); } } } else if (htmlNode instanceof CommentNode) { CommentNode comment = ((CommentNode) htmlNode); comment.getContent().append(" -- By HtmlCleaner"); } // tells visitor to continue traversing the DOM tree return true; } }); SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(cleaner.getProperties()); serializer.writeToFile(node, "c:/temp/themoscowtimes.html");
Example: Setting cleaner transformations
Following code snippet demonstrates how to set tranformations from the example:
... HtmlCleaner cleaner = new HtmlCleaner(...); ... CleanerTransformations transformations = new CleanerTransformations(); TagTransformation tt = new TagTransformation("cfoutput"); transformations.addTransformation(tt); tt = new TagTransformation("c:block", "div", false); transformations.addTransformation(tt); tt = new TagTransformation("font", "span", true); tt.addAttributeTransformation("size"); tt.addAttributeTransformation("face"); tt.addAttributeTransformation( "style", "${style};font-family=${face};font-size=${size};" ); transformations.addTransformation(tt); ... cleaner.getProperties().setCleanerTransformations(transformations); ... TagNode node = cleaner.clean(...);

