Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.
Note: the example code below has been updated.
The commandline version looks like this:
wget -O - http://example.com | tidy -asxml - | xsltproc somexsl.xsl -
where somexsl.xsl looks something like this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>
<xsl:output method="text" />
<xsl:value-of select="/html/head/title" />
It’s also possible to do the same thing entirely in Java. John Cowan wrote a wonderful HTML parser called TagSoup that outputs SAX events using a do-the-best-I-can approach (“Just Keep On Truckin'” as he describes it) that attempts to make the best job of even the nastiest badly-written HTML. It produces output in cases when HTML TIDY gives up and tells you that errors in the input must be corrected before it can continue.
Because the SAX events just look like XML to any downstream code, it can be plugged into an XPath processor such as Jaxen. XPath processors need DOM trees to work with (because of the backwards-and-forwards-looking nature of the language which makes streaming processing difficult). JDOM contains a nice class called SAXBuilder that can do this SAX-to-DOM conversion, and handily Jaxen can work with JDOM trees directly. So, the Java equivalent of the commandline above is:
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);