UPDATE: Oliver Roup has published updated code that uses the builtin XPath processor in JDK 1.5
Some emails and comments on Screenscraping HTML with TagSoup and XPath alerted me to the fact that the example I gave on that page has gone out of sync with the current release of JDOM and no longer works. I've reworked the example using Xalan 2.5.
The problem seems to be that JDOM is asking the TagSoup parser for full namespace support, which it's not able to give. This new example uses Xalan's SAX2DOM class to make a DOM tree out of the TagSoup SAX stream, then uses the simple XPathAPI wrapper to make the XPath call.
import java.net.URL;
import org.apache.xalan.xsltc.trax.SAX2DOM;
import org.apache.xpath.XPathAPI;
import org.apache.xpath.objects.XObject;
import org.ccil.cowan.tagsoup.Parser;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
public class example {
public final static void main(String[] args) throws Exception {
URL url = new URL("http://example.com");
Parser p = new Parser();
p.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
// to define the html: prefix (off by default)
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(url.openStream()));
Node doc = sax2dom.getDOM();
String titlePath = "/html:html/html:head/html:title";
XObject title = XPathAPI.eval(doc,titlePath);
System.out.println("Title is '"+title+"'");
}
}
This code example can be compiled and run with just the TagSoup classes and the Xalan 2.5 main jar on the classpath.