Screenscraping HTML with TagSoup and XPath

April 13th, 2003  |  Published in java, xml  |  10 Comments

Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.

Note: the example code below has been updated.


The commandline version looks like this:

wget -O - http://example.com | tidy -asxml - | xsltproc somexsl.xsl -

where somexsl.xsl looks something like this:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>
<xsl:output method="text" />
<xsl:template match="/">
<xsl:value-of select="/html/head/title" />
</xsl:template>
</xsl:stylesheet>

It’s also possible to do the same thing entirely in Java. John Cowan wrote a wonderful HTML parser called TagSoup that outputs SAX events using a do-the-best-I-can approach (“Just Keep On Truckin’” as he describes it) that attempts to make the best job of even the nastiest badly-written HTML. It produces output in cases when HTML TIDY gives up and tells you that errors in the input must be corrected before it can continue.

Because the SAX events just look like XML to any downstream code, it can be plugged into an XPath processor such as Jaxen. XPath processors need DOM trees to work with (because of the backwards-and-forwards-looking nature of the language which makes streaming processing difficult). JDOM contains a nice class called SAXBuilder that can do this SAX-to-DOM conversion, and handily Jaxen can work with JDOM trees directly. So, the Java equivalent of the commandline above is:

URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);

Responses

  1. Jason says:

    April 14th, 2003 at 2:06 am (#)

    I’ve used HttpUnit for “web-scraping”. It worked for what I needed it to do.

  2. Matt Biddulph says:

    April 14th, 2003 at 1:56 pm (#)

    Zaid of http://www.altmobile.com suggests his Mobile Internet Studio product for point-and-click HTML scraping. It’s commercial software so I’ve removed his marketing-oriented comment from this page (no offense intended; just personal preference), but do check out his site if you’re interested.

  3. Development Notebook says:

    April 15th, 2003 at 9:59 am (#)

    Screenscraping with TagSoup

    Hmm, I wonder : TagSoup is a HTML cleaner, SAX-style hackdiary: Screenscraping HTML with TagSoup and XPath Screenscraping HTML with

  4. Leigh Dodds says:

    April 30th, 2003 at 8:46 pm (#)

    Have you tried tweaking the TagSoup schema?

    I’ve not looked too closely myself, but looks like I’m going to have to do that as, not surprisingly, when I ran TagSoup over an HTML page containing Trackback markup it discarded it.

    There’s probably another tutorial in there somewhere…

  5. Max V says:

    December 10th, 2003 at 2:37 pm (#)

    The result of my diploma thesis made the Tagsoup approach simpler by giving the user a way to visually select elemtens in a browser. In the background an XSLT is created. Release sheduled for Xmas 2003.

    http://www.xam.de | http://wal.sf.net

  6. Eric says:

    December 15th, 2003 at 8:04 pm (#)

    Hi, sorry if this is a stupid question, but i’m trying to use Tagsoup with jdom and it seems like jdom sets an internal feature for namespace-prefixes which won’t turn off even if it try to setFeature(…,false)..just wondering if tagsoup will support this feature soon or do i need to bug jdom about it :) Thanks! -Eric

  7. Unicast says:

    December 27th, 2003 at 11:11 am (#)

    The Danish media RSS project

    I’ve started recoding the RSS feeds for the major Danish newspapers that I once had. Those feeds were done with…

  8. stuck-on-mobile-e-com ;-) says:

    January 4th, 2004 at 6:33 pm (#)

    html scraping

    by hackdiary: Screenscraping HTML with TagSoup and XPath: Screenscraping HTML with TagSoup and XPathlooking forward to experiment with those tools ……

  9. ip address says:

    February 13th, 2004 at 12:03 pm (#)

    Nice summary. Thank you for posting it.

  10. darkerhorse says:

    December 30th, 2008 at 10:54 pm (#)

    Doing the code in Java takes a little bit of less work I believe. I have used TagSoup, and that works great.