April 21, 2003

An RDF crawler

I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena's backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.

The system is multithreaded and so can simultaneously download from many sources while the aggregation thread does the processing. It builds a model that remembers the provenance of the RDF and takes care to delete and replace triples if it hits the same URL twice, so you can run it as often as you like to keep the data fresh without bloating the store with out-of-date information. As yet it doesn't do anything with what it gathers; the information's just sitting there waiting for interesting applications to be built on top of it.

To use it as distributed, set up a mysql database called "scutter" and set the username and password in the DBConnection setup in Scutter.java then recompile using 'ant compile' (sorry, no handy config files in this 0.1 release). Run the script scutter.sh passing in as many starting-point URLs as you like. These will be added to the queue, and any rdfs:seeAlso pointers in the downloaded RDF will be recursively followed until no more unique URLs can be found. The biggest known issue at the moment is that it doesn't do proper management to work out when it's run out of URLs - it just stops. The standard log4j.properties file can be edited to change what gets logged - with full debugging information turned on, you get quite a lot of output.

Plans for the future include tying FOAF-related processing into the aggregation such as smushing and mbox_sha1sum normalising, and making a publish/subscribe-based system so that people who can't run their own aggregators can subscribe to the RDF that's gathered.

Posted by Matt Biddulph at 02:33 PM | Comments (4) | TrackBack

April 13, 2003

Screenscraping HTML with TagSoup and XPath

Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I've always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I've been doing the same thing in program code using some very handy tools written in Java.

Note: the example code below has been updated.

The commandline version looks like this:

wget -O - http://example.com | tidy -asxml - | xsltproc somexsl.xsl -

where somexsl.xsl looks something like this:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>
<xsl:output method="text" />
<xsl:template match="/">
<xsl:value-of select="/html/head/title" />
</xsl:template>
</xsl:stylesheet>

It's also possible to do the same thing entirely in Java. John Cowan wrote a wonderful HTML parser called TagSoup that outputs SAX events using a do-the-best-I-can approach ("Just Keep On Truckin'" as he describes it) that attempts to make the best job of even the nastiest badly-written HTML. It produces output in cases when HTML TIDY gives up and tells you that errors in the input must be corrected before it can continue.

Because the SAX events just look like XML to any downstream code, it can be plugged into an XPath processor such as Jaxen. XPath processors need DOM trees to work with (because of the backwards-and-forwards-looking nature of the language which makes streaming processing difficult). JDOM contains a nice class called SAXBuilder that can do this SAX-to-DOM conversion, and handily Jaxen can work with JDOM trees directly. So, the Java equivalent of the commandline above is:

URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);

Posted by Matt Biddulph at 02:40 PM | Comments (6) | TrackBack

April 09, 2003

Using HTTP conditional GET in java for efficient polling

If you're going to download a resource over HTTP from a URL more than once, there are a couple of features of HTTP you should make sure you're using. By giving the server some metadata about what you saw when you last downloaded the resource, it can give you a status code indicating that the resource hasn't changed and you should continue to use the version you already have.

This issue has been highlighted recently by the bandwidth load caused by the growth in popularity of RSS readers, which repeatedly download RSS files looking for changes. There's a good writeup of the details at The Fishbowl. I didn't find any sample Java source when I went looking recently, so here's some code.

If you're using Jakarta Commons HttpClient and you have an etag and lastModified string cached with a document then use these lines on your GetMethod instance:

GetMethod get = new UrlGetMethod(url);
get.addRequestHeader(new Header("If-None-Match",etag));
get.addRequestHeader(new Header("If-Modified-Since",lastModified));

then check the response code like this:

client.executeMethod(get);
if(get.getStatusCode() < 300) {
  // server gave us a document
  HeaderElement[] etags = get.getResponseHeader("ETag").getValues();
  if(etags.length > 0) {
   String newEtag = etags[0].getName(); // stash this somewhere
  }

  HeaderElement[] mods = get.getResponseHeader("Last-Modified").getValues();
  if(mods.length > 0) {
   String newLastModified = mods[0].getName()); // stash this somewhere
  }
} else {
  // server didn't give us a document, no update
}

The equivalent lines (taken from nntp//rss) for the standard JDK java.net package are:

HttpURLConnection httpCon = ....
httpCon.setRequestProperty("If-None-Match", etag);
httpCon.setIfModifiedSince(lastModified);

and

if(httpCon.getResponseCode() == HttpURLConnection.HTTP_OK) {
 newEtag = httpCon.getHeaderField("ETag");
 newLastModified = httpCon.getHeaderFieldDate("Last-Modified", 0);
}
if(httpCon.getResponseCode() == HttpURLConnection.HTTP_NOT_MODIFIED) {
  // no change
}

Posted by Matt Biddulph at 03:52 PM | Comments (4) | TrackBack

April 01, 2003

"Most Popular Entries" sidebar

Noticing the variety in popularity amongst the different topics that the pieces on this site cover, I added a "Most Popular Entries" sidebar to keep track of what people are reading. This is done with a simple application of Apache::ParseLog, XML::RSS, movabletype's XML-RPC interface and a movabletype plugin.

Every night after midnight, hits.pl runs and inspects the previous day's Apache access log. Every entry has a predictable URL of the form http://www.hackdiary.com/archive/000articleID.html, so the code looks for hits on those URLs and parses out the article ID. It updates a dbm file, adding the day's hits to hits already recorded for each item.

After that's finished, hits2rss.pl is run and translates the dbm file into an RSS file by looking up the article titles via the MT XML-RPC interface. An MT plugin parses that file when the index page is rebuilt and makes the sidebar.

Fun hacks like this make me wish that MT had a license that allowed distribution of modifications to its core. Although the plugin interface is flexible, it's a dead-end in the long run without the ability to dig in and change things under the skin.

Posted by Matt Biddulph at 06:23 PM | Comments (1) | TrackBack