Archive for April, 2003

An RDF crawler

April 21st, 2003  |  Published in java, rdf

I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena’s backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.

Read the rest of this entry »

Screenscraping HTML with TagSoup and XPath

April 13th, 2003  |  Published in java, xml

Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.

Note: the example code below has been updated.

Read the rest of this entry »

Using HTTP conditional GET in java for efficient polling

April 9th, 2003  |  Published in java

If you’re going to download a resource over HTTP from a URL more than once, there are a couple of features of HTTP you should make sure you’re using. By giving the server some metadata about what you saw when you last downloaded the resource, it can give you a status code indicating that the resource hasn’t changed and you should continue to use the version you already have.

This issue has been highlighted recently by the bandwidth load caused by the growth in popularity of RSS readers, which repeatedly download RSS files looking for changes. There’s a good writeup of the details at The Fishbowl. I didn’t find any sample Java source when I went looking recently, so here’s some code.

Read the rest of this entry »

“Most Popular Entries” sidebar

April 1st, 2003  |  Published in perl, rss

Noticing the variety in popularity amongst the different topics that the pieces on this site cover, I added a “Most Popular Entries” sidebar to keep track of what people are reading. This is done with a simple application of Apache::ParseLog, XML::RSS, movabletype’s XML-RPC interface and a movabletype plugin.

Read the rest of this entry »