Archive for 2003

XMLEurope 2003 Talk Slides

May 7th, 2003  |  Published in photos, rdf, rss, talks, wordnet

This morning I did my talk (A Semantic Web Shoebox – Annotating Photos with RSS and RDF) at XMLEurope 2003. The slides are now available.

Read the rest of this entry »

An RDF crawler

April 21st, 2003  |  Published in java, rdf

I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena’s backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.

Read the rest of this entry »

Screenscraping HTML with TagSoup and XPath

April 13th, 2003  |  Published in java, xml

Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.

Note: the example code below has been updated.

Read the rest of this entry »

Using HTTP conditional GET in java for efficient polling

April 9th, 2003  |  Published in java

If you’re going to download a resource over HTTP from a URL more than once, there are a couple of features of HTTP you should make sure you’re using. By giving the server some metadata about what you saw when you last downloaded the resource, it can give you a status code indicating that the resource hasn’t changed and you should continue to use the version you already have.

This issue has been highlighted recently by the bandwidth load caused by the growth in popularity of RSS readers, which repeatedly download RSS files looking for changes. There’s a good writeup of the details at The Fishbowl. I didn’t find any sample Java source when I went looking recently, so here’s some code.

Read the rest of this entry »

“Most Popular Entries” sidebar

April 1st, 2003  |  Published in perl, rss

Noticing the variety in popularity amongst the different topics that the pieces on this site cover, I added a “Most Popular Entries” sidebar to keep track of what people are reading. This is done with a simple application of Apache::ParseLog, XML::RSS, movabletype’s XML-RPC interface and a movabletype plugin.

Read the rest of this entry »

Lightning talk on RDF and the Semantic Web

March 14th, 2003  |  Published in perl, rdf, talks

Last night I gave a lightning talk at the london.pm techmeet that attempted to explain as simply as possible what RDF and the Semantic Web are, and how you can start playing with them with perl.

Read the rest of this entry »

Installing Debian on a Dell Latitude X200

March 9th, 2003  |  Published in hardware, linux

UPDATE: more notes written recently

My new laptop arrived this week – a Dell Latitude x200. And it’s marvellous. Wonderfully lightweight, good battery life for such a small box, good keyboard and a really clear bright screen. After a quick look at Windows XP, which I’d never seen properly before, I set about installing Linux on it. The Linux on Laptops Dell page has links to some useful bootstrapping information, but there were a few things I found pretty hard to work out. Here are my notes on those things.

Read the rest of this entry »

On the sudden lack of momentum

February 26th, 2003  |  Published in misc

Just when I’d got into a regular rhythm of posting new stuff to this site at least once a week, my laptop died. For the last year or so I’ve stopped using desktop machines, partly prompted by the arrival of cheap wireless networking. A good laptop and a number of built-for-purpose servers (mp3 jukebox, network gateway and webserver, etc) have suited me very well.

I’ve blatted my savings and ordered a shiny new replacement, but until that arrives I won’t be able to properly finish any of the code I’ve been working on. I’m looking forward to posting a new RDF scutter based on an updated version of the foaftool code posted here a few weeks ago.

A freetext-indexing IMAP spider

February 6th, 2003  |  Published in java

Because the Exchange mailserver at work is frustratingly slow and doesn’t have a flexible cross-folder search option, I wanted an indexing spider for IMAP. After a bit of struggling with the javamail API and almost no work at all plugging the messages into Lucene (which is impressively clean, flexible and powerful), I had some working code that will start at a folder and work down through its subfolders, indexing messages as it goes.

Read the rest of this entry »

Sha1ing, smushing and aggregating FOAF

February 3rd, 2003  |  Published in foaf, java, rdf

To normalise and aggregate FOAF metadata related to photographs, I needed some new code to:

  • convert foaf:mbox entries to privacy-protected foaf:mbox_sha1sum entries.
  • normalise statements of the form “PICTURE depicts PERSON” to “PERSON depiction PICTURE”.
  • smush disparate references to the same person into references to a single definition of that person.
  • extract depiction triples from a model and copy just the bare minimum of information related to those depictions

So I wrote foaftool, a Java class that uses Jena. The tarball also contains a couple of servlets that can be used to transform existing content on the web.

Read the rest of this entry »