February 26, 2003

On the sudden lack of momentum

Just when I'd got into a regular rhythm of posting new stuff to this site at least once a week, my laptop died. For the last year or so I've stopped using desktop machines, partly prompted by the arrival of cheap wireless networking. A good laptop and a number of built-for-purpose servers (mp3 jukebox, network gateway and webserver, etc) have suited me very well.

I've blatted my savings and ordered a shiny new replacement, but until that arrives I won't be able to properly finish any of the code I've been working on. I'm looking forward to posting a new RDF scutter based on an updated version of the foaftool code posted here a few weeks ago.

Posted by Matt Biddulph at 10:50 AM | Comments (0) | TrackBack

February 06, 2003

A freetext-indexing IMAP spider

Because the Exchange mailserver at work is frustratingly slow and doesn't have a flexible cross-folder search option, I wanted an indexing spider for IMAP. After a bit of struggling with the javamail API and almost no work at all plugging the messages into Lucene (which is impressively clean, flexible and powerful), I had some working code that will start at a folder and work down through its subfolders, indexing messages as it goes.

This tarball contains the source, compiled class files and support jars, along with a Jetty setup that will let you run the demo servlet without needing an install of Tomcat or any other servlet engine. Point the indexer at your IMAP host and give it a folder to start from and it will recursively build an index of subject, date, from and mail body. Run Jetty via queryserver.sh and point your browser at http://localhost:9999

The indexer uses the Message-ID as a primary key; it will only index mail it hasn't seen before when it does a run. This means it will work nicely from a regular cronjob. The query code uses the standard Lucene query parser so will support queries such as +foo +bar, subject:fish and "phrase search". The spider is independent of the indexer and just fires message events at a MessageListener interface, so it might be useful for other things. The main limitation at the moment (apart from some kind of nice interface) is that the code only copes with single-part messages of type text/plain. The MailDocument class is the place to start improving that.

Posted by Matt Biddulph at 11:00 PM | Comments (3) | TrackBack

February 03, 2003

Sha1ing, smushing and aggregating FOAF

To normalise and aggregate FOAF metadata related to photographs, I needed some new code to:

  • convert foaf:mbox entries to privacy-protected foaf:mbox_sha1sum entries.
  • normalise statements of the form "PICTURE depicts PERSON" to "PERSON depiction PICTURE".
  • smush disparate references to the same person into references to a single definition of that person.
  • extract depiction triples from a model and copy just the bare minimum of information related to those depictions

So I wrote foaftool, a Java class that uses Jena. The tarball also contains a couple of servlets that can be used to transform existing content on the web.

The first servlet will transform foaf:mbox triples in FOAF data into appropriately-encoded foaf:mbox_sha1sum triples. This makes Edd's FOAF file look like this. Using an extra querystring parameter, it optionally converts foaf:depicts triples to foaf:depiction. foaf:depicts isn't actually in the official FOAF schema at the time of writing, although it is in informal use in many places as it sometimes makes for more elegant modeling. Normalising to foaf:depictions makes working with large amounts of FOAF data simpler.

Writing the smushing code was an entertaining diversion. Smushing is important when merging multiple RDF sources. Say you have two sources, edd1.rdf and edd2.rdf, showing where to find photos of Edd. When merged, the graph structure looks like this:

edd1.rdf
edd1.rdf

This is because without smushing, the anonymous nodes that both have Edd's email address are not equated. The smushed version (smushed on mbox_sha1sum using a foaftool servlet) looks like this:

edd1.rdf

With the data in normalised and merged form, I want to extract just the triples of the form "X foaf:depiction [picture uri]" and the related foaf:name and foaf:mbox_sha1sum triples. With the foaftool code, I can now merge and extract depictions from any number of RDF sources.

Comments and bugfixes are very welcome; the code has only been tested as far as the junit tests included in the tarball.

Posted by Matt Biddulph at 12:02 AM | Comments (0) | TrackBack