Update: Screenscraping HTML with TagSoup and XPath

December 28th, 2003  |  Published in java

UPDATE: Oliver Roup has published updated code that uses the builtin XPath processor in JDK 1.5

Some emails and comments on Screenscraping HTML with TagSoup and XPath alerted me to the fact that the example I gave on that page has gone out of sync with the current release of JDOM and no longer works. I’ve reworked the example using Xalan 2.5.

Read the rest of this entry »

An RDF crawler

April 21st, 2003  |  Published in java, rdf

I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena’s backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.

Read the rest of this entry »

Screenscraping HTML with TagSoup and XPath

April 13th, 2003  |  Published in java, xml

Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.

Note: the example code below has been updated.

Read the rest of this entry »

Using HTTP conditional GET in java for efficient polling

April 9th, 2003  |  Published in java

If you’re going to download a resource over HTTP from a URL more than once, there are a couple of features of HTTP you should make sure you’re using. By giving the server some metadata about what you saw when you last downloaded the resource, it can give you a status code indicating that the resource hasn’t changed and you should continue to use the version you already have.

This issue has been highlighted recently by the bandwidth load caused by the growth in popularity of RSS readers, which repeatedly download RSS files looking for changes. There’s a good writeup of the details at The Fishbowl. I didn’t find any sample Java source when I went looking recently, so here’s some code.

Read the rest of this entry »

A freetext-indexing IMAP spider

February 6th, 2003  |  Published in java

Because the Exchange mailserver at work is frustratingly slow and doesn’t have a flexible cross-folder search option, I wanted an indexing spider for IMAP. After a bit of struggling with the javamail API and almost no work at all plugging the messages into Lucene (which is impressively clean, flexible and powerful), I had some working code that will start at a folder and work down through its subfolders, indexing messages as it goes.

Read the rest of this entry »

Sha1ing, smushing and aggregating FOAF

February 3rd, 2003  |  Published in foaf, java, rdf

To normalise and aggregate FOAF metadata related to photographs, I needed some new code to:

  • convert foaf:mbox entries to privacy-protected foaf:mbox_sha1sum entries.
  • normalise statements of the form “PICTURE depicts PERSON” to “PERSON depiction PICTURE”.
  • smush disparate references to the same person into references to a single definition of that person.
  • extract depiction triples from a model and copy just the bare minimum of information related to those depictions

So I wrote foaftool, a Java class that uses Jena. The tarball also contains a couple of servlets that can be used to transform existing content on the web.

Read the rest of this entry »

Template for Java projects

January 24th, 2003  |  Published in java

Every time I start a new java project, no matter what size, the first thing I do is go hunting through my java directories looking for one to use as a template. Over time I’ve gathered some pretty useful ant targets and settled on a fairly rational directory structure. Today I got round to building a skeleton set of directories and files that I can reuse in the future. Here’s a tarball of the results.

Read the rest of this entry »

Photo-annotating bot

January 9th, 2003  |  Published in bots, java, photos, rdf, rest

A background project for a while has been to write a bot to help me annotate the fairly large number of pictures I post to picdiary (1496 at the last count). Creating a document of RSS-based metadata is a slightly cumbersome text-editor job every time I post a new set of pics.

Read the rest of this entry »

A Java utility class for the Wordnet namespace

January 2nd, 2003  |  Published in java, rdf, wordnet

To support the work I’ve been doing with Wordnet and RDF, I wrote a utility Java class to handle URIs from the Wordnet ontology for RDF devised by Dan Brickley.

Read the rest of this entry »

Movable Type categories in RDF

December 30th, 2002  |  Published in java, rdf

I’m collating writings about my various hacks and projects in a Movable Type system, but hopefully without actually creating a blog as such. I’d rather generate ad-hoc navigation based on the categories of the items, so creating RDF-based sitemaps seems like a good idea.

Read the rest of this entry »