December 28th, 2003 |
Published in
java
UPDATE: Oliver Roup has published updated code that uses the builtin XPath processor in JDK 1.5
Some emails and comments on Screenscraping HTML with TagSoup and XPath alerted me to the fact that the example I gave on that page has gone out of sync with the current release of JDOM and no longer works. I’ve reworked the example using Xalan 2.5.
Read the rest of this entry »
April 21st, 2003 |
Published in
java, rdf
I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena’s backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.
Read the rest of this entry »
April 13th, 2003 |
Published in
java, xml
Often I find I need to pull out a bit of information from a webpage to reuse inside some code. I’ve always done this from the commandline using a combination of wget, HTML TIDY and xsltproc. Recently I’ve been doing the same thing in program code using some very handy tools written in Java.
Note: the example code below has been updated.
Read the rest of this entry »
April 9th, 2003 |
Published in
java
If you’re going to download a resource over HTTP from a URL more than once, there are a couple of features of HTTP you should make sure you’re using. By giving the server some metadata about what you saw when you last downloaded the resource, it can give you a status code indicating that the resource hasn’t changed and you should continue to use the version you already have.
This issue has been highlighted recently by the bandwidth load caused by the growth in popularity of RSS readers, which repeatedly download RSS files looking for changes. There’s a good writeup of the details at The Fishbowl. I didn’t find any sample Java source when I went looking recently, so here’s some code.
Read the rest of this entry »
February 6th, 2003 |
Published in
java
Because the Exchange mailserver at work is frustratingly slow and doesn’t have a flexible cross-folder search option, I wanted an indexing spider for IMAP. After a bit of struggling with the javamail API and almost no work at all plugging the messages into Lucene (which is impressively clean, flexible and powerful), I had some working code that will start at a folder and work down through its subfolders, indexing messages as it goes.
Read the rest of this entry »
February 3rd, 2003 |
Published in
foaf, java, rdf
To normalise and aggregate FOAF metadata related to photographs, I needed some new code to:
- convert foaf:mbox entries to privacy-protected foaf:mbox_sha1sum entries.
- normalise statements of the form “PICTURE depicts PERSON” to “PERSON depiction PICTURE”.
- smush disparate references to the same person into references to a single definition of that person.
- extract depiction triples from a model and copy just the bare minimum of information related to those depictions
So I wrote foaftool, a Java class that uses Jena. The tarball also contains a couple of servlets that can be used to transform existing content on the web.
Read the rest of this entry »
January 24th, 2003 |
Published in
java
Every time I start a new java project, no matter what size, the first thing I do is go hunting through my java directories looking for one to use as a template. Over time I’ve gathered some pretty useful ant targets and settled on a fairly rational directory structure. Today I got round to building a skeleton set of directories and files that I can reuse in the future. Here’s a tarball of the results.
Read the rest of this entry »
January 9th, 2003 |
Published in
bots, java, photos, rdf, rest
A background project for a while has been to write a bot to help me annotate the fairly large number of pictures I post to picdiary (1496 at the last count). Creating a document of RSS-based metadata is a slightly cumbersome text-editor job every time I post a new set of pics.
Read the rest of this entry »
January 2nd, 2003 |
Published in
java, rdf, wordnet
To support the work I’ve been doing with Wordnet and RDF, I wrote a utility Java class to handle URIs from the Wordnet ontology for RDF devised by Dan Brickley.
Read the rest of this entry »
December 30th, 2002 |
Published in
java, rdf
I’m collating writings about my various hacks and projects in a Movable Type system, but hopefully without actually creating a blog as such. I’d rather generate ad-hoc navigation based on the categories of the items, so creating RDF-based sitemaps seems like a good idea.
Read the rest of this entry »