March 29, 2004

Planet RDF as a starting-point for semantic web crawling

I was thinking about scuttering and useful ways to gather information. It occurred to me that as well as following rdfs:seeAlso links as usual, it was worth scanning associated HTML for semantic links (eg FOAF autodiscovery metadata). I decided to use the blogroll at Planet RDF as a testing ground, on the assumption that if anyone was going to embed useful metadata in their HTML, it would be the hackers listed there.

I should note that when Dave puts together the blogroll, he generally insists that the RSS we point to is parseable RDF, and not just tag soup that carries no meaning to a semantic web-aware client. This is well in keeping with the theme of the site and gives us a useful starting point for scuttering.

I hacked up some rough code pretty quickly (cribbing and adapting from some Mark Pilgrim code where necessary). It visits each RSS file in the blogroll with an RDF parser and finds the channel link. It downloads the HTML from there and looks for link tags pointing to rdf/xml. Finally, it outputs a new blogroll augmented with extra rdfs:seeAlso links, and combines all the discovered RDF into a single model.

I ran the code (full log text), and it gathered a big bundle of information about the bloggers and an augmented blogroll.

I discovered that in the 33 weblogs listed:

  • 11 publish some sort of RDF link in their HTML (but one was a 404).
  • 9 of these RDF links were to FOAF files.
  • 2 didn't publish parseable XML in their RSS due to Unicode or character encoding issues.
  • 2 had moved their RSS feeds to a new location pointed to by a 302.
  • One is publishing RSS 2.0 under an RDF link.
  • One of the links in the blogroll points to an RDF file that isn't RSS (but there is an RSS file available on another URL).

When I've got some more hacking time, I'll get back to this dataset and do some more analysis. Smushing the blogroll against the gathered data (via the weblog property, an IFP), it'll be possible to build a little visualisation of who knows who on the Planet RDF planet, and gather some extra info to put on the site itself (thumbnail author images, for example). It'd be great to see more people put a link to their FOAF in their blog HTML.

Posted by Matt Biddulph at 02:58 AM | Comments (0) | TrackBack

March 08, 2004

Work at the BBC

There's a job being advertised at BBC Radio and Music Interactive in London, where I work. It involves Python, XML, CMSes, digital radio and other interesting technologies. You'd like it there.

UPDATE: Applications have now closed.

FURTHER UPDATE: for unknown reasons, this page is (at the time of writing) number one hit on google for the search term work at the bbc (and perhaps some similar terms). For those who come here looking for work, I suggest you look at the BBC Jobs site or perhaps BBC Talent, where budding writers, presenters and DJs are sought.

Posted by Matt Biddulph at 10:18 PM | Comments (1) | TrackBack