Planet RDF as a starting-point for semantic web crawling

March 29th, 2004  |  Published in rdf

I was thinking about scuttering and useful ways to gather information. It occurred to me that as well as following rdfs:seeAlso links as usual, it was worth scanning associated HTML for semantic links (eg FOAF autodiscovery metadata). I decided to use the blogroll at Planet RDF as a testing ground, on the assumption that if anyone was going to embed useful metadata in their HTML, it would be the hackers listed there.

I should note that when Dave puts together the blogroll, he generally insists that the RSS we point to is parseable RDF, and not just tag soup that carries no meaning to a semantic web-aware client. This is well in keeping with the theme of the site and gives us a useful starting point for scuttering.

I hacked up some rough code pretty quickly (cribbing and adapting from some Mark Pilgrim code where necessary). It visits each RSS file in the blogroll with an RDF parser and finds the channel link. It downloads the HTML from there and looks for link tags pointing to rdf/xml. Finally, it outputs a new blogroll augmented with extra rdfs:seeAlso links, and combines all the discovered RDF into a single model.

I ran the code (full log text), and it gathered a big bundle of information about the bloggers and an augmented blogroll.

I discovered that in the 33 weblogs listed:

  • 11 publish some sort of RDF link in their HTML (but one was a 404).
  • 9 of these RDF links were to FOAF files.
  • 2 didn’t publish parseable XML in their RSS due to Unicode or character encoding issues.
  • 2 had moved their RSS feeds to a new location pointed to by a 302.
  • One is publishing RSS 2.0 under an RDF link.
  • One of the links in the blogroll points to an RDF file that isn’t RSS (but there is an RSS file available on another URL).

When I’ve got some more hacking time, I’ll get back to this dataset and do some more analysis. Smushing the blogroll against the gathered data (via the weblog property, an IFP), it’ll be possible to build a little visualisation of who knows who on the Planet RDF planet, and gather some extra info to put on the site itself (thumbnail author images, for example). It’d be great to see more people put a link to their FOAF in their blog HTML.

Comments are closed.