I was thinking about scuttering and useful ways to gather information. It occurred to me that as well as following rdfs:seeAlso links as usual, it was worth scanning associated HTML for semantic links (eg FOAF autodiscovery metadata). I decided to use the blogroll at Planet RDF as a testing ground, on the assumption that if anyone was going to embed useful metadata in their HTML, it would be the hackers listed there.
I should note that when Dave puts together the blogroll, he generally insists that the RSS we point to is parseable RDF, and not just tag soup that carries no meaning to a semantic web-aware client. This is well in keeping with the theme of the site and gives us a useful starting point for scuttering.
I hacked up some rough code pretty quickly (cribbing and adapting from some Mark Pilgrim code where necessary). It visits each RSS file in the blogroll with an RDF parser and finds the channel link. It downloads the HTML from there and looks for link tags pointing to rdf/xml. Finally, it outputs a new blogroll augmented with extra rdfs:seeAlso links, and combines all the discovered RDF into a single model.
I ran the code (full log text), and it gathered a big bundle of information about the bloggers and an augmented blogroll.
I discovered that in the 33 weblogs listed:
When I've got some more hacking time, I'll get back to this dataset and do some more analysis. Smushing the blogroll against the gathered data (via the weblog property, an IFP), it'll be possible to build a little visualisation of who knows who on the Planet RDF planet, and gather some extra info to put on the site itself (thumbnail author images, for example). It'd be great to see more people put a link to their FOAF in their blog HTML.
rdf Posted by Matt Biddulph at March 29, 2004 02:58 AM