Crawling the Semantic Web

February 12th, 2004  |  Published in rdf, talks, web  |  2 Comments

I’ve had a proposal for a paper accepted for XML Europe 2004. Yay! Looking forward to meeting lots of old friends and making new ones in Amsterdam in April. Let me know if you’re going to be there. Here’s what I submitted:

This presentation examines the problem of semantic web crawling – following links from document to document and gathering the results for searching. Unlike centralised web search facilities, semantic web agents will be distributed, personalised and often highly domain-specific. How can we hold the entire world inside our laptops?

The W3C vision for the Semantic Web is “an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation”. It is “the representation of data on the World Wide Web”, expressed using the Resource Description Framework (RDF). Just as the web grew in usefulness as it was traversed, indexed and searched by systems such as Lycos, Altavista and Google, the semantic web requires technologies that can crawl, aggregate and query the RDF data.

This talk presents a modular semantic web crawler designed to explore the provision of services to applications. It highlights differences from and similarities to existing web search systems that gather their source data from the public web.

Rather than have web crawling and aggregation built into every semantic web application, agents will be able to call on aggregation services via webservices, be notified of new resources by publish-and-subscribe mechanisms, or simply receive a stream of RDF statements as they are found. A number of different RDF storage mechanisms are tested, including traditional relational databases and RDF toolkits such as Redland.

Applications will be discussed in the areas of social networks (using the Friend Of A Friend vocabulary) and personal publishing. Models for providing centralised services to lighter-weight agents (such as mobile applications) are explored, and important issues such as trust and attribution of information are covered.


  1. Danny says:

    February 13th, 2004 at 12:59 pm (#)

    Hi Matt, congrats!
    (They didn’t like my proposal, sniff).

    I’m curious – what have you got by way of implementation along these lines? I know of El Chumpo, what else have you got up your sleeve?

  2. steve cayzer says:

    March 1st, 2004 at 9:46 pm (#)

    Hi Matt,

    [catching up at last] well done! I note that I’m the talk just before you, it sounds like you’re setting a high standard to aim for :) Look forward to meeting you…