An RDF crawler

April 21st, 2003  |  Published in java, rdf  |  8 Comments

I wrote an RDF crawler (aka scutter) using Java and the Jena RDF toolkit that spiders the web gathering up semantic web data and storing it in any of Jena’s backend stores (in-memory, Berkeley DB, mysql, etc). Download it here.


The system is multithreaded and so can simultaneously download from many sources while the aggregation thread does the processing. It builds a model that remembers the provenance of the RDF and takes care to delete and replace triples if it hits the same URL twice, so you can run it as often as you like to keep the data fresh without bloating the store with out-of-date information. As yet it doesn’t do anything with what it gathers; the information’s just sitting there waiting for interesting applications to be built on top of it.

To use it as distributed, set up a mysql database called “scutter” and set the username and password in the DBConnection setup in Scutter.java then recompile using ‘ant compile’ (sorry, no handy config files in this 0.1 release). Run the script scutter.sh passing in as many starting-point URLs as you like. These will be added to the queue, and any rdfs:seeAlso pointers in the downloaded RDF will be recursively followed until no more unique URLs can be found. The biggest known issue at the moment is that it doesn’t do proper management to work out when it’s run out of URLs – it just stops. The standard log4j.properties file can be edited to change what gets logged – with full debugging information turned on, you get quite a lot of output.

Plans for the future include tying FOAF-related processing into the aggregation such as smushing and mbox_sha1sum normalising, and making a publish/subscribe-based system so that people who can’t run their own aggregators can subscribe to the RDF that’s gathered.

Responses

  1. Development Notebook says:

    April 21st, 2003 at 3:25 pm (#)

    Java scutter

    hackdiary: An RDF crawler

  2. Kieran says:

    April 21st, 2003 at 11:06 pm (#)

    Nice work,

    Look forward to trying it out at some point. Next step I suppose is to write a web service that you can query across many different peoples databases, so we don’t have to run from one great big “Google” like central database of crawled results.

  3. Chintan says:

    April 22nd, 2003 at 6:39 am (#)

    I m sorry but it is Pathetic !! please have a look at http://ontobroker.semanticweb.org/rdfcrawl/
    rather than spending ur effort in writing your own thing u culd have modified and enhanced the existing work

  4. BenM says:

    April 22nd, 2003 at 7:19 am (#)

    Calm down on the bashing will you! So what if there is something that works already available. That doesn’t stop other disciplines developing similar implementations of basic ideas.

    In my opinion it is often useful to do your own thing rather than just to modify or enhance existing work. Sure it’s good to do that as well, but building your own crawler is a good way to learn about or deepen your understanding for yourself. Foaf, not going to use the crawler myself (I’ve programmed my own in C#) but keep up the good work.

  5. Mark says:

    April 22nd, 2003 at 4:39 pm (#)

    Chintan: The RDF crawler you link to was last updated in November 2000, and appears to be abandoned. Furthermore, I can find no mention anywhere in the program or accompanying documentation of any licensing terms. In the absence of such terms, the project defaults to a strict traditional copyright, under which derivative works are not allowed.

    I should also note that this is far, far more courtesy than you deserve.

  6. CH*AS Blog says:

    August 1st, 2003 at 6:18 pm (#)

    RDF, XML Parser in Python

    rdfxml.py: An RDF/XML Parser … Python: rdfxml.py is a standalone Python module in under 10KB that parses RDF/XML using SAX. It was written to be used as a simple drop-in module for larger projects Parsing RDF with python An RDF…

  7. Goodpic says:

    December 9th, 2003 at 12:10 pm (#)

    Javaの RDF crawler (検索ロボット)

    WEB上をcrawlしてRDFを探しつつ情報をデーターベースに格納してくれる、Javaベースの検索ロボットだそうです。 An RDF crawler(Hackdiary) アクセスログを見ると、このブログにもGoogleなどの大

  8. Blog@ZeroDimension says:

    December 11th, 2003 at 3:59 am (#)

    An RDF crawler

    Goodpic: Java