A freetext-indexing IMAP spider

February 6th, 2003  |  Published in java  |  4 Comments

Because the Exchange mailserver at work is frustratingly slow and doesn’t have a flexible cross-folder search option, I wanted an indexing spider for IMAP. After a bit of struggling with the javamail API and almost no work at all plugging the messages into Lucene (which is impressively clean, flexible and powerful), I had some working code that will start at a folder and work down through its subfolders, indexing messages as it goes.


This tarball contains the source, compiled class files and support jars, along with a Jetty setup that will let you run the demo servlet without needing an install of Tomcat or any other servlet engine. Point the indexer at your IMAP host and give it a folder to start from and it will recursively build an index of subject, date, from and mail body. Run Jetty via queryserver.sh and point your browser at https://localhost:9999

The indexer uses the Message-ID as a primary key; it will only index mail it hasn’t seen before when it does a run. This means it will work nicely from a regular cronjob. The query code uses the standard Lucene query parser so will support queries such as +foo +bar, subject:fish and “phrase search”. The spider is independent of the indexer and just fires message events at a MessageListener interface, so it might be useful for other things. The main limitation at the moment (apart from some kind of nice interface) is that the code only copes with single-part messages of type text/plain. The MailDocument class is the place to start improving that.

Responses

  1. Erik Hatcher says:

    February 7th, 2003 at 4:07 am (#)

    Have you seen Zoe? It’s a Lucene-based mail indexer with a very snazzy web interface and many many features: http://guests.evectors.it/zoe/

    I’m glad to see more folks coding search engines on e-mail though…. very cool!

  2. lady snipa says:

    May 6th, 2003 at 12:27 pm (#)

    i am the best WHITE ghal mc

  3. Roderik says:

    June 3rd, 2003 at 1:01 pm (#)

    This is so nice to have around. Thanks a lot for including every needed component!

    I’m usually OK with ‘just’ using my IMAP client and finding messages by date/folder. Now, I was looking out for a mail indexer because I wanted to do a certain search — but I was hoping not to have to spend hours reconfiguring my system.
    (So Zoe looked like a nice product but I got a bit scared reading its website :) )

    This was the only other thing I found, but it was exactly what I needed:
    – download & unzip
    – run shell command for indexing
    – start webserver which you included in the package
    – search using browser (and stop webserver when I was done)

    Cool!

  4. things says:

    November 14th, 2004 at 11:48 am (#)

    id3 tag indexer

    I thought it would be useful to index MP3s using the data in their ID3 tags, so you can look…