DotLucene (Lucene.NET) + KStemmer + Searcharoo = great!
The overall search engine is composed of three parts:
- A site crawler: In the past, I’ve built search engines that utilites the raw data inside our CMS, but a crawler seems to work better when you have a fair amount of dymaic content. I found a nice crawler in Searcharoo. It’s a full search engine by itself, but since I wanted to use Lucene, I only used the crawler portion Searcharoo.
- An indexer: This is where Lucene.NET (or DotLucene) comes in. When Searharoo downloads a page, the text is sent to Lucene to index.
- A Stemmer: Lucene does a great job of indexing and searching, but it doesn’t natively have the ability to search for derivatives of a stem word. For example, if a user seraches for “tests”, Lucene doesn’t by default figure out the stem (“test” removing the plural “s”) and then search for all words based on the stem (“test” “testing” or “tested”). But there is a port of KStemmer which handles all the stemming automagically handles stemming. Example http://www05.dts.edu/search/?q=tests
Right now I’m using Lucene 1.4 because the 1.9 is not yet out of RC stage and the new Highlighter 1.5 has some bugs.