DotLucene (Lucene.NET) + KStemmer + Searcharoo = great!

March 7, 2006 | Uncategorized | 5 Comments

I’m in the middle of implimenting the current Lucene.NET (a port of the original Lucene in Java) on a new site currently under development: http://www05.dts.edu/search.

The overall search engine is composed of three parts:

  1. A site crawler: In the past, I’ve built search engines that utilites the raw data inside our CMS, but a crawler seems to work better when you have a fair amount of dymaic content. I found a nice crawler in Searcharoo. It’s a full search engine by itself, but since I wanted to use Lucene, I only used the crawler portion Searcharoo.
  2. An indexer: This is where Lucene.NET (or DotLucene) comes in. When Searharoo downloads a page, the text is sent to Lucene to index.
  3. A Stemmer: Lucene does a great job of indexing and searching, but it doesn’t natively have the ability to search for derivatives of a stem word. For example, if a user seraches for “tests”, Lucene doesn’t by default figure out the stem (“test” removing the plural “s”) and then search for all words based on the stem (“test” “testing” or “tested”). But there is a port of KStemmer which handles all the stemming automagically handles stemming. Example http://www05.dts.edu/search/?q=tests

Right now I’m using Lucene 1.4 because the 1.9 is not yet out of RC stage and the new Highlighter 1.5 has some bugs.

5 responses to “DotLucene (Lucene.NET) + KStemmer + Searcharoo = great!”

  1. 曾登高 says:

    这里是一点我在学习和开发搜索引擎过程中的一点学习和经验总结,文中讲述了蜘蛛、切词、索引、查询器等名模块的一些概述和细节,希望能给搜索引擎中的初学点的一点帮助,对于那些高手也能够带来一点点启发的帮助!这是我在2004年学习和开发搜索引擎相关东西时的一点总结,可能比较肤浅,最近还是一直在搞这方面的研究,相对于这篇文章又有了一些新的总结,等以后有时间再写一篇和大家分享!

  2. ray says:

    I have 3 questions,

    1. I thought Lucene creates the index on the file system, so in the case of a web farm, how do you do it?  

    2. SqlServer has its full text search capability, since you are not using that, I’d like to ask what is problem with SqlServer’s full text search service that you are not using it?

    3. Community Server wrote its own indexer and break everything and store into a DB table, how does this approach compare to the other two mentioned above?

    Thanks,

    Ray.

  3. Pierre says:

    Is your implementation of DotLucene (Lucene.NET) + KStemmer + Searcharoo available somewhere?

    I would like to implement it the same way.

    Thanks, Peter

  4. Unresolvable says:

    Is your implementation of DotLucene (Lucene.NET) + KStemmer + Searcharoo available somewhere?

    I would like to implement it the same way.

    Thanks, Peter

  5. Unresolvable says:

    Is your implementation of DotLucene (Lucene.NET) + KStemmer + Searcharoo available somewhere?

    I would like to implement it the same way.

    Thanks

Leave a Reply

Hi, I'm John Dyer. In my day job, I build websites and create online seminary software for a seminary in Dallas. I also like to release open source tools including a pretty popular HTML5 video player and build tools that help people find best bible commentaries and do bible study. And just for fun, I also wrote a book on the theology of technology and media.

Fork me on GitHub

Social Widgets powered by AB-WebLog.com.