WebCrawler Engine in C# (first draft)

A few weeks ago, I wrote about using SearchAroo as a spider to index
a site with DotLucene
. I've written a new WebCrawler using SearchAroo as a base and turned
it into a library that can be reused for other applications.

Here are
the improvements I've made:

  1. Gets text from the following HTML tag attributes: alt, title, summary, longdesc
  2. Better ability to determine relative URLs
  3. The
    WebDocument object keeps record of all files it links out to, including
    external and internal links, as well as images. This is useful for
    determining if your site has missing images or outgoing links.
  4. Compiled
    into a reusable library (the author of SearchAroo didn't want to have a
    dll, but I feel it's much more usable this way) which means it can be
    plugged into any indexing framework or used for other purposes such as
    simple link checking.

Here is the basic code to get it running:

string baseUrl = "http://mywebsite.com/";

CrawlerEngine crawler = new CrawlerEngine();

crawler.OnDocumentLoaded += new DocumentHandler(crawler_OnDocumentLoaded);


void crawler_OnDocumentLoaded(WebDocumentBase webDocument, int level) {

    // do indexing code

    // WebDocumentBase is a base class for all documents that are downloaded and spidered
    // it has the following properties (Uri, ContentType, MimeType, Encoding, Length, TextData, InternalLinks, ExternalLinks, ImageSrcs)

    // if the file an HTML file, then it can be cast as an HtmlDocument

    // with the following additional properties (Title, Description, Keywords, Html)

    // future additions will hopefully have plugins for PdfDocument and WordDocument

Future things I'd like to add:

  1. Other document types (PDF, Word, other Office formats) for indexing like DotLucene's indexer.
  2. More events to help steer the crawling
  3. Weight to heading tags (h1, h2, etc.)

Please note, the namespace "Refresh.Web" is for a future business
endeavor. The code is released with an CC-attributive license. If you're interested using it, please leave a comment on additional features you'd like to see.

  see: http://arachnode.net – an open source site crawler written in C# using SQL Server 2005.

