John Dyer

Technology and web development in curly bracket languages {Javascript, C#, ActionScript}

WebCrawler Engine in C# (first draft)

by John Dyer 7. April 2006 22:30

A few weeks ago, I wrote about using SearchAroo as a spider to index a site with DotLucene. I've written a new WebCrawler using SearchAroo as a base and turned it into a library that can be reused for other applications.

Download Web Crawler (zip file with WebCrawler engine and sample web and forms apps)

Here are the improvements I've made:

  1. Gets text from the following HTML tag attributes: alt, title, summary, longdesc
  2. Better ability to determine relative URLs
  3. The WebDocument object keeps record of all files it links out to, including external and internal links, as well as images. This is useful for determining if your site has missing images or outgoing links.
  4. Compiled into a reusable library (the author of SearchAroo didn't want to have a dll, but I feel it's much more usable this way) which means it can be plugged into any indexing framework or used for other purposes such as simple link checking.

Here is the basic code to get it running:

string baseUrl = "http://mywebsite.com/";
CrawlerEngine crawler = new CrawlerEngine();
crawler.OnDocumentLoaded += new DocumentHandler(crawler_OnDocumentLoaded);
crawler.Crawl(baseUrl);

void crawler_OnDocumentLoaded(WebDocumentBase webDocument, int level) {
    // do indexing code
    // WebDocumentBase is a base class for all documents that are downloaded and spidered
    // it has the following properties (Uri, ContentType, MimeType, Encoding, Length, TextData, InternalLinks, ExternalLinks, ImageSrcs)

    // if the file an HTML file, then it can be cast as an HtmlDocument
    // with the following additional properties (Title, Description, Keywords, Html)

    // future additions will hopefully have plugins for PdfDocument and WordDocument
}

Future things I'd like to add:

  1. Other document types (PDF, Word, other Office formats) for indexing like DotLucene's indexer.
  2. More events to help steer the crawling
  3. Weight to heading tags (h1, h2, etc.)

Please note, the namespace "Refresh.Web" is for a future business endeavor. The code is released with an CC-attributive license. If you're interested using it, please leave a comment on additional features you'd like to see.

Comments

3/27/2008 11:04:47 AM # arachnode.net arachnode.net United States | Reply
This post is a bit dated, but if you ever get the notion to investigate crawling any further see:  http://arachnode.net - an open source site crawler written in C# using SQL Server 2005.
4/10/2008 4:40:10 PM # Fabien Fabien | Reply
Your sample is too simple even if it is a beginning.

You don't manage Proxy & Credential, I work with a Proxy that have an identification access, and your code don't work. If I will find a solution, I will push it to you.
4/28/2008 12:58:53 PM # jon jon Republic of the Philippines | Reply
let me see
1/5/2009 10:50:25 PM # arachnode.net arachnode.net United States | Reply
Hey -

I just promoted arachnode.net to release/stable status, for those that are interested!

-an
2/25/2009 1:31:45 AM # Thomas Thomas United States | Reply
The download link is inactive
6/22/2010 7:38:47 AM # Betty Clark Betty Clark United States | Reply
I have read a lot of the comments and I just wonder why people say the things they do, I mean they can find the bad in anything.  I guess that is where we are in this world. Just hurt hurt hurt, no matter what the subject is. Lawrence Williams  www.trybw.com  Fort Myers, Naples, Bonita,Cape Coral  Computer Repair Service
8/9/2010 12:04:52 PM # rathi rathi India | Reply
this is very useful..
8/11/2010 5:23:29 PM # Isagenix Isagenix United States | Reply
thanks you are giving great information.
8/20/2010 2:50:09 PM # malli malli India | Reply
its very nice sir.....
8/24/2010 5:27:26 PM # Brintey Brintey United States | Reply
Thanks for the article and code. Very helpful.
8/26/2010 10:27:38 AM # malli malli India | Reply
its very usefull......
8/27/2010 8:44:16 AM # Isagenix cleanse Isagenix cleanse United States | Reply
Years ago I programmed in C and then started moving to C++. C # is the way to go now. Great code it is so short and simple.

Add comment


(Will show your Gravatar icon)

  Country flag

biuquote
  • Comment
  • Preview
Loading



Web Statistics