To try it out download the code from:
http://mrpfisterssource.googlegroups.com/web/WebCrawler.zip
Highlights:
Parses Robots.txt and Sitemaps to correctly determine what to crawl through
Uses multithreaded searching (via the ThreadPool) and Async Web Requests for lower CPU load.
How it works:
When a new Top Level Domain is encountered eg. http://www.Microsoft.com it is checked for Robots.txt file, if one exists it is parsed along with any Sitemap Xml files referenced within. Any pages referenced are added to the Web Crawler Task queue for that particular domain.
To stop chocking of particular domains, much like the Web Crawler example in the Windows Mobile 6 SDK, this one orders tasks in a round robin style between each domain, thus all domains have a chance to add tasks to their queues to be processed.
Although processing is queued in the ThreadPool, tasks as mentioned are pre-queued in their own domains allowing the current state to be serialised in Xml for future continuation of work.
Currently very little is extracted from a webpage, currently only links via regular expressions for further processing. This allows the user free reign on how the search engine should catalogue pages.
Enjoy!
No comments:
Post a Comment