Effective Web Page Crawler
Engineering and Technology Journal,
2011, Volume 29, Issue 3, Pages 513-530
AbstractThe World Wide Web (WWW) has grown from a few thousand pages in
1993 to more than eight billion pages at present. Due to this explosion in size,
web search engines are becoming increasingly important as the primary means
of locating relevant information.
This research aims to build a crawler that crawls the most important web
pages, a crawling system has been built which consists of three main
techniques. The first is Best-First Technique which is used to select the most
important page. The second is Distributed Crawling Technique which based on
UbiCrawler. It is used to distribute the URLs of the selected web pages to
several machines. And the third is Duplicated Pages Detecting Technique by
using a proposed document fingerprint algorithm.
- Article View: 132
- PDF Download: 124