Controlling web crawlers for your blog to get indexed correctly by the search engines.
Crawlers are programs created by Search Engines that go around the internet collecting information about the websites. The process of visiting a website and recording information is known as indexing. The information collected by these spiders is then used to rank the websites. Crawlers are also known as spiders.
Example of WEB Crawlers
WebCrawler (Pinkerton, 1994) was used to build the first publicly-available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is an URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
WebFountain (Edwards et al., 2001) is a distributed, modular crawler similar to Mercator but written in C++. It features a “controller” machine that coordinates a series of “ant” machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency.
PolyBot [Shkapenyuk and Suel, 2002] is a distributed crawler written in C++ and Python, which is composed of a “crawl manager”, one or more “downloaders” and one or more “DNS resolvers”. Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: http://www.example.com/ and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for “subscriptions” to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of “seed” URLs, WebRACE is continuously receiving new starting URLs to crawl from.
FAST Crawler (Risvik and Michelsen, 2002) is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a “document scheduler” that maintains a queue of documents to be downloaded by a “document processor” that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a “distributor” module that exchanges hyperlink information.
You can control how crawlers index your blog by using a simple text file named robots.txt.
The Robots Exclusion Protocol from 1994 defines “a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot”. That’s a quasi standard, but crawlers sent out by major search engines do comply.
robots.txt is a plain text file located in the root directory of your server. Web robots read it before they fetch a document. If the document the bot is going to fetch is excluded for the particular robot by statements in the robots.txt file, the bot will not request it.
If you want to know how to write robots.txt visit http://googleblog.blogspot.com/2007/01/controlling-how-search-engines-access.html.
If don’t have a blog yet read Build your blog!

1 Comment Already
Pingback & Trackback
Related Post
Please Leave Your Comments Below