Thursday, April 20, 2006

Result of the Project

These are the results we have obtained from our crawling. So in summery about 22% percent of the web uses Robots exclusion standard while 14% of the content is hidden.

The most interesting observation is the amount of errors present in robots.txt. About 20% of the robots.txt's we have crawled has errors in them. Although the de-facto standard was there for about a decade still there seems to be no proper agreement for the correctness.

Use of Robots Exclusion Standard in Different Domains

Domain %
com 23.69
org 20.18
net 21.85
edu 25.72
gov 42.98
info 26.51
Total 22.41


Hidden Fraction of the Web - Domain Wise


Document wise Size wise
Domain % %
info 18.75 17.29
gov 8.87 9.21
edu 11.9 16.7
net 13.1 12.26
org 11.29 12.22
com 15.33 16.62
Total 13.81 14.65


Error and Warning Percentages in the robots.txt in different domains

Domain Err % War %
com 21.02 30.83
net 21.12 36.59
info 21.05 41.8
edu 18.63 35.08
gov 13.37 22.28
org 18.01 36.2
Total 20.05 33.81


Warning Types and their percentages

Error Type: %
Capitalization 30.72
No user agent 22.59
Unrecognized Line 43.55
White space 3.13


Warning Types and their percentages

Warning Type: %
Paths should be absolute 53.79
Allow is not widely supported 2.07
No restrictions 20.47
Unrecognized field 9.79
Space In path 8.44
Wildcards aren't supported 3.48
Repeated User agent 1.96

Monday, April 10, 2006

Update

This is a brief update of what we have done :

After our intial data set which we felt was biased towards sites that have robots.txts, we decided to increase the data set.To do so, we got the RDF from DMOZ and classified URLs into different domains.In each domain, we pinged every site for existence of robots.txt upto a maximum of 50000 sites in each domain.

For those sites that had robots.txts, we crawled 2 levels completely using JoBo to get the size and usage statistics of robots.txt.Upto a maximum of 1000 random websites which had robots.txt for each domain or whichever is maximum.

For all the sites that had robots.txt, we validated the same using the validation logic in the below website :
http://www.sxw.org.uk/computing/robots/check.html

A total of 30000 robots.txts have been validated using an automated testing tool called iMacros Browser.We would like to mention that we got an academic trial lisence for 30 days that allowed up to use the tool for such a huge number..Thanks to iOpus, (www.iopus.com) for giving us the same which otherwise would have costed $500.

Other testing tools like WinRunner, TestComplete were considered but abandoned due to either being very heavy weight and lack of a trial version that would automate such a large data set.

Sunday, March 05, 2006

Moved the code to svn.

We have set up a svn repository for the modified crawler (jobo).
Here is the link. http://svn2.cvsdude.com/jaliya/robo

Thursday, March 02, 2006

Project update

Identified and finalized the web crawler to use to crawl the web.Jobo in written in Java and can be easily customised for the requirements of our project.

Customizations include :
  • Performing the crawl in many threads to complete the crawl sooner
  • Crawl the URLs in the robots.txt.This would violate the robot exclusion standards but our goal is to collect statistics and analyse the same.Also we would crawl just once and then analyse the data
  • Collect the data in a MySql database in various tables to store the URLs crawled(allowed and disallowed) and their sizes, the contents of robots.txt and the host they belong to

To do :

  • For greater efficiency, have a pool of database connection to perform database operations
  • Eliminate unwanted code in Jobo that is not being used by our application
  • Remove large buffers that currently cause 'Out of memory' errors when the crawler crawls beyond 2000 URLs.Examine whether any other issue causes this error

2)Process the data : Once we collect the data , think about how we could analyse the same and discuss with Fil before commencement of spring break and implement the same.

Thursday, February 09, 2006

Initial Project Plan , Goals and Deadlines

1) 13th Feb 2006
Shortlist a crawler and compare to see which is easier to modify,observe results.
As part of this, Jaliya would evaluate the basic Java web crawler and I shall evaluate another open source crawler called JoBo.

2)17th Feb 2006
  • Basic design for the application
  • Set up SVN on CS server for the project - Smitha
  • Long term deadlines for the project tasks
3)What next?
These are the main steps involved as part of the project :
  • Modify the crawler to download 'robots.txt' . Crawl the site to get the entire size of the crawlable portion of the site.To start with, the 'Open Directory' shall be used as the seed URL.
  • Continue crawling all web sites until a fixed number of robots.txt have been downloaded(200 robots.txt files which can be configurable)
  • Another process would examine the robots.txt to get the size of the sites in robots.txt which represents the size of the site that is disallowed for crawling. This could be done using html header or through a perl interface.

Steps involved in getting the size of the sites blocked by robots.txt :

a)For each of the URLs in the robots.txt, crawl that URL and get all the sites in that URL

b)For each of these sites, check if initial URL matches that of the blocked URL.

c)If so, consider that site to calculate the size, else ignore that site

  • Another process would validate the robots.txt to check for correctness
  • Analysis and collating together statistical information namely, the domains that use robots.txt, what type of content is typically hidden(cgi-bin), what is the page rank/importance of the sites that use valid/invalid robots.txt

Monday, January 30, 2006

The ROBOT Meta tag

This is used as an easy alternative to robots.txt, to specify whether a robot can access and index a particular web page. This is done through a ‘Robots’ META tag[4] which can be contained in the head of an html document as shown below

<META
NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

<META NAME="DESCRIPTION" CONTENT="THIS PAGE
....">

As part of the project, we would analyze the html content of web pages to find out whether it contains a ROBOT tag and evaluate the portion of the web that uses these tags as against the robots.txt.

Project Proposal

Project Goals:

Robots Exclusion standard [1] is a de-facto standard that is used to inform the crawlers about the disallowed sections of a web server. It has been in general use since mid nineteen ninety and heavily used to limit the access of pages or sections such as very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). From the aspect of a crawler this is a voluntary standard since it does not provide a mechanism to stop crawlers from accessing disallowed section. However, most crawlers adopt this standard and obey its rules.

Although the standard has been there for almost a decade, extensive research regarding its usage has not been done. As part of the project, we will perform a statistical analysis of the usage of the above standard and we will explore the following areas ;

  • Usage of the standard – what percent of the web use (follow) the above standard.
  • Hidden Web – What percentage of the web is covered or hidden for the robots
  • Accuracy of the de-facto standard – Analyze the accuracy of the robots.txt documents that we may collect during the research to come with a statistical figure of their accuracy
  • Other means of preventing robots such as ‘Robots’ META tag[4]

Based on the results of the analysis, we plan to recommend that the robots.txt be accepted as an official standard.

Implementation options:

Currently we are considering two approaches to get the required information using crawlers namely:

  • Crawl a random collection of sites from different domains and get the statistics.
    The results of this approach will depend on the quality and the breadth of the sample that we select for crawling.
  • Crawl the web up to a certain maximum and collect the required information in the process of crawling. This depends on the amount of the sites that we crawl during the research. This approach provides the flexibility of using the crawler with different maximum values and hence will be able to improve the correctness of the results.

We will use an open source crawler such as Apache Nutch [2] or Heritix [3] and enhance the same to suite the project’s requirements. Modifications would be done to calculate the size of the web pages that the crawler comes across and to calculate the size of the disallowed portion of the web. This can be achieved by downloading the HTTP header contents without downloading the entire web page. In addition the crawler will save the robots.txt for various sites and later they will be used to check the correctness using a valuator.

References:


[1] Robots Exclusion Standard, http://www.robotstxt.org/wc/norobots.html

[2] Apache Nutch, http://lucene.apache.org/nutch/

[3] Heritix, http://crawler.archive.org/

[4] Robot Meta Tag http://www.searchengineworld.com/metatag/robots.htm