Result of the Project
These are the results we have obtained from our crawling. So in summery about 22% percent of the web uses Robots exclusion standard while 14% of the content is hidden.
The most interesting observation is the amount of errors present in robots.txt. About 20% of the robots.txt's we have crawled has errors in them. Although the de-facto standard was there for about a decade still there seems to be no proper agreement for the correctness.
Use of Robots Exclusion Standard in Different Domains
| Domain | % |
| com | 23.69 |
| org | 20.18 |
| net | 21.85 |
| edu | 25.72 |
| gov | 42.98 |
| info | 26.51 |
| Total | 22.41 |
Hidden Fraction of the Web - Domain Wise
| Document wise | Size wise |
| Domain | % | % |
| info | 18.75 | 17.29 |
| gov | 8.87 | 9.21 |
| edu | 11.9 | 16.7 |
| net | 13.1 | 12.26 |
| org | 11.29 | 12.22 |
| com | 15.33 | 16.62 |
| Total | 13.81 | 14.65 |
Error and Warning Percentages in the robots.txt in different domains
| Domain | Err % | War % |
| com | 21.02 | 30.83 |
| net | 21.12 | 36.59 |
| info | 21.05 | 41.8 |
| edu | 18.63 | 35.08 |
| gov | 13.37 | 22.28 |
| org | 18.01 | 36.2 |
| Total | 20.05 | 33.81 |
Warning Types and their percentages
| Error Type: | % |
| Capitalization | 30.72 |
| No user agent | 22.59 |
| Unrecognized Line | 43.55 |
| White space | 3.13 |
Warning Types and their percentages
| Warning Type: | % |
| Paths should be absolute | 53.79 |
| Allow is not widely supported | 2.07 |
| No restrictions | 20.47 |
| Unrecognized field | 9.79 |
| Space In path | 8.44 |
| Wildcards aren't supported | 3.48 |
| Repeated User agent | 1.96 |
Update
This is a brief update of what we have done :
After our intial data set which we felt was biased towards sites that have robots.txts, we decided to increase the data set.To do so, we got the RDF from DMOZ and classified URLs into different domains.In each domain, we pinged every site for existence of robots.txt upto a maximum of 50000 sites in each domain.
For those sites that had robots.txts, we crawled 2 levels completely using JoBo to get the size and usage statistics of robots.txt.Upto a maximum of 1000 random websites which had robots.txt for each domain or whichever is maximum.
For all the sites that had robots.txt, we validated the same using the validation logic in the below website :
http://www.sxw.org.uk/computing/robots/check.htmlA total of 30000 robots.txts have been validated using an automated testing tool called iMacros Browser.We would like to mention that we got an academic trial lisence for 30 days that allowed up to use the tool for such a huge number..Thanks to iOpus, (
www.iopus.com) for giving us the same which otherwise would have costed $500.
Other testing tools like WinRunner, TestComplete were considered but abandoned due to either being very heavy weight and lack of a trial version that would automate such a large data set.
Moved the code to svn.
We have set up a svn repository for the modified crawler (jobo).
Here is the link.
http://svn2.cvsdude.com/jaliya/robo
Project update
Identified and finalized the web crawler to use to crawl the web.Jobo in written in Java and can be easily customised for the requirements of our project.
Customizations include :
- Performing the crawl in many threads to complete the crawl sooner
- Crawl the URLs in the robots.txt.This would violate the robot exclusion standards but our goal is to collect statistics and analyse the same.Also we would crawl just once and then analyse the data
- Collect the data in a MySql database in various tables to store the URLs crawled(allowed and disallowed) and their sizes, the contents of robots.txt and the host they belong to
To do :
- For greater efficiency, have a pool of database connection to perform database operations
- Eliminate unwanted code in Jobo that is not being used by our application
- Remove large buffers that currently cause 'Out of memory' errors when the crawler crawls beyond 2000 URLs.Examine whether any other issue causes this error
2)Process the data : Once we collect the data , think about how we could analyse the same and discuss with Fil before commencement of spring break and implement the same.
Initial Project Plan , Goals and Deadlines
1) 13th Feb 2006
Shortlist a crawler and compare to see which is easier to modify,observe results.
As part of this, Jaliya would evaluate the basic Java web crawler and I shall evaluate another open source crawler called JoBo.
2)17th Feb 2006
Basic design for the application
Set up SVN on CS server for the project - Smitha
Long term deadlines for the project tasks
3)What next?
These are the main steps involved as part of the project :
Modify the crawler to download 'robots.txt' . Crawl the site to get the entire size of the crawlable portion of the site.To start with, the 'Open Directory' shall be used as the seed URL.
Continue crawling all web sites until a fixed number of robots.txt have been downloaded(200 robots.txt files which can be configurable)
Another process would examine the robots.txt to get the size of the sites in robots.txt which represents the size of the site that is disallowed for crawling. This could be done using html header or through a perl interface.
Steps involved in getting the size of the sites blocked by robots.txt :
a)For each of the URLs in the robots.txt, crawl that URL and get all the sites in that URL
b)For each of these sites, check if initial URL matches that of the blocked URL.
c)If so, consider that site to calculate the size, else ignore that site
Another process would validate the robots.txt to check for correctness
Analysis and collating together statistical information namely, the domains that use robots.txt, what type of content is typically hidden(cgi-bin), what is the page rank/importance of the sites that use valid/invalid robots.txt
The ROBOT Meta tag
This is used as an easy alternative to robots.txt, to specify whether a robot can access and index a particular web page. This is done through a ‘Robots’ META tag[4] which can be contained in the head of an html document as shown below
<META
NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="THIS PAGE
....">
As part of the project, we would analyze the html content of web pages to find out whether it contains a ROBOT tag and evaluate the portion of the web that uses these tags as against the robots.txt.
Project Proposal
Project Goals:
Robots Exclusion standard [1] is a de-facto standard that is used to inform the crawlers about the disallowed sections of a web server. It has been in general use since mid nineteen ninety and heavily used to limit the access of pages or sections such as very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). From the aspect of a crawler this is a voluntary standard since it does not provide a mechanism to stop crawlers from accessing disallowed section. However, most crawlers adopt this standard and obey its rules.
Although the standard has been there for almost a decade, extensive research regarding its usage has not been done. As part of the project, we will perform a statistical analysis of the usage of the above standard and we will explore the following areas ;
- Usage of the standard – what percent of the web use (follow) the above standard.
- Hidden Web – What percentage of the web is covered or hidden for the robots
- Accuracy of the de-facto standard – Analyze the accuracy of the robots.txt documents that we may collect during the research to come with a statistical figure of their accuracy
- Other means of preventing robots such as ‘Robots’ META tag[4]
Based on the results of the analysis, we plan to recommend that the robots.txt be accepted as an official standard.
Implementation options:
Currently we are considering two approaches to get the required information using crawlers namely:
- Crawl a random collection of sites from different domains and get the statistics.
The results of this approach will depend on the quality and the breadth of the sample that we select for crawling. - Crawl the web up to a certain maximum and collect the required information in the process of crawling. This depends on the amount of the sites that we crawl during the research. This approach provides the flexibility of using the crawler with different maximum values and hence will be able to improve the correctness of the results.
We will use an open source crawler such as Apache Nutch [2] or Heritix [3] and enhance the same to suite the project’s requirements. Modifications would be done to calculate the size of the web pages that the crawler comes across and to calculate the size of the disallowed portion of the web. This can be achieved by downloading the HTTP header contents without downloading the entire web page. In addition the crawler will save the robots.txt for various sites and later they will be used to check the correctness using a valuator.
References:
[1] Robots Exclusion Standard, http://www.robotstxt.org/wc/norobots.html
[2] Apache Nutch, http://lucene.apache.org/nutch/
[3] Heritix, http://crawler.archive.org/
[4] Robot Meta Tag http://www.searchengineworld.com/metatag/robots.htm