Robots.txt to become an internet standard after 25 years

By on
Robots.txt to become an internet standard after 25 years

Ambiguity for spidering the web begone.

A quarter of a century after it was first created, the de-facto robots.txt standard has been submitted to the Internet Engineering Task Force by Google to become formalised and updated to meet modern-day corner cases.

It might seem like a small step, but don't be mistaken; it's a big thing.

The creator of the internet's first search engine Allweb, Dutch software engineer Martijn Koster, proposed a set of rules in 1994 that limited automated web crawlers' access to sites, after a badly written indexer caused a denial of service attack on his server.

Webmasters could put the rules into a file, robots.txt, and save it in the top-level root directory of web servers to guide web spiders as to what data they should and shouldn't access.

Since the robots.txt file was never made into an official internet standard, there have been several differing interpretations of the protocol over the last two decades of use.

This has made it difficult for webmasters to get the rules right, and Google is now seeking to formalise the protocol, and to update it.

Among the updates are the ability to make robots.txt applicable to any uniform resource identifier protocol and not just the hyper text transfer protocol currently.

Ensuring that the first 500 kib of robots.txt are parsed and defining a maximum file size to avoid undue stress on servers is also proposed by Google, along with a new maximum cacheing time of 24 hours.

Google is also proposing that in the case of a server failure making the previously parsed robots.txt file inaccessible, known excluded pages should not be crawled for a reasonably long time.

Improving the syntax definition of robots.txt is also part of the proposed internet standard, to help developers write code to parse the file correctly.

Over the years, the robots exclusionary protocol or REP became a de-facto internet standard with many indexers - but not all - being compatible with it.

Sites that do not have a robots.txt are assumed to provide no instructions to crawlers. These will proceed to access all the data on on the server in question and crawl the entire site.

Following robots.txt instructions is voluntary, with malicious robots often ignoring the file.

The Internet Archive now also no longer follows robots.txt, as it hinders their work to keep accurate historical records of web content.

Parallel to submitting the protocol to the IETF, Google has made the C++ code library that the search giant uses in production systems to parse robots.txt files on the web servers it indexes open source.

Google said that while the robots.txt parser has evolved over the past twenty years of use, it contains pieces of code written in the 90s. 

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © . All rights reserved.

Most Read Articles

Log In

  |  Forgot your password?