Analyze robots.txt with Python Standard Library

2023-09-24

If haven’t searched both “python” and “robots.txt” in the same input box, I would not ever know that Python Standard Library could parse robots.txt with urllib.robotparser.

But the official document of urllib.robotparser doesn’t go into detail. With the document, you could check whether a url can be fetch with a robot with robot_parser_inst.can_fetch(user_agent, url) if you are building a crawler bot yourself. But if you want to do some statistics about robots.txt, like what most disallowed path for Googlebot are, from the document not only you don’t know how to do it, but also you don’t know whether the lib is able to do it.

Surely the lib can do a lot more things, and this article could be supplementary to the official document.

Parse downloaded file

If you download a robots.txt file, and you want to store/cache it and parse it later. You are not going to use the .read() method from the official document but the .parse() method.

Note that you need to pass an iterable of lines to the .parse() method, for example

fd = open('robots.txt')
rfp = RobotFileParser()
rfp.parse(fd.readlines())
# or as below:
# rfp.parse(fd.read().splitlines())

After parsing, you may use .can_fetch() method to check with a user agent can crawl a path.

Entry and RuleLine

Beyond checking whether a crawler bot can fetch a url, there is more work need to be addressed on robots.txt files, e.g., data analysis, statistics, etc.. So we need abilities to iterate the rules parsed from robots.txt file. The document of urllib.robotparser doesn’t go into detail about the topic. Entry and ‘RuleLine’ are internal classes that could help us with the abilities disclose the statistics of robots.txt files.

An Entry object is a list of rules with regard to those specified in User-agent. Each Entry contains a .rulelines attribute, which is a list of RuleLine objects. A RuleLine object describes the path and the allow or disallow state of the rule. A parsed RobotFileParser object may contain several Enriy objects, where the entry of user agent ‘*’ is .default_entry, and the rest of the Entry objects are in the .entries list attribute.

Below is an example of Entry and RuleLine objects in a robots.txt, tagged in the comments:

User-agent: Googlebot  # .entries[0]
Allow: /blogs/  # .entries[0].rulelines[0]
Disallow: /  # .entries[0].rulelines[1]

User-agent: *  # .default_entry
Disallow: /  # .default_entry.rulelines[0]

To iterate over all the rules in a robots.txt file, consider the following code snippet:

def iterate_rules(robots_content):
    rfp = RobotFileParser()
    rfp.parse(robots_content.splitlines())
    entries = [rfp.default_entry, *rfp.entries]\
              if rfp.default_entry else rfp.entries
    for entry in entries:
        for ruleline in entry.rulelines:
            yield (entry.useragents, ruleline.path, ruleline.allowance)

The wildcard in Allow/Disallow

The section is not much related with urllib.robotparser, but the protocol of robots.txt itself. In a robots.txt file, you may use ‘*’ as a wildcard symbol in User-agent directives, like

User-agent: *

But how about use ‘*’ in Allow and Disallow directives? Actually, that depends. The protocol of robots.txt has two major version. The previous version as stated in https://www.robotstxt.org/orig.html has no support of wildcard symbols in Allow/Disallow directives, and urllib.robotparser in Python is implemented based on the version. The newer version as stated in rfc9309 introduced the support of ‘$’ and ‘*’ symbols, Googlebot and Bingbot both support this version.

It is kind of outdated to use the .can_fetch() method in Python Standard Library to check whether a url is crawable. But if you want to do some analysis on a bunch of robots.txt files, urllib.robotparser is still a straightforward yet solid choice.

Monkey patch for urllib.robotparser

We can monkey patch urllib.robotparser to support * and $ as wildcards in Allow and Disallow directives. The code is as below(or in github gist):

'''
A monkey patch for urllib.robotparser to support * and $ in robots.txt.
'''
import re
import urllib.parse
from urllib.robotparser import RobotFileParser, Entry, RuleLine


def get_robots_pattern(path):
    ending = '.*?'
    if path.endswith('$'):
        path = path[:-1]
        ending = '$'
    parts = path.split('*')
    parts = map(urllib.parse.quote, map(re.escape, parts))
    return '.*?'.join(parts) + ending


def _rule_line__init__(self, path, allowance):
    if path == '' and not allowance:
        # an empty value means allow all
        allowance = True
    path = urllib.parse.urlunparse(urllib.parse.urlparse(path))
    self.pattern = re.compile(get_robots_pattern(path))
    self.path = path
    self.allowance = allowance


def _rule_line_applies_to(self, filename):
    return True if self.pattern.match(filename) else False


RuleLine.__init__ = _rule_line__init__
RuleLine.applies_to = _rule_line_applies_to

__all__ = ['RobotFileParser']

Import RobotFileParser from the code above to use the patched version.