If haven’t searched both “python” and “robots.txt” in the same input box, I would not ever know that Python Standard Library could parse robots.txt with urllib.robotparser.
But the official document of
urllib.robotparser doesn’t go into detail. With the document, you could check whether a url can be fetch with a robot with
robot_parser_inst.can_fetch(user_agent, url) if you are building a crawler bot yourself. But if you want to do some statistics about robots.txt, like what most disallowed path for Googlebot are, from the document not only you don’t know how to do it, but also you don’t know whether the lib is able to do it.
Surely the lib can do a lot more things, and this article could be supplementary to the official document.
Parse downloaded file
If you download a robots.txt file, and you want to store/cache it and parse it later. You are not going to use the
.read() method from the official document but the
Note that you need to pass an iterable of lines to the
.parse() method, for example
fd = open('robots.txt') rfp = RobotFileParser() rfp.parse(fd.readlines()) # or as below: # rfp.parse(fd.read().splitlines())
After parsing, you may use
.can_fetch() method to check with a user agent can crawl a path.
Entry and RuleLine
Beyond checking whether a crawler bot can fetch a url, there is more work need to be addressed on robots.txt files, e.g., data analysis, statistics, etc.. So we need abilities to iterate the rules parsed from robots.txt file. The document of
urllib.robotparser doesn’t go into detail about the topic.
Entry and ‘RuleLine’ are internal classes that could help us with the abilities disclose the statistics of robots.txt files.
Entry object is a list of rules with regard to those specified in
Entry contains a
.rulelines attribute, which is a list of
RuleLine objects. A
RuleLine object describes the path and the allow or disallow state of the rule. A parsed
RobotFileParser object may contain several
Enriy objects, where the entry of user agent ‘*’ is
.default_entry, and the rest of the
Entry objects are in the
.entries list attribute.
Below is an example of
RuleLine objects in a robots.txt, tagged in the comments:
User-agent: Googlebot # .entries Allow: /blogs/ # .entries.rulelines Disallow: / # .entries.rulelines User-agent: * # .default_entry Disallow: / # .default_entry.rulelines
To iterate over all the rules in a robots.txt file, consider the following code snippet:
def iterate_rules(robots_content): rfp = RobotFileParser() rfp.parse(robots_content.splitlines()) entries = [rfp.default_entry, *rfp.entries]\ if rfp.default_entry else rfp.entries for entry in entries: for ruleline in entry.rulelines: yield (entry.useragents, ruleline.path, ruleline.allowance)
The wildcard in Allow/Disallow
The section is not much related with
urllib.robotparser, but the protocol of robots.txt itself. In a robots.txt file, you may use ‘*’ as a wildcard symbol in
User-agent directives, like
But how about use ‘*’ in Allow and Disallow directives? Actually, that depends. The protocol of robots.txt has two major version. The previous version as stated in https://www.robotstxt.org/orig.html has no support of wildcard symbols in Allow/Disallow directives, and
urllib.robotparser in Python is implemented based on the version. The newer version as stated in rfc9309 introduced the support of ‘$’ and ‘*’ symbols, Googlebot and Bingbot both support this version.
It is kind of outdated to use the
.can_fetch() method in Python Standard Library to check whether a url is crawable. But if you want to do some analysis on a bunch of robots.txt files,
urllib.robotparser is still a straightforward yet solid choice.
Monkey patch for urllib.robotparser
We can monkey patch
urllib.robotparser to support * and $ as wildcards in Allow and Disallow directives. The code is as below(or in github gist):
''' A monkey patch for urllib.robotparser to support * and $ in robots.txt. ''' import re import urllib.parse from urllib.robotparser import RobotFileParser, Entry, RuleLine def get_robots_pattern(path): ending = '.*?' if path.endswith('$'): path = path[:-1] ending = '$' parts = path.split('*') parts = map(urllib.parse.quote, map(re.escape, parts)) return '.*?'.join(parts) + ending def _rule_line__init__(self, path, allowance): if path == '' and not allowance: # an empty value means allow all allowance = True path = urllib.parse.urlunparse(urllib.parse.urlparse(path)) self.pattern = re.compile(get_robots_pattern(path)) self.path = path self.allowance = allowance def _rule_line_applies_to(self, filename): return True if self.pattern.match(filename) else False RuleLine.__init__ = _rule_line__init__ RuleLine.applies_to = _rule_line_applies_to __all__ = ['RobotFileParser']
RobotFileParser from the code above to use the patched version.