Python: urllib2 handlers

Python is a great language, with a good community (see Freenode irc channel #python) — but its modules can have lousy documentation at times.

One place where I find the documentation lacking is in the description of urllib2’s BaseHandler class. Subclasses of this class can be passed to urllib2’s build_opener function to add functionality to your url-opening activity. Some examples of useful modifications are modifying HTTP requests (and responses) or creating support for an unsupported network protocol. Unfortunately, while the methods of interest are described, examples are sparse. Here I’ll give a couple of examples of handlers that I’ve come up with. They may not be “pythonic,” and I’d be grateful for pythonic suggestions, but they get their jobs done and demonstrate some of the features of BaseHandler.

The first example is fairly simple: a User-Agent string injector.

import urllib2

class UserAgentProcessor(urllib2.BaseHandler):
    """A handler to add a custom UA string to urllib2 requests
    """
    def __init__(self, uastring):
        self.handler_order = 100
        self.ua = uastring

    def http_request(self, request):
        request.add_header("User-Agent", self.ua)
        return request
    https_request = http_request

The __init__ method takes a string to be used as the user agent string for HTTP requests. Init also sets the handler_order attribute to 100; the default for BaseHandler subclasses is 500, but we have to set a smaller number, otherwise our UA will be overwritten with the urllib2 default UA. I’m not entirely sure why (the docs barely mention this attribute), but my experimentation as shown that it’s necessary.

The http_request (and https_request) methods are used to pre-process HTTP (and HTTPS) requests before the requests are actually made. The “request” parameter is a urllib2.Request object, which contains a wealth of editable information, including HTTP headers. Here I’ve simply aliased https_request to http_request, since https requests don’t need special handling to add a UA string.

To use this handler, we need to build an opener with it:

opener = urllib2.build_opener(UserAgentProcessor("My Useragent"))

Then we can either “install” our new opener or use it straight up:

# install and use the opener
urllib2.install_opener(opener)
urllib2.urlopen("http://whatismyuseragent.com/")

# Just use it
opener.open("http://whatismyuseragent.com/")

Another, more complex example, is to add gzip compression to urllib2.

import urllib2
from gzip import GzipFile
from StringIO import StringIO

class GZipProcessor(urllib2.BaseHandler):
    """A handler to add gzip capabilities to urllib2 requests
    """
    def http_request(self, req):
        req.add_header("Accept-Encoding", "gzip")
        return req
    https_request = http_request

    def http_response(self, req, resp):
        if resp.headers.get("content-encoding") == "gzip":
            gz = GzipFile(
                        fileobj=StringIO(resp.read()),
                        mode="r"
                      )
#            resp.read = gz.read
#            resp.readlines = gz.readlines
#            resp.readline = gz.readline
#            resp.next = gz.next
            old_resp = resp
            resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code)
            resp.msg = old_resp.msg
        return resp
    https_response = http_response

You can see that the GZipProcessor class uses the http_request method, like UserAgentProcessor, to add an HTTP header to the request. However, GZProcessor also uses the http_response method to pre-process the response before it is returned to whatever called urlopen().

In http_response, the “req” and “resp” parameters are urllib2.Request and urllib2.Response objects, respectively. Since the Response contains the page contents, that’s what we’re interested in. If the server responded with gzipped content, it should have set the “Content-Encoding” response HTTP header to “gzip”. If we have received gzipped content, we need to make it so that read attempts on the Response object return valid strings of text, not chunks of gzip data.

The method I’ve used here is somewhat convoluted and certainly not “pythonic”, but it gets the job done. First, the gzipped content is read from the response and stuck into a StringIO object, which offers a file-like interface to string objects. This StringIO object is then passed to GzipFile’s fileobj argument. Now, since the Response object contains info other than the page contents, we can’t simply replace it; instead, we set the read, readlines, readline, and next methods of the response to the corresponding methods of the GzipFile object. Next, we replace the incoming Response object with a new one, this one with its fp (file pointer, originally the HTTP connection socket) attribute pointing to our GZipFile object. By instantiating a new addinfourl object, we don’t have to worry about missing a relevant read method; the constructor will take care of that for us. Now, when we request a page and get gzipped contents, we can read(), readline(), etc, seamlessly, as if there were no gzipping involved. The downside is that it will take slightly longer to receive our response because the GZProcessor has to read the entire contents of the page before decoding it.

Now, granted, there are other ways to do these things. A UA string can be inserted into a manually-created Request object, which can then be passed to a raw OpenerDirector (created with urllib2.build_opener()) just fine. Same with gzip acceptance, you’ll just have to remember to un-zip the contents you get back. But the nice thing about handlers is that you can build your opener once with your custom handlers and not have to worry about the results anywhere else in your code. Also, if you install your opener with urllib2.install_opener(), you can simply use urllib2.urlopen() for every page you need to open, and your handlers will automatically be called as needed. Just plug-and-chug.

Comments are closed.