Revision: 59337
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at September 1, 2012 07:15 by scrapy
Initial Code
# A spider middleware to canonicalize the urls of all requests generated from a spider.
from scrapy.http import Request
from scrapy.utils.url import canonicalize_url
class UrlCanonicalizerMiddleware(object):
def process_spider_output(self, response, result, spider):
for r in result:
if isinstance(r, Request):
curl = canonicalize_url(r.url)
if curl != r.url:
r = r.replace(url=curl)
yield r
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: pablo
# date : Sep 07, 2010
Initial URL
Initial Description
Initial Title
URL Canonicalizer Spider middleware
Initial Tags
url
Initial Language
Python