Posted By

scrapy on 09/01/12


Tagged

url scrapy c18n canonicalization spider-middleware


Versions (?)

Who likes this?

1 person have marked this snippet as a favorite

icecreamboyy


URL Canonicalizer Spider middleware


 / Published in: Python
 

  1. # A spider middleware to canonicalize the urls of all requests generated from a spider.
  2.  
  3. from scrapy.http import Request
  4. from scrapy.utils.url import canonicalize_url
  5.  
  6. class UrlCanonicalizerMiddleware(object):
  7. def process_spider_output(self, response, result, spider):
  8. for r in result:
  9. if isinstance(r, Request):
  10. curl = canonicalize_url(r.url)
  11. if curl != r.url:
  12. r = r.replace(url=curl)
  13. yield r
  14.  
  15. # Snippet imported from snippets.scrapy.org (which no longer works)
  16. # author: pablo
  17. # date : Sep 07, 2010
  18.  

Report this snippet  

You need to login to post a comment.