Posted By

hlongmore on 07/01/12


Tagged

crawler spider scraping scrapy


Versions (?)

Who likes this?

1 person have marked this snippet as a favorite

Priestd09


A simple spider using scrapy


 / Published in: Python
 

A simple spider using the scrapy module to get the text, title, url, author, and date of some poems. Although this is written with poems in mind, with some minor customization, it can be applied to a wider variety of scraping projects where the data desired is not dynamically generated.

  1. # Standard Python library imports
  2.  
  3. # 3rd party imports
  4. from scrapy.contrib.spiders import CrawlSpider, Rule
  5. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  6. from scrapy.selector import HtmlXPathSelector
  7.  
  8. # My imports
  9. from poetry_analysis.items import PoetryAnalysisItem
  10.  
  11. HTML_FILE_NAME = r'.+\.html'
  12.  
  13. class PoetryParser(object):
  14. """
  15. Provides common parsing method for poems formatted this one specific way.
  16. """
  17. date_pattern = r'(\d{2} \w{3,9} \d{4})'
  18.  
  19. def parse_poem(self, response):
  20. hxs = HtmlXPathSelector(response)
  21. item = PoetryAnalysisItem()
  22. # All poetry text is in pre tags
  23. text = hxs.select('//pre/text()').extract()
  24. item['text'] = ''.join(text)
  25. item['url'] = response.url
  26. # head/title contains title - a poem by author
  27. title_text = hxs.select('//head/title/text()').extract()[0]
  28. item['title'], item['author'] = title_text.split(' - ')
  29. item['author'] = item['author'].replace('a poem by', '')
  30. for key in ['title', 'author']:
  31. item[key] = item[key].strip()
  32. item['date'] = hxs.select("//p[@class='small']/text()").re(date_pattern)
  33. return item
  34.  
  35.  
  36. class PoetrySpider(CrawlSpider, PoetryParser):
  37. name = 'example.com_poetry'
  38. allowed_domains = ['www.example.com']
  39. root_path = 'someuser/poetry/'
  40. start_urls = ['http://www.example.com/someuser/poetry/recent/',
  41. 'http://www.example.com/someuser/poetry/less_recent/']
  42. rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),
  43. callback='parse_poem'),
  44. Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),
  45. callback='parse_poem')]

Report this snippet  

You need to login to post a comment.