Posted By

scrapy on 09/01/12


Tagged

parsley scrapy parselet


Versions (?)

Parsley Spider


 / Published in: Python
 

  1. # "Parsley is a simple language for extracting structured data from web pages. It consists of an powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting."
  2. #
  3. # We can get Parsley language site parsers (parselets) from Parselets site.
  4. #
  5. # "Parselets.com is a central repository for user-created APIs to the web, called Parselets. Parselets are snippets of parsing code written in a language called Parsley, which is a familiar combination of CSS, XPath, Regular Expressions, and JSON."
  6. #
  7. # In this example, we integrate Parsley with Scrapy using a new class of Item, ParsleyItem that defines its fields from a parselet code, and extend the CrawlSpider to create ParsleySpider that provides a method to parse a response with a parselet and return a ParsleyItem.
  8.  
  9. from pyparsley import PyParsley
  10.  
  11. from scrapy.contrib.spiders import CrawlSpider
  12. from scrapy.item import Item, Field
  13.  
  14.  
  15. class ParsleyItem(Item):
  16. def __init__(self, parslet_code, *args, **kwargs):
  17. for name in parslet_code.keys():
  18. self.fields[name] = Field()
  19.  
  20. super(ParsleyItem, self).__init__(*args, **kwargs)
  21.  
  22.  
  23. class ParsleySpider(CrawlSpider):
  24. parslet_code = {}
  25.  
  26. def parse_parsley(self, response):
  27. parslet = PyParsley(self.parslet_code, output='python')
  28. return ParsleyItem(self.parslet_code, parslet.parse(string=response.body))
  29.  
  30.  
  31. # example youtube.com spider
  32.  
  33. from scrapy.conf import settings
  34. from scrapy.contrib.spiders import Rule
  35. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  36.  
  37.  
  38. YOUTUBE_PARSLET = {
  39. "title": "h1",
  40. "desc": ".description",
  41. "rating": ".ratingL @title",
  42. "embed": "#embed_code @value"
  43. }
  44.  
  45.  
  46. class YoutubeSpider(ParsleySpider):
  47. query = settings.get('QUERY')
  48.  
  49. domain_name = 'youtube.com'
  50. start_urls = ['http://www.youtube.com/results?search_query=%s&page=1' %
  51. query]
  52.  
  53. rules = (
  54. Rule(SgmlLinkExtractor(allow=(r'results\?search_query=%s&page=\d+' %
  55. query,))),
  56. Rule(SgmlLinkExtractor(allow=(r'watch\?v=',),
  57. restrict_xpaths=['//div[@id="results-main-content"]']),
  58. 'parse_parsley'),
  59. )
  60.  
  61. parslet_code = YOUTUBE_PARSLET
  62.  
  63. # Snippet imported from snippets.scrapy.org (which no longer works)
  64. # author: void
  65. # date : Aug 10, 2010
  66.  

Report this snippet  

You need to login to post a comment.