Scrapy pipeline class to store scraped data in MongoDB


/ Published in: Python
Save to your folder(s)

In connection with my [poetry spider](http://snipplr.com/view/65893/a-simple-spider-using-scrapy/), this scrapy pipeline class facilitates storing the scraped data to a MongoDB database.


Copy this code and paste it in your HTML
  1. # Standard Python library imports
  2.  
  3. # 3rd party modules
  4. import pymongo
  5.  
  6. from scrapy import log
  7. from scrapy.conf import settings
  8. from scrapy.exceptions import DropItem
  9.  
  10.  
  11. class MongoDBPipeline(object):
  12. def __init__(self):
  13. self.server = settings['MONGODB_SERVER']
  14. self.port = settings['MONGODB_PORT']
  15. self.db = settings['MONGODB_DB']
  16. self.col = settings['MONGODB_COLLECTION']
  17. connection = pymongo.Connection(self.server, self.port)
  18. db = connection[self.db]
  19. self.collection = db[self.col]
  20.  
  21. def process_item(self, item, spider):
  22. err_msg = ''
  23. for field, data in item.items():
  24. if not data:
  25. err_msg += 'Missing %s of poem from %s\n' % (field, item['url'])
  26. if err_msg:
  27. raise DropItem(err_msg)
  28. self.collection.insert(dict(item))
  29. log.msg('Item written to MongoDB database %s/%s' % (self.db, self.col),
  30. level=log.DEBUG, spider=spider)
  31. return item

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.