Posted By

diggernaut on 12/07/17


Tagged

data etl Ecommerce scraping webscraping diggernaut bedbathandbeyond


Versions (?)

Scraping bedbathandbeyond.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape bedbathandbeyond.com to retrieve products information. Attention: you will need to use your own proxies for this digger as basic Diggernaut's proxies doesnt work with bedbathandbeyond.com

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. proxy: #USE YOUR PROXY HERE LIKE 1.1.1.1:8888
  6. do:
  7. - link_add:
  8. url:
  9. - https://www.bedbathandbeyond.com/__ssobj/static/giftsNavOutHol.json?v=8
  10. - https://www.bedbathandbeyond.com/__ssobj/static/personalizedgiftsNavOutHol.json?v=8
  11. - https://www.bedbathandbeyond.com/__ssobj/static/beddingNavOutHol.json?v=8
  12. - https://www.bedbathandbeyond.com/__ssobj/static/bathNavOutHol.json?v=8
  13. - https://www.bedbathandbeyond.com/__ssobj/static/kitchenNavOutHol.json?v=8
  14. - https://www.bedbathandbeyond.com/__ssobj/static/diningNavOutHol.json?v=8
  15. - https://www.bedbathandbeyond.com/__ssobj/static/homedecorNavOutHol.json?v=8
  16. - https://www.bedbathandbeyond.com/__ssobj/static/furnitureNavOutHol.json?v=8
  17. - https://www.bedbathandbeyond.com/__ssobj/static/storagecleaningNavOutHol.json?v=8
  18. - https://www.bedbathandbeyond.com/__ssobj/static/outdoorNavOutHol.json?v=8
  19. - https://www.bedbathandbeyond.com/__ssobj/static/babykidsNavOutHol.json?v=8
  20. - https://www.bedbathandbeyond.com/__ssobj/static/healthbeautyNavOutHol.json?v=8
  21. - https://www.bedbathandbeyond.com/__ssobj/static/moreNavOutHol.json?v=8
  22. - https://www.bedbathandbeyond.com/__ssobj/static/shopsNavOutHol.json?v=8
  23. - walk:
  24. to: links
  25. do:
  26. - find:
  27. path: l2url,l3url
  28. do:
  29. - parse:
  30. filter: ^([^\?]+)
  31. - trim
  32. - if:
  33. match: \w+
  34. do:
  35. - normalize:
  36. routine: url
  37. - link_add:
  38. pool: catalog
  39. - walk:
  40. to: links
  41. pool: catalog
  42. do:
  43. - find:
  44. path: "span#ctl00_InvalidRequest"
  45. do:
  46. - proxy_switch
  47. - page_reload
  48. - find:
  49. path: li.lnkNextPage>a
  50. do:
  51. - parse:
  52. attr: href
  53. - trim
  54. - if:
  55. match: \w+
  56. do:
  57. - normalize:
  58. routine: url
  59. - link_add:
  60. pool: catalog
  61. - find:
  62. path: a.prodImg
  63. do:
  64. - parse:
  65. attr: href
  66. filter: ^([^\?]+)
  67. - trim
  68. - if:
  69. match: \w+
  70. do:
  71. - normalize:
  72. routine: url
  73. - link_add:
  74. pool: pages
  75. - walk:
  76. to: links
  77. pool: pages
  78. mode: unique
  79. do:
  80. - find:
  81. path: "span#ctl00_InvalidRequest"
  82. do:
  83. - proxy_switch
  84. - page_reload
  85. - find:
  86. path: "span#ctl00_InvalidRequest"
  87. do:
  88. - exit
  89. - find:
  90. path: "div#content"
  91. do:
  92. - variable_clear: pid
  93. - variable_set:
  94. field: brand
  95. value: BedBathAndBeyond
  96. - object_new: product
  97. - eval:
  98. routine: js
  99. body: '(function (){var d = new Date(); return d.toISOString()})();'
  100. - object_field_set:
  101. object: product
  102. field: date
  103. - static_get: url
  104. - object_field_set:
  105. object: product
  106. field: url
  107. - find:
  108. path: 'h1#productTitle'
  109. do:
  110. - parse
  111. - space_dedupe
  112. - trim
  113. - object_field_set:
  114. object: product
  115. field: name
  116. - find:
  117. path: div[itemprop="brand"] span[itemprop="name"]
  118. do:
  119. - parse
  120. - space_dedupe
  121. - trim
  122. - variable_set: brand
  123. - variable_get: brand
  124. - object_field_set:
  125. object: product
  126. field: brand
  127. - find:
  128. path: p.prodSKU
  129. slice: 0
  130. do:
  131. - parse:
  132. filter: (\d+)
  133. - space_dedupe
  134. - trim
  135. - object_field_set:
  136. object: product
  137. field: sku
  138. - find:
  139. path: span[itemprop="priceCurrency"]
  140. do:
  141. - parse
  142. - normalize:
  143. routine: replace_matched
  144. args:
  145. \$: USD
  146. - object_field_set:
  147. object: product
  148. field: currency
  149. - find:
  150. path: span[itemprop="price"],span[itemprop="lowPrice"]
  151. do:
  152. - parse:
  153. filter:
  154. - ([0-9\.]+)\s*-
  155. - ([0-9\.]+)
  156. - object_field_set:
  157. object: product
  158. type: float
  159. field: price
  160. - find:
  161. path: li.colorSwatchLi
  162. do:
  163. - parse:
  164. attr: data-attr
  165. - space_dedupe
  166. - trim
  167. - if:
  168. match: \w+
  169. do:
  170. - object_field_set:
  171. object: product
  172. joinby: "|"
  173. field: variations
  174. - parse:
  175. attr: data-imgurlthumb
  176. filter: ^([^\?]+)
  177. - space_dedupe
  178. - trim
  179. - if:
  180. match: \w+
  181. do:
  182. - normalize:
  183. routine: url
  184. - variable_set: iurl
  185. - register_set: <%iurl%>?scl=1
  186. - object_field_set:
  187. object: product
  188. joinby: "|"
  189. field: images
  190. - find:
  191. path: div[itemprop="description"]
  192. do:
  193. - parse
  194. - space_dedupe
  195. - trim
  196. - if:
  197. match: \w+
  198. do:
  199. - object_field_set:
  200. object: product
  201. field: description
  202. - find:
  203. path: div.breadcrumbs>div.alpha>a
  204. slice: 1:-1
  205. do:
  206. - parse
  207. - space_dedupe
  208. - trim
  209. - if:
  210. match: \w+
  211. do:
  212. - object_field_set:
  213. object: product
  214. joinby: "|"
  215. field: category
  216. - find:
  217. path: 'img#mainProductImg'
  218. do:
  219. - parse:
  220. attr: src
  221. filter: ^([^\?]+)
  222. - if:
  223. match: \w+
  224. do:
  225. - normalize:
  226. routine: url
  227. - variable_set: iurl
  228. - register_set: <%iurl%>?scl=1
  229. - object_field_set:
  230. object: product
  231. joinby: "|"
  232. field: images
  233. - find:
  234. path: 'div#s7ProductImageWrapper'
  235. do:
  236. - parse:
  237. attr: data-s7imageid
  238. - if:
  239. match: \d+
  240. do:
  241. - variable_set: iid
  242. - walk:
  243. to: https://s7d9.scene7.com/is/image/BedBathandBeyond/<%iid%>_is?req=set,json,UTF-8
  244. do:
  245. - find:
  246. path: script
  247. do:
  248. - parse:
  249. filter: s7jsonResponse\((.+)\,\&quot;\&quot;\)\;
  250. - normalize:
  251. routine: unescape_html
  252. - normalize:
  253. routine: json2xml
  254. - to_block
  255. - find:
  256. path: item>i>n
  257. do:
  258. - parse
  259. - if:
  260. match: \d+
  261. do:
  262. - variable_set: iurl
  263. - register_set: https://s7d9.scene7.com/is/image/<%iurl%>?scl=1
  264. - object_field_set:
  265. object: product
  266. joinby: "|"
  267. field: images
  268. - object_save:
  269. name: product

Report this snippet  

You need to login to post a comment.