Posted By

diggernaut on 12/08/17


Tagged

data etl Ecommerce scraping webscraping diggernaut bloomingdales


Versions (?)

Scraping bloomingdales.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape bloomingdales.com to retrieve products information. Attention: you will need to use your own list of proxies for this digger as basic Diggernaut's proxies doesnt work with bloomingdales.com

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
  5. proxy: #put your proxies list here
  6.  
  7. do:
  8. - walk:
  9. to: https://www.bloomingdales.com/index
  10. headers:
  11. accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
  12. accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
  13. cache-control: no-cache
  14. do:
  15. - find:
  16. path: '#globalFlyouts a'
  17. do:
  18. - pool_clear: main
  19. - parse:
  20. attr: href
  21. filter:
  22. - \?id=(\d+)
  23. - variable_set: pur
  24. - variable_set:
  25. field: first
  26. value: 1
  27. - if:
  28. match: (\d)
  29. do:
  30. - register_set: https://www.bloomingdales.com/api/navigation/categories/facet?categoryId=<%pur%>&facet=false&pageIndex=1&bcomNavPPP=undefine
  31. - link_add:
  32. pool: main
  33. - walk:
  34. to: links
  35. pool: main
  36. headers:
  37. accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
  38. accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
  39. cache-control: no-cache
  40. do:
  41. - find:
  42. path: productids
  43. do:
  44. - parse
  45. - if:
  46. match: (\d)
  47. do:
  48. - register_set: https://www.bloomingdales.com/shop/product/?ID=<%register%>&CategoryID=<%pur%>
  49. - link_add:
  50. pool: sub
  51. - find:
  52. path: productcount
  53. do:
  54. - parse
  55. - if:
  56. match: (\d)
  57. do:
  58. - variable_set: count
  59. - variable_get: first
  60. - if:
  61. match: (\d+)
  62. do:
  63. - variable_clear: first
  64. - eval:
  65. routine: js
  66. body: (function () {var pages = []; for (var i=2; i*90 <= <%count%>; i++) {pages.push(i)}; return pages.join(",");})();
  67. - to_block
  68. - split:
  69. context: text
  70. delimiter: ","
  71. - find:
  72. path: .splitted
  73. do:
  74. - parse
  75. - register_set: https://www.bloomingdales.com/api/navigation/categories/facet?categoryId=<%pur%>&facet=false&pageIndex=<%register%>&bcomNavPPP=undefine
  76. - link_add:
  77. pool: main
  78. - walk:
  79. to: links
  80. headers:
  81. accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
  82. accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
  83. cache-control: no-cache
  84. pool: sub
  85. do:
  86. - proxy_switch
  87. - cookie_reset
  88. - variable_clear: allli
  89. - variable_clear: descr
  90. - variable_clear: n
  91. - object_new: product
  92. - find:
  93. in: doc
  94. path: head
  95. do:
  96. - eval:
  97. routine: js
  98. body: '(function (){var d = new Date(); return d.toISOString()})();'
  99. - object_field_set:
  100. object: product
  101. field: date
  102. - static_get: url
  103. - filter:
  104. args:
  105. - (.+\?[idID]+=\d+)\&
  106. - object_field_set:
  107. object: product
  108. field: url
  109. - register_set: Bloomingdale
  110. - object_field_set:
  111. object: product
  112. field: brand
  113. - find:
  114. path: '#productId'
  115. do:
  116. - parse:
  117. attr: value
  118. - if:
  119. match: (\d)
  120. do:
  121. - object_field_set:
  122. object: product
  123. field: sku
  124. - find:
  125. path: '#brandNameLink'
  126. do:
  127. - parse
  128. - space_dedupe
  129. - trim
  130. - object_field_set:
  131. object: product
  132. field: brand
  133. - find:
  134. path: '#productName, #productTitle'
  135. do:
  136. - variable_get: n
  137. - if:
  138. match: (\d)
  139. else:
  140. - parse
  141. - space_dedupe
  142. - trim
  143. - object_field_set:
  144. object: product
  145. field: name
  146. - variable_set:
  147. field: n
  148. value: 1
  149. - find:
  150. path: .selectedFOB
  151. do:
  152. - parse
  153. - space_dedupe
  154. - trim
  155. - normalize:
  156. routine: lower
  157. - object_field_set:
  158. object: product
  159. field: category
  160. joinby: "|"
  161. - find:
  162. path: 'script#pdp_data'
  163. do:
  164. - parse
  165. - normalize:
  166. routine: json2xml
  167. - to_block
  168. - find:
  169. path: colorwayadditionalimages > *, colorwayprimaryimages > *, additionalimages, imagesource
  170. do:
  171. - parse
  172. - split:
  173. context: text
  174. delimiter: ','
  175. - find:
  176. path: .splitted
  177. do:
  178. - parse
  179. - if:
  180. match: (\S)
  181. do:
  182. - register_set: https://images.bloomingdales.com/is/image/BLM/products/<%register%>
  183. - object_field_set:
  184. object: product
  185. field: images
  186. joinby: "|"
  187. - find:
  188. path: colorfamily > *
  189. do:
  190. - parse
  191. - if:
  192. match: (\S)
  193. do:
  194. - object_field_set:
  195. object: product
  196. field: variations
  197. joinby: "|"
  198. - find:
  199. path: product > seokeywords
  200. slice: 0:-2
  201. do:
  202. - parse
  203. - space_dedupe
  204. - trim
  205. - normalize:
  206. routine: lower
  207. - if:
  208. match: (\S)
  209. do:
  210. - object_field_set:
  211. object: product
  212. field: category
  213. joinby: "|"
  214. - find:
  215. path: longdescription
  216. do:
  217. - parse
  218. - space_dedupe
  219. - trim
  220. - if:
  221. match: (\S)
  222. do:
  223. - object_field_set:
  224. object: product
  225. field: description
  226. - find:
  227. path: product > price
  228. do:
  229. - parse
  230. - space_dedupe
  231. - trim
  232. - if:
  233. match: (\d)
  234. do:
  235. - object_field_set:
  236. object: product
  237. field: price
  238. type: float
  239. - register_set: USD
  240. - object_field_set:
  241. object: product
  242. field: currency
  243. - object_save:
  244. name: product

Report this snippet  

You need to login to post a comment.