Posted By

diggernaut on 12/08/17


Tagged

data etl Ecommerce scraping webscraping Burton diggernaut


Versions (?)

Scraping burton.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape burton.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5.  
  6. do:
  7. - link_add:
  8. pool: main
  9. url:
  10. - https://www.burton.com
  11. - walk:
  12. to: links
  13. pool: main
  14. do:
  15. - variable_clear: ur
  16. - find:
  17. path: .category-title.secondary, .category-title.tertiary
  18. do:
  19. - parse:
  20. attr: href
  21. - variable_set: ur
  22. - register_set: <%ur%>?sz=999&concise=true&format=ajax
  23. - pool_clear: cat
  24. - link_add:
  25. pool: cat
  26. - walk:
  27. to: value
  28. headers:
  29. x-requested-with: XMLHttpRequest
  30. do:
  31. - find:
  32. path: body_safe hit > href
  33. do:
  34. - parse
  35. - normalize:
  36. routine: url
  37. - link_add:
  38. pool: sub
  39.  
  40. - walk:
  41. to: links
  42. pool: sub
  43. do:
  44. - find:
  45. path: .cps-app-page-body
  46. do:
  47. - variable_clear: allli
  48. - variable_clear: sdecr
  49. - variable_clear: li
  50. - variable_clear: ean
  51. - variable_clear: vid
  52. - variable_clear: id
  53. - variable_clear: stp
  54. - object_new: product
  55. - eval:
  56. routine: js
  57. body: '(function (){var d = new Date(); return d.toISOString()})();'
  58. - object_field_set:
  59. object: product
  60. field: date
  61. - static_get: url
  62. - object_field_set:
  63. object: product
  64. field: url
  65. - register_set: Burton
  66. - object_field_set:
  67. object: product
  68. field: brand
  69. - find:
  70. in: doc
  71. path: script:matches(window\.__bootstrap)
  72. do:
  73. - parse
  74. - trim
  75. - space_dedupe
  76. - eval:
  77. routine: js
  78. body: (function () {var window = {}; <%register%> return JSON.stringify(window.__bootstrap.data)})();
  79. - normalize:
  80. routine: json2xml
  81. - to_block
  82. - find:
  83. path: body_safe
  84. do:
  85. - node_remove: recommendations
  86. - find:
  87. path: breadcrumbs
  88. do:
  89. - parse
  90. - normalize:
  91. routine: url
  92. - walk:
  93. to: value
  94. do:
  95. - find:
  96. path: displayname
  97. slice: 0:-2
  98. do:
  99. - parse
  100. - space_dedupe
  101. - trim
  102. - object_field_set:
  103. object: product
  104. field: category
  105. joinby: "|"
  106. - find:
  107. path: variationcolor > displayvalue
  108. do:
  109. - parse
  110. - trim
  111. - space_dedupe
  112. - if:
  113. match: (\S)
  114. do:
  115. - object_field_set:
  116. object: product
  117. field: variations
  118. joinby: "|"
  119. - find:
  120. path: masterid
  121. do:
  122. - parse
  123. - trim
  124. - space_dedupe
  125. - object_field_set:
  126. object: product
  127. field: sku
  128. - variable_set: id
  129. - walk:
  130. to: https://www.burton.com/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetTechFeaturesJSON?pids=<%id%>
  131. do:
  132. - find:
  133. path: longdescription
  134. do:
  135. - parse
  136. - space_dedupe
  137. - trim
  138. - if:
  139. match: \S
  140. do:
  141. - object_field_set:
  142. object: product
  143. field: description
  144. - walk:
  145. to: https://www.burton.com/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON?pid=<%id%>&pricing
  146. do:
  147. - find:
  148. path: variationcolor displayname
  149. do:
  150. - parse
  151. - space_dedupe
  152. - trim
  153. - if:
  154. match: (\S)
  155. do:
  156. - object_field_set:
  157. object: product
  158. field: variations
  159. joinby: "|"
  160. - find:
  161. path: pricing
  162. do:
  163. - variable_clear: isSale
  164. - find:
  165. path: isonsale
  166. do:
  167. - parse
  168. - if:
  169. match: 'true'
  170. do:
  171. - variable_set: isSale
  172. - find:
  173. path: saleprice
  174. do:
  175. - variable_get: isSale
  176. - if:
  177. match: \S
  178. do:
  179. - parse:
  180. filter:
  181. - ^\s*\$(\d+\.?\d*)\s*-
  182. - ^\s*\$(\d+\.?\d*)
  183. - if:
  184. match: \d
  185. do:
  186.  
  187. - object_field_set:
  188. object: product
  189. field: price
  190. type: float
  191. - register_set: USD
  192. - object_field_set:
  193. object: product
  194. field: currency
  195. - find:
  196. path: standardprice
  197. do:
  198. - variable_get: isSale
  199. - if:
  200. match: \S
  201. else:
  202. - parse:
  203. filter:
  204. - ^\s*\$(\d+\.?\d*)\s*-
  205. - ^\s*\$(\d+\.?\d*)
  206. - if:
  207. match: \d
  208. do:
  209.  
  210. - object_field_set:
  211. object: product
  212. field: price
  213. type: float
  214. - register_set: USD
  215. - object_field_set:
  216. object: product
  217. field: currency
  218. - find:
  219. path: variationimagedata xl > x1, imagedata xl > x1, images xl > x1
  220. do:
  221. - parse
  222. - trim
  223. - space_dedupe
  224. - filter:
  225. args:
  226. - (.+\.png)\?
  227. - object_field_set:
  228. object: product
  229. field: images
  230. joinby: "|"
  231.  
  232. - find:
  233. path: name
  234. slice: 0
  235. do:
  236. - parse
  237. - trim
  238. - space_dedupe
  239. - object_field_set:
  240. object: product
  241. field: name
  242. - object_save:
  243. name: product

Report this snippet  

You need to login to post a comment.