Posted By

diggernaut on 12/07/17


Tagged

data etl Ecommerce scraping webscraping diggernaut benmeadows


Versions (?)

Scraping benmeadows.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape benmeadows.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: http://www.benmeadows.com
  8. do:
  9. - find:
  10. path: 'ul#topnav>li:has(a#productCategories)>div.subMenu a'
  11. do:
  12. - parse:
  13. attr: href
  14. - space_dedupe
  15. - trim
  16. - if:
  17. match: \w+
  18. do:
  19. - normalize:
  20. routine: url
  21. - link_add:
  22. pool: catalog
  23. - walk:
  24. to: links
  25. pool: catalog
  26. do:
  27. - find:
  28. path: .viewPaginationNext
  29. do:
  30. - parse:
  31. attr: href
  32. - if:
  33. match: \w+
  34. do:
  35. - normalize:
  36. routine: url
  37. - link_add:
  38. pool: catalog
  39.  
  40. - find:
  41. path: 'a#hlNavigation'
  42. do:
  43. - parse:
  44. attr: href
  45. - space_dedupe
  46. - trim
  47. - if:
  48. match: \w+
  49. do:
  50. - normalize:
  51. routine: url
  52. - link_add:
  53. pool: catalog
  54. - find:
  55. path: 'a#hladd'
  56. do:
  57. - parse:
  58. attr: href
  59. - space_dedupe
  60. - trim
  61. - if:
  62. match: \w+
  63. do:
  64. - normalize:
  65. routine: url
  66. - link_add:
  67. pool: pages
  68. - walk:
  69. to: links
  70. pool: pages
  71. do:
  72. - sleep: 2
  73. - find:
  74. path: 'div#prodWrap'
  75. do:
  76. - object_new: product
  77. - eval:
  78. routine: js
  79. body: '(function (){var d = new Date(); return d.toISOString()})();'
  80. - object_field_set:
  81. object: product
  82. field: date
  83. - static_get: url
  84. - object_field_set:
  85. object: product
  86. field: url
  87. - find:
  88. path: meta[itemprop="identifier"]
  89. do:
  90. - parse:
  91. attr: content
  92. - space_dedupe
  93. - trim
  94. - if:
  95. match: \d+
  96. do:
  97. - object_field_set:
  98. object: product
  99. field: sku
  100. - find:
  101. path: 'span#lblGroupTitle'
  102. do:
  103. - parse
  104. - space_dedupe
  105. - trim
  106. - object_field_set:
  107. object: product
  108. field: name
  109. - find:
  110. path: 'a#imgLink'
  111. do:
  112. - parse:
  113. attr: href
  114. - space_dedupe
  115. - trim
  116. - if:
  117. match: \w+
  118. do:
  119. - normalize:
  120. routine: url
  121. - object_field_set:
  122. object: product
  123. joinby: "|"
  124. field: images
  125. - find:
  126. path: script:contains('loadProductPageDropDowns')
  127. do:
  128. - parse:
  129. filter: loadProductPageDropDowns\((.+)\)\;\$\('\#txtHeaderSearch'\)\.focus\(\)\;\}\)\;
  130. - normalize:
  131. routine: json2xml
  132. - to_block
  133. - find:
  134. path: body_safe>groupname
  135. do:
  136. - parse
  137. - space_dedupe
  138. - trim
  139. - object_field_set:
  140. object: product
  141. field: name
  142. - find:
  143. path: body_safe>groupid
  144. do:
  145. - parse
  146. - space_dedupe
  147. - trim
  148. - if:
  149. match: \d+
  150. do:
  151. - object_field_set:
  152. object: product
  153. field: sku
  154. - find:
  155. path: largeimage,secimages>large
  156. do:
  157. - parse
  158. - space_dedupe
  159. - trim
  160. - if:
  161. match: \w+
  162. do:
  163. - normalize:
  164. routine: url
  165. - object_field_set:
  166. object: product
  167. joinby: "|"
  168. field: images
  169. - find:
  170. path: properties>children
  171. do:
  172. - variable_clear: sort
  173. - variable_clear: value
  174. - find:
  175. path: sort
  176. do:
  177. - parse
  178. - space_dedupe
  179. - trim
  180. - variable_set: sort
  181. - find:
  182. path: value
  183. do:
  184. - parse
  185. - space_dedupe
  186. - trim
  187. - variable_set: value
  188. - variable_get: sort
  189. - if:
  190. match: \w+
  191. do:
  192. - variable_get: value
  193. - if:
  194. match: \w+
  195. do:
  196. - register_set: "<%sort%>: <%value%>"
  197. - object_field_set:
  198. object: product
  199. joinby: "|"
  200. field: variations
  201. - register_set: Ben Meadows
  202. - variable_set: brand
  203. - find:
  204. path: 'img[itemprop="brand"]'
  205. do:
  206. - parse:
  207. attr; content
  208. - space_dedupe
  209. - trim
  210. - variable_set: brand
  211. - variable_get: brand
  212. - object_field_set:
  213. object: product
  214. field: brand
  215. - find:
  216. path: span.currentCrumb>a
  217. slice: 0:-2
  218. do:
  219. - parse
  220. - space_dedupe
  221. - trim
  222. - if:
  223. match: \w+
  224. do:
  225. - object_field_set:
  226. object: product
  227. joinby: "|"
  228. field: category
  229. - find:
  230. in: doc
  231. path: meta[name="description"]
  232. do:
  233. - parse:
  234. attr: content
  235. - space_dedupe
  236. - trim
  237. - variable_set: desc
  238. - find:
  239. path: 'div#prodDetailedBenefit>div.proDesc'
  240. do:
  241. - parse
  242. - space_dedupe
  243. - trim
  244. - variable_set: desc
  245. - variable_get: desc
  246. - object_field_set:
  247. object: product
  248. field: description
  249. - find:
  250. path: meta[itemprop="price"],meta[itemprop="lowPrice"]
  251. do:
  252. - parse:
  253. attr: content
  254. filter: ([0-9\.\,]+)
  255. - normalize:
  256. routine: replace_substring
  257. args:
  258. \,: ''
  259. - space_dedupe
  260. - trim
  261. - object_field_set:
  262. object: product
  263. type: float
  264. field: price
  265. - find:
  266. path: meta[itemprop="currency"]
  267. do:
  268. - parse:
  269. attr: content
  270. - object_field_set:
  271. object: product
  272. field: currency
  273. - object_save:
  274. name: product

Report this snippet  

You need to login to post a comment.