Posted By

diggernaut on 12/07/17


Tagged

data etl Ecommerce scraping webscraping diggernaut asos


Versions (?)

Scraping asos.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape asos.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - link_add:
  7. url:
  8. - http://us.asos.com/women/a-to-z-of-brands/cat/?cid=1340
  9. - http://us.asos.com/men/a-to-z-of-brands/cat/?cid=1361
  10. - walk:
  11. to: links
  12. do:
  13. - pool_clear: main
  14. - find:
  15. path: .brands-list a
  16. do:
  17. - parse:
  18. attr: href
  19. - normalize:
  20. routine: replace_substring
  21. args:
  22. \s*$: '&pgesize=204'
  23. - link_add:
  24. pool: main
  25. - walk:
  26. to: links
  27. pool: main
  28. do:
  29. - find:
  30. path: li.next a
  31. do:
  32. - parse:
  33. attr: href
  34. - normalize:
  35. routine: url
  36. - link_add:
  37. pool: main
  38. - find:
  39. path: .product
  40. do:
  41. - parse:
  42. attr: href
  43. - link_add:
  44. pool: sub
  45. - walk:
  46. to: links
  47. pool: sub
  48. do:
  49. - variable_clear: isP
  50. - variable_clear: allli
  51. - variable_clear: sdescr
  52. - variable_clear: li
  53. - variable_clear: json
  54. - variable_clear: id
  55. - find:
  56. path: script:matches(Pages/FullProduct)
  57. do:
  58. - variable_set:
  59. field: isP
  60. value: 1
  61. - parse:
  62. filter:
  63. - view\(\'\s*(.+)\',
  64. - normalize:
  65. routine: replace_substring
  66. args:
  67. \\\': ''
  68. - normalize:
  69. routine: unescape_html
  70. - variable_set: json
  71. - find:
  72. path: html
  73. do:
  74. - variable_get: isP
  75. - if:
  76. match: (1)
  77. do:
  78. - object_new: product
  79. - find:
  80. path: head
  81. do:
  82. - eval:
  83. routine: js
  84. body: '(function (){var d = new Date(); return d.toISOString()})();'
  85. - object_field_set:
  86. object: product
  87. field: date
  88. - static_get: url
  89. - object_field_set:
  90. object: product
  91. field: url
  92. - find:
  93. path: '#breadcrumb li'
  94. slice: 1:-2
  95. do:
  96. - parse
  97. - space_dedupe
  98. - trim
  99. - normalize:
  100. routine: replace_matched
  101. args:
  102. A\s*To\s*Z\s*Of\s*Brands: ''
  103. - if:
  104. match: (\S)
  105. do:
  106. - object_field_set:
  107. object: product
  108. field: category
  109. joinby: "|"
  110. - find:
  111. path: .product-code > span
  112. do:
  113. - parse
  114. - space_dedupe
  115. - trim
  116. - if:
  117. match: (\S)
  118. do:
  119. - object_field_set:
  120. object: product
  121. field: sku
  122. - find:
  123. path: meta[name="description"]
  124. do:
  125. - parse:
  126. attr: content
  127. - space_dedupe
  128. - trim
  129. - if:
  130. match: (\S)
  131. do:
  132. - object_field_set:
  133. object: product
  134. field: description
  135. - find:
  136. path: body
  137. do:
  138. - variable_get: json
  139. - normalize:
  140. routine: replace_substring
  141. args:
  142. '\\\\': '\'
  143. - normalize:
  144. routine: json2xml
  145. - to_block
  146. - find:
  147. path: brandname
  148. do:
  149. - parse
  150. - space_dedupe
  151. - trim
  152. - if:
  153. match: (\S)
  154. do:
  155. - object_field_set:
  156. object: product
  157. field: brand
  158. - find:
  159. path: body_safe > name
  160. do:
  161. - parse
  162. - space_dedupe
  163. - trim
  164. - if:
  165. match: (\S)
  166. do:
  167. - object_field_set:
  168. object: product
  169. field: name
  170. - find:
  171. path: images
  172. do:
  173. - find:
  174. path: colour
  175. do:
  176. - parse
  177. - space_dedupe
  178. - trim
  179. - if:
  180. match: (\S)
  181. do:
  182. - object_field_set:
  183. object: product
  184. field: variations
  185. joinby: "|"
  186. - find:
  187. path: url
  188. do:
  189. - parse
  190. - space_dedupe
  191. - trim
  192. - if:
  193. match: (\S)
  194. do:
  195. - normalize:
  196. routine: replace_substring
  197. args:
  198. \s*$: ?scl=1
  199. - object_field_set:
  200. object: product
  201. field: images
  202. joinby: "|"
  203. - find:
  204. path: price > current
  205. do:
  206. - parse:
  207. filter:
  208. - (\d+\.?\d*)
  209. - if:
  210. match: (\d)
  211. do:
  212. - object_field_set:
  213. object: product
  214. field: price
  215. type: float
  216. - register_set: USD
  217. - object_field_set:
  218. object: product
  219. field: currency
  220. - object_save:
  221. name: product

Report this snippet  

You need to login to post a comment.