Posted By

diggernaut on 12/05/17


Tagged

scraping web adidas diggernaut etl Ecommerce


Versions (?)

Scraping adidas.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape adidas.com and retrieve products information. Attention: you will need to use your own proxy for this digger as basic Diggernaut's proxies doesnt work with adidas.com

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. proxy: #USE YOUR OWN PROXY HERE LIKE 1.1.1.1:8888
  4. agent: Chrome
  5. debug: 2
  6. do:
  7. - pool_clear: catalog
  8. - pool_clear: pages
  9. - walk:
  10. to: http://www.adidas.com/us/
  11. do:
  12. - find:
  13. path: div.contentasset
  14. slice: 0:2
  15. do:
  16. - find:
  17. path: ul>li>a
  18. do:
  19. - parse:
  20. attr: href
  21. - trim
  22. - if:
  23. match: \w+
  24. do:
  25. - normalize:
  26. routine: url
  27. - link_add:
  28. pool: catalog
  29. - walk:
  30. to: links
  31. pool: catalog
  32. do:
  33. - sleep: 3
  34. - find:
  35. path: a.product-link:not(.design-starter-click)
  36. do:
  37. - parse:
  38. attr: href
  39. - trim
  40. - if:
  41. match: \w+
  42. do:
  43. - normalize:
  44. routine: url
  45. - link_add:
  46. pool: pages
  47. - find:
  48. path: a.pagging-next-page
  49. slice: 0
  50. do:
  51. - parse:
  52. attr: href
  53. - trim
  54. - if:
  55. match: \w+
  56. do:
  57. - normalize:
  58. routine: url
  59. - link_add:
  60. pool: catalog
  61. - cookie_reset
  62. - walk:
  63. to: links
  64. pool: pages
  65. do:
  66. - sleep: 3
  67. - find:
  68. path: head
  69. do:
  70. - variable_clear: pid
  71. - object_new: product
  72. - eval:
  73. routine: js
  74. body: '(function (){var d = new Date(); return d.toISOString()})();'
  75. - object_field_set:
  76. object: product
  77. field: date
  78. - static_get: url
  79. - object_field_set:
  80. object: product
  81. field: url
  82. - filter:
  83. args: \/([A-Z0-9]+)\.html$
  84. - if:
  85. match: \w+
  86. do:
  87. - walk:
  88. to: https://www.adidas.com/api/products/<%register%>?sitePath=us
  89. do:
  90. - find:
  91. path: body_safe>id
  92. do:
  93. - parse
  94. - space_dedupe
  95. - trim
  96. - object_field_set:
  97. object: product
  98. field: sku
  99. - variable_set: pid
  100. - register_set: Adidas
  101. - object_field_set:
  102. object: product
  103. field: brand
  104. - find:
  105. path: body_safe>name
  106. do:
  107. - parse
  108. - space_dedupe
  109. - trim
  110. - object_field_set:
  111. object: product
  112. field: name
  113. - find:
  114. path: body_safe>product_description>text
  115. do:
  116. - parse
  117. - space_dedupe
  118. - trim
  119. - object_field_set:
  120. object: product
  121. field: description
  122. - find:
  123. path: body_safe>breadcrumb_list>text
  124. do:
  125. - parse
  126. - space_dedupe
  127. - trim
  128. - if:
  129. match: \w+
  130. do:
  131. - object_field_set:
  132. object: product
  133. field: category
  134. joinby: "|"
  135. - find:
  136. path: body_safe>view_list>image_url
  137. do:
  138. - parse
  139. - if:
  140. match: \w+
  141. do:
  142. - normalize:
  143. routine: url
  144. - object_field_set:
  145. object: product
  146. field: images
  147. joinby: "|"
  148. - find:
  149. path: body_safe>attribute_list>color,body_safe>product_link_list>default_color
  150. do:
  151. - parse
  152. - space_dedupe
  153. - trim
  154. - if:
  155. match: \w+
  156. do:
  157. - object_field_set:
  158. object: product
  159. field: variations
  160. joinby: "|"
  161. - find:
  162. path: body_safe>pricing_information>currentprice
  163. do:
  164. - parse
  165. - object_field_set:
  166. object: product
  167. type: float
  168. field: price
  169. - register_set: USD
  170. - object_field_set:
  171. object: product
  172. field: currency
  173. - object_save:
  174. name: product
  175. - cookie_reset

Report this snippet  

You need to login to post a comment.