Posted By

diggernaut on 12/05/17


Tagged

data etl Ecommerce scraping webscraping diggernaut adidashardware


Versions (?)

Scraping adidashardware.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape adidashardware.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - link_add:
  7. pool: main
  8. url:
  9. - https://www.adidashardware.com/training
  10. - https://www.adidashardware.com/recovery
  11. - https://www.adidashardware.com/strength
  12. - https://www.adidashardware.com/yoga
  13. - https://www.adidashardware.com/benches
  14. - https://www.adidashardware.com/cardio
  15. - walk:
  16. to: links
  17. pool: main
  18. do:
  19. - find:
  20. path: script:matches(Wix Stores)
  21. do:
  22. - parse
  23. - space_dedupe
  24. - trim
  25. - filter:
  26. args:
  27. - var\s*rendererModel\s*=\s*(.+)\s*;\s*var\s*publicModel
  28. - normalize:
  29. routine: json2xml
  30. - to_block
  31. - find:
  32. path: clientspecmap > *:hasChild(appdefinitionname:matches(Wix Stores)) instance
  33. do:
  34. - parse
  35. - variable_set: instance
  36. - find:
  37. path: body
  38. do:
  39. - static_get: url
  40. - normalize:
  41. routine: replace_substring
  42. args:
  43. \s*$: ?_escaped_fragment_=
  44. - walk:
  45. to: value
  46. do:
  47. - find:
  48. path: meta[property="og:title"]
  49. do:
  50. - parse:
  51. attr: content
  52. filter:
  53. - \s*\|\s*(.+)
  54. - normalize:
  55. routine: lower
  56. - variable_set: cats
  57. - find:
  58. path: li[itemprop="itemListElement"] h3[itemprop="name"] > a
  59. do:
  60. - variable_clear: uri
  61. - parse:
  62. attr: href
  63. - normalize:
  64. routine: url
  65. - variable_set: uri
  66. - variable_clear: ur
  67. - filter:
  68. args:
  69. - product-page\/(.+)
  70. - variable_set: ur
  71. - walk:
  72. to: https://ecom.wix.com/storefront/product/<%ur%>?instance=<%instance%>&locale=en
  73. do:
  74. - object_new: product
  75. - find:
  76. path: head
  77. do:
  78. - variable_get: cats
  79. - object_field_set:
  80. object: product
  81. field: categories
  82. - register_set: Adidas
  83. - object_field_set:
  84. object: product
  85. field: brand
  86. - eval:
  87. routine: js
  88. body: '(function (){var d = new Date(); return d.toISOString()})();'
  89. - object_field_set:
  90. object: product
  91. field: date
  92. - variable_get: uri
  93. - object_field_set:
  94. object: product
  95. field: url
  96. - find:
  97. path: script:matches(eCom.eComAppConfig)
  98. do:
  99. - parse
  100. - space_dedupe
  101. - trim
  102. - filter:
  103. args:
  104. - eCom\.eComAppConfig\(\'productPageApp\'\,\s*(.+),\s*\'\/\/
  105. - normalize:
  106. routine: json2xml
  107. - to_block
  108. - find:
  109. path: body_safe
  110. do:
  111. - find:
  112. path: product > description
  113. do:
  114. - parse
  115. - space_dedupe
  116. - trim
  117. - normalize:
  118. routine: replace_substring
  119. args:
  120. \<p\>\s*\<\/p\>: ''
  121. - register_set: <div><%register%></div>
  122. - to_block
  123. - find:
  124. path: div
  125. do:
  126. - attr_remove:
  127. selector: '*'
  128. - parse
  129. - object_field_set:
  130. object: product
  131. field: description
  132. - find:
  133. path: product > media:hasChild(mediatype:matches(PHOTO)) > url
  134. do:
  135. - parse
  136. - register_set: https://static.wixstatic.com/media/<%register%>
  137. - object_field_set:
  138. object: product
  139. field: images
  140. joinby: "|"
  141. - find:
  142. path: product > name
  143. do:
  144. - parse
  145. - space_dedupe
  146. - trim
  147. - object_field_set:
  148. object: product
  149. field: name
  150. - find:
  151. path: product options:hasChild(optiontype:matches(COLOR)) description
  152. do:
  153. - parse
  154. - space_dedupe
  155. - trim
  156. - object_field_set:
  157. object: product
  158. field: variations
  159. joinby: "|"
  160.  
  161. - find:
  162. path: product > sku, product > managedproductitems > sku
  163. do:
  164. - parse
  165. - space_dedupe
  166. - trim
  167. - if:
  168. match: (\S{3})
  169. do:
  170. - object_field_set:
  171. object: product
  172. field: sku
  173. - find:
  174. path: product > prices > price
  175. do:
  176. - parse:
  177. filter:
  178. - (\d+\.?\d*)
  179. - object_field_set:
  180. object: product
  181. field: price
  182. type: float
  183. - register_set: GBP
  184. - object_field_set:
  185. object: product
  186. field: currency
  187. - object_save:
  188. name: product

Report this snippet  

You need to login to post a comment.