Posted By

diggernaut on 12/05/17


Tagged

data etl Ecommerce scraping webscraping ae diggernaut


Versions (?)

Scraping ae.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape ae.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. do:
  5. - pool_clear: pages
  6. - walk:
  7. to: https://www.ae.com
  8. do:
  9. - find:
  10. path: a.third-level,a.second-level-item
  11. do:
  12. - parse:
  13. attr: href
  14. - normalize:
  15. routine: url
  16. - link_add:
  17. pool: catalog
  18. - walk:
  19. to: links
  20. pool: catalog
  21. do:
  22. - find:
  23. path: div.product-tile>div.product-details-container>a[data-qa-link-to="pdp"]
  24. do:
  25. - parse:
  26. attr: href
  27. filter: ^([^\?]+)
  28. - normalize:
  29. routine: url
  30. - link_add:
  31. pool: pages
  32. - walk:
  33. to: links
  34. pool: pages
  35. do:
  36. - sleep: 3
  37. - find:
  38. path: div.container-fluid[itemtype="http://schema.org/Product"]
  39. do:
  40. - variable_clear: pid
  41. - object_new: product
  42. - eval:
  43. routine: js
  44. body: '(function (){var d = new Date(); return d.toISOString()})();'
  45. - object_field_set:
  46. object: product
  47. field: date
  48. - static_get: url
  49. - object_field_set:
  50. object: product
  51. field: url
  52. - find:
  53. path: h1.psp-product-name
  54. do:
  55. - parse
  56. - space_dedupe
  57. - trim
  58. - object_field_set:
  59. object: product
  60. field: name
  61. - find:
  62. path: span.pdp-about-cs
  63. do:
  64. - parse
  65. - space_dedupe
  66. - trim
  67. - object_field_set:
  68. object: product
  69. field: sku
  70. - find:
  71. in: doc
  72. path: script:contains("var $AWP_ENV")
  73. do:
  74. - parse:
  75. filter: var\s+\$AWP_ENV\s+\=\s+(.+)\s*$
  76. - normalize:
  77. routine: json2xml
  78. - to_block
  79. - find:
  80. path: body_safe>viewdata>availableproducts
  81. do:
  82. - find:
  83. path: brandname
  84. slice: 0
  85. do:
  86. - parse
  87. - space_dedupe
  88. - trim
  89. - object_field_set:
  90. object: product
  91. field: brand
  92. - find:
  93. path: colorname
  94. do:
  95. - parse
  96. - space_dedupe
  97. - trim
  98. - if:
  99. match: \w+
  100. do:
  101. - object_field_set:
  102. object: product
  103. joinby: "|"
  104. field: variations
  105. - find:
  106. path: largeimages
  107. do:
  108. - parse
  109. - space_dedupe
  110. - trim
  111. - if:
  112. match: \w+
  113. do:
  114. - normalize:
  115. routine: url
  116. - variable_set: iurl
  117. - register_set: "<%iurl%>?scl=1"
  118. - object_field_set:
  119. object: product
  120. joinby: "|"
  121. field: images
  122. - find:
  123. path: span.psp-product-regularprice[itemprop="priceCurrency"]
  124. do:
  125. - parse:
  126. attr: content
  127. - object_field_set:
  128. object: product
  129. field: currency
  130. - find:
  131. path: span.psp-product-regularprice[itemprop="price"]
  132. do:
  133. - parse:
  134. attr: content
  135. - object_field_set:
  136. object: product
  137. type: float
  138. field: price
  139. - find:
  140. path: span.psp-product-saleprice[itemprop="priceCurrency"],span.psp-product-sale-currency[itemprop="priceCurrency"]
  141. do:
  142. - parse:
  143. attr: content
  144. - object_field_set:
  145. object: product
  146. field: currency
  147. - find:
  148. path: span.psp-product-saleprice[itemprop="price"]
  149. do:
  150. - parse:
  151. attr: content
  152. - object_field_set:
  153. object: product
  154. type: float
  155. field: price
  156. - find:
  157. in: doc
  158. path: div.pdp-about-details-equit
  159. slice: 0
  160. do:
  161. - parse
  162. - space_dedupe
  163. - trim
  164. - object_field_set:
  165. object: product
  166. field: description
  167. - find:
  168. in: doc
  169. path: ol.breadcrumb>li.bc-item
  170. slice: 0:-2
  171. do:
  172. - variable_clear: list_item
  173. - parse
  174. - space_dedupe
  175. - trim
  176. - if:
  177. match: \w{2,}
  178. do:
  179. - object_field_set:
  180. object: product
  181. joinby: "|"
  182. field: category
  183. - object_save:
  184. name: product

Report this snippet  

You need to login to post a comment.