Posted By

diggernaut on 12/06/17


Tagged

data etl Ecommerce scraping webscraping diggernaut anntaylor


Versions (?)

Scraping anntaylor.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape anntaylor.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. do:
  5. - pool_clear: pages
  6. - walk:
  7. to: https://www.anntaylor.com
  8. do:
  9. - find:
  10. path: nav.sub-nav a
  11. do:
  12. - variable_clear: did
  13. - variable_set:
  14. field: viewsnum
  15. value: 0
  16. - parse:
  17. attr: data-id
  18. - variable_set: did
  19. - walk:
  20. to: https://www.anntaylor.com/ecws/endecaService.jsp?SortByFacetSelectedValue=remove&DocSortOrder=remove&format=json&catid=<%did%>&question=&fRequest=true&goToPage=1&N=0&categoryType=regular&priceSort=DESC&country=US&currency=USD&Submit=Submit
  21. do:
  22. - find:
  23. path: resultslist>pagination>attributes>pagesavailable
  24. do:
  25. - parse
  26. - variable_set: viewsnum
  27. - eval:
  28. routine: js
  29. body: (function(){var num = <%viewsnum%>;var str = "";for(var i = num; i > 0; i--){if (i != num){str += ","}str += i} return "<div>"+str+"</div>";})();
  30. - to_block
  31. - split:
  32. context: text
  33. delimiter: ","
  34. - find:
  35. path: div
  36. do:
  37. - variable_clear: pagenum
  38. - parse
  39. - variable_set: pagenum
  40. - link_add:
  41. url: https://www.anntaylor.com/ecws/endecaService.jsp?SortByFacetSelectedValue=remove&DocSortOrder=remove&format=json&catid=<%did%>&question=&fRequest=true&goToPage=<%pagenum%>&N=0&categoryType=regular&priceSort=DESC&country=US&currency=USD&Submit=Submit
  42. pool: catalog
  43. - walk:
  44. to: links
  45. pool: catalog
  46. do:
  47. - find:
  48. path: resultslist>records>records>attributes>quicklookurl
  49. do:
  50. - parse:
  51. filter: ^([^\?]+)
  52. - normalize:
  53. routine: url
  54. - link_add:
  55. pool: pages
  56. - walk:
  57. to: links
  58. pool: pages
  59. do:
  60. - sleep: 3
  61. - find:
  62. path: main
  63. do:
  64. - variable_clear: pid
  65. - object_new: product
  66. - eval:
  67. routine: js
  68. body: '(function (){var d = new Date(); return d.toISOString()})();'
  69. - object_field_set:
  70. object: product
  71. field: date
  72. - static_get: url
  73. - object_field_set:
  74. object: product
  75. field: url
  76. - find:
  77. path: h1[itemprop="name"]
  78. do:
  79. - parse
  80. - space_dedupe
  81. - trim
  82. - object_field_set:
  83. object: product
  84. field: name
  85. - register_set: Ann Taylor
  86. - object_field_set:
  87. object: product
  88. field: brand
  89. - find:
  90. in: doc
  91. path: script:contains("window.productSettings = ")
  92. do:
  93. - parse:
  94. filter: window\.productSettings\s+=\s+(.+)\s*$
  95. - normalize:
  96. routine: json2xml
  97. - to_block
  98. - find:
  99. path: body_safe>currency
  100. do:
  101. - parse
  102. - normalize:
  103. routine: replace_matched
  104. args:
  105. \$: USD
  106. - object_field_set:
  107. object: product
  108. field: currency
  109. - find:
  110. path: body_safe>products>listprice
  111. do:
  112. - parse
  113. - object_field_set:
  114. object: product
  115. type: float
  116. field: price
  117. - find:
  118. path: body_safe>prodid
  119. do:
  120. - parse
  121. - space_dedupe
  122. - trim
  123. - variable_set: pid
  124. - object_field_set:
  125. object: product
  126. field: sku
  127. - find:
  128. path: body_safe>products>skucolors>colors
  129. do:
  130. - find:
  131. path: colorname
  132. do:
  133. - parse
  134. - space_dedupe
  135. - trim
  136. - if:
  137. match: \w+
  138. do:
  139. - object_field_set:
  140. object: product
  141. joinby: "|"
  142. field: variations
  143. - walk:
  144. to: https://richmedia.channeladvisor.com/ViewerDelivery/productXmlService?profileid=52000652&itemid=<%pid%>&viewerid=196
  145. do:
  146. - find:
  147. path: img
  148. do:
  149. - parse:
  150. attr: path
  151. - normalize:
  152. routine: replace_substring
  153. args:
  154. \&recipeId\=\d+: ''
  155. - object_field_set:
  156. object: product
  157. joinby: "|"
  158. field: images
  159. - find:
  160. path: body_safe>products>weblongdescription
  161. do:
  162. - parse
  163. - space_dedupe
  164. - trim
  165. - object_field_set:
  166. object: product
  167. field: description
  168. - find:
  169. path: body_safe>products>parentcategoryname
  170. do:
  171. - parse
  172. - space_dedupe
  173. - trim
  174. - if:
  175. match: \w+
  176. do:
  177. - object_field_set:
  178. object: product
  179. joinby: "|"
  180. field: category
  181. - object_save:
  182. name: product

Report this snippet  

You need to login to post a comment.