Posted By

diggernaut on 12/06/17


Tagged

data etl Ecommerce scraping webscraping diggernaut americanapparel


Versions (?)

Scraping americanapparel.net with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape americanapparel.net to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - link_add:
  7. url: http://store.americanapparel.net
  8. - link_add:
  9. url: http://store.americanapparel.net/en/factory-store/
  10. - walk:
  11. to: links
  12. do:
  13. - find:
  14. path: .cd-primary-nav a
  15. do:
  16. - parse:
  17. attr: href
  18. - normalize:
  19. routine: url
  20. - link_add:
  21. pool: main
  22. - walk:
  23. to: links
  24. pool: main
  25. do:
  26. - find:
  27. path: .product > a
  28. do:
  29. - parse:
  30. attr: href
  31. - normalize:
  32. routine: url
  33. - link_add:
  34. pool: sub
  35. - walk:
  36. to: links
  37. pool: sub
  38. do:
  39. - sleep: 3
  40. - find:
  41. path: .pdp
  42. do:
  43. - variable_clear: allli
  44. - variable_clear: descr
  45. - variable_clear: li
  46. - variable_clear: id
  47. - variable_clear: views
  48. - variable_clear: color
  49. - variable_clear: imgnum
  50. - variable_clear: imgxl
  51. - variable_clear: viewsnum
  52. - variable_clear: stp
  53. - object_new: product
  54. - eval:
  55. routine: js
  56. body: '(function (){var d = new Date(); return d.toISOString()})();'
  57. - object_field_set:
  58. object: product
  59. field: date
  60. - static_get: url
  61. - object_field_set:
  62. object: product
  63. field: url
  64. - find:
  65. in: doc
  66. path: head meta[name="description"]
  67. do:
  68. - parse:
  69. attr: content
  70. - space_dedupe
  71. - trim
  72. - to_block
  73. - node_replace:
  74. path: br
  75. with: "\n"
  76. - split:
  77. context: text
  78. delimiter: \n+
  79. - find:
  80. path: div.splitted
  81. slice: 0
  82. do:
  83. - parse
  84. - space_dedupe
  85. - trim
  86. - object_field_set:
  87. object: product
  88. field: description
  89. - find:
  90. path: .product-style
  91. do:
  92. - parse
  93. - space_dedupe
  94. - trim
  95. - object_field_set:
  96. object: product
  97. field: sku
  98. - find:
  99. path: .price
  100. do:
  101. - find:
  102. path: .red-text
  103. do:
  104. - parse:
  105. filter:
  106. - (\d+\.?\d*)
  107. - if:
  108. match: (\d)
  109. do:
  110. - object_field_set:
  111. object: product
  112. field: price
  113. type: float
  114. - register_set: USD
  115. - object_field_set:
  116. object: product
  117. field: currency
  118. - register_set: 1
  119. - variable_set: stp
  120. - find:
  121. path: span[data-test="test"]
  122. do:
  123. - variable_get: stp
  124. - if:
  125. match: (1)
  126. else:
  127. - parse:
  128. filter:
  129. - (\d+\.?\d*)
  130. - if:
  131. match: (\d)
  132. do:
  133. - object_field_set:
  134. object: product
  135. field: price
  136. type: float
  137. - register_set: USD
  138. - object_field_set:
  139. object: product
  140. field: currency
  141. - find:
  142. path: .product-name
  143. do:
  144. - parse
  145. - space_dedupe
  146. - trim
  147. - object_field_set:
  148. object: product
  149. field: name
  150. - find:
  151. path: .main-img
  152. do:
  153. - parse:
  154. attr: src
  155. - object_field_set:
  156. object: product
  157. field: images
  158. joinby: "|"
  159. - find:
  160. path: .logo
  161. slice: 0
  162. do:
  163. - parse
  164. - space_dedupe
  165. - trim
  166. - object_field_set:
  167. object: product
  168. field: brand
  169. - find:
  170. path: .breadcrumbs a
  171. do:
  172. - parse
  173. - space_dedupe
  174. - trim
  175. - object_field_set:
  176. object: product
  177. field: category
  178. joinby: "|"
  179. - find:
  180. path: '.product-details > input#skuVarData'
  181. do:
  182. - parse:
  183. attr: value
  184. - normalize:
  185. routine: replace_substring
  186. args:
  187. \s+\/\s+: _
  188. - normalize:
  189. routine: json2xml
  190. - to_block
  191. - find:
  192. path: body_safe > name
  193. do:
  194. - parse
  195. - space_dedupe
  196. - trim
  197. - object_field_set:
  198. object: product
  199. field: name
  200. - find:
  201. path: colors
  202. do:
  203. - find:
  204. path: zoomimage
  205. do:
  206. - parse:
  207. filter:
  208. - \s*(.+)\?
  209. - variable_set: imgxl
  210. - register_set: <%imgxl%>?$ProductZoom$
  211. - object_field_set:
  212. object: product
  213. field: images
  214. joinby: "|"
  215. - find:
  216. path: name
  217. do:
  218. - parse
  219. - space_dedupe
  220. - trim
  221. - object_field_set:
  222. object: product
  223. field: variations
  224. joinby: "|"
  225. - object_save:
  226. name: product

Report this snippet  

You need to login to post a comment.