Posted By

diggernaut on 12/06/17


Tagged

data etl Ecommerce scraping webscraping diggernaut alexandermcqueen


Versions (?)

Scraping alexandermcqueen.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape alexandermcqueen.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: http://www.alexandermcqueen.com/us/
  8. do:
  9. - find:
  10. path: ul.level-1>li
  11. do:
  12. - variable_clear: cat1
  13. - parse:
  14. attr: id
  15. - normalize:
  16. routine: replace_matched
  17. args:
  18. shop_womenswear: Womens
  19. shop_menswear: Mens
  20. .+: ''
  21. - variable_set: cat1
  22. - find:
  23. path: ul.level-2>li
  24. do:
  25. - variable_clear: cat2
  26. - find:
  27. path: a
  28. slice: 0
  29. do:
  30. - parse
  31. - space_dedupe
  32. - trim
  33. - variable_set: cat2
  34. - find:
  35. path: ul.level-3>li
  36. do:
  37. - variable_clear: cat3
  38. - find:
  39. path: a
  40. slice: 0
  41. do:
  42. - parse
  43. - space_dedupe
  44. - trim
  45. - variable_set: cat3
  46. - parse:
  47. attr: href
  48. - space_dedupe
  49. - trim
  50. - if:
  51. match: \w+
  52. do:
  53. - normalize:
  54. routine: url
  55. - walk:
  56. to: value
  57. do:
  58. - find:
  59. path: script:contains('yTos.navigation =')
  60. do:
  61. - parse:
  62. filter: yTos\.navigation\s+\=\s+(.+)\;
  63. - normalize:
  64. routine: json2xml
  65. - to_block
  66. - find:
  67. path: pathandqueryparsed:has(paramname:matches(^sitecode$))
  68. do:
  69. - find:
  70. path: paramvalue
  71. do:
  72. - parse
  73. - variable_set: sitecode
  74. - find:
  75. path: pathandqueryparsed:has(paramname:matches(^dept$))
  76. do:
  77. - find:
  78. path: paramvalue
  79. do:
  80. - parse
  81. - variable_set: department
  82. - find:
  83. path: pathandqueryparsed:has(paramname:matches(^season$))
  84. do:
  85. - find:
  86. path: paramvalue
  87. do:
  88. - parse
  89. - normalize:
  90. routine: replace_substring
  91. args:
  92. \,: "%2C"
  93. - variable_set: season
  94. - find:
  95. path: pathandqueryparsed:has(paramname:matches(^gender$))
  96. do:
  97. - find:
  98. path: paramvalue
  99. do:
  100. - parse
  101. - variable_set: gender
  102. - find:
  103. path: pathandqueryparsed:has(paramname:matches(^yurirulename$))
  104. do:
  105. - find:
  106. path: paramvalue
  107. do:
  108. - parse
  109. - variable_set: yurirulename
  110. - walk:
  111. to: http://www.alexandermcqueen.com/Search/RenderProducts?ytosQuery=true&department=<%department%>&gender=<%gender%>&season=<%season%>&yurirulename=<%yurirulename%>&page=1&productsPerPage=1000&suggestion=false&totalPages=1&partialLoadedItems=1000&siteCode=<%sitecode%>
  112. do:
  113. - find:
  114. path: article>a
  115. do:
  116. - parse:
  117. attr: href
  118. filter: ^([^\#]+)
  119. - walk:
  120. to: value
  121. do:
  122. - sleep: 2
  123. - find:
  124. path: article.item
  125. do:
  126. - variable_clear: pid
  127. - variable_clear: cid
  128. - object_new: product
  129. - eval:
  130. routine: js
  131. body: '(function (){var d = new Date(); return d.toISOString()})();'
  132. - object_field_set:
  133. object: product
  134. field: date
  135. - static_get: url
  136. - object_field_set:
  137. object: product
  138. field: url
  139. - register_set: Alexander McQueen
  140. - object_field_set:
  141. object: product
  142. field: brand
  143. - find:
  144. path: h2.modelName
  145. do:
  146. - parse
  147. - space_dedupe
  148. - trim
  149. - object_field_set:
  150. object: product
  151. field: name
  152. - find:
  153. in: doc
  154. path: meta[name="description"]
  155. do:
  156. - parse:
  157. attr: content
  158. - space_dedupe
  159. - trim
  160. - variable_set: desc
  161. - find:
  162. path: div.descriptionsContainer>div.EditorialDescription
  163. do:
  164. - parse
  165. - space_dedupe
  166. - trim
  167. - variable_set: desc
  168. - variable_get: desc
  169. - object_field_set:
  170. object: product
  171. field: description
  172. - find:
  173. path: div.itemPriceContainer span.price
  174. slice: 0
  175. do:
  176. - find:
  177. path: span.currency
  178. do:
  179. - parse
  180. - normalize:
  181. routine: replace_matched
  182. args:
  183. \$: USD
  184. - object_field_set:
  185. object: product
  186. field: currency
  187. - find:
  188. path: span.value
  189. do:
  190. - parse
  191. - normalize:
  192. routine: replace_substring
  193. args:
  194. - \,: ''
  195. - \s+: ''
  196. - object_field_set:
  197. object: product
  198. type: float
  199. field: price
  200. - find:
  201. path: div.modelFabricColor>span.value
  202. do:
  203. - parse
  204. - space_dedupe
  205. - trim
  206. - if:
  207. match: \w+
  208. do:
  209. - variable_set: pid
  210. - object_field_set:
  211. object: product
  212. field: sku
  213. - variable_get: cat1
  214. - if:
  215. match: \w{2,}
  216. do:
  217. - object_field_set:
  218. object: product
  219. joinby: "|"
  220. field: category
  221. - variable_get: cat2
  222. - if:
  223. match: \w{2,}
  224. do:
  225. - object_field_set:
  226. object: product
  227. joinby: "|"
  228. field: category
  229. - variable_get: cat3
  230. - if:
  231. match: \w{2,}
  232. do:
  233. - object_field_set:
  234. object: product
  235. joinby: "|"
  236. field: category
  237. - find:
  238. path: ul.alternativeImages>li>img
  239. do:
  240. - parse:
  241. attr: srcset
  242. - to_block
  243. - split:
  244. context: text
  245. delimiter: \,\s*
  246. - find:
  247. path: div.splitted
  248. slice: 0
  249. do:
  250. - parse:
  251. filter: ^([^\s]+)
  252. - object_field_set:
  253. object: product
  254. joinby: "|"
  255. field: images
  256. - find:
  257. path: div.selectColor
  258. slice: 0
  259. do:
  260. - variable_clear: cod10
  261. - find:
  262. in: doc
  263. path: script:contains("yTos.navigation.itemData =")
  264. do:
  265. - parse:
  266. filter: yTos\.navigation\.itemData\s+\=\s+(.+)\;
  267. - normalize:
  268. routine: json2xml
  269. - to_block
  270. - find:
  271. path: cod10
  272. do:
  273. - parse
  274. - space_dedupe
  275. - trim
  276. - variable_set: cod10
  277. - walk:
  278. to: http://www.alexandermcqueen.com/yTos/api/Plugins/ItemPluginApi/GetCombinationsAsync/?siteCode=<%sitecode%>&code10=<%cod10%>
  279. do:
  280. - find:
  281. path: body_safe>colors
  282. do:
  283. - find:
  284. path: description
  285. do:
  286. - parse
  287. - space_dedupe
  288. - trim
  289. - if:
  290. match: \w+
  291. do:
  292. - object_field_set:
  293. object: product
  294. joinby: "|"
  295. field: variations
  296. - object_save:
  297. name: product

Report this snippet  

You need to login to post a comment.