Posted By

diggernaut on 12/06/17


Tagged

data etl Ecommerce scraping webscraping diggernaut anthropologie


Versions (?)

Scraping anthropologie.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape anthropologie.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: https://www.anthropologie.com
  8. do:
  9. - find:
  10. path: .c-main-navigation__li--level-1
  11. do:
  12. - find:
  13. path: span
  14. slice: 0
  15. do:
  16. - parse
  17. - space_dedupe
  18. - trim
  19. - normalize:
  20. routine: lower
  21. - variable_set: cat1
  22. - find:
  23. path: .c-main-navigation__li--level-2
  24. do:
  25. - variable_clear: subcat
  26. - find:
  27. path: .c-main-navigation__a--level-2
  28. do:
  29. - parse
  30. - space_dedupe
  31. - trim
  32. - normalize:
  33. routine: lower
  34. - variable_set: cat2
  35. - find:
  36. path: .c-main-navigation__li--level-3 a
  37. do:
  38. - parse
  39. - space_dedupe
  40. - trim
  41. - normalize:
  42. routine: lower
  43. - variable_set: cat3
  44. - variable_set:
  45. field: subcat
  46. value: 1
  47. - parse:
  48. attr: href
  49. - pool_clear: main
  50. - link_add:
  51. pool: main
  52. - walk:
  53. to: links
  54. pool: main
  55. do:
  56. - find:
  57. path: .js-pagination__arrow--next
  58. slice: 0
  59. do:
  60. - parse:
  61. attr: href
  62. - link_add:
  63. pool: main
  64. - find:
  65. path: .c-product-tile__image-link
  66. do:
  67. - parse:
  68. attr: href
  69. filter:
  70. - (.+)\?
  71. - (.+)
  72. - normalize:
  73. routine: url
  74. - walk:
  75. to: value
  76. do:
  77. - find:
  78. path: body
  79. do:
  80. - object_new: product
  81. - eval:
  82. routine: js
  83. body: '(function (){var d = new Date(); return d.toISOString()})();'
  84. - object_field_set:
  85. object: product
  86. field: date
  87. - register_set: Anthropologie
  88. - object_field_set:
  89. object: product
  90. field: brand
  91. - static_get: url
  92. - object_field_set:
  93. object: product
  94. field: url
  95. - find:
  96. path: meta[property="product:price:amount"]
  97. do:
  98. - parse:
  99. attr: content
  100. - if:
  101. match: (\d)
  102. do:
  103. - object_field_set:
  104. object: product
  105. field: price
  106. type: float
  107. - register_set: USD
  108. - object_field_set:
  109. object: product
  110. field: currency
  111. - find:
  112. path: .o-carousel__flex-wrapper > img.c-product-image
  113. do:
  114. - parse:
  115. attr: src
  116. filter:
  117. - (.+)\?
  118. - (.+)
  119. - normalize:
  120. routine: url
  121. - object_field_set:
  122. object: product
  123. field: images
  124. joinby: "|"
  125.  
  126. - find:
  127. path: script:matches(window\.productData)
  128. do:
  129. - parse:
  130. filter:
  131. - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
  132. - normalize:
  133. routine: Base64ZLIBDecode
  134. - normalize:
  135. routine: json2xml
  136. - to_block
  137. - find:
  138. path: body_safe
  139. do:
  140. - find:
  141. path: primaryslice:hasChild(displaylabel:matches(Color))
  142. do:
  143. - find:
  144. path: sliceitems > displayname
  145. do:
  146. - parse
  147. - space_dedupe
  148. - trim
  149. - object_field_set:
  150. object: product
  151. field: variations
  152. joinby: "|"
  153. - find:
  154. path: sliceitems
  155. do:
  156. - variable_clear: iid
  157.  
  158. - find:
  159. path: id
  160. slice: 0
  161. do:
  162. - parse
  163. - variable_set: iid
  164. - find:
  165. path: images
  166. do:
  167. - parse
  168. - register_set: http://images.anthropologie.com/is/image/Anthropologie/<%iid%>_<%register%>
  169. - object_field_set:
  170. object: product
  171. field: images
  172. joinby: "|"
  173.  
  174.  
  175. - find:
  176. path: product > stylenumber
  177. slice: 0
  178. do:
  179. - parse
  180. - space_dedupe
  181. - trim
  182. - object_field_set:
  183. object: product
  184. field: sku
  185. - find:
  186. path: product > product > brand
  187. do:
  188. - parse
  189. - space_dedupe
  190. - trim
  191. - object_field_set:
  192. object: product
  193. field: brand
  194. - find:
  195. path: product > product > displayname
  196. do:
  197. - parse
  198. - space_dedupe
  199. - trim
  200. - object_field_set:
  201. object: product
  202. field: name
  203. - find:
  204. path: product > product > longdescription
  205. do:
  206. - parse
  207. - space_dedupe
  208. - trim
  209. - object_field_set:
  210. object: product
  211. field: description
  212. - variable_get: cat1
  213. - if:
  214. match: (\S)
  215. do:
  216. - object_field_set:
  217. object: product
  218. field: category
  219. joinby: "|"
  220. - variable_get: cat2
  221. - if:
  222. match: (\S)
  223. do:
  224. - object_field_set:
  225. object: product
  226. field: category
  227. joinby: "|"
  228. - variable_get: cat3
  229. - if:
  230. match: (\S)
  231. do:
  232. - object_field_set:
  233. object: product
  234. field: category
  235. joinby: "|"
  236. - object_save:
  237. name: product
  238. - variable_get: subcat
  239. - if:
  240. match: (1)
  241. else:
  242. - find:
  243. path: .c-main-navigation__a--level-2
  244. do:
  245. - parse:
  246. attr: href
  247. - pool_clear: main
  248. - link_add:
  249. pool: main
  250. - walk:
  251. to: links
  252. pool: main
  253. do:
  254. - find:
  255. path: .js-pagination__arrow--next
  256. slice: 0
  257. do:
  258. - parse:
  259. attr: href
  260. - link_add:
  261. pool: main
  262. - find:
  263. path: .c-product-tile__image-link
  264. do:
  265. - parse:
  266. attr: href
  267. filter:
  268. - (.+)\?
  269. - (.+)
  270. - normalize:
  271. routine: url
  272. - walk:
  273. to: value
  274. do:
  275. - find:
  276. path: body
  277. do:
  278. - object_new: product
  279. - eval:
  280. routine: js
  281. body: '(function (){var d = new Date(); return d.toISOString()})();'
  282. - object_field_set:
  283. object: product
  284. field: date
  285. - register_set: Anthropologie
  286. - object_field_set:
  287. object: product
  288. field: brand
  289. - static_get: url
  290. - object_field_set:
  291. object: product
  292. field: url
  293. - find:
  294. path: meta[property="product:price:amount"]
  295. do:
  296. - parse:
  297. attr: content
  298. - if:
  299. match: (\d)
  300. do:
  301. - object_field_set:
  302. object: product
  303. field: price
  304. type: float
  305. - register_set: USD
  306. - object_field_set:
  307. object: product
  308. field: currency
  309. - find:
  310. path: .o-carousel__flex-wrapper > img.c-product-image
  311. do:
  312. - parse:
  313. attr: src
  314. filter:
  315. - (.+)\?
  316. - (.+)
  317. - normalize:
  318. routine: url
  319. - object_field_set:
  320. object: product
  321. field: images
  322. joinby: "|"
  323.  
  324. - find:
  325. path: script:matches(window\.productData)
  326. do:
  327. - parse:
  328. filter:
  329. - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
  330. - normalize:
  331. routine: Base64ZLIBDecode
  332. - normalize:
  333. routine: json2xml
  334. - to_block
  335. - find:
  336. path: body_safe
  337. do:
  338. - find:
  339. path: primaryslice:hasChild(displaylabel:matches(Color))
  340. do:
  341. - find:
  342. path: sliceitems > displayname
  343. do:
  344. - parse
  345. - space_dedupe
  346. - trim
  347. - object_field_set:
  348. object: product
  349. field: variations
  350. joinby: "|"
  351. - find:
  352. path: sliceitems
  353. do:
  354. - variable_clear: iid
  355.  
  356. - find:
  357. path: id
  358. slice: 0
  359. do:
  360. - parse
  361. - variable_set: iid
  362. - find:
  363. path: images
  364. do:
  365. - parse
  366. - register_set: http://images.anthropologie.com/is/image/Anthropologie/<%iid%>_<%register%>
  367. - object_field_set:
  368. object: product
  369. field: images
  370. joinby: "|"
  371.  
  372.  
  373. - find:
  374. path: product > stylenumber
  375. slice: 0
  376. do:
  377. - parse
  378. - space_dedupe
  379. - trim
  380. - object_field_set:
  381. object: product
  382. field: sku
  383. - find:
  384. path: product > product > brand
  385. do:
  386. - parse
  387. - space_dedupe
  388. - trim
  389. - object_field_set:
  390. object: product
  391. field: brand
  392. - find:
  393. path: product > product > displayname
  394. do:
  395. - parse
  396. - space_dedupe
  397. - trim
  398. - object_field_set:
  399. object: product
  400. field: name
  401. - find:
  402. path: product > product > longdescription
  403. do:
  404. - parse
  405. - space_dedupe
  406. - trim
  407. - object_field_set:
  408. object: product
  409. field: description
  410. - variable_get: cat1
  411. - if:
  412. match: (\S)
  413. do:
  414. - object_field_set:
  415. object: product
  416. field: category
  417. joinby: "|"
  418. - variable_get: cat2
  419. - if:
  420. match: (\S)
  421. do:
  422. - object_field_set:
  423. object: product
  424. field: category
  425. joinby: "|"
  426. - variable_get: cat3
  427. - if:
  428. match: (\S)
  429. do:
  430. - object_field_set:
  431. object: product
  432. field: category
  433. joinby: "|"
  434. - object_save:
  435. name: product

Report this snippet  

You need to login to post a comment.