Posted By

diggernaut on 12/07/17


Tagged

data etl Ecommerce scraping webscraping gap diggernaut athleta


Versions (?)

Scraping athleta.gap.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape athleta.gap.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: http://athleta.gap.com/
  8. do:
  9. - find:
  10. path: div.topnav_atol>ul>li>a
  11. do:
  12. - parse:
  13. attr: href
  14. - space_dedupe
  15. - trim
  16. - if:
  17. match: \w+
  18. do:
  19. - link_add:
  20. pool: main
  21. - walk:
  22. to: links
  23. pool: main
  24. do:
  25. - find:
  26. path: .sidebar-navigation
  27. do:
  28. - node_remove: h1
  29. - sequence:
  30. header: h2
  31. selector: h2,div
  32. - find:
  33. path: div.sequence
  34. do:
  35. - variable_clear: catname
  36. - find:
  37. path: h2
  38. do:
  39. - parse
  40. - space_dedupe
  41. - trim
  42. - variable_set: catname
  43. - find:
  44. path: .sidebar-navigation--category--link
  45. do:
  46. - pool_clear: pager
  47. - parse:
  48. attr: href
  49. filter:
  50. - cid=(.+)
  51. - variable_set: cid
  52. - register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&isFacetsEnabled=true
  53. - link_add:
  54. pool: pager
  55. - walk:
  56. to: links
  57. pool: pager
  58. do:
  59. - variable_clear: ptot
  60. - find:
  61. path: pageNumberTotal
  62. do:
  63. - parse
  64. - if:
  65. match: (^\s*[0-1]\s*$)
  66. else:
  67. - variable_set: ptot
  68. - find:
  69. path: pageNumberRequested
  70. do:
  71. - parse
  72. - if:
  73. match: (^\s*0\s*$)
  74. do:
  75. - variable_get: ptot
  76. - if:
  77. match: (\d)
  78. do:
  79. - if:
  80. gt: 1
  81. do:
  82. - eval:
  83. routine: js
  84. body: '(function (){var r = ""; for (var i = 1; i<<%ptot%>; i++){r += "<div>"+i+"</div>"}; return r;})();'
  85. - to_block
  86. - find:
  87. path: div
  88. do:
  89. - parse
  90. - variable_set: pageid
  91. - register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&pageId=<%pageid%>&isFacetsEnabled=true
  92. - link_add:
  93. pool: pager
  94. - find:
  95. path: productCategory > name
  96. do:
  97. - parse
  98. - space_dedupe
  99. - trim
  100. - variable_set: catname2
  101. - find:
  102. path: productCategory > childProducts
  103. do:
  104. - find:
  105. path: parentBusinessCatalogItemId
  106. do:
  107. - parse
  108. - if:
  109. match: (\S)
  110. do:
  111. - variable_set: pid
  112. - register_set: http://athleta.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
  113. - walk:
  114. to: value
  115. do:
  116. - variable_clear: isP
  117. - find:
  118. path: script:matches(gap.pageProductData\s*=\s*\{)
  119. do:
  120. - variable_set:
  121. field: isP
  122. value: 1
  123. - find:
  124. path: html
  125. do:
  126. - variable_get: isP
  127. - if:
  128. match: (1)
  129. do:
  130. - object_new: product
  131. - find:
  132. path: head
  133. do:
  134. - eval:
  135. routine: js
  136. body: '(function (){var d = new Date(); return d.toISOString()})();'
  137. - object_field_set:
  138. object: product
  139. field: date
  140. - static_get: url
  141. - object_field_set:
  142. object: product
  143. field: url
  144. - register_set: 'GAP'
  145. - object_field_set:
  146. object: product
  147. field: brand
  148. - find:
  149. path: meta[name="keywords"]
  150. do:
  151. - parse:
  152. attr: content
  153. - object_field_set:
  154. object: product
  155. field: description
  156. - find:
  157. path: script:matches(gap.pageProductData\s*=\s*\{)
  158. do:
  159. - parse:
  160. filter:
  161. - gap\.currentBrand\s*=\s*\"(.+)\"\;
  162. - if:
  163. match: (\S)
  164. do:
  165. - object_field_set:
  166. object: product
  167. field: brand
  168. - parse
  169. - normalize:
  170. routine: replace_substring
  171. args:
  172. var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
  173. gap\.pageProductData\s*=\s*: ''
  174. \s*;\s*gap.currentBrand\s*=\s*.*\;: ''
  175. - normalize:
  176. routine: json2xml
  177. - to_block
  178. - find:
  179. path: productimages
  180. do:
  181. - parse:
  182. format: html
  183. - variable_set: imghtml
  184. - find:
  185. path: variants > productstylecolors > productstylecolorimages
  186. do:
  187. - parse
  188. - normalize:
  189. routine: lower
  190. - variable_set: imgpath
  191. - register_set: <div><%imghtml%></div>
  192. - to_block
  193. - find:
  194. path: safe_<%imgpath%>
  195. do:
  196. - variable_clear: getit
  197. - find:
  198. path: xlarge
  199. do:
  200. - parse
  201. - if:
  202. match: (\S)
  203. do:
  204. - variable_set:
  205. field: getit
  206. value: 1
  207. - normalize:
  208. routine: url
  209. - object_field_set:
  210. object: product
  211. field: images
  212. joinby: "|"
  213. - variable_get: getit
  214. - if:
  215. match: (1)
  216. else:
  217. - find:
  218. path: large
  219. do:
  220. - parse
  221. - if:
  222. match: (\S)
  223. do:
  224. - variable_set:
  225. field: getit
  226. value: 1
  227. - normalize:
  228. routine: url
  229. - object_field_set:
  230. object: product
  231. field: images
  232. joinby: "|"
  233. - variable_get: getit
  234. - if:
  235. match: (1)
  236. else:
  237. - find:
  238. path: medium
  239. do:
  240. - parse
  241. - if:
  242. match: (\S)
  243. do:
  244. - variable_set:
  245. field: getit
  246. value: 1
  247. - normalize:
  248. routine: url
  249. - object_field_set:
  250. object: product
  251. field: images
  252. joinby: "|"
  253. - variable_get: getit
  254. - if:
  255. match: (1)
  256. else:
  257. - find:
  258. path: small
  259. do:
  260. - parse
  261. - if:
  262. match: (\S)
  263. do:
  264. - variable_set:
  265. field: getit
  266. value: 1
  267. - normalize:
  268. routine: url
  269. - object_field_set:
  270. object: product
  271. field: images
  272. joinby: "|"
  273. - find:
  274. path: body_safe > variants > productstylecolors > colorname
  275. do:
  276. - parse
  277. - if:
  278. match: (\S)
  279. do:
  280. - object_field_set:
  281. object: product
  282. field: variations
  283. joinby: "|"
  284. - find:
  285. path: body_safe > name
  286. do:
  287. - parse
  288. - if:
  289. match: (\S)
  290. do:
  291. - object_field_set:
  292. object: product
  293. field: name
  294. - find:
  295. path: body_safe > currentmaxprice, body_safe > currentminprice
  296. do:
  297. - parse:
  298. filter:
  299. - (\d+\.?\d*)
  300. - if:
  301. match: (\d+)
  302. do:
  303. - object_field_set:
  304. object: product
  305. field: price
  306. type: float
  307. - register_set: USD
  308. - object_field_set:
  309. object: product
  310. field: currency
  311. - find:
  312. path: styleid
  313. slice: 0
  314. do:
  315. - parse
  316. - object_field_set:
  317. object: product
  318. field: sku
  319. - find:
  320. path: body
  321. do:
  322. - find:
  323. path: '.selected'
  324. do:
  325. - parse
  326. - space_dedupe
  327. - trim
  328. - object_field_set:
  329. object: product
  330. field: category
  331. joinby: "|"
  332. - variable_get: catname
  333. - if:
  334. match: (\S)
  335. do:
  336. - object_field_set:
  337. object: product
  338. field: category
  339. joinby: "|"
  340. - variable_get: catname2
  341. - if:
  342. match: (\S)
  343. do:
  344. - object_field_set:
  345. object: product
  346. field: category
  347. joinby: "|"
  348. - object_save:
  349. name: product
  350. - find:
  351. path: productCategory > childCategories
  352. do:
  353. - variable_clear: catname3
  354. - find:
  355. path: name
  356. slice: 0
  357. do:
  358. - parse
  359. - space_dedupe
  360. - trim
  361. - variable_set: catname3
  362. - find:
  363. path: parentBusinessCatalogItemId
  364. do:
  365. - parse
  366. - if:
  367. match: (\S)
  368. do:
  369. - variable_set: pid
  370. - register_set: http://athleta.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
  371. - walk:
  372. to: value
  373. do:
  374. - variable_clear: isP
  375. - find:
  376. path: script:matches(gap.pageProductData\s*=\s*\{)
  377. do:
  378. - variable_set:
  379. field: isP
  380. value: 1
  381. - find:
  382. path: html
  383. do:
  384. - variable_get: isP
  385. - if:
  386. match: (1)
  387. do:
  388. - object_new: product
  389. - find:
  390. path: head
  391. do:
  392. - eval:
  393. routine: js
  394. body: '(function (){var d = new Date(); return d.toISOString()})();'
  395. - object_field_set:
  396. object: product
  397. field: date
  398. - static_get: url
  399. - object_field_set:
  400. object: product
  401. field: url
  402. - register_set: 'GAP'
  403. - object_field_set:
  404. object: product
  405. field: brand
  406. - find:
  407. path: meta[name="keywords"]
  408. do:
  409. - parse:
  410. attr: content
  411. - object_field_set:
  412. object: product
  413. field: description
  414. - find:
  415. path: script:matches(gap.pageProductData\s*=\s*\{)
  416. do:
  417. - parse:
  418. filter:
  419. - gap\.currentBrand\s*=\s*\"(.+)\"\;
  420. - if:
  421. match: (\S)
  422. do:
  423. - object_field_set:
  424. object: product
  425. field: brand
  426. - parse
  427. - normalize:
  428. routine: replace_substring
  429. args:
  430. var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
  431. gap\.pageProductData\s*=\s*: ''
  432. \s*;\s*gap.currentBrand\s*=\s*.*\;: ''
  433. - normalize:
  434. routine: json2xml
  435. - to_block
  436. - find:
  437. path: productimages
  438. do:
  439. - parse:
  440. format: html
  441. - variable_set: imghtml
  442. - find:
  443. path: variants > productstylecolors > productstylecolorimages
  444. do:
  445. - parse
  446. - normalize:
  447. routine: lower
  448. - variable_set: imgpath
  449. - register_set: <div><%imghtml%></div>
  450. - to_block
  451. - find:
  452. path: safe_<%imgpath%>
  453. do:
  454. - variable_clear: getit
  455. - find:
  456. path: xlarge
  457. do:
  458. - parse
  459. - if:
  460. match: (\S)
  461. do:
  462. - variable_set:
  463. field: getit
  464. value: 1
  465. - normalize:
  466. routine: url
  467. - object_field_set:
  468. object: product
  469. field: images
  470. joinby: "|"
  471. - variable_get: getit
  472. - if:
  473. match: (1)
  474. else:
  475. - find:
  476. path: large
  477. do:
  478. - parse
  479. - if:
  480. match: (\S)
  481. do:
  482. - variable_set:
  483. field: getit
  484. value: 1
  485. - normalize:
  486. routine: url
  487. - object_field_set:
  488. object: product
  489. field: images
  490. joinby: "|"
  491. - variable_get: getit
  492. - if:
  493. match: (1)
  494. else:
  495. - find:
  496. path: medium
  497. do:
  498. - parse
  499. - if:
  500. match: (\S)
  501. do:
  502. - variable_set:
  503. field: getit
  504. value: 1
  505. - normalize:
  506. routine: url
  507. - object_field_set:
  508. object: product
  509. field: images
  510. joinby: "|"
  511. - variable_get: getit
  512. - if:
  513. match: (1)
  514. else:
  515. - find:
  516. path: small
  517. do:
  518. - parse
  519. - if:
  520. match: (\S)
  521. do:
  522. - variable_set:
  523. field: getit
  524. value: 1
  525. - normalize:
  526. routine: url
  527. - object_field_set:
  528. object: product
  529. field: images
  530. joinby: "|"
  531. - find:
  532. path: body_safe > variants > productstylecolors > colorname
  533. do:
  534. - parse
  535. - if:
  536. match: (\S)
  537. do:
  538. - object_field_set:
  539. object: product
  540. field: variations
  541. joinby: "|"
  542. - find:
  543. path: body_safe > name
  544. do:
  545. - parse
  546. - if:
  547. match: (\S)
  548. do:
  549. - object_field_set:
  550. object: product
  551. field: name
  552. - find:
  553. path: body_safe > currentmaxprice, body_safe > currentminprice
  554. do:
  555. - parse:
  556. filter:
  557. - (\d+\.?\d*)
  558. - if:
  559. match: (\d+)
  560. do:
  561. - object_field_set:
  562. object: product
  563. field: price
  564. type: float
  565. - register_set: USD
  566. - object_field_set:
  567. object: product
  568. field: currency
  569. - find:
  570. path: styleid
  571. slice: 0
  572. do:
  573. - parse
  574. - object_field_set:
  575. object: product
  576. field: sku
  577. - find:
  578. path: body
  579. do:
  580. - find:
  581. path: '.selected'
  582. do:
  583. - parse
  584. - space_dedupe
  585. - trim
  586. - object_field_set:
  587. object: product
  588. field: category
  589. joinby: "|"
  590. - variable_get: catname
  591. - if:
  592. match: (\S)
  593. do:
  594. - object_field_set:
  595. object: product
  596. field: category
  597. joinby: "|"
  598. - variable_get: catname2
  599. - if:
  600. match: (\S)
  601. do:
  602. - object_field_set:
  603. object: product
  604. field: category
  605. joinby: "|"
  606. - object_save:
  607. name: product

Report this snippet  

You need to login to post a comment.