Posted By

diggernaut on 12/07/17


Tagged

data etl Ecommerce scraping webscraping gap diggernaut bananarepublic


Versions (?)

Scraping bananarepublic.gap.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape bananarepublic.gap.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: http://bananarepublic.gap.com/
  8. do:
  9. - find:
  10. path: ul.brnavigation-brol>li>a
  11. do:
  12. - parse:
  13. attr: href
  14. - space_dedupe
  15. - trim
  16. - if:
  17. match: \/browse\/
  18. do:
  19. - normalize:
  20. routine: url
  21. - link_add:
  22. pool: main
  23. - walk:
  24. to: links
  25. pool: main
  26. do:
  27. - find:
  28. path: .sidebar-navigation
  29. slice: 0
  30. do:
  31. - node_remove: h1
  32. - sequence:
  33. header: h2
  34. selector: h2,div
  35. - find:
  36. path: div.sequence
  37. do:
  38. - variable_clear: catname
  39. - find:
  40. path: h2
  41. do:
  42. - parse
  43. - space_dedupe
  44. - trim
  45. - variable_set: catname
  46. - find:
  47. path: .sidebar-navigation--category--link
  48. do:
  49. - pool_clear: pager
  50. - parse:
  51. attr: href
  52. filter:
  53. - cid=(.+)
  54. - variable_set: cid
  55. - register_set: http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&isFacetsEnabled=true
  56. - link_add:
  57. pool: pager
  58. - walk:
  59. to: links
  60. pool: pager
  61. do:
  62. - variable_clear: ptot
  63. - find:
  64. path: pageNumberTotal
  65. do:
  66. - parse
  67. - if:
  68. match: (^\s*[0-1]\s*$)
  69. else:
  70. - variable_set: ptot
  71. - find:
  72. path: pageNumberRequested
  73. do:
  74. - parse
  75. - if:
  76. match: (^\s*0\s*$)
  77. do:
  78. - variable_get: ptot
  79. - if:
  80. match: (\d)
  81. do:
  82. - if:
  83. gt: 1
  84. do:
  85. - eval:
  86. routine: js
  87. body: '(function (){var r = ""; for (var i = 1; i<<%ptot%>; i++){r += "<div>"+i+"</div>"}; return r;})();'
  88. - to_block
  89. - find:
  90. path: div
  91. do:
  92. - parse
  93. - variable_set: pageid
  94. - register_set: http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&pageId=<%pageid%>&isFacetsEnabled=true
  95. - link_add:
  96. pool: pager
  97. - find:
  98. path: productCategory > name
  99. do:
  100. - parse
  101. - space_dedupe
  102. - trim
  103. - variable_set: catname2
  104. - find:
  105. path: productCategory > childProducts
  106. do:
  107. - find:
  108. path: parentBusinessCatalogItemId
  109. do:
  110. - parse
  111. - if:
  112. match: (\S)
  113. do:
  114. - variable_set: pid
  115. - register_set: http://bananarepublic.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
  116. - walk:
  117. to: value
  118. do:
  119. - variable_clear: isP
  120. - find:
  121. path: script:matches(gap.pageProductData\s*=\s*\{)
  122. do:
  123. - variable_set:
  124. field: isP
  125. value: 1
  126. - find:
  127. path: html
  128. do:
  129. - variable_get: isP
  130. - if:
  131. match: (1)
  132. do:
  133. - object_new: product
  134. - find:
  135. path: head
  136. do:
  137. - eval:
  138. routine: js
  139. body: '(function (){var d = new Date(); return d.toISOString()})();'
  140. - object_field_set:
  141. object: product
  142. field: date
  143. - static_get: url
  144. - object_field_set:
  145. object: product
  146. field: url
  147. - register_set: 'Banana Republic'
  148. - object_field_set:
  149. object: product
  150. field: brand
  151. - find:
  152. path: meta[name="keywords"]
  153. do:
  154. - parse:
  155. attr: content
  156. - object_field_set:
  157. object: product
  158. field: description
  159. - find:
  160. path: script:matches(gap.pageProductData\s*=\s*\{)
  161. do:
  162. - parse:
  163. filter:
  164. - gap\.currentBrand\s*=\s*\"(.+)\"\;
  165. - if:
  166. match: (\S)
  167. do:
  168. - object_field_set:
  169. object: product
  170. field: brand
  171. - parse
  172. - normalize:
  173. routine: replace_substring
  174. args:
  175. var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
  176. gap\.pageProductData\s*=\s*: ''
  177. \s*;\s*gap.currentBrand\s*=\s*.*\;: ''
  178. - normalize:
  179. routine: json2xml
  180. - to_block
  181. - find:
  182. path: productimages
  183. do:
  184. - parse:
  185. format: html
  186. - variable_set: imghtml
  187. - find:
  188. path: variants > productstylecolors > productstylecolorimages
  189. do:
  190. - parse
  191. - normalize:
  192. routine: lower
  193. - variable_set: imgpath
  194. - register_set: <div><%imghtml%></div>
  195. - to_block
  196. - find:
  197. path: safe_<%imgpath%>
  198. do:
  199. - variable_clear: getit
  200. - find:
  201. path: xlarge
  202. do:
  203. - parse
  204. - if:
  205. match: (\S)
  206. do:
  207. - variable_set:
  208. field: getit
  209. value: 1
  210. - normalize:
  211. routine: url
  212. - object_field_set:
  213. object: product
  214. field: images
  215. joinby: "|"
  216. - variable_get: getit
  217. - if:
  218. match: (1)
  219. else:
  220. - find:
  221. path: large
  222. do:
  223. - parse
  224. - if:
  225. match: (\S)
  226. do:
  227. - variable_set:
  228. field: getit
  229. value: 1
  230. - normalize:
  231. routine: url
  232. - object_field_set:
  233. object: product
  234. field: images
  235. joinby: "|"
  236. - variable_get: getit
  237. - if:
  238. match: (1)
  239. else:
  240. - find:
  241. path: medium
  242. do:
  243. - parse
  244. - if:
  245. match: (\S)
  246. do:
  247. - variable_set:
  248. field: getit
  249. value: 1
  250. - normalize:
  251. routine: url
  252. - object_field_set:
  253. object: product
  254. field: images
  255. joinby: "|"
  256. - variable_get: getit
  257. - if:
  258. match: (1)
  259. else:
  260. - find:
  261. path: small
  262. do:
  263. - parse
  264. - if:
  265. match: (\S)
  266. do:
  267. - variable_set:
  268. field: getit
  269. value: 1
  270. - normalize:
  271. routine: url
  272. - object_field_set:
  273. object: product
  274. field: images
  275. joinby: "|"
  276. - find:
  277. path: body_safe > variants > productstylecolors > colorname
  278. do:
  279. - parse
  280. - if:
  281. match: (\S)
  282. do:
  283. - object_field_set:
  284. object: product
  285. field: variations
  286. joinby: "|"
  287. - find:
  288. path: body_safe > name
  289. do:
  290. - parse
  291. - if:
  292. match: (\S)
  293. do:
  294. - object_field_set:
  295. object: product
  296. field: name
  297. - find:
  298. path: body_safe > currentmaxprice, body_safe > currentminprice
  299. do:
  300. - parse:
  301. filter:
  302. - (\d+\.?\d*)
  303. - if:
  304. match: (\d+)
  305. do:
  306. - object_field_set:
  307. object: product
  308. field: price
  309. type: float
  310. - register_set: USD
  311. - object_field_set:
  312. object: product
  313. field: currency
  314. - find:
  315. path: styleid
  316. slice: 0
  317. do:
  318. - parse
  319. - object_field_set:
  320. object: product
  321. field: sku
  322. - variable_set: sid
  323. - find:
  324. path: body
  325. do:
  326. - find:
  327. path: '#topNavWrapper a[class*=_selected]'
  328. do:
  329. - parse
  330. - space_dedupe
  331. - trim
  332. - object_field_set:
  333. object: product
  334. field: category
  335. joinby: "|"
  336. - variable_get: catname
  337. - if:
  338. match: (\S)
  339. do:
  340. - object_field_set:
  341. object: product
  342. field: category
  343. joinby: "|"
  344. - variable_get: catname2
  345. - if:
  346. match: (\S)
  347. do:
  348. - object_field_set:
  349. object: product
  350. field: category
  351. joinby: "|"
  352. - object_save:
  353. name: product
  354.  
  355. - find:
  356. path: productCategory > childCategories
  357. do:
  358. - variable_clear: catname3
  359. - find:
  360. path: name
  361. slice: 0
  362. do:
  363. - parse
  364. - space_dedupe
  365. - trim
  366. - variable_set: catname3
  367. - find:
  368. path: parentBusinessCatalogItemId
  369. do:
  370. - parse
  371. - if:
  372. match: (\S)
  373. do:
  374. - variable_set: pid
  375. - register_set: http://bananarepublic.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
  376. - walk:
  377. to: value
  378. do:
  379. - variable_clear: isP
  380. - find:
  381. path: script:matches(gap.pageProductData\s*=\s*\{)
  382. do:
  383. - variable_set:
  384. field: isP
  385. value: 1
  386. - find:
  387. path: html
  388. do:
  389. - variable_get: isP
  390. - if:
  391. match: (1)
  392. do:
  393. - object_new: product
  394. - find:
  395. path: head
  396. do:
  397. - eval:
  398. routine: js
  399. body: '(function (){var d = new Date(); return d.toISOString()})();'
  400. - object_field_set:
  401. object: product
  402. field: date
  403. - static_get: url
  404. - object_field_set:
  405. object: product
  406. field: url
  407. - register_set: 'Banana Republic'
  408. - object_field_set:
  409. object: product
  410. field: brand
  411. - find:
  412. path: meta[name="keywords"]
  413. do:
  414. - parse:
  415. attr: content
  416. - object_field_set:
  417. object: product
  418. field: description
  419. - find:
  420. path: script:matches(gap.pageProductData\s*=\s*\{)
  421. do:
  422. - parse:
  423. filter:
  424. - gap\.currentBrand\s*=\s*\"(.+)\"\;
  425. - if:
  426. match: (\S)
  427. do:
  428. - object_field_set:
  429. object: product
  430. field: brand
  431. - parse
  432. - normalize:
  433. routine: replace_substring
  434. args:
  435. var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
  436. gap\.pageProductData\s*=\s*: ''
  437. \s*;\s*gap.currentBrand\s*=\s*.*\;: ''
  438. - normalize:
  439. routine: json2xml
  440. - to_block
  441. - find:
  442. path: productimages
  443. do:
  444. - parse:
  445. format: html
  446. - variable_set: imghtml
  447. - find:
  448. path: variants > productstylecolors > productstylecolorimages
  449. do:
  450. - parse
  451. - normalize:
  452. routine: lower
  453. - variable_set: imgpath
  454. - register_set: <div><%imghtml%></div>
  455. - to_block
  456. - find:
  457. path: safe_<%imgpath%>
  458. do:
  459. - variable_clear: getit
  460. - find:
  461. path: xlarge
  462. do:
  463. - parse
  464. - if:
  465. match: (\S)
  466. do:
  467. - variable_set:
  468. field: getit
  469. value: 1
  470. - normalize:
  471. routine: url
  472. - object_field_set:
  473. object: product
  474. field: images
  475. joinby: "|"
  476. - variable_get: getit
  477. - if:
  478. match: (1)
  479. else:
  480. - find:
  481. path: large
  482. do:
  483. - parse
  484. - if:
  485. match: (\S)
  486. do:
  487. - variable_set:
  488. field: getit
  489. value: 1
  490. - normalize:
  491. routine: url
  492. - object_field_set:
  493. object: product
  494. field: images
  495. joinby: "|"
  496. - variable_get: getit
  497. - if:
  498. match: (1)
  499. else:
  500. - find:
  501. path: medium
  502. do:
  503. - parse
  504. - if:
  505. match: (\S)
  506. do:
  507. - variable_set:
  508. field: getit
  509. value: 1
  510. - normalize:
  511. routine: url
  512. - object_field_set:
  513. object: product
  514. field: images
  515. joinby: "|"
  516. - variable_get: getit
  517. - if:
  518. match: (1)
  519. else:
  520. - find:
  521. path: small
  522. do:
  523. - parse
  524. - if:
  525. match: (\S)
  526. do:
  527. - variable_set:
  528. field: getit
  529. value: 1
  530. - normalize:
  531. routine: url
  532. - object_field_set:
  533. object: product
  534. field: images
  535. joinby: "|"
  536. - find:
  537. path: body_safe > variants > productstylecolors > colorname
  538. do:
  539. - parse
  540. - if:
  541. match: (\S)
  542. do:
  543. - object_field_set:
  544. object: product
  545. field: variations
  546. joinby: "|"
  547. - find:
  548. path: body_safe > name
  549. do:
  550. - parse
  551. - if:
  552. match: (\S)
  553. do:
  554. - object_field_set:
  555. object: product
  556. field: name
  557. - find:
  558. path: body_safe > currentmaxprice, body_safe > currentminprice
  559. do:
  560. - parse:
  561. filter:
  562. - (\d+\.?\d*)
  563. - if:
  564. match: (\d+)
  565. do:
  566. - object_field_set:
  567. object: product
  568. field: price
  569. type: float
  570. - register_set: USD
  571. - object_field_set:
  572. object: product
  573. field: currency
  574. - find:
  575. path: styleid
  576. slice: 0
  577. do:
  578. - parse
  579. - object_field_set:
  580. object: product
  581. field: sku
  582. - variable_set: sid
  583. - find:
  584. path: body
  585. do:
  586. - find:
  587. path: '#topNavWrapper a[class*=_selected]'
  588. do:
  589. - parse
  590. - space_dedupe
  591. - trim
  592. - object_field_set:
  593. object: product
  594. field: category
  595. joinby: "|"
  596. - variable_get: catname
  597. - if:
  598. match: (\S)
  599. do:
  600. - object_field_set:
  601. object: product
  602. field: category
  603. joinby: "|"
  604. - variable_get: catname2
  605. - if:
  606. match: (\S)
  607. do:
  608. - object_field_set:
  609. object: product
  610. field: category
  611. joinby: "|"
  612. - variable_get: catname3
  613. - if:
  614. match: (\S)
  615. do:
  616. - object_field_set:
  617. object: product
  618. field: category
  619. joinby: "|"
  620. - object_save:
  621. name: product

Report this snippet  

You need to login to post a comment.