Posted By

diggernaut on 12/08/17


Tagged

data etl Ecommerce scraping webscraping diggernaut betseyjohnson


Versions (?)

Scraping betseyjohnson.com with Diggernaut


 / Published in: Other
 

URL: https://www.diggernaut.com

This config can be used with diggernaut service to scrape betseyjohnson.com to retrieve products information.

  1. You need to create free account at diggernaut.com
  2. Login to your account
  3. Create a project with any name and description you want
  4. Get into your new project by clicking it and create new digger with any name
  5. Then you will see 3 options suggested to you, you need to use one where you will use meta-language
  6. Config editor will open and you can simply copy and paste config code and click on save button.
  7. Change digger mode from Debug to Active and run your digger.
  8. Wait for completion.
  9. Download data.
  10. Schedule your runs if required.
  1. ---
  2. config:
  3. debug: 2
  4. agent: Firefox
  5. do:
  6. - walk:
  7. to: http://www.betseyjohnson.com/
  8. do:
  9. - find:
  10. path: .off > a
  11. do:
  12. - parse:
  13. attr: href
  14. - normalize:
  15. routine: replace_substring
  16. args:
  17. '\s*#': ''
  18.  
  19. - normalize:
  20. routine: url
  21. - link_add:
  22. pool: main
  23. - walk:
  24. to: links
  25. pool: main
  26. do:
  27. - find:
  28. path: .categoryNav a
  29. do:
  30. - parse:
  31. attr: href
  32. - walk:
  33. to: value
  34. do:
  35. - find:
  36. path: .viewAll
  37. do:
  38. - parse:
  39. attr: data-href
  40. - normalize:
  41. routine: url
  42. - walk:
  43. to: value
  44. do:
  45. - find:
  46. path: .mainImage
  47. do:
  48. - parse:
  49. attr: href
  50. filter:
  51. - (.+)\?
  52. - (.+)
  53. - normalize:
  54. routine: url
  55. - link_add:
  56. pool: sub
  57. - find:
  58. path: .mainImage
  59. do:
  60. - parse:
  61. attr: href
  62. filter:
  63. - (.+)\?
  64. - (.+)
  65. - normalize:
  66. routine: url
  67. - link_add:
  68. pool: sub
  69. - walk:
  70. to: links
  71. pool: sub
  72. do:
  73. - object_new: product
  74. - find:
  75. in: doc
  76. path: head
  77. do:
  78. - eval:
  79. routine: js
  80. body: '(function (){var d = new Date(); return d.toISOString()})();'
  81. - object_field_set:
  82. object: product
  83. field: date
  84. - static_get: url
  85. - object_field_set:
  86. object: product
  87. field: url
  88. - find:
  89. path: 'meta[itemprop="productID"]'
  90. do:
  91. - parse:
  92. attr: content
  93. - space_dedupe
  94. - trim
  95. - object_field_set:
  96. object: product
  97. field: sku
  98. - find:
  99. path: .breadcrumb a
  100. do:
  101. - parse
  102. - space_dedupe
  103. - trim
  104. - normalize:
  105. routine: lower
  106. - object_field_set:
  107. object: product
  108. field: category
  109. joinby: "|"
  110. - find:
  111. path: 'select.COLOR_NAME > option'
  112. do:
  113. - parse:
  114. attr: value
  115. - space_dedupe
  116. - trim
  117. - if:
  118. match: (\S)
  119. do:
  120. - object_field_set:
  121. object: product
  122. field: variations
  123. joinby: "|"
  124. - find:
  125. path: .item-name
  126. do:
  127. - parse
  128. - space_dedupe
  129. - trim
  130. - object_field_set:
  131. object: product
  132. field: name
  133. - find:
  134. path: .productPrice
  135. do:
  136. - parse:
  137. filter:
  138. - ^\s*\$\s*(\d+\.?\d*)
  139. - if:
  140. match: (\d+)
  141. do:
  142. - object_field_set:
  143. object: product
  144. field: price
  145. type: float
  146. - register_set: USD
  147. - object_field_set:
  148. object: product
  149. field: currency
  150. - find:
  151. path: script:matches(variantMatrices)
  152. do:
  153. - parse:
  154. filter:
  155. - \/\/var\s*thumbsAndStuff\s*=\s*(.+);\s*
  156. - normalize:
  157. routine: json2xml
  158. - to_block
  159. - find:
  160. path: alts
  161. do:
  162. - parse
  163. - walk:
  164. to: http://www.betseyjohnson.com/scene7_proxy.jsp?cb=&req=set,json,utf-8&id=<%register%>
  165. do:
  166. - find:
  167. path: body
  168. do:
  169. - parse
  170. - normalize:
  171. routine: replace_substring
  172. args:
  173. - \/\*jsonp\*\/\s*: ''
  174. - s7jsonResponse\(: ''
  175. - \}\}\}\,\"\"\)\;: '}}}'
  176. - normalize:
  177. routine: unescape_html
  178. - normalize:
  179. routine: json2xml
  180. - to_block
  181. - find:
  182. path: item > i > n
  183. do:
  184. - parse
  185. - register_set: http://s7d9.scene7.com/is/image/<%register%>?scl=1
  186. - object_field_set:
  187. object: product
  188. field: images
  189. joinby: "|"
  190. - find:
  191. path: .detailsWrap > p
  192. do:
  193. - parse
  194. - space_dedupe
  195. - trim
  196. - object_field_set:
  197. object: product
  198. field: description
  199. - find:
  200. path: meta[itemprop="brand"]
  201. do:
  202. - parse:
  203. attr: content
  204. - space_dedupe
  205. - trim
  206. - object_field_set:
  207. object: product
  208. field: brand
  209. - object_save:
  210. name: product

Report this snippet  

You need to login to post a comment.