Posted By

martinson on 09/13/16


Tagged

data web scraping instagram


Versions (?)

Public Instagram User Scraper


 / Published in: Other
 

URL: https://www.diggernaut.com

Hey guys,

Sharing my diggernaut's scripts for web scraping, hope it will be useful for you.

This is script for scraping user's accounts without logging in to instagram, so no risk. What you can get with it: all information about user (his full name, username, id, avatar, number of follows and followers, number of posts), information about his posts (url, image, number of likes, information about persons who liked, comments, caption). So you probably will be interested in getting not all posts but just lets say 10 (or 30) of most recent. You can adjust it with settings. You can also set mode for script, to simple or extended. In simple mode it will not retrieve list of persons who likes post and comments. It works faster and eats less bandwidth in this case.

If you look into script (lines 5-9):

- type: fieldset
  fields:
  - user: somusername

you can see settings you may adjust, instead of "someusername" you need to set username of instagram account you want to scrape. You can set multiple users to scrape, you can need in this case add additional settings chunks, like below:

- type: fieldset
  fields:
  - user: somusername1
  - user: somusername2
  1. ---
  2. config:
  3. agent: Firefox
  4. iterator:
  5. - type: fieldset
  6. fields:
  7. - user: someusername
  8. do:
  9. - walk:
  10. to: https://www.instagram.com/<%user%>/
  11. do:
  12. - variable_set:
  13. field: repeat
  14. value: 'no'
  15. - variable_clear: queryid
  16. - find:
  17. path: body
  18. do:
  19. - parse:
  20. filter: window\._sharedData\s+\=\s+([^;]+);
  21. - normalize:
  22. routine: json2xml
  23. - to_block
  24. - find:
  25. path: config>csrf_token
  26. do:
  27. - parse
  28. - variable_set: token
  29. - cookie_get: mid
  30. - variable_set: mid
  31.  
  32. - find:
  33. path: script[type="text/javascript"]
  34. do:
  35. - parse:
  36. attr: src
  37. - if:
  38. match: Commons\.js
  39. do:
  40. - normalize:
  41. routine: url
  42. - walk:
  43. to: value
  44. headers:
  45. Cookue: csrftoken=<%token%>; mid=<%mid%>;
  46. do:
  47. - find:
  48. path: script
  49. do:
  50. - parse
  51. - normalize:
  52. routine: unescape_html
  53.  
  54. - filter:
  55. args:
  56. return\s*e\.profilePosts\.byUserId\.get\(t\)\.pagination\}\,queryId\:\"(\d+)\"
  57. - if:
  58. match: \d+
  59. do:
  60. - variable_set: queryid
  61. - find:
  62. path: body
  63. do:
  64. - cookie_get: mid
  65. - variable_set: mid
  66. - parse:
  67. filter: window\._sharedData\s+\=\s+([^;]+);
  68. - normalize:
  69. routine: json2xml
  70. - to_block
  71. - find:
  72. path: config>csrf_token
  73. do:
  74. - parse
  75. - variable_set: token
  76. - find:
  77. path: entry_data>profilepage
  78. do:
  79. - register_set: https://www.instagram.com/p
  80. - variable_set: baseurl
  81. - object_new: user
  82. - find:
  83. path: user>id
  84. do:
  85. - parse
  86. - object_field_set:
  87. object: user
  88. field: id
  89. - variable_set: userid
  90. - find:
  91. path: user>username
  92. do:
  93. - parse
  94. - object_field_set:
  95. object: user
  96. field: username
  97. - find:
  98. path: user>full_name
  99. do:
  100. - parse
  101. - object_field_set:
  102. object: user
  103. field: full_name
  104. - find:
  105. path: user>biography
  106. do:
  107. - parse
  108. - object_field_set:
  109. object: user
  110. field: biography
  111. - find:
  112. path: user>profile_pic_url
  113. do:
  114. - parse
  115. - object_field_set:
  116. object: user
  117. field: profile_pic_url
  118. - find:
  119. path: user>profile_pic_url_hd
  120. do:
  121. - parse
  122. - object_field_set:
  123. object: user
  124. field: profile_pic_url_hd
  125. - find:
  126. path: user>external_url
  127. do:
  128. - parse
  129. - object_field_set:
  130. object: user
  131. field: external_url
  132. - find:
  133. path: user>external_url_linkshimmed
  134. do:
  135. - parse
  136. - object_field_set:
  137. object: user
  138. field: external_url_linkshimmed
  139. - find:
  140. path: user>country_block
  141. do:
  142. - parse
  143. - object_field_set:
  144. object: user
  145. field: country_block
  146. - find:
  147. path: user>follows>count
  148. do:
  149. - parse
  150. - object_field_set:
  151. object: user
  152. field: follows
  153. - find:
  154. path: user>followed_by>count
  155. do:
  156. - parse
  157. - object_field_set:
  158. object: user
  159. field: followed_by
  160. - find:
  161. path: user>media>nodes
  162. do:
  163. - object_new: nodes
  164. - find:
  165. path: id
  166. do:
  167. - parse
  168. - object_field_set:
  169. object: nodes
  170. field: id
  171. - find:
  172. path: is_video
  173. do:
  174. - parse
  175. - object_field_set:
  176. object: nodes
  177. field: is_video
  178. - find:
  179. path: video_views
  180. do:
  181. - parse
  182. - object_field_set:
  183. object: nodes
  184. field: video_views
  185. - find:
  186. path: date
  187. do:
  188. - parse
  189. - normalize:
  190. routine: date_format
  191. args:
  192. format_in: '%s'
  193. format_out: '%Y-%m-%d %H:%M:%S'
  194. - object_field_set:
  195. object: nodes
  196. field: date
  197. - find:
  198. path: dimensions>width
  199. do:
  200. - parse
  201. - object_field_set:
  202. object: nodes
  203. field: width
  204. - find:
  205. path: dimensions>height
  206. do:
  207. - parse
  208. - object_field_set:
  209. object: nodes
  210. field: height
  211. - find:
  212. path: likes>count
  213. do:
  214. - parse
  215. - object_field_set:
  216. object: nodes
  217. field: likes_count
  218. - find:
  219. path: comments>count
  220. do:
  221. - parse
  222. - object_field_set:
  223. object: nodes
  224. field: comments_count
  225. - find:
  226. path: comments_disabled
  227. do:
  228. - parse
  229. - object_field_set:
  230. object: nodes
  231. field: comments_disabled
  232. - find:
  233. path: caption_safe
  234. do:
  235. - parse
  236. - object_field_set:
  237. object: nodes
  238. field: caption
  239. - find:
  240. path: thumbnail_src
  241. do:
  242. - parse
  243. - object_field_set:
  244. object: nodes
  245. field: thumbnail
  246. - find:
  247. path: display_src
  248. do:
  249. - parse
  250. - object_field_set:
  251. object: nodes
  252. field: media
  253. - object_save:
  254. name: nodes
  255. to: user
  256. - find:
  257. path: user>media>page_info
  258. do:
  259. - find:
  260. path: has_next_page
  261. do:
  262. - parse
  263. - if:
  264. match: true
  265. do:
  266. - variable_set:
  267. field: repeat
  268. value: 'yes'
  269. - find:
  270. path: end_cursor
  271. do:
  272. - parse
  273. - eval:
  274. routine: js
  275. body: '(function () {return encodeURIComponent("<%register%>")})();'
  276. - variable_set: cursor
  277. - walk:
  278. to: https://www.instagram.com/graphql/query/?query_id=<%queryid%>&variables=%7B%22id%22%3A%22<%userid%>%22%2C%22first%22%3A12%2C%22after%22%3A%22<%cursor%>%22%7D
  279. repeat: <%repeat%>
  280. headers:
  281. accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
  282. accept-language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4
  283. cache-control: max-age=0
  284. upgrade-insecure-requests: 1
  285. Cookie: mid=<%mid%>;
  286. do:
  287. - sleep: 5
  288. - variable_set:
  289. field: repeat
  290. value: 'no'
  291. - find:
  292. path: edge_owner_to_timeline_media>page_info
  293. do:
  294. - find:
  295. path: has_next_page
  296. do:
  297. - parse
  298. - if:
  299. match: true
  300. do:
  301. - variable_set:
  302. field: repeat
  303. value: 'yes'
  304. - find:
  305. path: end_cursor
  306. do:
  307. - parse
  308. - eval:
  309. routine: js
  310. body: '(function () {return encodeURIComponent("<%register%>")})();'
  311. - variable_set: cursor
  312. - find:
  313. path: edge_owner_to_timeline_media>count
  314. do:
  315. - parse
  316. - object_field_set:
  317. object: user
  318. field: media_count
  319. - find:
  320. path: edge_owner_to_timeline_media>edges>node
  321. do:
  322. - object_new: nodes
  323. - find:
  324. path: id
  325. do:
  326. - parse
  327. - object_field_set:
  328. object: nodes
  329. field: id
  330. - find:
  331. path: is_video
  332. do:
  333. - parse
  334. - object_field_set:
  335. object: nodes
  336. field: is_video
  337. - find:
  338. path: video_views
  339. do:
  340. - parse
  341. - object_field_set:
  342. object: nodes
  343. field: video_views
  344. - find:
  345. path: taken_at_timestamp
  346. do:
  347. - parse
  348. - normalize:
  349. routine: date_format
  350. args:
  351. format_in: '%s'
  352. format_out: '%Y-%m-%d %H:%M:%S'
  353. - object_field_set:
  354. object: nodes
  355. field: date
  356. - find:
  357. path: dimensions>width
  358. do:
  359. - parse
  360. - object_field_set:
  361. object: nodes
  362. field: width
  363. - find:
  364. path: dimensions>height
  365. do:
  366. - parse
  367. - object_field_set:
  368. object: nodes
  369. field: height
  370. - find:
  371. path: edge_media_preview_like>count
  372. do:
  373. - parse
  374. - object_field_set:
  375. object: nodes
  376. field: likes_count
  377. - find:
  378. path: edge_media_to_comment>count
  379. do:
  380. - parse
  381. - object_field_set:
  382. object: nodes
  383. field: comments_count
  384. - find:
  385. path: comments_disabled
  386. do:
  387. - parse
  388. - object_field_set:
  389. object: nodes
  390. field: comments_disabled
  391. - find:
  392. path: edge_media_to_caption
  393. do:
  394. - parse
  395. - object_field_set:
  396. object: nodes
  397. field: caption
  398. - find:
  399. path: thumbnail_src
  400. do:
  401. - parse
  402. - object_field_set:
  403. object: nodes
  404. field: thumbnail
  405. - find:
  406. path: display_url
  407. do:
  408. - parse
  409. - object_field_set:
  410. object: nodes
  411. field: media
  412. - object_save:
  413. name: nodes
  414. to: user
  415. - object_save:
  416. name: user

Report this snippet  

You need to login to post a comment.