/ Published in: Other
Hey guys,
Sharing my diggernaut's scripts for web scraping, hope it will be useful for you.
This is script for scraping user's accounts without logging in to instagram, so no risk. What you can get with it: all information about user (his full name, username, id, avatar, number of follows and followers, number of posts), information about his posts (url, image, number of likes, information about persons who liked, comments, caption). So you probably will be interested in getting not all posts but just lets say 10 (or 30) of most recent. You can adjust it with settings. You can also set mode for script, to simple or extended. In simple mode it will not retrieve list of persons who likes post and comments. It works faster and eats less bandwidth in this case.
If you look into script (lines 5-9):
- type: fieldset
fields:
- user: somusername
you can see settings you may adjust, instead of "someusername" you need to set username of instagram account you want to scrape. You can set multiple users to scrape, you can need in this case add additional settings chunks, like below:
- type: fieldset
fields:
- user: somusername1
- user: somusername2
Sharing my diggernaut's scripts for web scraping, hope it will be useful for you.
This is script for scraping user's accounts without logging in to instagram, so no risk. What you can get with it: all information about user (his full name, username, id, avatar, number of follows and followers, number of posts), information about his posts (url, image, number of likes, information about persons who liked, comments, caption). So you probably will be interested in getting not all posts but just lets say 10 (or 30) of most recent. You can adjust it with settings. You can also set mode for script, to simple or extended. In simple mode it will not retrieve list of persons who likes post and comments. It works faster and eats less bandwidth in this case.
If you look into script (lines 5-9):
- type: fieldset
fields:
- user: somusername
you can see settings you may adjust, instead of "someusername" you need to set username of instagram account you want to scrape. You can set multiple users to scrape, you can need in this case add additional settings chunks, like below:
- type: fieldset
fields:
- user: somusername1
- user: somusername2
Expand |
Embed | Plain Text
Copy this code and paste it in your HTML
--- config: agent: Firefox iterator: - type: fieldset fields: - user: someusername do: - walk: to: https://www.instagram.com/<%user%>/ do: - variable_set: field: repeat value: 'no' - variable_clear: queryid - find: path: body do: - parse: filter: window\._sharedData\s+\=\s+([^;]+); - normalize: routine: json2xml - to_block - find: path: config>csrf_token do: - parse - variable_set: token - cookie_get: mid - variable_set: mid - find: path: script[type="text/javascript"] do: - parse: attr: src - if: match: Commons\.js do: - normalize: routine: url - walk: to: value headers: Cookue: csrftoken=<%token%>; mid=<%mid%>; do: - find: path: script do: - parse - normalize: routine: unescape_html - filter: args: return\s*e\.profilePosts\.byUserId\.get\(t\)\.pagination\}\,queryId\:\"(\d+)\" - if: match: \d+ do: - variable_set: queryid - find: path: body do: - cookie_get: mid - variable_set: mid - parse: filter: window\._sharedData\s+\=\s+([^;]+); - normalize: routine: json2xml - to_block - find: path: config>csrf_token do: - parse - variable_set: token - find: path: entry_data>profilepage do: - register_set: https://www.instagram.com/p - variable_set: baseurl - object_new: user - find: path: user>id do: - parse - object_field_set: object: user field: id - variable_set: userid - find: path: user>username do: - parse - object_field_set: object: user field: username - find: path: user>full_name do: - parse - object_field_set: object: user field: full_name - find: path: user>biography do: - parse - object_field_set: object: user field: biography - find: path: user>profile_pic_url do: - parse - object_field_set: object: user field: profile_pic_url - find: path: user>profile_pic_url_hd do: - parse - object_field_set: object: user field: profile_pic_url_hd - find: path: user>external_url do: - parse - object_field_set: object: user field: external_url - find: path: user>external_url_linkshimmed do: - parse - object_field_set: object: user field: external_url_linkshimmed - find: path: user>country_block do: - parse - object_field_set: object: user field: country_block - find: path: user>follows>count do: - parse - object_field_set: object: user field: follows - find: path: user>followed_by>count do: - parse - object_field_set: object: user field: followed_by - find: path: user>media>nodes do: - object_new: nodes - find: path: id do: - parse - object_field_set: object: nodes field: id - find: path: is_video do: - parse - object_field_set: object: nodes field: is_video - find: path: video_views do: - parse - object_field_set: object: nodes field: video_views - find: path: date do: - parse - normalize: routine: date_format args: format_in: '%s' format_out: '%Y-%m-%d %H:%M:%S' - object_field_set: object: nodes field: date - find: path: dimensions>width do: - parse - object_field_set: object: nodes field: width - find: path: dimensions>height do: - parse - object_field_set: object: nodes field: height - find: path: likes>count do: - parse - object_field_set: object: nodes field: likes_count - find: path: comments>count do: - parse - object_field_set: object: nodes field: comments_count - find: path: comments_disabled do: - parse - object_field_set: object: nodes field: comments_disabled - find: path: caption_safe do: - parse - object_field_set: object: nodes field: caption - find: path: thumbnail_src do: - parse - object_field_set: object: nodes field: thumbnail - find: path: display_src do: - parse - object_field_set: object: nodes field: media - object_save: name: nodes to: user - find: path: user>media>page_info do: - find: path: has_next_page do: - parse - if: match: true do: - variable_set: field: repeat value: 'yes' - find: path: end_cursor do: - parse - eval: routine: js body: '(function () {return encodeURIComponent("<%register%>")})();' - variable_set: cursor - walk: to: https://www.instagram.com/graphql/query/?query_id=<%queryid%>&variables=%7B%22id%22%3A%22<%userid%>%22%2C%22first%22%3A12%2C%22after%22%3A%22<%cursor%>%22%7D repeat: <%repeat%> headers: accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 accept-language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4 cache-control: max-age=0 upgrade-insecure-requests: 1 Cookie: mid=<%mid%>; do: - sleep: 5 - variable_set: field: repeat value: 'no' - find: path: edge_owner_to_timeline_media>page_info do: - find: path: has_next_page do: - parse - if: match: true do: - variable_set: field: repeat value: 'yes' - find: path: end_cursor do: - parse - eval: routine: js body: '(function () {return encodeURIComponent("<%register%>")})();' - variable_set: cursor - find: path: edge_owner_to_timeline_media>count do: - parse - object_field_set: object: user field: media_count - find: path: edge_owner_to_timeline_media>edges>node do: - object_new: nodes - find: path: id do: - parse - object_field_set: object: nodes field: id - find: path: is_video do: - parse - object_field_set: object: nodes field: is_video - find: path: video_views do: - parse - object_field_set: object: nodes field: video_views - find: path: taken_at_timestamp do: - parse - normalize: routine: date_format args: format_in: '%s' format_out: '%Y-%m-%d %H:%M:%S' - object_field_set: object: nodes field: date - find: path: dimensions>width do: - parse - object_field_set: object: nodes field: width - find: path: dimensions>height do: - parse - object_field_set: object: nodes field: height - find: path: edge_media_preview_like>count do: - parse - object_field_set: object: nodes field: likes_count - find: path: edge_media_to_comment>count do: - parse - object_field_set: object: nodes field: comments_count - find: path: comments_disabled do: - parse - object_field_set: object: nodes field: comments_disabled - find: path: edge_media_to_caption do: - parse - object_field_set: object: nodes field: caption - find: path: thumbnail_src do: - parse - object_field_set: object: nodes field: thumbnail - find: path: display_url do: - parse - object_field_set: object: nodes field: media - object_save: name: nodes to: user - object_save: name: user
URL: https://www.diggernaut.com