Posted By

noah on 07/03/07


Tagged

http file Shell browser download simple text commandline filter web perl lwp interactive easy automation beginner 2001 agent grabit


Versions (?)

Who likes this?

3 people have marked this snippet as a favorite

iblis
wirenaught
icecreamboyy


Grab linked files from a list of web pages


 / Published in: Perl
 

URL: http://github.com/textarcana/scrapers/blob/643e6e7cb349fa94cbc3fc88e1d55c7b6a262d11/grabit.pl

how to use

perl grabit.pl urls_for_download.txt

Expects as argument the name of a file containing a newline-delimited list of URLs:

http://example.com/coolstuff
http://example.com/coolstuff/fun
http://example.com/videos/explosions

When invoked, launches an interactive shell that asks what type of file should be downloaded. Then downloads all the files that are linked from each of the listed Web pages.

Note that the location of the download folder is hard-coded to c:/windows/desktop/grabit/ so you may want to change that before trying.

This script is also available on Github

Wait! Do you know about WGet and Curl?

This script is legacy. People seem to like it (hey, I still use it) but today I would probably not write my own tool to download multiple files off remote sites.

Instead I would likely just use a command-line Web browser like WGet or Curl. LWP-Request would also do the trick

do not comment your code like this!

For a great explanation of the rather baroque commenting style I was using circa 2001, see Steve Yegge's excellent article on code style: Portait of a n00b.

Of course, when I sit down to write a Perl script today, I use POD to format and publish my comments.

  1. #!/usr/local/bin/perl -w
  2. use strict;
  3. use LWP::Simple;
  4. use LWP::UserAgent;
  5. use HTML::LinkExtor;
  6. use URI::URL;
  7. #MAIN----------------------------------------------------------------------------
  8. #Extract info from the tags in these files:
  9. my $tag_type; #Extract info from this type of tag ONLY (ok to use | here)
  10. my $local_directory; #Save files here
  11. my $extensions; #Only save files with these extensions (ok to use | here)
  12. $extensions = &grab_what(); #Let the user choose what type of files to grab
  13. #Need to prefix the sub with "&" here or perl thinks its a call to a prototype and gives a warning
  14. # $ARGV[0] = "c:/windows/desktop/list.txt"; #List of urls to search for files
  15. $local_directory = "c:/windows/desktop/grabit/"; #Store grabbed files here
  16. $tag_type = "a"; #Look in <A> tags for file URIs
  17. #
  18. die "\n*******************\n ERROR\n*******************\nPlease create the directory:\n\n $local_directory\n\n" unless -d $local_directory; #unless local really is a directory...
  19. #
  20. while (<>) { #Assume we are reading a file with one URL on each line
  21. chomp(my $url = $_);
  22. if ($url ne "") {
  23. grabit($_, $tag_type, $local_directory, $extensions);
  24. print $url . "\n";
  25. #print " "; #delete urls from list file once they've been grabbed
  26. }
  27.  
  28. }
  29. #GRABIT--------------------------------------------------------------------------
  30. #Just a wrapper for grab_hyperlink
  31. #makes it easier to call g-h iteratively
  32. sub grabit {
  33. my ($url, $tag_type, $local_directory, $extensions) = @_;
  34.  
  35. grab_hyperlinked($url, $tag_type, $local_directory, $extensions);
  36. }
  37. #GRAB_HYPERLINKED----------------------------------------------------------------
  38. #Search the file at URL for tags of type TAG_TYPE and grab those targets that end with arbitrarily chosen EXTENSIONS
  39. sub grab_hyperlinked {
  40. my ($url, $tag_type, $local_directory, $extensions) = @_;
  41. my @links = list_links($url, $tag_type);
  42.  
  43. #@links = @links[0 .. 7]; #only get the first X images (or comment this out to get all)
  44.  
  45. foreach my $image_uri (@links) {
  46. next if $image_uri eq "";
  47. if ($image_uri =~ m{.($extensions)$}io){ #Only save files with the specified extensions
  48. my $image_name = $image_uri;
  49. $image_name =~ s{.*/(.*)}{$1};
  50. $image_name = smart_save($image_name, $local_directory); #Don't overwrite files with same name (obviously, either this line should be commented out, or the one below it should be)
  51. save_image($image_uri, $local_directory . $image_name);
  52. #print $image_uri;
  53. }
  54. }
  55. }
  56. #SMART SAVE------------------------------------------------------------------------------
  57. #This script checks to see if the file FILE_NAME already exists in DIRECTORY
  58. #and if so, adds an integer to the end of the file's name, before the extension
  59. #ie, if there are 2 files named foo.bar, then the second one to be saved will be renamed foo-1.bar
  60. #The RETURN VALUE is the new name of the file.
  61. sub smart_save {
  62. my ($file_name, $directory) = @_;
  63. my $int = 0;
  64. my $ext = $file_name;
  65.  
  66. while (-e $directory . $file_name) {
  67. $ext =~ s{[^.]*(.*)}{$1}; #extension of file_name
  68. $file_name =~ s{([^.]*).*}{$1}; #file_name minus exension
  69. while (-e $directory . $file_name . "-" . $int . $ext) {
  70. $int++;
  71. }
  72. $file_name = $file_name . "-" . $int . $ext; #returns foo-1.bar
  73. }
  74. return $file_name;
  75. }
  76.  
  77.  
  78. #SAVE IMAGE------------------------------------------------------------------------------
  79. #This script will grab an image from a web page and save it locally
  80. #file = 'http://localhost/libraries/images/oiltower/top_boom.jpg'; #This is the name of the image on the server
  81. #my $download = 'c:\windows\desktop\grabbed.jpg'; #This is where the image will be saved locally
  82. #save_image($file, $download);
  83. sub save_image { #copy web FILE to local DOWNLOAD location
  84. my ($file, $download) = @_;
  85.  
  86. my $user_agent = LWP::UserAgent->new;
  87. my $request = HTTP::Request->new('GET', $file);
  88. my $response = $user_agent->request ($request, $download);
  89. }
  90. #LIST LINKS---------------------------------------------------------------------
  91. #Extract the URL information from all links on the page, filtering out links that do not go to GIFS or JPEGS
  92. #Returns an array containing the full paths of each of the images
  93. #This code is adapted from the HTML::LinkExtor docs
  94. #my $temp = "c:/windows/desktop/grabit.temp";
  95. #my $url = "http://localhost/lwp/pics.html"; # for instance
  96. #my @links = list_links($url, $temp);
  97. sub list_links {
  98. my ($url, $tag_type) = @_;
  99. my $user_agent = new LWP::UserAgent;
  100. #$user_agent->agent("MSIE/5.5 " . $user_agent->agent);
  101. # Set up a callback that collect image links
  102. my @images = ();
  103. #
  104. # Make the parser. Unfortunately, we don't know the base yet
  105. # (it might be diffent from $url)
  106. # my $p = HTML::LinkExtor->new(\&callback);
  107. my $p = HTML::LinkExtor->new(
  108. sub {
  109. my($tag, %attributes) = @_;
  110. return if $tag ne $tag_type ; # we only look closer at the tags specified by TAG_TYPE
  111. push(@images, values %attributes);
  112. }
  113. );
  114. #
  115. # Request document and parse it as it arrives
  116. my $response = $user_agent->request(HTTP::Request->new(GET => $url),
  117. sub {$p->parse($_[0])});
  118. #
  119. # Expand all image URLs to absolute ones
  120. my $base = $response->base;
  121. @images = map { $_ = url($_, $base)->abs; } @images;
  122. #
  123. # Print them out
  124. #print join("\n", @images), "\n";
  125. return @images;
  126. }
  127.  
  128. #******************************************************************************
  129. #* *
  130. #* USER-QUERY FUNCTIONS: *
  131. #* *
  132. #******************************************************************************
  133.  
  134. #GRAB WHAT?--------------------------------------------------------------------
  135. #Let the user choose what type(s) of files to grab
  136. sub grab_what(){
  137. my $option_id = 1;
  138. my $selection;
  139. my @extensions = qw(
  140. jpg|gif|mpg
  141. wav|zip
  142. zip
  143. wav
  144. mp3
  145. );
  146. print "Welcome to Grabit by Noah Sussman\n\n";
  147. foreach my $ext (@extensions){
  148. print "$option_id) $ext\n";
  149. $option_id++;
  150. }
  151. print "\nWhat type(s) of files would you like to grab?\n";
  152. chomp($selection = <STDIN>);
  153. die "You must enter a number corresponding to an option!!" unless ($extensions[$selection - 1] ne "");
  154. print "Extension set to \"$extensions[$selection - 1]\".\nGrabbing...\n";
  155. return $extensions[$selection - 1];
  156. }
  157.  
  158. ##############################
  159. ##############################
  160. ##############################
  161. ##############################
  162. ##############################
  163. ##############################
  164. ##############################
  165. ##########END#################
  166. ##############################
  167. ##############################
  168. ##############################
  169. ##############################
  170. ##############################
  171. ##############################
  172. ##############################

Report this snippet  

You need to login to post a comment.