We Recommend
Learning Perl
In this smooth, carefully paced course, a leading Perl trainer teaches you to program in the language that threatens to make C, sed, awk, and the Unix shell obsolete for many tasks. This book is the "official" guide for both formal (classroom) and informal learning. It is fully accessible to the novice programmer.
Posted By
iblis on 10/14/07
Tagged
html munging
Versions (? )
10/14/07 11:49pm
Who likes this? 1 person has marked this snippet as a favorite
melling
Grab all links from local or remote html file
Published in: Perl
#!/usr/bin/perl -w
use strict;
use Getopt:: Std ;
use LWP:: Simple ;
use HTML:: Parser ;
#
# Grab all links from local or remote html file
# perl html munging
#
# option -a (/ -r) grabs only absolute (/ relative) urls
# get options and argument
#
my %opts ;
getopts( 'ar' , \%opts ) ;
die "Usage: $0 [-a | -r] filename [| URL]\n " if ( not defined $arg or $opts { a
} && $opts { r
} ) ;
# allow either -a or -r
# get the page either from file or url
#
my $page ;
if ( $arg =~ m ! ^http
://! ) { $page = get( $arg )
or die "Couldn't get $arg: $!\n " ;
}
else {
or die "Couldn't open $arg: $!\n " ;
$page = do { local $/ ;
<FH> } ;
}
# set the parser and parse
#
my $parser = HTML:: Parser -> new ( api_version => 3 ,
start_h => [ \& start, "tagname, attr" ] ,
) ;
my @links ;
sub start {
my ( $tag , $attr ) = @_ ;
if ( $tag =~ /^a$/ and defined $attr -> { href
} ) { if ( $attr -> { href
} =~ m ! ^http
://! and $opts { r
} ) ;
# exclude absolute url when -r if ( $attr -> { href
} !~ m ! http
://! and $opts { a
} ) ;
# exclude relative url when -a push @links , $attr -> { href
} ;
}
}
$parser -> parse ( $page ) ;
$parser -> eof ;
# output
#
Report this snippet