Posted By

noah on 08/15/09


Tagged

sort Bash web perl scraping one-liners uniq lwp-request


Versions (?)

check for broken links (shell one-liner)


 / Published in: Perl
 

Retrieves links from a remote HTML page, then checks the response code of each link. Duplicated links are only checked once, and anchors are ignored. That is http://foo and http://foo#bar are considered to be the same URL, and thus http://foo will only be checked once; even if both URLs occur on the page.

Note that if the command produces too much output for you, you can filter down to just the broken links (if any) by piping the output of the entire one-liner, to grep -v "200 OK"

Dependencies

These must be installed on your system:

  1. Perl
  2. lwp-request
  3. sort
  4. uniq
Troubleshooting

Check the quotes. The command below has strings wrapped in double quotes (") which is appropriate if you are using a Windows shell.

If you are using a Mac or Linux shell then you need to change the double quotes around the strings in the command below, to single quotes ('). Then all should work fine.

It is good to keep in mind always that the Windows shell wants strings to be double-quoted, while Unix-ish shells want strings to be single-quoted.

  1. lwp-request -o links http://onemorebug.com | perl -pe "chomp; $_ =~ s{ ^ \w* \s* ( [^#]+ ) .* $ }{$1}x; undef $_ unless m/^http/; $_ = qq{\"$_\"\n} if $_" | sort | uniq | perl -ne "chomp; print $_ . qq{\t} . qx{lwp-request -ds $_}"

Report this snippet  

You need to login to post a comment.