Posted By

noah on 11/13/07


command Bash commandline validation crawler perl tidy textonly scraping crawling one-liners

Versions (?)

check linked pages for Tidy validation errors, on the command line

 / Published in: Bash

Given a list of HTML links (for example, a Google results page that has been saved locally) check each link that points to a page on a given domain, and report if Tidy complains that its doctype declaration is missing.

Besides DOCTYPE, other strings to search for include "discarding", "lacks value" and "Error:"

Remember to replace MY_DOMAIN with the actual domain you want to validate against.

  1. lwp-request -o links file:///SAVED_GOOGLE_RESULTS.htm|grep -P "A\s*http://\w*.MY_DOMAIN" | perl -pe "m#A\s*(.*)#; $notify = qq{\n\t$1:\n}; $_=qx{lwp-request $1|tidy -eq 2>&1|grep -e Error -e DOCTYPE}; $_ = $notify .$_ if $_" > report.txt
  3. #Alternate: print out just the HTTP response code for linked pages that have my domain in the link
  5. lwp-request -o links|perl -pe "chomp; $_ =~ s#\w*\s*##; undef $_ unless m/; $_ .= qq{\t} . qx{lwp-request -ds $_} if $_"
  7. #old version
  9. lwp-request -o links file:///C:/SAVED_GOOGLE_RESULTS.htm|grep -P "A\s*http://\w*.MY_DOMAIN" | perl -pe "m#A\s*(.*)#; $notify = qq{\t$1: }; $_=qx{lwp-request $1|tidy -e 2>&1 | grep \"DOCTYPE\"}; print $notify if $_"

Report this snippet  

You need to login to post a comment.