<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Snipplr - noah</title>
<link>http://snipplr.com/users/noah/tags/scraping</link>
<description>Recent snippets posted on Snipplr.com</description>
<language>en-us</language>
<pubDate>Sat, 18 May 2013 19:23:45 GMT</pubDate>
<item>
<title>(Perl) check for broken links (shell one-liner)</title>
<link>http://snipplr.com/view/18344/check-for-broken-links-shell-oneliner/</link>
<description><![CDATA[ <p>Retrieves links from a remote HTML page, then checks the response code of each link.  Duplicated links are only checked once, and anchors are ignored.  That is `http://foo` and `http://foo#bar` are considered to be the same URL, and thus `http://foo` will only be checked once; even if both URLs occur on the page.

Note that if the command produces too much output for you, you can filter down to just the *broken* links (if any) by piping the output of the entire one-liner, to `grep -v "200 OK"`

## Dependencies

These must be installed on your system:

0. [Perl](http://perl.org)
0. [lwp-request](http://search.cpan.org/~gaas/libwww-perl-5.831/bin/lwp-request)
0. [sort](http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html 'GNU sort sorts lines in a text file')
0. [uniq](http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html 'GNU uniq filters duplicate items from a list')

## Troubleshooting

**Check the quotes.**  The command below has strings wrapped in *double* quotes (") which is appropriate if you are using a *Windows* shell.

**If you are using a Mac or Linux shell** then you need to change the double quotes around the strings in the command below, to *single* quotes (').  Then all should work fine.

It is good to keep in mind always that the Windows shell wants strings to be double-quoted, while Unix-ish shells want strings to be single-quoted.</p> ]]></description>
<pubDate>Sat, 15 Aug 2009 08:55:19 GMT</pubDate>
<guid>http://snipplr.com/view/18344/check-for-broken-links-shell-oneliner/</guid>
</item>
<item>
<title>(Bash) LDAP/NTLM authentication on the command line with curl</title>
<link>http://snipplr.com/view/5578/ldapntlm-authentication-on-the-command-line-with-curl/</link>
<description><![CDATA[ <p>Note that curl will not follow redirects.  Will prompt interactively for password if the -u option is omitted.

This snippet is tagged "scraping" because good luck scraping anything off your corporate intranet without LDAP auth :)</p> ]]></description>
<pubDate>Thu, 27 Mar 2008 11:33:30 GMT</pubDate>
<guid>http://snipplr.com/view/5578/ldapntlm-authentication-on-the-command-line-with-curl/</guid>
</item>
<item>
<title>(Bash) Download an entire site with wget -pkr</title>
<link>http://snipplr.com/view/5094/download-an-entire-site-with-wget-pkr/</link>
<description><![CDATA[ <p>## Where to Get Even More WGet Hacks
See also these killer `wget` [hacks by Jeff Veen.](http://www.veen.com/jeff/archives/000573.html)

## The WGet Hacks
Here are a couple of recipes to **download and archive an entire Web site,** starting with the given page and recursing down 1 level.   Adjust how many levels deep by changing the numeric argument given after -l


## Pitfalls
As of 2008, WGet doesn't follow @import links in CSS.</p> ]]></description>
<pubDate>Sat, 16 Feb 2008 21:42:46 GMT</pubDate>
<guid>http://snipplr.com/view/5094/download-an-entire-site-with-wget-pkr/</guid>
</item>
<item>
<title>(Bash) Scrape Google from the command line</title>
<link>http://snipplr.com/view/4299/scrape-google-from-the-command-line/</link>
<description><![CDATA[ <p>This code is POC only -- actually using it would violate Google's TOS, which forbids scraping.  It is published here for educational value only.

Hypothetically, the following command should return a list of the top 500 or so hits in Google for onemorebug.com.

The results will be prepended with digits, followed by a dot and some whitespace (Lynx adds these).

_You must have Lynx and Wget installed on your system for this to work._

Keep in mind that *nix shells don't like it when you double-quote strings, see the comments.</p> ]]></description>
<pubDate>Sun, 09 Dec 2007 21:16:58 GMT</pubDate>
<guid>http://snipplr.com/view/4299/scrape-google-from-the-command-line/</guid>
</item>
<item>
<title>(Bash) check linked pages for Tidy validation errors, on the command line</title>
<link>http://snipplr.com/view/4130/check-linked-pages-for-tidy-validation-errors-on-the-command-line/</link>
<description><![CDATA[ <p>Given a list of HTML links (for example, a Google results page that has been saved locally) check each link that points to a page on a given domain, and report if Tidy complains that its doctype declaration is missing.

Besides DOCTYPE, other strings to search for include "discarding", "lacks value" and "Error:"

Remember to replace MY_DOMAIN with the actual domain you want to validate against.</p> ]]></description>
<pubDate>Tue, 13 Nov 2007 23:47:28 GMT</pubDate>
<guid>http://snipplr.com/view/4130/check-linked-pages-for-tidy-validation-errors-on-the-command-line/</guid>
</item>
<item>
<title>(Bash) Validate a list of Web pages with Tidy, on the command line</title>
<link>http://snipplr.com/view/4129/validate-a-list-of-web-pages-with-tidy-on-the-command-line/</link>
<description><![CDATA[ <p></p> ]]></description>
<pubDate>Tue, 13 Nov 2007 22:54:37 GMT</pubDate>
<guid>http://snipplr.com/view/4129/validate-a-list-of-web-pages-with-tidy-on-the-command-line/</guid>
</item>
<item>
<title>(Bash) Download linked JPEGs from a Web page, on the command line</title>
<link>http://snipplr.com/view/4063/download-linked-jpegs-from-a-web-page-on-the-command-line/</link>
<description><![CDATA[ <p>The following command will download all the files with a JPG extension that are linked from http://flickr.com.

_Requires the LWP and HTML::Tree Perl modules.  You must also have Wget installed on your system for this to work._</p> ]]></description>
<pubDate>Fri, 02 Nov 2007 22:57:45 GMT</pubDate>
<guid>http://snipplr.com/view/4063/download-linked-jpegs-from-a-web-page-on-the-command-line/</guid>
</item>
<item>
<title>(Perl) scraper</title>
<link>http://snipplr.com/view/3131/scraper/</link>
<description><![CDATA[ <p>For a while I used this to scrape weather.com.  Then they changed their HTML and my script broke.</p> ]]></description>
<pubDate>Tue, 03 Jul 2007 22:48:36 GMT</pubDate>
<guid>http://snipplr.com/view/3131/scraper/</guid>
</item>
</channel>
</rss>