<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Snipplr</title>
<link>http://snipplr.com/language/perl/tags/scraping</link>
<description>Recent snippets posted on Snipplr.com</description>
<language>en-us</language>
<pubDate>Fri, 24 May 2013 21:31:50 GMT</pubDate>
<item>
<title>(Perl) check for broken links (shell one-liner) - noah</title>
<link>http://snipplr.com/view/18344/check-for-broken-links-shell-oneliner/</link>
<description><![CDATA[ <p>Retrieves links from a remote HTML page, then checks the response code of each link.  Duplicated links are only checked once, and anchors are ignored.  That is `http://foo` and `http://foo#bar` are considered to be the same URL, and thus `http://foo` will only be checked once; even if both URLs occur on the page.

Note that if the command produces too much output for you, you can filter down to just the *broken* links (if any) by piping the output of the entire one-liner, to `grep -v "200 OK"`

## Dependencies

These must be installed on your system:

0. [Perl](http://perl.org)
0. [lwp-request](http://search.cpan.org/~gaas/libwww-perl-5.831/bin/lwp-request)
0. [sort](http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html 'GNU sort sorts lines in a text file')
0. [uniq](http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html 'GNU uniq filters duplicate items from a list')

## Troubleshooting

**Check the quotes.**  The command below has strings wrapped in *double* quotes (") which is appropriate if you are using a *Windows* shell.

**If you are using a Mac or Linux shell** then you need to change the double quotes around the strings in the command below, to *single* quotes (').  Then all should work fine.

It is good to keep in mind always that the Windows shell wants strings to be double-quoted, while Unix-ish shells want strings to be single-quoted.</p> ]]></description>
<pubDate>Sat, 15 Aug 2009 08:55:19 GMT</pubDate>
<guid>http://snipplr.com/view/18344/check-for-broken-links-shell-oneliner/</guid>
</item>
<item>
<title>(Perl) The Onion's What Do You Think? Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8240/the-onions-what-do-you-think-scrape-to-rss/</link>
<description><![CDATA[ <p>Old and busted but still gets the job done.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 23:08:32 GMT</pubDate>
<guid>http://snipplr.com/view/8240/the-onions-what-do-you-think-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) Mother 3 Translation Blog RSS Tidy Up - noah</title>
<link>http://snipplr.com/view/8236/mother-3-translation-blog-rss-tidy-up/</link>
<description><![CDATA[ <p>This one I like.  Scrapes the blog and fixes their RSS feed to include images, video, mini updates and whatnot.  Images are busted and too lazy to fix.  Left up to an exercise for the reader.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 23:03:31 GMT</pubDate>
<guid>http://snipplr.com/view/8236/mother-3-translation-blog-rss-tidy-up/</guid>
</item>
<item>
<title>(Perl) The Onion Infographic Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8234/the-onion-infographic-scrape-to-rss/</link>
<description><![CDATA[ <p>Scrapes the infographic to RSS.  Old and busted.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:59:51 GMT</pubDate>
<guid>http://snipplr.com/view/8234/the-onion-infographic-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) Girl With a One Track Mind Scraper - noah</title>
<link>http://snipplr.com/view/8233/girl-with-a-one-track-mind-scraper/</link>
<description><![CDATA[ <p>Didn't bother fixing it but works well enough to scrape Girl With a One Track Mind</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:58:36 GMT</pubDate>
<guid>http://snipplr.com/view/8233/girl-with-a-one-track-mind-scraper/</guid>
</item>
<item>
<title>(Perl) Get Your War On Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8232/get-your-war-on-scrape-to-rss/</link>
<description><![CDATA[ <p>Really old and busted Get Your War On scraper but it still works so there.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:57:22 GMT</pubDate>
<guid>http://snipplr.com/view/8232/get-your-war-on-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) Broken Gallup Election 2008 Polls Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8231/broken-gallup-election-2008-polls-scrape-to-rss/</link>
<description><![CDATA[ <p>Scrapes the presidential stuff into an RSS feed.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:56:19 GMT</pubDate>
<guid>http://snipplr.com/view/8231/broken-gallup-election-2008-polls-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) Engrish Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8228/engrish-scrape-to-rss/</link>
<description><![CDATA[ <p>Scrapes Engrish to RSS.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:52:28 GMT</pubDate>
<guid>http://snipplr.com/view/8228/engrish-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) APOD Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8226/apod-scrape-to-rss/</link>
<description><![CDATA[ <p>Scrapes the Astronomy Picture of the Day and packages it into a messy RSS feed.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:49:22 GMT</pubDate>
<guid>http://snipplr.com/view/8226/apod-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) Achewood Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8225/achewood-scrape-to-rss/</link>
<description><![CDATA[ <p>Generates an RSS feed from Achewood's stuff.</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:47:26 GMT</pubDate>
<guid>http://snipplr.com/view/8225/achewood-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) New York Times Scrape to RSS - noah</title>
<link>http://snipplr.com/view/8224/new-york-times-scrape-to-rss/</link>
<description><![CDATA[ <p>Crawls all over the NYT RSS feed to glue entire articles together for your enjoyment.  Requires registration.

Pick a feed URL from http://topics.nytimes.com/top/reference/timestopics/index.html and have at it.

perl nyt.pl --user=aksdhf --pass=aksdfhasouidf --url=http://asiudhfasdfj</p> ]]></description>
<pubDate>Sat, 06 Sep 2008 22:44:32 GMT</pubDate>
<guid>http://snipplr.com/view/8224/new-york-times-scrape-to-rss/</guid>
</item>
<item>
<title>(Perl) scraper - noah</title>
<link>http://snipplr.com/view/3131/scraper/</link>
<description><![CDATA[ <p>For a while I used this to scrape weather.com.  Then they changed their HTML and my script broke.</p> ]]></description>
<pubDate>Tue, 03 Jul 2007 22:48:36 GMT</pubDate>
<guid>http://snipplr.com/view/3131/scraper/</guid>
</item>
</channel>
</rss>