Posted By

on 09/06/08


Tagged

rss scraping theonion whatdoyouthink


Versions (?)

Who likes this?

2 people have marked this snippet as a favorite

techdetours
wirenaught


The Onion's What Do You Think? Scrape to RSS


 / Published in: Perl
 

URL: http://www.theonion.com/content/topics/American+Voices

Old and busted but still gets the job done.

  1. #!/usr/bin/perl
  2.  
  3. use HTML::Entities;
  4. use LWP::Simple;
  5.  
  6. print "<?xml version=\"1.0\"?>\n".
  7. "<rss version=\"2.0\">\n".
  8.  
  9. "<channel>\n".
  10. "<title>The Onion - What Do You Think?</title>\n".
  11. "<link>http://www.theonion.com/content/topics/American+Voices</link>\n".
  12. "<description>The Onion's place to vent anger at All of Us USA.</description>\n".
  13. "<ttl>180</ttl>\n".
  14. "<skipDays>\n".
  15. "\t<day>Saturday</day>\n".
  16. "\t<day>Sunday</day>\n".
  17. "</skipDays>\n".
  18. "<category>Humor</category>\n".
  19. "<language>en-us</language>\n".
  20. "\n";
  21.  
  22. $html_string = get("http://www.theonion.com/content/topics/American+Voices");
  23.  
  24. while ($html_string =~ m/<a href="(.*?)" class="plain">(.*?)<\/a>/g)
  25. {
  26. $url = "http://www.theonion.com" . $1;
  27.  
  28. $article = get($url);
  29.  
  30. if ($article =~ m/<h2 class="title">(.*?)<\/h2>(.*?)<div id="thumbs">/s) {
  31. $title = $1;
  32. $intro = $2;
  33. }
  34.  
  35. else {
  36. die("<item><title>whatdoyouthink.pl - Error grabbing article title and heading!</title></item></channel></rss>\n");
  37. }
  38.  
  39. if ($article =~ m/<div id="thumbs">(.*?)<div id="amvo_below">/s) {
  40. $content = $1;
  41. }
  42. else {
  43. die("<item><title>whatdoyouthink.pl - Error grabbing article text!</title></item></channel></rss>\n");
  44. }
  45.  
  46. print "<item>\n".
  47. "<title>" . $title . "</title>\n".
  48. "<link>" . $url . "</link>\n".
  49. "<category>Humor</category>\n".
  50. "<description>\n".
  51. "<![CDATA[" . $intro . $content . "]]>\n".
  52. "</description>\n".
  53. "</item>\n\n";
  54. }
  55.  
  56. print "</channel>\n".
  57. "</rss>";

Report this snippet  

You need to login to post a comment.