Posted By

lfatr on 02/03/11


Tagged

ruby csv seo open-uri scrapi


Versions (?)

seo scraper ( writes to csv )


 / Published in: Ruby
 

reads from pages.txt ( one url per line ) checks url ( for title & h1 tag ) writes to output.csv ( path, title, headline, http_status )

  1. require 'rubygems'
  2. require 'open-uri'
  3. require 'scrapi'
  4. require 'cgi'
  5. require 'csv'
  6.  
  7. # Functions
  8. def self.checkPage(page)
  9.  
  10. httpPrefix = 'http://www.bikeshd.co.uk/'
  11. resultsHash = Array.new()
  12.  
  13. page = httpPrefix << page
  14.  
  15. scraper = Scraper.define do
  16. array :items
  17. process "html", :items => Scraper.define {
  18. process "h1", :headline => :text
  19. process "title", :title => :text
  20. result :headline, :title
  21. }
  22. result :items
  23. end
  24.  
  25. # Check http status and scrape
  26. begin
  27. # try and open it
  28. file = open(page)
  29. the_status = file.status[0]
  30.  
  31. # Scrape it!
  32. item = scraper.scrape(open(page).read)
  33. resultsArray = ["path" => page, "title" => item[0].title, "headline" => item[0].headline, "status" => '200']
  34. return resultsArray
  35.  
  36. rescue OpenURI::HTTPError => the_error
  37. # Whut? ~ Error code
  38. the_status = the_error.io.status[0]
  39. resultsArray = ["path" => page, "title" => '', "headline" => '', "status" => the_status]
  40. return resultsArray
  41.  
  42. end
  43. end
  44.  
  45. # Execution
  46.  
  47. pages = Hash.new()
  48. c = 0;
  49.  
  50. puts "GETTING PAGES -----------------------------"
  51.  
  52. File.open("pages.txt", "r").each_with_index { |page,i|
  53.  
  54. !CGI.escape(page)
  55. puts "#{i} >>> #{page}"
  56.  
  57. pages[i] = Array.new()
  58. pages[i] << checkPage(page)
  59.  
  60. #if i == 500
  61. # break
  62. #end
  63.  
  64. c = c + 1
  65. if c == 50
  66. sleep 3
  67. c = 0
  68. end
  69.  
  70. }
  71.  
  72. puts "WRITING CSV -------------------------------"
  73.  
  74. begin
  75. CSV.open("output.csv", "wb") do |csv|
  76. pages.each_with_index do |page,i|
  77.  
  78. path = page[1][0][0]['path']
  79. status = page[1][0][0]['status']
  80. title = page[1][0][0]['title']
  81. headline = page[1][0][0]['headline']
  82.  
  83. csv << [path, title, headline, status]
  84. end
  85. end
  86. end

Report this snippet  

You need to login to post a comment.