Posted By

noah on 12/09/07


Tagged

search google results commandline iterator perl wget metrics aggregator lynx scraping analysis one-liners


Versions (?)

Who likes this?

3 people have marked this snippet as a favorite

armanx
icecreamboyy
tazjel


Scrape Google from the command line


 / Published in: Bash
 

This code is POC only -- actually using it would violate Google's TOS, which forbids scraping. It is published here for educational value only.

Hypothetically, the following command should return a list of the top 500 or so hits in Google for onemorebug.com.

The results will be prepended with digits, followed by a dot and some whitespace (Lynx adds these).

You must have Lynx and Wget installed on your system for this to work.

Keep in mind that *nix shells don't like it when you double-quote strings, see the comments.

  1. perl -e "$i=0;while($i<1000){sleep 1; open(WGET,qq/|xargs lynx -dump/);printf WGET qq{http://www.google.com/search?q=site:onemorebug.com&hl=en&start=$i&sa=N},$i+=10}" | grep "\/\/[^/]*onemorebug.com\/"

Report this snippet  

Comments

RSS Icon Subscribe to comments
Posted By: hemanthhm on January 11, 2009

syntax error at -e line 1, near "=" Unterminated operator at -e line 1.

??

Posted By: noah on June 11, 2009

@hemanthhm I don't know what to tell you -- it works fine for me, I just double-checked.

Posted By: knshetty on July 19, 2009

For some reason if a perl script that is followed with quotes (i.e. perl -e ".....") produces syntax error, then try such an alternative -> perl -e '.....' Hence, applying the above pattern to the script at hand, we get -> perl -e '$i=0;while($i

Posted By: noah on September 29, 2009

I think I finally get what the problem was here: the *nix shell uses single quotes, while the DOS/Windows shell uses double quotes. So you have to be aware of which platform you are on and wrap the argument to perl -e in the appropriate type of quotes.

Posted By: scraper on November 21, 2009

Thanks for the nice perl command. That for sure is one more proof that perl is a spaghetti langauge but powerful :-)

While this perl/lynx code will work to get results it won't really work well.

I recently stumbled upon an article called "Scraping Google for Fun and Profit", it goes much deeper into that subject. It shows how you can scrape not only a few hundred, it can scrape millions of hits from Google. Free PHP code, including filtering of advertisement and parsing the data (title, descripion, host, url, etc) into an array is included.

Works for web and console.

Here is the article, hope you like it: http://google-scraper.squabbel.com

Posted By: noah on October 6, 2010

@scraper thanks for the link. For heavier scraping I'd personally recommend any of Selenium, Mechanize or the HTTP module of Node.js. Depends on your needs.

Hypothetically, if I needed to scrape pages from Google, I'd imagine this script would serve my needs. Because I'd be doing that in order to assess which Web properties are most important to a client in terms of SEO and uptime. And theoretically I could then build functional tests to "watch" those assets. But such functional tests are so expensive to maintain, that I doubt I'd ever even reach coverage for the first ~500 urls in Google.

In practice, covering the first page of search results with functional tests, would probably be sufficiently valuable.

You need to login to post a comment.