Posted By

ginoplusio on 01/16/10


Tagged

curlregular expressionpregreplacewebbotspider


Versions (?)

Who likes this?

5 people have marked this snippet as a favorite

liamchapman
jimdam
netzwerktourist
ginoplusio
Priestd09


PHP bot that retrieves the text of a page with CURL


 / Published in: PHP
 

URL: http://www.barattalo.it/2010/01/16/php-web-page-to-text-function/

I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace page. This bot returns the text content of a url and it could be used to take text from a site and find relevant words to search.

  1. function webpage2txt($url) {
  2. $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
  3.  
  4. $ch = curl_init(); // initialize curl handle
  5. curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
  6. curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors
  7. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
  8. curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
  9. curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number
  10. curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s
  11.  
  12. curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
  13.  
  14. $document = curl_exec($ch);
  15.  
  16. $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
  17. '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
  18. '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags
  19. '@<![\s\S]*?�[ \t\n\r]*>@', // Strip multi-line comments including CDATA
  20. '/\s{2,}/',
  21. );
  22.  
  23. $text = preg_replace($search, "\n", html_entity_decode($document));
  24.  
  25. $pat[0] = "/^\s+/";
  26. $pat[2] = "/\s+\$/";
  27. $rep[0] = "";
  28. $rep[2] = " ";
  29.  
  30. $text = preg_replace($pat, $rep, trim($text));
  31.  
  32. return $text;
  33. }
  34.  
  35. echo webpage2txt("http://www.repubblica.it");

Report this snippet  

You need to login to post a comment.