Return to Snippet

Revision: 22577
at January 16, 2010 09:07 by ginoplusio


Updated Code
function webpage2txt($url) {
	$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

	$ch = curl_init();    // initialize curl handle
	curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
	curl_setopt($ch, CURLOPT_FAILONERROR, 1);              // Fail on errors
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);    // allow redirects
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
	curl_setopt($ch, CURLOPT_PORT, 80);            //Set the port number
	curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

	$document = curl_exec($ch);

	$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
		'@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
		'@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
		'@<![\s\S]*?�[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
		'/\s{2,}/',
	);

	$text = preg_replace($search, "\n", html_entity_decode($document));

	$pat[0] = "/^\s+/";
	$pat[2] = "/\s+\$/";
	$rep[0] = "";
	$rep[2] = " ";

	$text = preg_replace($pat, $rep, trim($text));

	return $text;
}

echo webpage2txt("http://www.repubblica.it");

Revision: 22576
at January 16, 2010 09:06 by ginoplusio


Initial Code
function webpage2txt($url) {
	$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

	$ch = curl_init();    // initialize curl handle
	curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
	curl_setopt($ch, CURLOPT_FAILONERROR, 1);              // Fail on errors
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);    // allow redirects
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
	curl_setopt($ch, CURLOPT_PORT, 80);            //Set the port number
	curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

	$document = curl_exec($ch);

	$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
		'@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
		'@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
		'@<![\s\S]*?–[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
		'/\s{2,}/',
	);

	$text = preg_replace($search, "\n", html_entity_decode($document));

	$pat[0] = "/^\s+/";
	$pat[2] = "/\s+\$/";
	$rep[0] = "";
	$rep[2] = " ";

	$text = preg_replace($pat, $rep, trim($text));

	return $text;
}

echo webpage2txt("http://www.rockit.it");

Initial URL
http://www.barattalo.it/2010/01/16/php-web-page-to-text-function/

Initial Description
I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace page.
This bot returns the text content of a url and it could be used to take text from a site and find relevant words to search.

Initial Title
PHP bot that retrieves the text of a page with CURL

Initial Tags


Initial Language
PHP