Clean Word HTML using Regular Expressions - PHP Snipplr Social Repository

Revision: 5314

at February 28, 2008 05:19 by localhorst

Updated Code

function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags

$html = ereg_replace("<(/)?(font|span|del|ins)[^>]*>","",$html);

// then run another pass over the html (twice), removing unwanted attributes

$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);

return $html
}

Revision: 5313

at February 27, 2008 13:49 by localhorst

Initial Code

function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags

$html = ereg_replace("<(/)?(font|span|del|ins)[^>]*>","",$html);

// then run another pass over the html (twice), removing unwanted attributes

$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);

// sample word html <p class="aaa" style="background:dot">abc</p> will return <p > </p>
}

Initial URL

http://tim.mackey.ie/CommentView,guid,2ece42de-a334-4fd0-8f94-53c6602d5718.aspx

Initial Description

The PHP Code appears in Post Comments

Initial Title

Clean Word HTML using Regular Expressions

Initial Tags

php

Initial Language

PHP

Choose a language for easy browsing: