.Net ramblings
# Wednesday, November 23, 2005
Clean Word HTML using Regular Expressions

Introduction

I've spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success.  Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement.  i think i've finally found a satisfactory and fast solution using only regular expressions.  Please feel free to use it in your applications, and post any improvements you may find.

The Code

/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
private string CleanHtml(string html)
{
// start by completely removing all unwanted tags
html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase);
// then run another pass over the html (twice), removing unwanted attributes
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
return html;
}

Samples of non-standard Microsoft Word HTML

<SPAN lang=EN-IE style="mso-ansi-language: EN-IE">
<p class="MSO Normal">
<UL style="MARGIN-TOP: 0cm" type=circle>
<o:p>&nbsp;</o:p>
<li class=MsoNormal style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>

Explanation of Regular Expressions

I've spent a good deal of time examining the problematic tags that MS Word inserts in its HTML, some examples are shown above.  The above code is based on a few requirements for my CMS:

  • remove all FONT and SPAN tags, because all the content in my CMS is done through style-sheets.
  • remove all CLASS and STYLE tags because they mean nothing outside of the original word document
  • remove all namespace tags and attributes like <o:p> and < ... v:shape ... >

The first regular expression removes unwanted tags, and is broken down as follows:

<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>
  • match an open tag character <
  • and optionally match a close tag sequence </  (because we also want to remove the closing tags)
  • match any of the list of unwanted tags: font,span,xml,del,ins
  • a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
  • match any attributes as far as the closing tag character >
  • the replace string for this regex is "", which will completely remove the instances of any matching tags.
  • note that we are not removing anything between the tags, just the tags themselves

The second regular expression removes unwanted attributes, and is broken down as follows:

<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>
  • match an open tag character <
  • capture any text before the unwanted attribute (This is $1 in the replace expression)
  • match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
  • there should always be an = character after the attribute name
  • match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
  • for single quotes, the pattern is: ' followed by anything but a ' followed by a '
  • similarly for double quotes. 
  • for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
  • lastly, capture whatever comes after the unwanted attribute in ([^>]*)
  • the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
  • note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice.  For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em"> 
    the regex will only remove one of these attributes.  Running the regex twice will remove the second one.  I can't think of any reasonable cases where it would need to be run more than that.

Suggestions!

If you have any suggestions or improvments, please post them here as comments.
Thanks :)

p.s. thanks to BinBin for the fix to preserve attributes like 'align=center'.
Wednesday, November 23, 2005 3:40:36 PM (GMT Standard Time, UTC+00:00)  #    Comments [27]  .Net GeneralTracked by:
"Clean Word HTML using Regular Expressions" (CreativeNRG Web Development Blog) [Trackback]


Sunday, December 18, 2005 11:08:23 PM (GMT Standard Time, UTC+00:00)
A superb example of reg exp and very handy too!!
tom
Wednesday, December 21, 2005 4:22:17 PM (GMT Standard Time, UTC+00:00)
note from author: updated first regular expression to also remove del and ins tags which appeared in a Word doc with comments from different users, probably from the "track changes" feature or something.
tim mackey
Wednesday, February 01, 2006 2:01:33 PM (GMT Standard Time, UTC+00:00)
[Apologies if posted twice: I hit submit, it didn't display a confirmation, but just brought me back to the same page, with the form still filled in, but a different anto-robot code]

Randomly surfed in, thought that
http://office.microsoft.com/en-gb/assistance/HA010549981033.aspx might be handy to you, though I think you have most of these covered already.

You can also somewhat refine your second regex so that:
1) unquoted attributes are a single word only,
2) whitespace is permitted around the '=' sign (they may not do this now, but it's legal, so futureproofing is good),
3) it will remove any number of adjacent dodgy attributes rather than only one at a time,
4) o/v/w/x/p: tags can contain hyphens and other nasties,
5) a name is required for the tag, followed by whitespace,
6) whitespace is required after each attribute, and is removed with it
7) A minimal match is used rather than a character set, to match the ending "anything not a >"

<([^>]+\s+)(?:(?:class|lang|style|size|face|[ovwxp]:[^\s=]+)\s*=\s*(?:'[^']*'|""[^""]*""|[^\s>]+)\s+)+.*?>

It would be nice, though, to remove all dodgy attributes within a tag, even where those dodgy attributes are separated by a legitimate attribute.

So in order to accomplish this, what can we do to our regex? At the moment, it is basically:
<([^>]+\s+)(?:names=values\s+)+([^>]*)>

For this, we need to start using lookaround assertions. If your regex engine doesn't grok variable-width lookbehind (and most don't), then at first sight, it seems that you're stuck with the above. You can't say "any distance after an opening tag".

But then... do we care about the exact position of the opening tag? No. We only care that it doesn't appear within the match, nor between the match and the first closing-tag.

(?:names=values\s+)+(?=[^<]*?>)

Pow. Replace the matches with the empty string (no more need to use captured $1$2 references), and you're done. For safety, though, some refinement should be done to the "name" and "values" clauses, like so:

(?:(?:class|lang|style|size|face|[ovwxp]:[^<=>\s]+)\s*=\s*(?:'[^<']*'|""[^<""]*""|[^\s<>]+)\s+)+.*?>

This slightly paranoid revision means that even pages where people are talking about MS tags will not accidentally have parts stripped out. This comes at the expense of preventing MS parameters that contain '<' in the value field from being stripped.

Note that I haven't TESTED any of the above at all. Hope it comes in somehow useful anyway though.
Friday, April 21, 2006 5:26:22 AM (GMT Daylight Time, UTC+01:00)
The first regex doesnt seem to work, error message reports unmatched [ in regex
Beholder
Friday, April 21, 2006 8:55:07 AM (GMT Daylight Time, UTC+01:00)
[quote]...The first regex doesnt seem to work, error message reports unmatched "[" in regex[/quote]

it works fine for me. if you read it, you can see that all the square brackets are matched. did you remember to use the @ in front of the string?

tim
tim
Saturday, May 27, 2006 12:43:15 PM (GMT Daylight Time, UTC+01:00)
Excellent! But...

Two pretty useful and basic things that get deleted are paragraph alignment and indentation.

Presumably there is some way of first checking for, say, 'text-align:center' and keeping it?



Jon
Saturday, May 27, 2006 1:35:25 PM (GMT Daylight Time, UTC+01:00)
hi Jon,
you're right, any attributes specified in the style property will be removed, as is the design of the regex. you'd have to change the expression quite a lot to behave the way you want. the simplest way i can think of would be to run another regex beforehand, that would remove any 'text-align:xyz' style attributes and parse them in as a 'align=xyz' html attribute, which would not be picked up by the subsequent regular expression. but i don't know what this would look like. it's complex enough already!
tim
Saturday, June 03, 2006 1:08:11 PM (GMT Daylight Time, UTC+01:00)
I was able to retain the tags I need (centre and alignment) by simply first finding 'text-align=center' and 'margin-left=x' [where x is 1 or more] and then replacing them between 'false' HTML tags.

E.g. myString = Regex.Replace(myString, "margin-left:[1-9]", "'&gt;#INDENT THIS#&lt;span ")

Then at the end I replace #INDENT THIS# with " style=""margin-left:3em""&gt;"

A couple of other Word tags I found I needed to replace in the line 1 of the Regex are 'div' and 'st' ('st' are Office 2002+ 'smart tags')

Another useful thing is the 'HTML Tidy' program http://www.w3.org/People/Raggett/tidy/. This (tidy.exe) can be called from the command line so after I strip the Word HTML code I call HTML Tidy to check the code to see if there are any errors or problems. Very quick and easy way of insuring any HTML code is correct and problem free.
Jon
Monday, October 16, 2006 6:13:30 PM (GMT Daylight Time, UTC+01:00)
any chance of a vb version of this?
brian
Monday, October 16, 2006 6:16:13 PM (GMT Daylight Time, UTC+01:00)
hi brian, sorry don't have the time. shouldn't be too hard, just pull out the semi-colons and the @ string literal at the beginning of each regex...
there are very good free tools to conver C# code to Vb, i suggest a google search.
good luck
tim
tim
Monday, October 16, 2006 11:15:44 PM (GMT Daylight Time, UTC+01:00)
If you're interested in converting a BLOCK of MS Word (from a say copy/paste operation), I just blogged about how to do this. You may be able to use the same technique for an entire Word HTML doc. Just put the DHTML control into Design Mode (see post below) and then save web.Document.InnerHTML to a file.

Copy Paste HTML From MS Word: IE's DHTML Editing Control (in a .NET WinApp)
http://blogs.msdn.com/noahc/archive/2006/10/16/copy-paste-html-from-ms-word-ie-s-dhtml-editing-control-in-a-net-winapp.aspx
Tuesday, May 08, 2007 4:06:49 AM (GMT Daylight Time, UTC+01:00)
Thanks Tim for your code, this is what I am looking for. However I couldn't find PHP script available so I tried my best to modify your script to work with PHP. I have tested my script with FCKEditor and it really works. Of course there may have bugs:-) so your comments are welcome

function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags

$html = ereg_replace("<(/)?(font|span|del|ins)[^>]*>","",$html);

// then run another pass over the html (twice), removing unwanted attributes

$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=(\"[^\"]*\"|'[^']*'|[^>]+)([^>]*)>","<\\1>",$html);

// sample word html <p class="aaa" style="background:dot">abc</p> will return <p > </p>
}
Viet le
Friday, August 03, 2007 2:24:38 PM (GMT Daylight Time, UTC+01:00)
Thanks Tim! Your article offers great help!

The first expression works well, however the second one has a minor issue, for the situation below
<p class=MsoNormal align=center>, it will strip off everything after class attribute including align attribute which should be keeped, so a \s was added to resolve this issue as shown below.

<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^>\s]+)([^>]*)>

BTW, I enjoyed reading your blog!
BinBin
binbin
Wednesday, December 19, 2007 3:47:46 PM (GMT Standard Time, UTC+00:00)
Unable to get the second expression working with javascript :

var sPAttern2 = new RegExp("<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","igm");

Syntax error at Line *** position 94:Expected ')'

Any help would be greatly appreciated.

Thank you




Rev
Wednesday, December 19, 2007 3:52:58 PM (GMT Standard Time, UTC+00:00)
hi rev. it's probably a syntax difference with the way javascript treats quote characters in string literals. you should escape them as per javascript syntax. good luck
tim.
tim
Wednesday, December 19, 2007 4:54:09 PM (GMT Standard Time, UTC+00:00)
Tim,

Thank you, yes found the issue is the double quotes :)

""[^""]*""

Not a javascript expert so googling along ;)

Rev
Rev
Monday, December 24, 2007 12:06:31 PM (GMT Standard Time, UTC+00:00)
Have resolved the double quote issue using \".

However I have an issue with removing non-delimited attribute values using this example:

<TABLE class=MsoNormalTable style="WIDTH: 100%; mso-cellspacing: 1.5pt" cellPadding=0 width="100%" border=0>

This would be amended to:

<TABLE soNormalTable cellPadding=0 width="100%" border=0>

Does not handle 'class=MsoNormalTable' - looking into this now.

Apart from this works great!

Thank you very much.
REV
Tuesday, February 19, 2008 2:33:04 AM (GMT Standard Time, UTC+00:00)
// For javascript
// h: html code

h = h.replace(/<[/]?(font|st1|shape|path|lock|imagedata|stroke|formulas|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>/gi, '')

h = h.replace(/<([^>]*)style="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)style=([^>]*)>/gi, '<$1>')

h = h.replace(/<([^>]*)class="([^>"]*)"([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class='([^>']*)'([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^> ]*) ([^>]*)>/gi, '<$1 $3>')
h = h.replace(/<([^>]*)class=([^>]*)>/gi, '<$1>')

Please comment for improvement or mistakes. I'm going to use this. Thanks.
Friday, March 14, 2008 10:35:06 PM (GMT Standard Time, UTC+00:00)
Thanks...this was exactly what I needed!
Gary
Thursday, July 03, 2008 5:11:39 AM (GMT Daylight Time, UTC+01:00)
Shouldn't the PHP code use eregi_replace rather than ereg_replace since Tim's code passed in the flag RegexOptions.IgnoreCase?
Tuesday, December 09, 2008 7:13:07 PM (GMT Standard Time, UTC+00:00)
' VB.net version
'/ <summary>
'/ http://tim.mackey.ie/CommentView,guid,2ece42de-a334-4fd0-8f94-53c6602d5718.aspx
'/ Removes all FONT and SPAN tags, and all Class and Style attributes.
'/ Designed to get rid of non-standard Microsoft Word HTML tags.
'/ </summary>
Private Function CleanHtml(ByVal html As String) As String
' start by completely removing all unwanted tags
html = Regex.Replace(html, "<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase)
' then run another pass over the html (twice), removing unwanted attributes
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
Return html
End Function
Thursday, October 29, 2009 3:46:24 PM (GMT Standard Time, UTC+00:00)
How do we remove smart tags using the same code?
Mimi
Monday, November 09, 2009 7:41:39 PM (GMT Standard Time, UTC+00:00)
Tim,
Excellent code. I wanted to remove a little more from the text I was getting from MS Word. Here is my Code.

Private Function CleanHtml(ByVal html As String) As String
html = Regex.Replace(html, "<[/]?(font|link|m|a|st1|meta|object|style|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
html = Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", RegexOptions.IgnoreCase)
Dim i As Integer = 0
Dim x As Integer = 0
html = customClean(html, "<!--[if", "<![endif]-->")
html = customClean(html, "<!-- /*", "-->")
Return html
End Function

Private Function customClean(ByVal html As String, ByVal begStr As String, ByVal endStr As String) As String
Dim i As Integer
Dim j As Integer
While html.Contains(begStr)
i = html.IndexOf(begStr, 0)
j = html.IndexOf(endStr, 0)
html = html.Remove(i, ((j - i) + endStr.Length))
End While
Return html
End Function
James Schwietert
Thursday, August 26, 2010 5:07:59 PM (GMT Daylight Time, UTC+01:00)
Here's a version for any of you ColdFusion users out there...

<cffunction name="cleanUpWord" access="public" output="false" returntype="string" returnformat="JSON" hint="I clean up MS Word code">
<cfargument name="inputString" type="string" required="yes">

<cfset var local = StructNew()>

<!--- The two regex expressions in this function were taken from http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx --->

<cfset local.cleanText = ReplaceNoCase(arguments.inputString,"<p ","<p><p ","all")> <!--- Keep our P tag when it has bullshit MS Word attributes --->
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,"<[/]?(font|span|xml|del|ins|o|st1|[ovwxp]:\w+)[^>]*?>","","all")> <!--- Borrowed Regex --->
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","","all")> <!--- Borrowed Regex --->
<cfset local.cleanText = ReplaceNoCase(local.cleanText,"&ndash;","-","all")> <!--- Get rid of unnecessary escape sequences --->
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,"&rsquo;|&lsquo;","'","all")> <!--- Get rid of unnecessary escape sequences --->
<cfset local.cleanText = ReReplaceNoCase(local.cleanText,"&rdquo;|&ldquo;","""","all")> <!--- Get rid of unnecessary escape sequences --->

<cfset local.cleanText = ReReplaceNoCase(local.cleanText,"“|”","&quot;","all")> <!--- Get rid of MS Word SmartQuotes --->

<cfreturn local.cleanText>
</cffunction>
Josh
Wednesday, May 11, 2011 9:07:21 PM (GMT Daylight Time, UTC+01:00)
Thank you so much for posting the code for this. It was extremely helpful in quickly resolving a problem we were having with word formatting showing up on web pages.
sh
Monday, October 10, 2011 2:08:00 AM (GMT Daylight Time, UTC+01:00)
Thanks so much for sharing. Getting stuck on reg expression was making me suicidal :D
mike
Friday, November 11, 2011 6:28:52 PM (GMT Standard Time, UTC+00:00)
For Python:

f = open(filename)
text = f.read()
f.close()

f = open(outputFilename, 'w')

flags = re.IGNORECASE | re.MULTILINE
removeStupidTags = re.compile(r"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", flags)
text = re.sub(removeStupidTags,"",text)

removeStupidTags = re.compile(r"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", flags)
text = re.sub(removeStupidTags, r"<\1\2>", text)
text = re.sub(removeStupidTags, r"<\1\2>", text)
Chris G
OpenID
Please login with either your OpenID above, or your details below.
Name
E-mail
Home page

Comment (Some html is allowed: a@href@title, strike) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

[Captcha]Enter the code shown (prevents robots):

Live Comment Preview