| |
Sign In
I've spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success. Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement. i think i've finally found a satisfactory and fast solution using only regular expressions. Please feel free to use it in your applications, and post any improvements you may find.
/// <summary>/// Removes all FONT and SPAN tags, and all Class and Style attributes./// Designed to get rid of non-standard Microsoft Word HTML tags./// </summary>private string CleanHtml(string html){ // start by completely removing all unwanted tags html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase); // then run another pass over the html (twice), removing unwanted attributes html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase); html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase); return html;}
<SPAN lang=EN-IE style="mso-ansi-language: EN-IE"><p class="MSO Normal"><UL style="MARGIN-TOP: 0cm" type=circle><o:p> </o:p><li class=MsoNormal style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>
I've spent a good deal of time examining the problematic tags that MS Word inserts in its HTML, some examples are shown above. The above code is based on a few requirements for my CMS:
The first regular expression removes unwanted tags, and is broken down as follows:
<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>
The second regular expression removes unwanted attributes, and is broken down as follows:
<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>
If you have any suggestions or improvments, please post them here as comments. Thanks :)
Remember Me