RSS 2.0 | Atom 1.0 | CDF

Search

Categories

Archive

Blogroll

Sign In

# Wednesday, November 23, 2005
Wednesday, November 23, 2005 3:40:36 PM (GMT Standard Time, UTC+00:00) ( .Net General )

Introduction

I've spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success.  Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement.  i think i've finally found a satisfactory and fast solution using only regular expressions.  Please feel free to use it in your applications, and post any improvements you may find.

The Code

/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
private string CleanHtml(string html)
{
// start by completely removing all unwanted tags
html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase);
// then run another pass over the html (twice), removing unwanted attributes
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
return html;
}

Samples of non-standard Microsoft Word HTML

<SPAN lang=EN-IE style="mso-ansi-language: EN-IE">
<p class="MSO Normal">
<UL style="MARGIN-TOP: 0cm" type=circle>
<o:p>&nbsp;</o:p>
<li class=MsoNormal style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>

Explanation of Regular Expressions

I've spent a good deal of time examining the problematic tags that MS Word inserts in its HTML, some examples are shown above.  The above code is based on a few requirements for my CMS:

  • remove all FONT and SPAN tags, because all the content in my CMS is done through style-sheets.
  • remove all CLASS and STYLE tags because they mean nothing outside of the original word document
  • remove all namespace tags and attributes like <o:p> and < ... v:shape ... >

The first regular expression removes unwanted tags, and is broken down as follows:

<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>
  • match an open tag character <
  • and optionally match a close tag sequence </  (because we also want to remove the closing tags)
  • match any of the list of unwanted tags: font,span,xml,del,ins
  • a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
  • match any attributes as far as the closing tag character >
  • the replace string for this regex is "", which will completely remove the instances of any matching tags.
  • note that we are not removing anything between the tags, just the tags themselves

The second regular expression removes unwanted attributes, and is broken down as follows:

<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>
  • match an open tag character <
  • capture any text before the unwanted attribute (This is $1 in the replace expression)
  • match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
  • there should always be an = character after the attribute name
  • match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
  • for single quotes, the pattern is: ' followed by anything but a ' followed by a '
  • similarly for double quotes. 
  • for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
  • lastly, capture whatever comes after the unwanted attribute in ([^>]*)
  • the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
  • note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice.  For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em"> 
    the regex will only remove one of these attributes.  Running the regex twice will remove the second one.  I can't think of any reasonable cases where it would need to be run more than that.

Suggestions!

If you have any suggestions or improvments, please post them here as comments.
Thanks :)

p.s. thanks to BinBin for the fix to preserve attributes like 'align=center'.
Comments [27] | | # 
# Friday, November 11, 2005
Friday, November 11, 2005 11:58:32 AM (GMT Standard Time, UTC+00:00) ( Asp.Net )

i'm just posting this here for reference.  it is a datagrid that supports standard paging and sorting, and displays the current set of record indices e.g. "1-15 of 1000 records"

private void bindGrid()
{
    DataSet ds = new DB.Audits().SelectAllPendingAudits();

    // the sorting is always retrieved from viewstate, if it exists or not.
    string sort = String.Concat(ViewState["Sort"], "");
    if(sort != "")
        this.lblSort.Text = "Sorted by " + sort + " in ascending order";

    int numRows = ds.Tables[0].Rows.Count;
    if(numRows > 0)
    {
        int start = this.DataGrid1.CurrentPageIndex * this.DataGrid1.PageSize + 1;
        int end = Math.Min(numRows, start + this.DataGrid1.PageSize - 1);
        this.lblHeading.Text = String.Format("Displaying {0}-{1} of {2} records", start, end, numRows);
        this.DataGrid1.AllowPaging = (numRows > this.DataGrid1.PageSize);    // don't show pager unless relevant
        this.DataGrid1.Visible = true;
        DataView dv = new DataView(ds.Tables[0]);
        dv.Sort = sort;
        this.DataGrid1.DataSource = dv;
        this.DataGrid1.DataBind();
    }
    else
    {
        this.lblHeading.Text = "No records";            
        this.DataGrid1.Visible = false;
    }
}

private void DataGrid1_ItemCommand(object source, System.Web.UI.WebControls.DataGridCommandEventArgs e)
{
    if(e.CommandName == "Sort")
    {
        ViewState["Sort"] = e.CommandArgument.ToString();    // is picked up in bindGrid() function
        this.DataGrid1.CurrentPageIndex = 0;
        this.bindGrid();                
    }            
}

private void DataGrid1_PageIndexChanged(object source, System.Web.UI.WebControls.DataGridPageChangedEventArgs e)
{
    this.DataGrid1.CurrentPageIndex = e.NewPageIndex;
    bindGrid();
}
Comments [0] | | # 
# Thursday, November 03, 2005
Thursday, November 03, 2005 6:52:23 PM (GMT Standard Time, UTC+00:00) ( Asp.Net )

i looked on the newsgroups to see if anyone had posted anything about this, and i found a few dead-end posts which seemed to conclude that it couldn't be done. 
i used a very simple approach that works well, and am posting it here for anyone looking to see how to do it.   the requirements are to present a radio-button-list with images instead of just text.

string imageBankFolder = "/ImageBankFolder/Thumbnails/";
DataSet ds = new DB.ImageBank().Select(); // get your dataset from wherever
foreach(DataRow dr in ds.Tables[0].Rows)
   this.RadioButtonList1.Items.Add(new ListItem(String.Format("<img src='{0}'>", imageBankFolder + dr["ImageFile"].ToString()), dr["ImageID"].ToString()));

this displays the images only.  note: firefox works fine with this, you can click on the image to select it, but IE6 requires you to actually click on the round radio button icon.  to work around this, i included some text above the image, which sits beside the button, and it is more intuitive for the user to click the text or the radio icon then.  to include some text above the image, try the following:

this.RadioButtonList1.Items.Add(new ListItem(String.Format("{1}<BR><img src='{0}'>", imageBankFolder + dr["ImageFile"].ToString(), dr["Text"].ToString()), dr["ImageID"].ToString()));

hope this helps someone out there.

Comments [14] | | #