.Net ramblings
# Wednesday, 23 November 2005
Clean Word HTML using Regular Expressions


I've spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success.  Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement.  i think i've finally found a satisfactory and fast solution using only regular expressions.  Please feel free to use it in your applications, and post any improvements you may find.

The Code

/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
private string CleanHtml(string html)
// start by completely removing all unwanted tags
html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase);
// then run another pass over the html (twice), removing unwanted attributes
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
return html;

Samples of non-standard Microsoft Word HTML

<SPAN lang=EN-IE style="mso-ansi-language: EN-IE">
<p class="MSO Normal">
<UL style="MARGIN-TOP: 0cm" type=circle>
<li class=MsoNormal style='mso-list:l3 level1 lfo3;tab-stops:list 36.0pt'>

Explanation of Regular Expressions

I've spent a good deal of time examining the problematic tags that MS Word inserts in its HTML, some examples are shown above.  The above code is based on a few requirements for my CMS:

  • remove all FONT and SPAN tags, because all the content in my CMS is done through style-sheets.
  • remove all CLASS and STYLE tags because they mean nothing outside of the original word document
  • remove all namespace tags and attributes like <o:p> and < ... v:shape ... >

The first regular expression removes unwanted tags, and is broken down as follows:

  • match an open tag character <
  • and optionally match a close tag sequence </  (because we also want to remove the closing tags)
  • match any of the list of unwanted tags: font,span,xml,del,ins
  • a pattern is given to match any of the namespace tags, anything beginning with o,v,w,x,p, followed by a : followed by another word
  • match any attributes as far as the closing tag character >
  • the replace string for this regex is "", which will completely remove the instances of any matching tags.
  • note that we are not removing anything between the tags, just the tags themselves

The second regular expression removes unwanted attributes, and is broken down as follows:

  • match an open tag character <
  • capture any text before the unwanted attribute (This is $1 in the replace expression)
  • match (but don't capture) any of the unwanted attributes: class, lang, style, size, face, o:p, v:shape etc.
  • there should always be an = character after the attribute name
  • match the value of the attribute by identifying the delimiters. these can be single quotes, or double quotes, or no quotes at all.
  • for single quotes, the pattern is: ' followed by anything but a ' followed by a '
  • similarly for double quotes. 
  • for a non-delimited attribute value, i specify the pattern as anything except the closing tag character >
  • lastly, capture whatever comes after the unwanted attribute in ([^>]*)
  • the replacement string <$1$2> reconstructs the tag without the unwanted attribute found in the middle.
  • note: this only removes one occurence of an unwanted attribute, this is why i run the same regex twice.  For example, take the html fragment: <p class="MSO Normal" style="Margin-TOP:3em"> 
    the regex will only remove one of these attributes.  Running the regex twice will remove the second one.  I can't think of any reasonable cases where it would need to be run more than that.


If you have any suggestions or improvments, please post them here as comments.
Thanks :)

p.s. thanks to BinBin for the fix to preserve attributes like 'align=center'.
Wednesday, 23 November 2005 15:40:36 (GMT Standard Time, UTC+00:00)  #    Comments [27]  .Net General

# Friday, 11 November 2005
Asp.Net generally useful datagrid code

i'm just posting this here for reference.  it is a datagrid that supports standard paging and sorting, and displays the current set of record indices e.g. "1-15 of 1000 records"

private void bindGrid()
    DataSet ds = new DB.Audits().SelectAllPendingAudits();

    // the sorting is always retrieved from viewstate, if it exists or not.
    string sort = String.Concat(ViewState["Sort"], "");
    if(sort != "")
        this.lblSort.Text = "Sorted by " + sort + " in ascending order";

    int numRows = ds.Tables[0].Rows.Count;
    if(numRows > 0)
        int start = this.DataGrid1.CurrentPageIndex * this.DataGrid1.PageSize + 1;
        int end = Math.Min(numRows, start + this.DataGrid1.PageSize - 1);
        this.lblHeading.Text = String.Format("Displaying {0}-{1} of {2} records", start, end, numRows);
        this.DataGrid1.AllowPaging = (numRows > this.DataGrid1.PageSize);    // don't show pager unless relevant
        this.DataGrid1.Visible = true;
        DataView dv = new DataView(ds.Tables[0]);
        dv.Sort = sort;
        this.DataGrid1.DataSource = dv;
        this.lblHeading.Text = "No records";            
        this.DataGrid1.Visible = false;

private void DataGrid1_ItemCommand(object source, System.Web.UI.WebControls.DataGridCommandEventArgs e)
    if(e.CommandName == "Sort")
        ViewState["Sort"] = e.CommandArgument.ToString();    // is picked up in bindGrid() function
        this.DataGrid1.CurrentPageIndex = 0;

private void DataGrid1_PageIndexChanged(object source, System.Web.UI.WebControls.DataGridPageChangedEventArgs e)
    this.DataGrid1.CurrentPageIndex = e.NewPageIndex;

Friday, 11 November 2005 11:58:32 (GMT Standard Time, UTC+00:00)  #    Comments [0]  Asp.Net

# Thursday, 03 November 2005
HowTo: present a radiobuttonlist with images

i looked on the newsgroups to see if anyone had posted anything about this, and i found a few dead-end posts which seemed to conclude that it couldn't be done. 
i used a very simple approach that works well, and am posting it here for anyone looking to see how to do it.   the requirements are to present a radio-button-list with images instead of just text.

string imageBankFolder = "/ImageBankFolder/Thumbnails/";
DataSet ds = new DB.ImageBank().Select(); // get your dataset from wherever
foreach(DataRow dr in ds.Tables[0].Rows)
   this.RadioButtonList1.Items.Add(new ListItem(String.Format("<img src='{0}'>", imageBankFolder + dr["ImageFile"].ToString()), dr["ImageID"].ToString()));

this displays the images only.  note: firefox works fine with this, you can click on the image to select it, but IE6 requires you to actually click on the round radio button icon.  to work around this, i included some text above the image, which sits beside the button, and it is more intuitive for the user to click the text or the radio icon then.  to include some text above the image, try the following:

this.RadioButtonList1.Items.Add(new ListItem(String.Format("{1}<BR><img src='{0}'>", imageBankFolder + dr["ImageFile"].ToString(), dr["Text"].ToString()), dr["ImageID"].ToString()));

hope this helps someone out there.

Thursday, 03 November 2005 18:52:23 (GMT Standard Time, UTC+00:00)  #    Comments [14]  Asp.Net

# Wednesday, 05 October 2005
Send Ctrl-Alt-Delete via Remote Desktop

i wanted to change the admin password on my web server through remote desktop, but Ctrl-Alt-Delete always goes to the local computer. 

i found out you can also use Ctrl-Alt-End to achieve the same thing, which works in remote desktop.

Wednesday, 05 October 2005 12:35:21 (GMT Daylight Time, UTC+01:00)  #    Comments [46]  Windows Server

# Friday, 30 September 2005
HowTo: Backup RRAS configuration to text files

Thanks to Dusty Harper for his post on the server.networking MS newsgroup, to backup RRAS settings using Netsh:

Netsh Routing Dump > Routing.txt
Netsh RAS Dump > RAS.txt 

Then you can use Netsh Exec to playback the file.

Note: if you do a ntbackup of system state, the RRAS settings are also included in that.  i just like having the text file versions just in case.

Friday, 30 September 2005 13:49:56 (GMT Daylight Time, UTC+01:00)  #    Comments [0]  Windows Server

# Wednesday, 28 September 2005
Removing rows from a dataset (i.e. achieving TOP functionality when you can't use SQL)

This might sound really obvious, but i couldn't find a better way.  Normally i would use TOP in the SQL query to limit the number of records i want to retrieve, but in my case, this value is parameterised and Access won't allow me to parameterise that value.  I tried using a DataView but TOP isn't one of it's supported functions.  So i just loop through the dataset and keep removing rows until the right number of records is reached.

int maxItems = 5;
while(ds.Tables[0].Rows.Count > MaxItems)

Wednesday, 28 September 2005 11:01:17 (GMT Daylight Time, UTC+01:00)  #    Comments [0]  .Net General | Database

# Saturday, 10 September 2005
Mapping .html pages to Asp.Net

I was doing an upgrade on a web site recently, and all the pages were .html pages.  I wanted to add some .Net functionality, but didn't want to change all the urls, for bookmarks, search engines etc.  As well as scaring off the client with the strange ".aspx" file extensions.  yes- many irish companies are still technophobic. 

Add an IIS mapping for .html

i remember how to change mappings for a file extension in IIS (web site properties > home directory > configuration), so i did this for .html pages by adding a mapping for .html to aspnet_isapi.dll (copy the full path from the mapping for .aspx). 

Add a HttpHandler to the application web.config file

when i did the above, my .net code was ignored and rendered as plain text.  i found out this was because the web application (at the .net level) wasn't configured to handle .html files as .aspx files. this is what i added to my web.config to get it working:

<configuration> <system.web> <httpHandlers> <add verb="*" path="*.html" type="System.Web.UI.PageHandlerFactory" /> </httpHandlers> </system.web> </configuration>

now the whole application works with full .net functionality, overcoming all those migration problems usually associated with .net upgrades.

Saturday, 10 September 2005 14:38:39 (GMT Daylight Time, UTC+01:00)  #    Comments [0]  Asp.Net

# Friday, 09 September 2005
An asp.net button that disables itself automatically after clicking.

Some users of a web application i wrote insist on clicking buttons more than once, probably out of impatience. this often causes duplicate key exceptions with the database, because the first time they clicked the button the record was created, and the second time they clicked it, an exception is thrown, so they get the error screen and don't know what they did wrong. 

i wanted to write a button control that would disable itself automatically and re-enable itself once it was finished.  i couldn't find any good samples out there.  javascript is obviously the answer, and the solution i came up with is quite simple.  here's the code: currently only works with .Net 1.1:

	/// <summary>
/// A button control that disables itself when clicked, and changes the text to "Please wait..."
/// This is to prevent duplicate clicks by impatient or novice users.
/// It requires the button to be placed in a server form.
/// </summary>
[DefaultProperty("Text"), ToolboxData("<{0}:SmartButton runat=server></{0}:SmartButton>")]
public class SmartButton : Button

/// <summary>
/// Add an 'onClick' attribute to disable the button when it is clicked, and submit the form,
/// invoking the postback.
/// The onClick code handles the case where __EVENTTARGET is registered on the page, in which case
/// this variable is set to the button ID, and the form is submitted.
/// The other case is where __EVENTTARGET does not exist on the page, i found this sometimes
/// occurred on pages with only one button. In this case, the form is simply submitted, and the
/// button_click event will be raised by virtue of the default submit button in the form.
/// </summary>
protected override void Render(HtmlTextWriter output)
string onClick = "if(this.form != null && this.form.__EVENTTARGET != null){ this.form.__EVENTTARGET.value='" + this.UniqueID + "'; this.disabled = true; this.value = 'Please wait...'; this.form.submit(); } else this.form.submit(); ";
if(this.Attributes["onclick"] != null) // prepend the existing onClick attributes
onClick = this.Attributes["onclick"].ToString() + onClick;
this.Attributes.Add("onclick", onClick);

protected override void OnClick(EventArgs e)
// do the OnClick code first
base.OnClick (e);

// then reset the enabled + text values to their original state
int insertAt = Math.Max(this.Page.Controls.Count-1, 0); // never insert at -1 if there are no controls on the page
this.Page.Controls.AddAt(insertAt, new LiteralControl(String.Format(@"
if(document.getElementById('{0}') != null)
document.getElementById('{0}').disabled = false;
document.getElementById('{0}').value = '{1}';
", this.UniqueID, this.Text)));

Friday, 09 September 2005 15:41:29 (GMT Daylight Time, UTC+01:00)  #    Comments [2]  Asp.Net

# Saturday, 27 August 2005
A 'Progress-Task-List' control

I decided to write this control when i realised there are some cases where a progress bar is not enough, even with a label that gets updated for each new stage of a list of operations.

Sometimes you want the user to see what's coming next, what has already been done, and the ProgressTaskList control does just that.  it's nothing new of course, often used in installers and the like, but there didn't appear to be any .Net control like this.

I've submitted the control to code-project (URL to follow), which is where any updates will be posted.  If you have comments or suggestions, it's best to put them on the code-project site please.

I'm quite pleased with the way it turned out.  The code is very simple, and i couldn't find any bugs having done lots of testing.

As you can see in the screenshot to the left, it handles scrolling quite well, and automatically jumps to the current task when it is starting a new one.

To use it you just specify TaskItems as a string[] and call Start() to set it off. Then call NextTask() every time a task is finished to advance it to the next task. 

You can download the source here (30Kb) if you like, but check on code-project for updates.

Saturday, 27 August 2005 01:28:29 (GMT Daylight Time, UTC+01:00)  #    Comments [0]  .Net Windows Forms

# Friday, 19 August 2005
Tortilla EspaƱola: the real deal

Ever since i went to Malaga in Southern Spain 10 years ago, i have tried and failed to reproduce the authentic taste of the amazing "Tortilla EspaƱola", the Spanish Omelette.  I remember paying about a euro for a large tortilla that would be perfectly acceptable to eat for breakfast, lunch or (and!) dinner.
fortunately, i came across a recipe online today that i am posting here for future reference.  although i am fairly handy with the old omelettes in general, this was a real find, in particular the discovery that you fry the potatoes in lots of olive oil, which makes them go soft and gives a lovely soft texture to the whole tortilla.
You can see it on it's original location here, i'm only copying it here in case that url ever disappears or goes down.

Spanish TortillaServes four as a main course; twelve as a tapa.

  • 1 and 3/4 cups vegetable oil for frying (or plain olive oil)
  • about 5 medium-sized potatoes, peeled
  • 2 tsp. coarse salt
  • 2 or 3 medium-sized onions, diced
  • 5 medium cloves garlic, very coarsely chopped
  • 6 large eggs
  • 1/8 tsp. freshly ground black pepper

In a 10 or 11 inch non-stick skillet (should be at least 2 inches deep), heat the oil on medium high. While the oil is heating, slice the potatoes thinly, about 1/8 inch. Transfer to a bowl and sprinkle on the 2 tsp. of salt, tossing to distribute it well.

When the oil is very hot (a potato slice will sizzle vigorously around the edges without browning), gently slip the potatoes into the oil with a skimmer or slotted spoon. Fry the potatoes, turning occasionally (trying not to break them) and adjusting the heat so they sizzle but don't crisp or brown. Set a sieve over a bowl or else line a plate with paper towels. When the potatoes are tender, after 10 to 12 min., transfer them with the skimmer to the sieve or lined plate.

Add the onions and garlic to the pan. Fry, stirring occasionally, until the onions are very soft and translucent but not browned (you might need to lower the heat), 7 to 9 min. Remove the pan from the heat and, using the skimmer, transfer the onions and garlic to the sieve or plate with the potatoes. Drain the oil from the skillet, reserving at least 1 Tbs. (strain the rest and reserve to use again, if you like) and wipe out the pan with a paper towel so it's clean. Scrape out any stuck-on bits, if necessary.

In a large bowl, beat the eggs, 1/4 tsp. salt, and the pepper with a fork until blended. Add the drained potatoes, onions, and garlic and mix gently to combine with the egg, trying not to break the potatoes (some will anyway).

Heat the skillet on medium high. Add the 1 Tbs. reserved oil. Let the pan and oil get very hot (important so the eggs don't stick), and then pour in the potato and egg mixture, spreading it evenly. Cook for 1 min. and then lower the heat to medium low, cooking until the eggs are completely set at the edges, halfway set in the center, and the tortilla easily slips around in the pan when you give it a shake, 8 to 10 min. You may need to nudge the tortilla loose with a knife or spatula. (I found i had to turn it down very low to keep it from burning)

Set a flat, rimless plate that's at least as wide as the skillet upside down over the pan. Lift the skillet off the burner and, with one hand against the plate and the other holding the skillet's handle, invert the skillet so the tortilla lands on the plate (it should fall right out). Set the pan back on the heat and slide the tortilla into it, using the skimmer to push any stray potatoes back in under the eggs as the tortilla slides off the plate. Once the tortilla is back in the pan, tuck the edges in and under itself (to neaten the sides). Cook until a skewer inserted into the center comes out clean, hot, and with no uncooked egg on it, another 5 to 6 min.

Transfer the tortilla to a serving platter and let cool at least 10 min. Serve warm, at room temperature, or slightly cool. Cut into wedges or small squares, sticking a toothpick in each square if serving as an appetizer.

If the idea of cold tortilla doesn't get you going, you should try it, it might surprise you like it did me.  I didn't even like eggs when i got hooked on tortillas :)

Many thanks and all credits to Sarah Jay for sharing this great recipe.
By the way, it's incredibly filling because of all that oil, so eat about half as much as you'd think, then wait a while to see how you get on!  no wonder the spaniards have so many siestas, eating tortilla all the time would knock anyone out.

Friday, 19 August 2005 16:34:36 (GMT Daylight Time, UTC+01:00)  #    Comments [2]  General