demo winforms app to test algorithm (7k)
source code (3k)
Did you ever notice on big sites like microsoft.com, if you reach a
page that doesn't exist, they don't just say "Sorry, 404.", they give
you a list of pages that are similar to the one you requested.
This is obviously very nice for your users to have, and it's easy
enough to integrate into your site. This article provides source
code and explains the algorithm to accomplish this feature. Note: the
real benefit of the approach outlined here is the semi-intelligent
The need for this grew out of a client of mine who was changing
content management system, and every url in the site changed, so all
the search engine results came up with 404 pages. this was obviously a
big inconvenience, so i put this together to help users find their way
through the new site when arriving from a search engine.
Go to the following page http://www.iserc.ie/December15-ISERCWorkshoponTesting.html (which doesn't exist), and the 404 page should give you a list of pages that have quite similar names.
To compare strings, you can use System.String.IndexOf or you can use
regular expressions to match similarities, but all these methods are
very unforgiving for slight discrepancies in the string. in the example
url above, the page name isDecember15-ISERCWorkshoponTesting.html but
under the new content management system, the url is December 15 - ISERC
Workshop - Software Testing.html, which is different enough to make
traditional string comparison techniques fall down.
So i looked around for a fuzzy string compare routine, and came
across an algorithm written by a guy called Levenshtein. His algorithm
figures out how different 2 strings are, based on how many character
additions, deletions and modifications are necessary to change one
string into the other. This is called the 'edit distance', i.e. how far
you have to go to make 2 strings match. This is very useful because it
takes into account slight differences in spacing, punctuation and
spelling. I found this algorithm on http://www.merriampark.com/ld.htm
where Lasse Johansen kindly ported it to C#. The algorithm is explained
at that site, and it is well worth a read to see how it is done.
I originally had a problem with the algorithm because it gave
surprising results for certain situations. If the 404 page request was
for 'hello' and there is a valid page called 'hello_new_version' and
another valid page called 'abcde', then the 'abcde' page gets a better
score, because fewer changes are needed to make it the same as hello
(just change the 5 characters in 'abcde' into 'hello'). this is 5
changes, even though the 'hello_new_version' is semantically a better
match. Fortunately, a kind newsgroup participant named Patrice
suggested that i divide the score by the length of the comparison
string, to normalise the results. This worked perfectly, and i found
that a score between 0 (perfect match) and 0.6 (a good match) is worth
including as a suggested page. You can change this value in the
ComputeResults() method if you want to make it more or less flexible.
private void Page_Load(object sender, System.EventArgs e)
The above code shows the 4 key tasks that make up this solution. Each method is explained below.
private void SetUpSiteUrls()
this.validUrls = new ArrayList();
* Insert code here to add the pages in your site to this arraylist
private void ComputeResults()
ArrayList results = new ArrayList(); // used to store the results
// build up an arraylist of the positive results
foreach(string s in validUrls)
// don't waste time calculating the edit distance of nothing with something
if(s == "") continue;
double distance = Levenshtein.CalcEditDistance(s, this.requestedUrl); // both in lower case
double meanDistancePerLetter = (distance / s.Length); // anything between 0.0 and 0.6 is a good match. the algorithm always returns a value >= 0
if(meanDistancePerLetter <= 0.60D)
// add this result to the list.
results.Add(new DictionaryEntry(meanDistancePerLetter, "<a href='" + s + ".html'>" + s + "</a>")); // use dictionary entries because we want to store the score and the hyperlink.
// can't use sortedlist because they don't allow duplicate keys and we have 2 hyperlinks with the // same edit distance.
private void BindList()
if(results.Count > 0)
this.lblHeader.Text = "The following pages have similar names to " + this.requestedUrl + "";
this.DataList1.DataSource = results;
this.lblHeader.Text = "Unable to find any pages in this site that have similar names to " + this.requestedUrl + "";
The 'magic' in the code is all done with the
Levenshtein.CalcEditDistance method which returns the distance between
2 strings. It is included in the source.
If you're interested to test out the levenshtein algorithm, i've
written a windows forms application that lets you enter a string (e.g.
a page url) and also a list of strings to compare it against (e.g. all
the page urls in your site), and it gives you the 'edit distance'
scores. download here (7k)
I think this is a great feature because it adds significant value to
the user experience for a web site. Please feel free to comment below
if you have any questions, find any bugs, improvements, or if you can't
get it working, or if you use it in a novel way.
© Copyright 2017 Tim Mackey