Comparing the Match Percentage of Two Strings

Goog Feed, General, ColdFusion    Comments (4)


Today at work, my boss asked me to check into "Fuzzy Match" capabilities in ColdFusion for a project we're working on here at work. Basically we want to compare two strings and find out what is the percentage of match between the two. For example does string one match string two with 70% accuracy or better?

So after a couple Google searches I came upon a partial solution. All signs pointed to the fact that I would have to use the Levenshtein distance formula which calculates the number of replacements, insertions or deletions necessary to match string one exactly up to string two.

I happened across a ColdFusion UDF written by Nicholas Zograpos which uses the Levenshtein formula to do this computation and passes back the number of characters needed to change string one into string two. You can find the UDF here.

I put this UDF at the top of my CFC.

<cfscript>
/**
* Computes the Levenshtein distance between two strings.
*
* @param s First string. (Required)
* @param t Second string. (Required)
* @return Returns a number.
* @author Nicholas Zographos (nicholas@nezen.net)
* @version 1, March 15, 2004
*/
function levDistance(s,t) {
var d = ArrayNew(2);
var i = 1;
var j = 1;
var s_i = "A";
var t_j = "A";
var cost = 0;

var n = len(s)+1;
var m = len(t)+1;

d[n][m]=0;

if (n is 1) {
return m;
}

if (m is 1) {
return n;
}

for (i = 1; i lte n; i=i+1) {
d[i][1] = i-1;
}

for (j = 1; j lte m; j=j+1) {
d[1][j] = j-1;
}

for (i = 2; i lte n; i=i+1) {
s_i = Mid(s,i-1,1);

for (j = 2; j lte m; j=j+1) {
t_j = Mid(t,j-1,1);

if (s_i is t_j) {
cost = 0;
}
else {
cost = 1;
}
d[i][j] = min(d[i-1][j]+1, d[i][j-1]+1);
d[i][j] = min(d[i][j], d[i-1][j-1] + cost);
}
}

return d[n][m];
}
</cfscript>

Next, I created a function of my own inside of my CFC which calls the levDistance function passing in two strings, it then divides the result by the length of the string we want to match against (stringTwo) and multiplies by 100 which gives the percentage of error between the two strings in numeric format. Finally I compare the error percentage with my threshold or my max error percentage I am willing to allow, passing the user back a true/false based on the result.

You'll find my function below.

<cffunction name="stringCompare" access="public" returnType="boolean" output="false">
<cfargument type="any" name="stringOne" default="" required="yes">
<cfargument type="any" name="stringTwo" default="" required="yes">

<cfset var result = true />
<cfset var threshold = 30 /><!---This means that we are only allowing for a 30% error rate, 70% match in essence.--->
<cfset var stringCompare = levDistance(trim(arguments.stringOne),trim(arguments.stringTwo))/len(trim(arguments.stringTwo))*100 />

<!---Comparing the result.--->
<cfif stringCompare GT threshold>
<cfset result = false />
</cfif>

<cfreturn result>
</cffunction>

Finally, one last bit of code. How to call the function from the view.

<cfset st1 = "How are you doing?" />
<cfset st2 = "How are ya doing?" />

#myObj.stringCompare(stringOne=st1,stringTwo=st2)#

You can play with the percentage numbers as needed to fit your situation. This really helped me in a problem I faced, hopefully it will do the same for you!

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
You might look into SOUNDEX. I don't think it'll help with this % match, but it might help with fuzzy matches, which it sounds like you might be doing.
# Posted By shan | 7/23/08 6:00 PM
@Shan

According to Ben Forta -Soundex checks up to first 4 sounds only, and not numbers or any non alpha characters.

It didn't seem to be as reliable.
# Posted By Joe Gautreau | 7/23/08 6:04 PM
Pretty cool. I have a version of the Levenshtein method I found in C and ported to cfscript. It returns the distance, percentage match, and an HTML highlighted version of each string showing the differences. It's found here:
http://www.bradwood.com/string_compare/
# Posted By Brad Wood | 7/24/08 5:52 AM
@Brad

Cool stuff! Thanks for posting.
# Posted By Joe Gautreau | 7/24/08 10:22 AM