Introduction - Soundex more flawed than fuzzy?
Many genealogy websites offer searches using a choice of exact name, Metaphone, or Soundex. Soundex and Metaphone searches are ways of searching for a surname the way it is pronounced, rather than the way it is spelled. This kind of search is called approximate match or fuzzy search.
Soundex is sometimes named the Russell Soundex, after Robert Russell and Margaret Odell who patented it back in 1918 and 1922. Soundex uses a very crude sort of phonetic matching. Because this system is very simple it will generate many "false positives"; and this is the reason that Soundex, though very old, is not frequently used in the many Oracle databases I have managed. And if Soundex was used we got complaints from our end-users. Just google for terms like "Soundex problem", "Soundex false positive", "Soundex poor matching", "Soundex accuracy" or "Soundex false match" to check this for yourself.
Quote:
Only 33% of the matches that would be returned by Soundex would be correct. Even more significant was the finding that fully 25% of correct matches would fail to be discovered by Soundex. (Alan Stanier, Sept 1990, Computers in Genealogy, Vol.3, No. 7)
Because of these problems inherent in the rather simple Soundex algorithm there are many variants, e.g. the American Soundex System, the NYSIIS algorithm, the Daitch-Mokotoff Soundex System and many more. The only conclusion can be: Soundex is more flawed than fuzzy.
Better then Soundex
Metaphone is an algorithm for encoding a word to a phonetic key. To implement a good fuzzy search algorithm take a look at Metaphone - it is in the public domain so it can be freely used.
Suppose you would like to perform queries that finds not only exactly spelled words but also words that sound similar and tolerates typos.
For example I have a column "name" and want to search for "Keyser" and also want to return "Kayser, Keizer" etc. The names "Kayser, Keizer" and "Keyser" are all translated to "KSR" by Metaphone; the Z is replaced by S and all vowels are removed.
Another example: the place Bucuresti, Bucarest, Boekarest and Bukarest are all translated to "BKRST" - there's the match! Metaphone has many more phonetic tricks up it's sleeve.
Implement Metaphone fuzzy searching
Setting up a fuzzy search on your data can be just as easy as a normal "exact" search. Soundex is included as a function in Oracle; Metaphone is not so you'll have to use one of the available PL/SQL sources. There are a few versions of Metaphone available for PL/SQL; i like the package made by Scott Stephens best.
Metaphone for Oracle package by Scott Stephens - metaphone10.sql
Test it like this:
SQL> @metaphone10
Package created.
Package body created.
No errors.
SQL> select meta.phone('Bucarest') from DUAL ;
SQL> BKRST
Then try to query your CONTACT_NAMES table (or Customers or another table) :
select name from CONTACT_NAMES
where meta.phone(name) = meta.phone('&searchname')
Metaphone is more advanced then Soundex and needs more computing. The above query will perform on a few thousand records; however on a table of 1 million records it might take too long - perhaps a few minutes.
So in this case you'll have to create a table to contain the metaphone key of the names in your CONTACT_NAMES table.
Lets call this phonetic key table the CONTACT_METAPHONE table :
create table CONTACT_METAPHONE as
select CONTACT_ID, meta.phone(NAME) NAME_KEY
from CONTACT_NAMES ;
And the new fast query is:
select name from CONTACT_NAMES C
,CONTACT_METAPHONE M
where M.CONTACT_ID = C.CONTACT_ID
and M.NAME_KEY = meta.phone('&searchname')
Now obviously you will need some triggers on CONTACT_NAMES to keep the CONTACT_METAPHONE table synchronized with it's parent table. That's pretty straightforward though.
Metaphone language limitations.
Metaphone has it's limitations due to the different pronunciation rules for different languages. A quote from the author of the Metaphone algorithm Lawrence Philips :
Quote: (Dr. Dobbs, 1-june-2000 ) English is not only well known for having maddeningly irrational spelling practices, but in America we have also accumulated names from all over the world. The first algorithm to deal with this problem, Soundex, is gratifyingly simple, but it is not at all an adequate solution. It often fails to do the job of returning alternatives that are pronounced similarly to the search string.
Soundex and Metaphone belong to a class of algorithms usually known as "phonetic encoding" or "sound alike" algorithms - a heuristic type of fuzzy matching. They input a word or name, and return an encoded key, which should be the same for any words that are pronounced similarly. Therefore, Metaphone encodes "Stephan" as STFN. Since "Steven" and "Stefan" are pronounced similarly, you should be able to find those spellings as well. But Soundex doesn't know much about English spelling peculiarities, so it won't!
I published Metaphone in Computer Language magazine in 1990, as a replacement for Soundex. I coded in many common rules of English pronunciation that Soundex doesn't cover, such as when 'C' is pronounced as 'S' and when it is pronounced as 'K.'
Metaphone is written for the English language - therefore this version of Metaphone uses the English language rules - it may need to be localized on other languages. This is the part that needs a lot of expertise and testing.
A number of Metaphone 4GL sources for Oracle are on page 2 : metaphone Page 2
For any questions :
Please e-mail : info AT bytelife.nl
Bytelife is a registered trademark.