25 Sep 2018

Word similarity matching using Soundex algorithm in python




Word similarity matching is an essential part for text cleaning or text analysis.

Let’s say in your text there are lots of spelling mistakes for any proper nouns like name, place etc. and you need to convert all similar names or places in a standard form. 

This is where Soundex algorithm is needed to match similarity between two words.

Soundex is a phonetic algorithm which can find similar sounding terms.

A Soundex search algorithm takes a word, such as a person's name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike or sound (roughly) is equal.

The Soundex method is based on six phonetic types of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal). These are based on where you put your lips and tongue to make the sounds.


Let’s test something in python.

(Note: I have used fuzzy==1.1 version)

import fuzzy
# Convert up to 10 characters to phonetic character
soundex = fuzzy.Soundex(10)
# Text to process
word = 'phone'
soundex(word)

Output:
‘P500000000’


Here P is for first letter of word ‘phone’

Now if someone misspelled ‘phone’ to ‘fone’ let’s see if Soundex can identify or not.

import fuzzy
# Convert up to 10 characters to phonetic character
soundex = fuzzy.Soundex(10)
# Text to process
word = 'fone'
soundex(word)

Output:
‘F500000000’


You can see apart from first character phonetic characters are same for both ‘phone’ and ‘fone’. So we can use Soundex algorithm to solve this kind of problems.

word = 'fone'
soundex(word)[1:]
word = 'phone'
soundex(word)[1:]

Output:
‘500000000’


How Soundex algorithm works?


Step1:

Store the first letter of the word and keep it separate .Convert it to upper case.


‘phone’ => ‘P’, ‘hone’ => ‘hone’


Step2:

Drop all a, e, i, o, u, y, h, w from word.


‘hone’ => hone => ‘n’

Step3:

Replace characters with digits as follows (after the first letter):

  • b, f, p, v → 1
  • c, g, j, k, q, s, x, z → 2
  • d, t → 3
  • l → 4
  • m, n → 5
  • r → 6
            n’ => 5
Step4:


Join uppercase first character with step 3 value.


So phonetic value of word ‘phone’ will be P5.

As we had given 10 characters to convert [soundex = fuzzy.Soundex(10); soundex(word)]

8 zeros will be filled after P5 (total 10 characters) so final value will be P500000000

Let’s test for one more word.



Word = 'Rupert'


Step1:
‘Rupert’ => ‘R’, ‘upert’

Step2:

‘upert’ => ‘prt’

Step3:

‘prt’ => 163


Step4:

R163 => R163000000 (6 zeros added to make character length to 10)

Conclusion:

In this tutorial you learned:
  • What is word similarity matching
  • Why word similarity matching is important for text analysis.
  • How to do word similarity matching using Soundex algorithm in python.
  • How Soundex algorithm works.