The Soundex Algorithm

Soundex is one of a number of phonetic algorithms, assigning values to words or names so that they can be compared for similarity of pronounciation. It is probably the best know such algorithm as it is built in to most major RDBMSs, as well as PHP and other languages.

It doesn’t take much thought to realise that the whole area of phonetic algorithms is a minefield, and Soundex itself is rather restricted in its usefulness. In fact, after writing this implementation I came to the conclusion that it is rather mediocre but at least coding it up does give a better understanding of how it works and therefore its usefulness and limitations.

Wikipedia has a surprisingly brief article on the topic Soundex on Wikipedia which you might like to read.

The Algorithm

The purpose of the algorithm is to create for a given word a four-character string. The first character is the first character of the input string. The subsequent three characters are any of the numbers 1 to 6, padded to the right with zeros if necessary. The idea is that words that sound the same but are spelled differently will have the same Soundex encoding.

The steps involved are:

  • Copy the first character of the input string to the first character of the output string

  • For subsequent characters in the input string, add digits to the output string according to the table below, up to a maximum of three digits (ie. a total output string length of 4). Note that a number of input letters are ignored, including all vowels. Also, further occurences of an input letter with the same encoding are ignored.

  • If we reach the end of the input string before the output string reaches 4 characters, pad it to the right with zeros.

Letter Encodings

This table lists the digits assigned to the letters A-Z. I have assigned 0 to letters which are ignored, and note that uppercase and lowercase letters are treated the same.

Input letterEncoding
A0
B1
C2
D3
E0
F1
G2
H0
I0
J2
K2
L4
M5
N5
O0
P1
Q2
R6
S2
T3
U0
V1
W0
X2
Y0
Z2

The Code

When you are ready to start coding create a new folder and within it create the following empty files. You can also download the source code from the Downloads page or clone/download it from Github if you prefer.

  • soundex.h
  • soundex.c
  • main.c

Open soundex.h and enter this single function prototype.

soundex.h

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
void soundex(char* name, char* s);

Now open soundex.c and enter this function itself.

soundex.c

#include<string.h>
#include<ctype.h>
#include<stdio.h>
#include<stdlib.h>

#include"soundex.h"

//--------------------------------------------------------
// FUNCTION soundex
//--------------------------------------------------------
void soundex(char* name, char* s)
{
    int si = 1;
    char c;

    //                 ABCDEFGHIJKLMNOPQRSTUVWXYZ
    char mappings[] = "01230120022455012623010202";

    s[0] = toupper(name[0]);

    for(int i = 1, l = strlen(name); i < l; i++)
    {
        c = toupper(name[i]) - 65;

        if(c >= 0 && c <= 25)
        {
            if(mappings[c] != '0')
            {
                if(mappings[c] != s[si-1])
                {
                    s[si] = mappings[c];
                    si++;
                }

                if(si > 3)
                {
                    break;
                }
            }
        }
    }

    if(si <= 3)
    {
        while(si <= 3)
        {
            s[si] = '0';
            si++;
        }
    }
}

The function arguments consist of a char* for the name to be encoded, and another for the encoding. Passing a char pointer for the result removes the need to dynamically allocate memory within the soundex function and then call (or forget to call!) free in the calling code. This way we just create an auto variable in the calling code and pass it to the soundex function.

The int si is the current index of s, the result string. The char c is current letter in the input string, modified as we will see in a moment.

Next we create a mappings string. This represents the output values for each letter of the alphabet as per the above table. We then set the first letter of the output string to the first letter of the input.

Next we enter a for loop through the input string; note that the loop starts at 1 as we have already dealt with the first character. Within the loop we assign c to the current input letter, converted to upper case. We then subtract 65 so the numeric value corresponds to the indexes of the mappings array.

Next we check the value is within the range 0 to 25, ie. an uppercase letter. If not it is ignored, but if so we check if its corresponding numeric value is not 0. We then check the value is not the same as the previous to implement the rule that consecutive identical values are skipped, and then set the next value of the output string to the correct number. The si index is then incremented, before we check if it is more than 3; if so we break out of the loop.

Finally, we need to check if we have not yet filled up the encoded string s, which can happen if there are not enough encodable letters in the input string. If this is the case we simply pad out the string with 0s in a while loop.

That's the algorithm implemented. We do not need to return anything because, as I mentioned above, we are populating a pointer argument from the calling code. We can now write a short main function to try out the function.

main.c

#include<stdio.h>
#include<math.h>
#include<time.h>
#include<locale.h>
#include<stdlib.h>

#include"soundex.h"

//--------------------------------------------------------
// FUNCTION main
//--------------------------------------------------------
void main(void)
{
    puts("---------------------------\n| code-in-c.com - Soundex |\n---------------------------\n");

    char* names1[] = {"Johnson", "Adams", "Davis", "Simons", "Richards", "Taylor", "Carter", "Stevenson", "Taylor", "Smith", "McDonald", "Harris", "Sim", "Williams", "Baker", "Wells", "Fraser", "Jones", "Wilks", "Hunt", "Sanders", "Parsons", "Robson", "Harker"};

    char* names2[] = {"Jonson", "Addams", "Davies", "Simmons", "Richardson", "Tailor", "Chater", "Stephenson", "Naylor", "Smythe", "MacDonald", "Harrys", "Sym", "Wilson", "Barker", "Wills", "Frazer", "Johns", "Wilkinson", "Hunter", "Saunders", "Pearson", "Robertson", "Parker"};

    int namecount = sizeof(names1) / sizeof(names1[0]);

    char s1[] = "     ";
    char s2[] = "     ";

    for(int i = 0; i < namecount; i++)
    {
        soundex(names1[i], s1);
        soundex(names2[i], s2);

        printf("%-20s%-6s%-20s%-6s\n", names1[i], s1, names2[i], s2);
    }
}

The main function first creates a couple of string arrays, each pair of names being similar to some degree. To avoid hard-coding the array size the next line picks it up using sizeof. What we think of as an array of strings is of course actually an array of pointers, so sizeof(names1) return the size of a pointer (most likely 8 bytes or 64 bits) times the number of elements. To get the number of elements we therefore just need to divide that by the size of one element, ie. the size of a pointer.

In this case there are 24 elements, so sizeof(names1) = 192. Divide that by sizeof(names1[0]) which is 8 and we get 24. This of course assumes we are using a 64 bit machine.

We then create a couple of strings for the encoded values and then loop through the name pairs, calling the soundex function for each, and finally printing out the names and their Soundex encodings.

Now compile and run the program with these commands in your terminal.

Build and run

gcc main.c soundex.c -std=c11 -lm -o main
./main

Program output

---------------------------
| code-in-c.com - Soundex |
---------------------------

Johnson             J525  Jonson              J525
Adams               A352  Addams              A352
Davis               D120  Davies              D120
Simons              S520  Simmons             S520
Richards            R263  Richardson          R263
Taylor              T460  Tailor              T460
Carter              C636  Chater              C360
Stevenson           S315  Stephenson          S315
Taylor              T460  Naylor              N460
Smith               S530  Smythe              S530
McDonald            M235  MacDonald           M235
Harris              H620  Harrys              H620
Sim                 S500  Sym                 S500
Williams            W452  Wilson              W425
Baker               B260  Barker              B626
Wells               W420  Wills               W420
Fraser              F626  Frazer              F626
Jones               J520  Johns               J520
Wilks               W420  Wilkinson           W425
Hunt                H530  Hunter              H536
Sanders             S536  Saunders            S536
Parsons             P625  Pearson             P625
Robson              R125  Robertson           R163
Harker              H626  Parker              P626

As you can see, the algorithm is not perfect. Even with this small selection of names a few problems are apparent. Ignoring repeating values means Simons and Simmons are given the same encoding, and using only the first few letters means Richards and Richardson are also encoded the same. Ignoring vowels means that Wells and Wills, Sanders and Saunders, Parsons and Pearson are all given the same encoding despite not actually being homophones.

Please let me have your comments and suggestions, and follow Code in C on Twitter for news of future posts.

Leave a Reply

Your email address will not be published. Required fields are marked *