Normal Distribution

One of the most useful bits of number-crunching you can do with data is to calculate the probability distribution, in the earnest hope that it will be a reasonable fit for one of the recognised distributions such as the normal distribution. In this project I will write a small C library to calculate the normal distribution for a given data set.

The Normal Distribution

Most people will have an intuitive understanding of the normal distribution even if they have no formal education in statistics. As an example let's think about people's heights: a few people are very short, rather more are fairly short, a lot are around average height, and then the numbers drop back down as we start to consider heights above average, eventually reaching zero.

If we draw a graph of heights and their respective frequencies we might expect to see something like this. (Note this is just fictitious data I have created for this project.)

normal distribution

The bars show actual data, whereas the red circles show the theoretical normal distribution for this particular data. As you can see there is a very good but not perfect fit, for example the frequency of 66 inches is 21, but the theoretical frequency assuming the data fits a normal distribution is 20.1516.

So how do we calculate, for example, 20.1516 from our data? Short answer: using the formula below, borrowed from Wikipedia. The variables used in this formula are:

  • x - the individual data value, eg. 21
  • μ - the arithmetic mean of the data
  • σ - the standard deviation of the data
normal distribution

The formula looks slightly intimidating but later on I'll implement it in C to calculate the normal values for each item of data. In fact that is at the core of this whole project - to iterate a set of data and apply the above formula to calculate the values plotted as red circles.

The Plan

Before starting to code let's consider our requirements and how we are going to implement them. Basically what we want to end up with is a data structure with a row for each different value in the original data, and columns for the following:

  • The value (eg. height)
  • The actual frequency of that value
  • The actual probability of that value
  • The theoretical frequency according to the normal distribution
  • The theoretical probability according to the normal distribution

For simplicity I am restricting the data to integers, each of which will be counted separately when calculating frequencies. A refinement would be to allow for real values (float or double in C) and to group these into class intervals for the purposes of calculating frequencies. A further refinement could be to extend the code to calculate other types of probability distribution.

Coding

Create a new folder and within it create the following empty files. You can download the source code from the Downloads page or pull it from Github if you prefer.

  • data.h
  • data.c
  • normaldistribution.h
  • normaldistribution.c
  • main.c

In the real world we would be getting our data from a database or some kind of file but for the purposes of this project I have written a function to populate an array with hard-coded sample data. In a couple of earlier projects on this site I have used randomly generated sample data but here we need data which roughly follows a normal distribution so have resorted to hard coding it. I have on my list of future projects a program to generate random data following certain probability distributions but for now this will have to suffice.

In data.h enter the function prototype.

data.h

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
bool populatedata(int data[8]);

Now open data.c and enter the function body.

data.c

#include<stdio.h>
#include<stdbool.h>

#include<string.h>

//--------------------------------------------------------
// FUNCTION populatedata
//--------------------------------------------------------
void populatedata(int data[138])
{
    memcpy(data, (int[138]){66,64,65,68,69,71,69,67,66,64,66,60,65,68,66,66,67,66,60,
                            63,69,64,66,68,65,65,71,65,70,67,66,69,67,68,70,60,66,63,
                            67,67,68,68,63,72,69,71,63,69,67,64,67,71,68,62,64,66,68,
                            65,65,66,62,67,63,67,64,67,67,64,69,69,64,62,63,63,70,65,
                            68,66,67,66,64,62,68,71,63,65,67,66,69,66,67,70,70,66,69,
                            66,64,62,64,64,61,67,65,67,68,69,71,65,61,71,69,67,61,66,
                            61,66,65,61,65,70,63,65,64,70,67,62,66,68,66,65,68,65,65,
                            70,68,68,62,63}, sizeof(int[138]));
}

The normaldistribution.h header file contains two struct declarations. The central one is prob_dist which will hold an array of individual items, one for each discrete value, as well as overall totals. The other struct is prob_dist_item to hold values, frequencies and probabilities. We'll see exactly how this works later.

Also in the header file are the four function prototypes.

normaldistribution.h

#include<stdlib.h>
#include<stdbool.h>
#include<stdio.h>
#include<math.h>

//--------------------------------------------------------
// STRUCT prob_dist_item
//--------------------------------------------------------
typedef struct prob_dist_item
{
    int value;
    int frequency;
    double probability;
    double normal_probability;
    double normal_frequency;
} prob_dist_item;

//--------------------------------------------------------
// STRUCT prob_dist
//--------------------------------------------------------
typedef struct prob_dist
{
    prob_dist_item* items;
    int count;

    double total_probability;
    double total_normal_probability;
    double total_frequency;
    double total_normal_frequency;
} prob_dist;

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
prob_dist* normal_distribution_create();
void normal_distribution_calculate(int* data, int size, prob_dist* pd);
void normal_distribution_print(prob_dist* pd);
void normal_distribution_free(prob_dist* pd);

That's the preliminaries and infrastructure done so we can get stuck in to the central pieces of code. Open normaldistribution.c and enter or paste the following. I'll run through it in detail afterwards.

normaldistribution.c

#include<stdio.h>
#include<stdlib.h>
#include<math.h>

#include"normaldistribution.h"

#ifndef M_PI
#define M_PI 3.14159
#endif

#ifndef M_E
#define M_E 2.71828
#endif

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
static int indexof(int value, prob_dist* pd);
static int compare_prob_dist_item(const void* a, const void* b);

//--------------------------------------------------------
// FUNCTION normal_distribution_create
//--------------------------------------------------------
prob_dist* normal_distribution_create()
{
    prob_dist* pd = malloc(sizeof(prob_dist));

    pd->count = 0;
    pd->items = NULL;

    pd->total_probability = 0;
    pd->total_normal_probability = 0;
    pd->total_normal_frequency = 0;

    return pd;
}

//--------------------------------------------------------
// FUNCTION normal_distribution_calculate
//--------------------------------------------------------
void normal_distribution_calculate(int* data, int size, prob_dist* pd)
{
    int index;
    double total = 0;
    double sumofsquares = 0;
    double mean;
    double variance;
    double stddev;

    pd->total_frequency = size;

    // CALCULATE FREQUENCIES
    // iterate data
    // add new values to pd
    // or
    // increment frequency of existing values

    for(int i = 0; i < size; i++)
    {
        index = indexof(data[i], pd);

        // already exists
        if(index >= 0)
        {
            pd->items[index].frequency++;
        }
        // does not exist
        else
        {
            pd->count++;

            // first item
            if(pd->items == NULL)
            {
                pd->items = malloc(sizeof(prob_dist_item));
            }
            else
            {
                pd->items = realloc(pd->items, sizeof(prob_dist_item) * pd->count);
            }

            pd->items[pd->count - 1].value = data[i];
            pd->items[pd->count - 1].frequency = 1;
        }

        total += data[i];
        sumofsquares += pow(data[i], 2);
    }

    // SORT ITEMS
    qsort(pd->items, pd->count, sizeof(prob_dist_item), compare_prob_dist_item);

    // CALCULATE MEAN, VARIANCE AND STANDARD DEVIATION
    mean = total / size;
    variance = (sumofsquares - ((pow(total, 2)) / size)) / size;
    stddev = sqrt(variance);

    // CALCULATE PROBABILITIES OF EACH UNIQUE VALUE
    for(int c = 0; c < pd->count; c++)
    {
        pd->items[c].probability = (double)pd->items[c].frequency / (double)size;

        pd->items[c].normal_probability = ((1.0 / (stddev * sqrt(2.0 * M_PI))) * (pow(M_E, -1.0 * ((pow((pd->items[c].value - mean), 2.0)) / ( variance * 2.0)))));

        pd->items[c].normal_frequency = pd->items[c].normal_probability * size;

        pd->total_probability += pd->items[c].probability;

        pd->total_normal_probability += pd->items[c].normal_probability;
        
        pd->total_normal_frequency += pd->items[c].normal_frequency;
    }
}

//--------------------------------------------------------
// FUNCTION normal_distribution_print
//--------------------------------------------------------
void normal_distribution_print(prob_dist* pd)
{
    printf("Value | Probability | Normal Prob | Freq | Normal Freq\n------------------------------------------------------\n");

    for(int i = 0; i < pd->count; i++)
    {
        printf("%5d |%12.6lf |%12.4lf |%5d |%12.4lf\n", pd->items[i].value, pd->items[i].probability, pd->items[i].normal_probability, pd->items[i].frequency, pd->items[i].normal_frequency);
    }

    printf("------------------------------------------------------\n");

    printf("      |%12.4lf |%12.6lf |%5.0lf |%12.4lf\n", pd->total_probability, pd->total_normal_probability, pd->total_frequency, pd->total_normal_frequency);

    printf("------------------------------------------------------\n");
}

//--------------------------------------------------------
// FUNCTION normal_distribution_free
//--------------------------------------------------------
void normal_distribution_free(prob_dist* pd)
{
    free(pd->items);

    free(pd);
}

//--------------------------------------------------------
// FUNCTION indexof
//--------------------------------------------------------
static int indexof(int value, prob_dist* pd)
{
    for(int i = 0; i < pd->count; i++)
    {
        if(pd->items[i].value == value)
        {
            return i;
        }
    }

    return -1;
}

//--------------------------------------------------------
// STATIC FUNCTION compare_prob_dist_item
//--------------------------------------------------------
static int compare_prob_dist_item(const void* a, const void* b)
{
    if(((prob_dist_item*)a)->value < ((prob_dist_item*)b)->value)
        return -1;
    else if(((prob_dist_item*)a)->value > ((prob_dist_item*)b)->value)
        return 1;
    else
        return 0;
}

Firstly we need to #define a couple of constants if they don't already exist - these are Pi (π) and e, or Euler's Number.

Next there are a couple of function prototypes for static functions which I'll describe when we get to them.

The normal_distribution_create function uses malloc to get enough memory for a prob_dist struct and then initializes the members to 0 or NULL.

Now we move on to normal_distribution_calculate, the central function of the whole project.

Firstly we'll declare a few variables for later use, and set pd->total_frequency to size.

Next we iterate the data. On each iteration if we haven't encountered the value before we add a new prob_dist_item to the array and set the frequency to 1. If we have already seen the value before it will already have a prob_dist_item with its associated frequency, so we just increment the frequency. To check which is the case I have used a separate static function called indexof. Along the way we also increment total and sumofsquares for later use.

We now need to sort the items. This is not strictly essential but users aren't going to want their numbers all jumbled up. I have used the qsort function from stdlib.h. I wrote a post about qsort a while ago so won't go into details here, but briefly it takes an array to sort, the size of the array, the size of an individual item, and a pointer to a comparator function which we must provide ourselves. How you implement this is up to you but it takes void pointers to the two items to compare and must return <1, 0 or >1 depending on whether the first item is less than, equal to or more than the second. Our version is compare_prob_dist_item.

We can now calculate the arithmetic mean, variance and standard deviation, before incrementing the recently created array of prob_dist_item structs to fill in the missing values.

For each discrete value we calculate the following three values:

  • probability - simply the frequency divided by size

  • normal_probability - this uses the equation shown above but repeated here
    normal distribution

  • normal_frequency - this is the normal probability multiplied by the number of data items

We also increment the three totals, total_probability, total_normal_probability and total_normal_frequency.

Now let's look at the normal_distribution_print function. This prints out a heading and then iterates the prob_dist_item array, printing out the contents in a table.

The normal_distribution_free function is very simple; it just calls free on the prob_dist struct's array of prob_dist_item structs, and then on the prob_dist struct itself.

The main function creates an int array and then calls the various functions we have written. To recap and provide an overview of the entire process I'll list the steps:

  • Create an int array
  • Populate it with populatedata
  • Create a prob_dist struct
  • Calculate actual and normal distribution frequencies and probabilities with normal_distribution_calculate
  • Print these with normal_distribution_print
  • Free up dynamic memory with normal_distribution_free

main.c

#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>

#include"data.h"
#include"normaldistribution.h"

//--------------------------------------------------------
// FUNCTION main
//--------------------------------------------------------
int main(void)
{
    puts("---------------------------------------\n| code-in-c.com - Normal Distribution |\n---------------------------------------\n");

    int data[138];

    populatedata(data);

    prob_dist* pd = normal_distribution_create();

    normal_distribution_calculate(data, 138, pd);

    normal_distribution_print(pd);

    normal_distribution_free(pd);

    return EXIT_SUCCESS;
}

We have now finished the coding so to compile and run the program run the following in your terminal.

Compile and Run

gcc main.c data.c normaldistribution.c -std=c11 -lm -o main

./main

This is the program output. The normal probability and frequency totals don't quite add up to 1 and 138 respectively due to roundings.

Program Output

---------------------------------------
| code-in-c.com - Normal Distribution |
---------------------------------------

Value | Probability | Normal Prob | Freq | Normal Freq
------------------------------------------------------
   60 |    0.021739 |      0.0120 |    3 |      1.6597
   61 |    0.036232 |      0.0255 |    5 |      3.5189
   62 |    0.050725 |      0.0473 |    7 |      6.5242
   63 |    0.072464 |      0.0766 |   10 |     10.5773
   64 |    0.094203 |      0.1087 |   13 |     14.9952
   65 |    0.123188 |      0.1347 |   17 |     18.5894
   66 |    0.152174 |      0.1460 |   21 |     20.1516
   67 |    0.137681 |      0.1384 |   19 |     19.1024
   68 |    0.108696 |      0.1147 |   15 |     15.8343
   69 |    0.086957 |      0.0832 |   12 |     11.4773
   70 |    0.057971 |      0.0527 |    8 |      7.2747
   71 |    0.050725 |      0.0292 |    7 |      4.0320
   72 |    0.007246 |      0.0142 |    1 |      1.9542
------------------------------------------------------
      |      1.0000 |    0.983270 |  138 |    135.6913
------------------------------------------------------

Please let me have your comments and suggestions, and follow Code in C on Twitter for news of future posts.

Leave a Reply

Your email address will not be published. Required fields are marked *