Linear Regression: Fitting a Line to Data

In a previous post I implemented the Pearson Correlation Coefficient, a measure of how much one variable depends on another. The three sets of bivariate data I used for testing and demonstration are shown again below, along with their corresponding scatterplots. As you can see these scatterplots now have lines of best fit added, their gradients and heights being calculated using least-squares regression which is the subject of this article.

Data Set 1

xy
1032
2044
4068
4574
6092
6598
75110
80116

dataset1

Data Set 2

xy
1040
2040
4060
4580
6090
65110
75100
80130

dataset2

Data Set 3

xy
10100
2010
40130
4590
6040
6580
75180
8050

dataset3

y = ax + b

To draw the line of best fit we need the two end points which are then just joined up. The values of x are 0 and 90, so we need a formula in the form

General Form of Linear Equation

y = ax + b

We can then plug in 0 and 90 as the values of x to get the corresponding values for y. a represents the gradient or slope of the line, for example if y increases at the same rate as x then a will be 1, if it increases at twice the rate then a will be 2 etc. (The x and y axes might be to different scales so the apparent gradient might not be the same as the actual gradient.) b is the offset, or the amount the line is shifted up or down. If the scatterplot's x-axis starts at or passes through 0 then the height of the line at that point will be the same as b.

I have used the equation of the line in the form ax + b, but mx + c is also commonly used. Also, the offset is sometimes put first, ie. b + ax or c + mx.

Interpolation and Extrapolation

Apart from drawing a line to give an immediate and intuitive impression of the relationship between two variables, the equation given by linear or other kinds of regression can also be used to estimate values of y for values of x either within or outside the known range of values.

Estimating values within the current range of the independent variable is known as interpolation, for example in Data Set 1 we have no data for x = 30, but once we have calculated a and b we can use them to calculate an estimated value of y for x = 30.

The related process of estimating values outside the range of known x values (in these examples < 0 and > 90) is known as extrapolation.

The results of interpolation and extrapolation need to be treated with caution. Even if the known data fits a line exactly does not imply that unknown values will also fit, and of course the more imperfect the fit the more unlikely unknown values are to be on or near the regression line.

One reason for this is a limited set of data which might not be representative of all possible data. Another is that the range of independent variables might be small, fitting a straight line within a limited range but actually following a more complex pattern over a wider range of values.

As an example, imagine temperature data for a short period during spring. It might appear to rise linearly during this period, but of course we know it will level out in summer before dropping off through winter, before repeating an annual cycle. A linear model for such data is clearly useless for extrapolation.

Is Regression AI?

The concept of regression discussed here goes back to the early 19th Century, when of course it was calculated by hand using pencil and paper. Thefore the answer to the question "is it artificial intelligence" is obviously "don't be stupid, of course it isn't!"

The term artificial intelligence has been around since 1956, and existed as a serious idea (ie. beyond just sci-fi or fantasy) since the 1930s and was Alan Turing's original concept. The idea that for decades computers would be little more than calculators or filing systems would not have impressed him. This is perhaps longer than many people imagine but still nowhere near as far back as the early 1800s.

AI has has a chequered history, full of false starts, dead ends and disappointments, and has only started to become mainstream and actually useful in the past few years. This is due mainly to the emergence of machine learning, so AI and ML are sometimes now being used interchangeably.

The existence of very large amounts of data ("Big Data") plus the computing power to crunch that data using what are actually very simple algorithms has led to this revolution. With enough data and computing power you can derive generalised rules (ie. numbers) from samples which can be used to make inferences or decisions on other, similar, items of data.

Hopefully you can see what I am getting at here: carry out regression on the data you have, then use the results for interpolation and extrapolation about other possible data - almost a definition of Machine Learning.

I won't answer my own question "Is Regression AI?" but if you have any opinions one way or the other please let me know in the Comments below.

Writing the Code

For this project we will write a function to create sample arrays of data corresponding to that shown above, and then write the code necessary to carry out linear regression, ie. calculate a and b, for each of them.

Create a new folder and then create the following empty files:

  • main.c
  • data.h
  • data.c
  • linearregression.h
  • linearregression.c

Open main.c and type or paste the following. You can also download the source code from the Downloads page if you prefer.

(Note that data.h and data.c are copied from the earlier Pearson Correlation article. Copying and pasting code into different projects is bad practice of course but I want to keep the source code for each project entirely self-contained. If you followed along the with Pearson Correlation code you might like to create the files for this project in the same folder and just have one copy of the data files.)

main.c

#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>

#include"data.h"
#include"linearregression.h"

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
void printdata(double* independent, double* dependent, int size);

//--------------------------------------------------------
// FUNCTION main
//--------------------------------------------------------
int main(void)
{
    puts("-------------------------------------\n| code-in-c.com - Linear Regression |\n-------------------------------------\n");

    double independent[8];
    double dependent[8];

    for(int dataset = 1; dataset <=3; dataset++)
    {
        if(populatedata(independent, dependent, dataset))
        {
            printf("Data Set %d\n-----------\n", dataset);

            printdata(independent, dependent, 8);

            lin_reg lr;

            linear_regression(independent, dependent, 8, &lr);

            printf("y = %gx + %g\n\n", lr.a, lr.b);
        }
    }

    return EXIT_SUCCESS;
}

//--------------------------------------------------------
// FUNCTION printdata
//--------------------------------------------------------
void printdata(double* independent, double* dependent, int size)
{
    for(int i = 0; i < size; i++)
    {
        printf("%3.0lf\t%3.0lf\n", independent[i], dependent[i]);
    }
}

In main we create a pair of arrays for the bivariate data, and then enter a for loop for the three available sets of data. Within the loop the arrays are populated and the data displayed, the linear_regression function is called and the result printed.

Now enter the following code into data.h and data.c. The populatedata function simply copies hard-coded data into the two arrays.

data.h

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
bool populatedata(double independent[8], double dependent[8], int dataset);

data.c

#include<string.h>
#include<stdio.h>
#include<stdbool.h>

//--------------------------------------------------------
// FUNCTION populatedata
//--------------------------------------------------------
bool populatedata(double independent[8], double dependent[8], int dataset)
{
    bool success = true;

    switch(dataset)
    {
        case 1:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){32,44,68,74,92,98,110,116}, sizeof(double[8]));
            break;

        case 2:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){40,40,60,80,90,110,100,130}, sizeof(double[8]));
            break;

        case 3:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){100,10,130,90,40,80,180,50}, sizeof(double[8]));
            break;

        default:
            puts("Invalid dataset requested");
            success = false;
            break;
    }

    return success;
}

Now we get to the heart of the matter - the linear_regression function.

We have a minor problem here in that we need to return two values, a and b, but of course only one value can be returned by a function. There are a few ways to get round this.

The first way would be to write two separate functions, one to calculate a and one to calculate b. This would be fiddly and repetitive to implement and irritating to use so we'll reject that solution.

Another way would be to create and return a struct containing a and b, but local variables cannot be returned from a function as they go out of scope when the function exits. We would therefore need to allocate memory using malloc but the problem with that approach is that the calling code would be responsible for calling free on the struct when it had finished with it. The person calling the function might forget to do so or might not even know it is necessary, so although possible this is not an advisable solution to the problem. (Hopefully the person calling the function would use a utility like Valgrind to check any memory obtained using malloc/calloc/realloc is freed but memory management in C is something that should be isolated and foolproofed as much as possible.)

The solution we will go with is to still use a struct, but to pass it "empty" as a pointer to linear_regression which will then set a and b. This pattern of calling code passing pointers to empty data structures for functions to populate is one I generally use in these type of situations, and was also used above in populatedata.

Open linearregression.h and linearregression.c, and enter the following code.

linearregression.h

//--------------------------------------------------------
// STRUCT lin_reg
//--------------------------------------------------------
typedef struct lin_reg
{
    double a;
    double b;
} lin_reg;

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
void linear_regression(double* independent, double* dependent, int size, lin_reg* lr);

linearregression.c

#include<stdlib.h>
#include<stdbool.h>
#include<stdio.h>
#include<string.h>
#include<math.h>

#include"linearregression.h"

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
static double arithmetic_mean(double* data, int size);
static double mean_of_products(double* data1, double* data2, int size);
static double variance(double* data, int size);

// --------------------------------------------------------
// FUNCTION linear_regression
// --------------------------------------------------------
void linear_regression(double* independent, double* dependent, int size, lin_reg* lr)
{
    double independent_mean = arithmetic_mean(independent, size);
    double dependent_mean = arithmetic_mean(dependent, size);
    double products_mean = mean_of_products(independent, dependent, size);
    double independent_variance = variance(independent, size);

    lr->a = (products_mean - (independent_mean * dependent_mean) ) / independent_variance;

    lr->b = dependent_mean - (lr->a * independent_mean);
}

//--------------------------------------------------------
// FUNCTION arithmetic_mean
//--------------------------------------------------------
static double arithmetic_mean(double* data, int size)
{
    double total = 0;

    // note that incrementing total is done within the for loop
    for(int i = 0; i < size; total += data[i], i++);

    return total / size;
}

//--------------------------------------------------------
// FUNCTION mean_of_products
//--------------------------------------------------------

static double mean_of_products(double* data1, double* data2, int size)
{
    double total = 0;

    // note that incrementing total is done within the for loop
    for(int i = 0; i < size; total += (data1[i] * data2[i]), i++);

    return total / size;
}

//--------------------------------------------------------
// FUNCTION variance
//--------------------------------------------------------
static double variance(double* data, int size)
{
    double squares[size];

    for(int i = 0; i < size; i++)
    {
        squares[i] = pow(data[i], 2);
    }

    double mean_of_squares = arithmetic_mean(squares, size);
    double mean = arithmetic_mean(data, size);
    double square_of_mean = pow(mean, 2);
    double variance = mean_of_squares - square_of_mean;

    return variance;
}

To calculate a and b we need the following four values:

  • The mean of the independent values
  • The mean of the dependent values
  • The mean of the products of the pairs of data
  • The variance of the independent values

Calculating these is farmed out to separate functions. These are general-purpose functions with applications outside just linear regression so if we were writing a wider statistical library or application these would form part of a library of utility functions.

Having calculated the four values mentioned above we use them to calculate a and b, assigning them to the struct pointer argument.

That's the code finished so we can compile and run it with the following lines in Terminal.

Compiling

gcc main.c linearregression.c data.c -std=c11 -lm -o main

Running

./main

The program output shows each of the three sets of data, along with the formulae of their respective lines of best fit.

Program Output

-------------------------------------
| code-in-c.com - Linear Regression |
-------------------------------------

Data Set 1
-----------
 10      32
 20      44
 40      68
 45      74
 60      92
 65      98
 75     110
 80     116
y = 1.2x + 20

Data Set 2
-----------
 10      40
 20      40
 40      60
 45      80
 60      90
 65     110
 75     100
 80     130
y = 1.24249x + 19.9022

Data Set 3
-----------
 10     100
 20      10
 40     130
 45      90
 60      40
 65      80
 75     180
 80      50
y = 0.441649x + 63.1936

In the section above on interpolation and extrapolation, I used x = 30 in Data Set 1 as an example of missing data which could be estimated by interpolation. Now we have values of a and b for that data we can use them as follows:

Interpolation Example

y = 1.2 * 30 + 20
= 56

Please let me have any ideas, suggestions or questions in Comments below, and follow Code in C on Twitter for news of future posts.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *