Levenshtein Word Distance

A while ago I wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently. In this post I'll cover the Levenshtein Word Distance algorithm which is a related concept measuring the "cost" of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.

The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.

Continue reading

Estimating Pi

I started off calling this project "Calculating Pi" but soon realised that I needed to rename it "Estimating Pi". Pi is an irrational number starting off 3.14159 and then carrying on for an infinite number of digits with no pattern which anybody has ever discovered. Therefore it cannot be calculated as such, just estimated to (in principle) any number of digits.

Continue reading

Serializing and Deserializing Data Structures

If you want to convert the contents of a data structure to a string to either save it to disc or transmit it over the wire you will probably choose either XML or JSON. Both of these have the advantage of including metadata with the data itself, so it is possible to recreate the data structure purely from the XML or JSON without any additional knowledge. The downside is that both these formats can be very bloated.

In a restricted environment where you have control over all the code writing and reading data you can get away without a data + metadata format, and just save the data itself. In this project I will write a few functions to create an array of structs, and then serialize/deserialize them to/from a file.

Continue reading

Boolean Bits

The C99 standard introduced the _Bool type as well as stdbool.h which allows you to use bool, true and false. _Bool uses a byte to store true/false, yes/no, on/off or whatever the semantics of your program might be, but of course you only really need 1 bit so 7 bits are wasted. Most of the time this isn't worth worrying about but in some rare cases where you need a lot of Booleans it might be worthwhile looking into ways of being a bit more efficient with memory usage. This article presents my particular approach to doing this.

Continue reading

Redirection and Piping

This post will demonstrate programs which each perform a single very specific task but which can be chained together in such a way that the output of one forms the input of another. Connecting programs like this, or piping to use the correct terminology, enables more complex workflows or processes to be run.

This post includes just three short programs to carry out the following tasks:

  • Generating data

  • Filtering data from the data generating program

  • Calculating totals from the filtering program
Continue reading

Star Catalog 1.0

This post covers the first part of a project to develop a star catalog in C. The aims of this initial stage are quite modest - we will write a program to import star data from a file into a data structure and then print it to the screen in a neat format.

Future plans for the project include calculating a few extra pieces of information from existing data, sorting and filtering the data, and exporting it to various other formats.

All these things are specific to stellar data but of course the principles and coding techniques can be applied to any field.

Continue reading