Statistics and Programming - Fundamentals

By doug

If there was one thing I could go back and do in my college career... well, there would probably be a lot of things I'd change. One academic choice I could change would be to make better use of my statistics class. I've recently been thinking about my lack-of understanding of good statistics and how it could be applicable to testing and verifying software systems (or any studies for that matter). I decided I wanted a bit more of a hands on and familiar way to approach statistics, so I decided to look up statistics for programmers. My first results were interesting:

Programmers Need To Learn Statistics Or I Will Kill Them All

Does knowledge of statistics make you a better programmer?

Statistics For Programmers I: The Problem

I was kinda surprised. There are also arguments that support using statistical knowledge like many other fields have for a long time to reason. Check out this presentation by Greg Wilson - What We Actually Know About Software Development, and Why We Believe It's True.

Greg Wilson - What We Actually Know About Software Development, and Why We Believe It's True from CUSEC on Vimeo.

I suggest watching the entire hour. However, my understanding of the presentation is this - Many fields (varieties of engineering fields, social sciences, etc) take advantage of good statistical evidence to argue and support claims while the world of software engineering has yet to do this properly. From what I understand, Greg argues that many of the qualities that have been proclaimed by development/management styles (agile, XP, etc) are not properly supported by data that describes these ideas. In addition, he goes on to describe many studies that have been done improperly in the software engineering world and many studies that have been done properly and reveal a lot of fascinating information. His point stands out strongly in my mind - if it's possible, data must be used to demonstrate problems (if possible).

If for no other reason, I believe it's helpful to keep your programming skills sharp. If you decide to sit down and try to figure out to develop some of the basic statistical functions (mean, median, standard deviation, variance, etc) I think one may be surprised by the challenges that can arise from just implementing these in particular.

Thanks to my refresher on the basics from Think Stats, I've created a basic Java Statistics class. This is modeled off of the idea of the Math class in Java in it that it contains a variety of static methods to perform different statistical functions. In particular, it supports calculating mean, variance, standard deviation and median. The class functions are designed to support any type derived from Number, but mainly the Double type has been tested within this class. 

Mean is one of the most crucial functions within this set of functions (although the most straight forward). While not the most important statistical tool on its own, it is the base that the standard deviation and variance rely upon. Lets take a look at the mean function:

public static<T extends Number> Double mean(AbstractList<T> values) {
        if (values == null || values.isEmpty())
            return new Double(0.0);
        double sum = 0.0;

        for (T value : values) {
            sum += value.doubleValue();
        }

        return new Double(sum / values.size());
    }

Simply take the sum of all values contained within the collection of numbers and divide by the amount of values summed. If there is an empty set provided, the average is zero. This is the fundamental piece of the variance and standard deviation. Essentially, the standard deviation is a way of observing values within a measurement by determining how far each value measured compares to your mean. In turn, this provides a way to look at how far off some of the values you have are from your mean and thus provide some slightly better idea on how much difference there is in the points measured.

Check out the source code, Statistics.java.

Next time, I'll talk about the determining of the median - the middle most represented value within a set of numbers.


Sailing heart-ships
thru broken harbors
Out on the waves in the night
Still the searcher
must ride the dark horse
Racing alone in his fright.
Tell me why, tell me why

Neil Young, Tell Me Why


Tags: programming, java, statistics