Stats, MR and Data: Mean, mode and median on a skewed noisy distribution

Introduction

There are a lot of articles and blog entries on the differences between the three (usual) measures of the average of a distribution. Here's my version.

The Averages

Mean - the easiest to calculate as it's just the sum of the data points divided by the count of the point. See wikipedia for more details.
Mode - the most common value within a set of data points. See wikipedia for more details.
Median - the middle point of an ordered set of data points. If the number of points is even then the average of the two middle points can be taken. Again it's probably best to look at wikipedia for more complete details

The Data

The data points are taken from an ad-hoc survey. People were asked to answer a set of questions and the time to complete them was recorded. The times recorded range from around 3 seconds to just over 100,000 seconds. Both seem extremely unlikely. Below is a chart of the data with a red line denoting the smoothed distribution:

The chart was produced in R with the ggplot2 package. Basically the data points were loaded, tabulated and smoothed. As you can see, not much detail is shown. The extremes completely swamp the interesting data. A second chart, restricted to an upper limit of 800 seconds shows the details more clearly:

You'll also note that I've coloured the line according to the frequency. There's no reason apart from the fact that I think it looks good. It adds no meaning to the chart.

So what values do we get for the three averages from the raw data:

Mean - 198 seconds
Mode - 95 seconds
Median - 132 seconds

For this type of distribution with data that is skewed to the right we would always expect mean to be bigger than the median which will be bigger than the mode.

The following chart shows the same plot as above but with the averages shown:

There are two problems with these values though.

The outliers are being included in the averages. The bulk of the data lies between 0 and 400 seconds. Values above this should be included but at what point do we start to say that the data points are outliers.
The mode is calculated from the noisy data. On the distribution above, this is probably not too much of a problem. But what if the peak wasn't so pronounced? Then the noise could produce a value that is far from the 'true' value of the peak.

So what do we do? And what is the best measure of the average?

Removing Outliers

There are many ways to remove outliers but in this case I'm just going to look at the effect of the data points on the mean. So, calculating the mean for point 1, then points 1 to 2, and so forth until we get to a mean of points 1 to n. We can then show the effect of the individual points on the mean:

The elbow of this line lies around 500 seconds. Beyond this point the mean rise slowly but significantly. Showing the lower values in more detail:

This show that the mean basically stops rising with any 'speed' at around 2000 seconds. This seems a reasonable cut off point.

Recalculating the mean for this upper limit gives a value of 171 seconds, a reduction of 27 seconds (14%).

Noisy Modes

The next problem was how much is the mode affected by having noisy data. To see this I'm going to fit a spline to the data using the R function smooth.spline. I've restricted the degrees of freedom to 40. The default value gave a line that was too wavy.

The smoothed line is shown above in chart 2. You can see that the peak of the smoothed data does not correspond to the most common data point.

To calculate the max of the smoothed data you can just run:

dur3 <- smooth.spline(x=dur2$dur,y=dur2$freq,df=40)

dur4 <- data.frame(cbind(dur3$x,dur3$y))

dur4$X1[dur4$X2==max(dur4$X2)]

where dur2 is the tabulated original data.

This gives a modal value of 107 seconds, an increase of 12 seconds.

The Median

This doesn't change at all when restring the data set to 0 to 2000 seconds.

Which Average to choose?

So, which average should you choose to show the average time of completion for the questionnaire.

I think that the modal value would give an answer that was too low. It doesn't take into account the skew nature of the data.

The mean also has problems in that it is overly affected by outliers. It's not too much trouble to remove these for this scenario. After all if it takes around 2 minutes for most people to complete the questionnaire then cutting the data off at around 30 minutes seems sensible.

The median value seems a good compromise. It could still be said to be too low given the nature of the distribution but it is relatively simple to calculate and you don't have to worry about outliers in the data set.

Another question to answer would do we need a single measure of an average for this? The chart itself gives a better indication of how long it took to answer. By reducing it to a single value we lose a lot of the information such as the spread of the times. Maybe taking the mode and the spread at half the height of the peak would be better. This gives a spread of 67 to 174 seconds and therefore includes all measures of the average.

Stats, MR and Data

Thursday, 5 September 2013

Mean, mode and median on a skewed noisy distribution