Friday 23 August 2013

Death Statistics in England and Wales - Postscript

In the last post about death statistics I put in an animated GIF that I had created using a combination of Excel and GIMP.

Quite why I did it this way I don't know. It took hours (probably) to adjust the chart, copy it to paint, save it and then copy it as a layer to GIMP.

I've since discovered a few websites detailing how to create exactly the same thing using R. Googling 'animated gif R' gives, almost, thousands of examples. I now realise quite how stupid I was.

Below is the script that I used:

-------------------------------------------------------------------------------
death <- read.table("death.txt",header=TRUE)
#summary(death)

library(ggplot2)
library(Cairo)

for(i in 2:49){
  dp <- ggplot(data=death,aes(x=Age,y=death[,i])) + geom_line() + ylim(0,0.05)
  
  if(i<10){
    ttle <- paste('plot0',i,'.png',sep="")
  } else {
    ttle <- paste('plot',i,'.png',sep="")
  }

  CairoPNG(filename=ttle,width=480,height=480,dpi=300)
  #png(filename=ttle)
  print(dp)
  dev.off()
}

shell('"C:/Program Files/GraphicsMagick/gm.exe" convert -delay 15 -loop 0 plot*.png death.gif')
shell('"C:/Program Files/FFMpeg/bin/ffmpeg.exe" -start_number 02 -i plot%02d.png -vcodec libx264 death.mp4')
-------------------------------------------------------------------------------

The script loads in the data which is arranged as a table with age as the rows, year as the columns and deaths as a proportion of the total.

It then loads the required libraries.

Lines 7-20 do the work of creating the plots. These are saved as PNG files. Two things to note:
  1. I've set the y axis scale to run from 0->0.05. This is to keep the axes on a consistent scale.
  2. To make sure that the PNG files are ordered correctly I've had to zero pad the numbers for 2-9.
The work of creating the animated GIF is done by GraphicMagick. You can also use ImageMagick.

I won't go into any more detail as other sites give much better information. There's no point copying it.

The last line uses ffmeg to create a movie from the plots. This was inspired by www.animatedgraphs.co.uk.

Below is the resultant GIF:

The whole process takes seconds - once the script is written.

Thursday 8 August 2013

An Excel macro to simplistically seasonally adjust time series data

Introduction

There are many ways to deseasonalise seasonal time series data.

I'm going to write about one of the easier methods. I can't remember what the official name for the method is but all we're doing is dividing the data point by the average for each seasonal period.

That definition probably needs explaining. So, say, for example we have quarterly data for 3 years. We take the average for each quarter over the years i.e. average Q1 for years 1, 2 and 3. Then do the same for Q2, Q3 and Q4. After this you divide the Q1 points by the Q1 average, the Q2 points by the Q2 average, etc.

There are some things to be careful of when using this method:

  1. The more data you have the better. You'll probably need more than 3 years worth of data for this to work well.
  2. You need to be careful of outliers affecting the seasonal averages by too much when there is very little data.
  3. The data series needs to be stationary. Or at least, mostly stationary. If it's not, you tend to get step changes in the resulting seasonally adjusted data. By stationary I mean that there is no long term trend in the data.

The Macro

Here is the listing of the macro.

--------------------------------------------------------------------------
Function SeasonAverage(labelRange As range, dataRange As range) As Variant

    Dim labelSum As Dictionary
    Dim labelCount As Dictionary
    Dim labelAvg As Dictionary
    Dim totalSum As Double
    Dim totalCount As Integer
    Dim OutputRows As Long
    Dim OutputCols As Long
    Dim output() As Double
    Dim i As Integer
    
    totalSum = 0
    totalCount = 0
    
    Set labelSum = New Dictionary
    Set labelCount = New Dictionary
    Set labelAvg = New Dictionary
    
    With Application.Caller
        OutputRows = .Rows.Count
        OutputCols = .Columns.Count
    End With
    ReDim output(1 To OutputRows, 1 To OutputCols)
    
    'Is dataRange, labelRange and outputRange the same size and shape
    If dataRange.Rows.Count = labelRange.Rows.Count & dataRange.Columns.Count = labelRange.Columns.Count & dataRange.Rows.Count = OutputRows & dataRange.Columns.Count = OutputCols Then
        Exit Function
    End If
    
    If labelRange.Rows.Count < labelRange.Columns.Count Then
        For i = 1 To labelRange.Columns.Count
            If labelSum.Exists(labelRange(1, i).Value) Then
                labelSum.Item(labelRange(1, i).Value) = labelSum.Item(labelRange(1, i).Value) + dataRange(1, i).Value
            Else
                labelSum.Add Key:=labelRange(1, i).Value, Item:=dataRange(1, i).Value
            End If
            If labelCount.Exists(labelRange(1, i).Value) Then
                labelCount.Item(labelRange(1, i).Value) = labelCount.Item(labelRange(1, i).Value) + 1
            Else
                labelCount.Add Key:=labelRange(1, i).Value, Item:=1
            End If
            totalSum = totalSum + dataRange(1, i).Value
            totalCount = totalCount + 1
        Next
    Else
        For i = 1 To labelRange.Rows.Count
            If labelSum.Exists(labelRange(i, 1).Value) Then
                labelSum.Item(labelRange(i, 1).Value) = labelSum.Item(labelRange(i, 1).Value) + dataRange(i, 1).Value
            Else
                labelSum.Add Key:=labelRange(i, 1).Value, Item:=dataRange(i, 1).Value
            End If
            If labelCount.Exists(labelRange(i, 1).Value) Then
                labelCount.Item(labelRange(i, 1).Value) = labelCount.Item(labelRange(i, 1).Value) + 1
            Else
                labelCount.Add Key:=labelRange(i, 1).Value, Item:=1
            End If
            totalSum = totalSum + dataRange(i, 1).Value
            totalCount = totalCount + 1
        Next
    End If
    
    'Calculate average
    'Dim ts As String
    Dim rKey
    For Each rKey In labelSum
        labelAvg.Add Key:=rKey, Item:=((labelSum.Item(rKey) * totalCount) / (labelCount.Item(rKey) * totalSum))
    Next
    
    'Output data
    If dataRange.Rows.Count < dataRange.Columns.Count Then
        For i = 1 To dataRange.Columns.Count
            output(1, i) = dataRange(1, i).Value / labelAvg.Item(labelRange(1, i).Value)
        Next
    Else
        For i = 1 To dataRange.Rows.Count
            output(i, 1) = dataRange(i, 1).Value / labelAvg.Item(labelRange(i, 1).Value)
        Next
    End If
    
    Set labelSum = Nothing
    Set labelCount = Nothing
    Set labelAvg = Nothing
    
    SeasonAverage = output
    
End Function
--------------------------------------------------------------------------


The function takes two arguments. The first is the list of labels. These need to be the list of the seasons (Q1, Q2, etc. or January, February, etc.). The second argument is the data. The output is a variant and hence the function needs to be used as an array function.

The labels are placed in to dictionary objects and averages calculated for each label. Then the data is divided by these averages.

Example

I'm going to use an artificial set of data for my example. The data has been generated by adding a constant line of 100 to a seasonal trend (ranging from -10 to +40) and then some noise (normal distribution with a mean of zero and standard deviation of 4). This is basically the definition of a seasonal data series.

The first chart shows the raw generated data:


The next chart shows the data once the seasonal component has been removed.

And the third chart compares the generated noise with the derived noise from applying the macro.

As expected the noises compare quite well. As they should, given the mathematics involved.

Conclusion

This is a very simplistic way of seasonally adjusting time series data. It does not take into account trend breaks, outliers (although if there is enough data, this method can be used to spot them), holiday movements, trading days, variable days in the month etc.

It is quite a good first stab at adjusting data and seeing long term trends though. Also, because of the simplicity of the method you can be fairly sure that not too many artifacts are being introduced into the data.

As always, comments, potential improvements etc are welcome.

Thursday 1 August 2013

Excel macro to detect outliers

Introduction

Automatically detecting outliers in a set of data is generally fairly difficult. There are various papers for example from the University of York or the University of Pittsburgh summarising the various methods. What follows is a description of one the most simple. It is useful for straight line data (univariate) or any data that can be made linear.

The macro just looks at the correlation of the whole data series and calculates, for each data point, the difference of the correlation from the total without that point.

The point with the highest difference is the point that affects the correlation the most and is the most likely to be the outlier.

The Macro

As per usual most of the macro relates to arranging the data and checking the inputs.
--------------------------------------------------------------------------------
Function OutlierCorrelation(dataRange1 As range, dataRange2 As range) As Variant

    Dim OutputRows As Long
    Dim OutputCols As Long
    Dim output() As Double
    Dim vert As Boolean
    Dim ma As Double
    Dim i, j As Integer
    Dim totalCor As Double
    Dim cor() As Double
    Dim x() As Double
    Dim y() As Double
    
    With Application.Caller
        OutputRows = .Rows.Count
        OutputCols = .Columns.Count
    End With
    ReDim output(1 To OutputRows, 1 To OutputCols)
    
    'Check that dataRange1, dataRange2 and outputRange are the same size
    If dataRange1.Rows.Count <> dataRange2.Rows.Count Then Exit Function
    If dataRange1.Columns.Count <> dataRange2.Columns.Count Then Exit Function
    If dataRange1.Rows.Count <> OutputRows Then Exit Function
    If dataRange1.Columns.Count <> OutputCols Then Exit Function
    
    vert = False
    If dataRange1.Rows.Count > dataRange1.Columns.Count Then
        vert = True
        ReDim x(1 To dataRange1.Rows.Count - 1)
        ReDim y(1 To dataRange1.Rows.Count - 1)
        ReDim cor(1 To dataRange1.Rows.Count)
    Else
        ReDim x(1 To dataRange1.Columns.Count - 1)
        ReDim y(1 To dataRange1.Columns.Count - 1)
        ReDim cor(1 To dataRange1.Columns.Count)
    End If
    
    'Populate output with zeroes
    For i = 1 To OutputRows
        For j = 1 To OutputCols
            output(i, j) = 0
        Next
    Next
    
    'Calculate total correlation
    totalCor = Application.WorksheetFunction.Correl(dataRange1, dataRange2)
    
    If vert = True Then
        For i = 1 To dataRange1.Rows.Count
            For j = 1 To dataRange1.Rows.Count
                If i = j Then
                    
                Else
                    If j > i Then
                        x(j - 1) = dataRange1(j, 1).Value
                        y(j - 1) = dataRange2(j, 1).Value
                    Else
                        x(j) = dataRange1(j, 1).Value
                        y(j) = dataRange2(j, 1).Value
                    End If
                End If
            Next
            cor(i) = Correlation(x, y)
            output(i, 1) = Abs(cor(i) - totalCor)
        Next
    Else
        For i = 1 To dataRange1.Columns.Count
            For j = 1 To dataRange1.Columns.Count
                If i = j Then
                    
                Else
                    If j > i Then
                        x(j - 1) = dataRange1(1, j).Value
                        y(j - 1) = dataRange2(1, j).Value
                    Else
                        x(j) = dataRange1(1, j).Value
                        y(j) = dataRange2(1, j).Value
                    End If
                End If
            Next
            cor(i) = Correlation(x, y)
            output(1, i) = Abs(cor(i) - totalCor)
        Next
    End If
    
    OutlierCorrelation = output
    
End Function
--------------------------------------------------------------------------------


The macro also needs another function to calculate the correlation:
--------------------------------------------------------------------------------
Function Correlation(x() As Double, y() As Double) As Double

    Dim sx As Double
    Dim sy As Double
    Dim s1 As Double
    Dim s2 As Double
    Dim s3 As Double
    Dim k As Integer

    sx = 0
    sy = 0
    s1 = 0
    s2 = 0
    s3 = 0
    For k = 1 To UBound(x)
        sx = x(k) + sx
        sy = y(k) + sy
    Next
    sx = sx / UBound(x)
    sy = sy / UBound(x)
        
    For k = 1 To UBound(x)
        s1 = s1 + (x(k) - sx) * (y(k) - sy)
        s2 = s2 + (x(k) - sx) ^ 2
        s3 = s3 + (y(k) - sy) ^ 2
    Next
    Correlation = s1 / Sqr(s2 * s3)

End Function
--------------------------------------------------------------------------------

Example

Using the following data, which admittedly is very false:

The line was produced using y=2x+3 then adding uniform errors (8*rand()-4). I could have used normal errors using the norminv() function but it was easier using the rand() function on its own. It doesn't make too much difference for the purposes. I then changed the error on one of the points to be much higher. No prizes for spotting it and indeed a quick calculation of point-fitted line would quickly find it.

Using the macro gives:

The outlier is now very obvious. The correlation difference for the outlier is about 20 times higher than for the other points. For comparison the difference from the best fit line for the outlier is 6.5 time the average difference of the other points.

So this example is rather artificial but does mark less obvious outliers with real data.