Wednesday 29 May 2013

An Excel Auto-Correlation Macro

Wikipedia defines the auto-correlation as the cross-correlation of a signal with itself. For a series of data points measured through time it is the correlation between points measured a specific time apart.

I won't go into any more detail about auto-correlation as there are many good descriptions online:

  1. Wikipedia
  2. Mathworld
  3. NIST Engineering Statistics Handbook - this handbook is well worth reading for definitions of other time series related functions
There's no point repeating what others have already written. Especially when they've written it better than I ever could.

The only things I will define relate directly to the function. These are:
  • Lag - this is the number of points n from point t to calculate the auto-correlation for. So for lag=1, the correlation is calculated for points (t, t+1, t+2, ...) with points (t+1, t+2, t+3, ...).
  • Difference - when calculating the auto-correlation one often needs to calculate it on the series (t+1-t) rather than (t). This is to ensure some degree of stationarity and a clearer picture of the auto-correlation.

Function

So, on to the function. It's another array function (see here for details on how to use them). I've
called the function 'AutoCor' and is used as per the example below:
=AutoCor($C$3:$C$304,$H$3:$H$23,I$2)
where, in this case cells $C$3:$C$304 hold the data, $H$3:$H$23 hold the lags and I$2 holds the number of times to difference the data.

-----------------------------------------------------------------------------

Function AutoCor(dataRange As range, lag As range, diff As Integer) As Variant

    Dim x() As Double           'Array to hold data values
    Dim sx As Double
    Dim sy As Double
    Dim s1 As Double
    Dim s2 As Double
    Dim s3 As Double
    Dim i, j, k As Integer      'Loop variables
    Dim y() As Integer          'Lag array
    Dim OutputRows As Long
    Dim OutputCols As Long
    Dim output() As Double
    
    'Set-up data variable
    ReDim x(dataRange.Rows.Count - 1)
    If diff > 0 Then
        For i = 0 To dataRange.Rows.Count - 2
            x(i) = dataRange(i + 1, 1).Value - dataRange(i + 2, 1).Value
        Next
    Else
        For i = 0 To dataRange.Rows.Count - 1
            x(i) = dataRange(i + 1, 1).Value
        Next
    End If
    
    If diff > 1 Then
        For k = 1 To diff
            For i = 0 To dataRange.Rows.Count - 2 - k
                x(i) = x(i) - x(i + 1)
            Next
        Next
    End If
    
    'Set-up lag variable
    ReDim y(lag.Rows.Count - 1)
    For i = 0 To lag.Rows.Count - 1
        y(i) = lag(i + 1, 1).Value
    Next
    
    'Set-up output variable
    With Application.Caller
        OutputRows = .Rows.Count
        OutputCols = .Columns.Count
    End With
    If OutputRows <> dataRange.Rows.Count And OutputCols <> dataRange.Columns.Count Then Exit Function
    ReDim output(1 To OutputRows, 1 To OutputCols)
    
    For j = 0 To lag.Rows.Count - 1
        sx = 0
        sy = 0
        s1 = 0
        s2 = 0
        s3 = 0
        For k = 0 To dataRange.Rows.Count - y(j) - 1 - diff
            sx = x(k) + sx
            sy = x(k + y(j)) + sy
        Next
        sx = sx / (dataRange.Rows.Count - y(j) - diff)
        sy = sy / (dataRange.Rows.Count - y(j) - diff)
        
        For k = 0 To dataRange.Rows.Count - y(j) - 1 - diff
            s1 = s1 + (x(k) - sx) * (x(k + y(j)) - sy)
            s2 = s2 + (x(k) - sx) ^ 2
            s3 = s3 + (x(k + y(j)) - sy) ^ 2
        Next
        output(j + 1, 1) = s1 / Sqr(s2 * s3)
    
    Next
    
    AutoCor = output

End Function
-----------------------------------------------------------------------------


Example

For the example, I'm going to use the clothing sales indices for the UK from 1988 to 2013 from the ONS:

This is a lot of data (in times series analysis anyway) and should give a nice clear auto-correlation. I'll plot lags of 0 to 20 for differences of 0 to 4 (because I can).


Lag 0 1 2 3 4
0 1 1 1 1 1
1 0.723 -0.298 -0.687 -0.758 -0.801
2 0.609 -0.165 0.192 0.308 0.392
3 0.590 -0.084 0.020 -0.017 -0.066
4 0.612 -0.047 -0.080 -0.097 -0.096
5 0.660 0.118 0.142 0.157 0.168
6 0.642 -0.053 -0.166 -0.180 -0.190
7 0.654 0.112 0.134 0.150 0.162
8 0.602 -0.040 -0.070 -0.087 -0.087
9 0.572 -0.086 0.017 -0.022 -0.075
10 0.592 -0.167 0.177 0.301 0.395
11 0.710 -0.282 -0.666 -0.744 -0.791
12 0.994 0.984 0.980 0.979 0.976
13 0.709 -0.289 -0.674 -0.743 -0.785
14 0.588 -0.165 0.189 0.303 0.385
15 0.568 -0.090 0.016 -0.019 -0.067
16 0.592 -0.045 -0.076 -0.094 -0.094
17 0.643 0.119 0.143 0.158 0.168
18 0.624 -0.055 -0.170 -0.183 -0.192
19 0.637 0.116 0.137 0.152 0.164
20 0.582 -0.039 -0.068 -0.085 -0.087

Plotting differences 0 and 1 gives:
 and:


The original time series has a strong month on month growth and this is reflected in the auto-correlation chart with no differencing. There are very few lags with a value less than 0.6.

However, when the data is differenced once all correlations for lags other than 1 and 12 are close to zero. This is expected in this data series as the sales of clothing are highly seasonal with Christmas being the peak period every year.

Significance

How can we tell whether any one value of auto-correlation is significant though? It is generally accepted that significance is shown by a value that is greater than the square root of 2/number of points.

In this case, this is 0.081.

Due to the number of points, several lags are deemed significant, although only lag 12 has a very high significance.

2 comments:

  1. hi!can you elaborate a bit on the "difference" term? I am trying to find the correlation of a controller's output signal in order to determine its oscillation..i ve tried with the box-jenkins autocorrelation function (rk) but the results dindnt make any sense..i have thousands of sample data and all are equi-spaced..if you have any ideas that may help me please reply!
    THANK YOU!

    ReplyDelete
    Replies
    1. The difference term is just the number of times the data has been differenced. By differencing we're calculating the auto-correlation on the difference between one data point and the next i.e. x(i+1)-x(i). Generally this removes trend from a data series. This would normally have to be done as most of the analysis techniques rely on this to be true (among other things). There are plenty of sites that have a lot more detail on this - national statistics sites are normally quite good. The Australian Bureau of Statistics has a good introduction to TS analysis (google 'abs time series introductory course')

      Box-jenkins generally refers to ARIMA/ARMA modelling with auto-correlation being used to work out which model you should be using. I'm not sure what you're doing. If you just want to know the frequencies of the oscillation, a fourier transform may be useful. It depends on the type of data you're getting.

      Delete