Tuesday 25 March 2014

Market Research and Statistics Excel add-in Part 3

I've updated the Excel add-in. It's now on version 0.6.13.

Most of the changes are bug fixes but I've added some functionality. Most notably these are:

  • Partial auto-correlation function
  • Generate a set of frequency tables from a data set
  • List file details in a folder
The full list of changes since version 0.6 and the add-in itself are on:
  • sciolist.weebly.com
  • http://sourceforge.net/projects/surveyscience
Comments on the add-in are very welcome.

Thursday 20 March 2014

Excel macro for partial auto correlation


Introduction

Partial auto-correlation (PACF) is useful in time series analysis. It is generally used with the auto-correlation coefficient for determining the order of the ARIMA processes to be fitted to whatever data set you may have.

According to Dr. Roland Fuss 'the partial auto-correlation describes the supplementary information provided by the additional lag'. Lag in this sense being the time difference between one point and another.

Essentially you use the ACF and PACF to determine what equation to fit to a set of time series data.

The equations that describe the PACF are as follows:
The value in the first equation is what we're interested in.

The Macro

The heart of the macro is contained within the following few lines of code:

1    t(0, 0) = 1
2    t(1, 1) = p(1)
3    For k = 2 To maxLag
4        'Calculate factors to take away from p(i)'
5        totalt = 0
6        For j = 1 To k - 1
7            If k - 1 <> j And k - 2 > 0 Then
8                t(k - 1, j) = t(k - 2, j) - t(k - 1, k - 1) * t(k - 2, k - 1 - j)
9            End If
10           totalt = totalt + t(k - 1, j) * p(k - j)
11       Next
12       t(k, k) = (p(k) - totalt) / (1 - totalt)
13        
14   Next


All this macro is doing is taking the auto-correlations, held in p(), and calculating the partial auto-correlations, the t() values or pi in the equation. It's very simple but does assume that all values of t(k,j) for smaller values of k and j have already been calculated.

Line 8 holds equation 2 and calculates the value of t() when k and j are not the same. Otherwise it is assumed that the values have already been calculated.

Lines 6 to 12 hold equation 1.

The bulk of the macro loads the data set, calculates the auto-correlations and only then calculates the partial auto-correlations. It's detailed next.


The Complete Code

And here is the complete PACF function for Excel:

Function PAutoCor(dataRange As range, Optional lag As Variant, Optional diff As Variant) As Variant
    Dim x() As Double           'Array to hold data values'
    Dim lags() As Integer       'Array to hold lag values'
    Dim sx As Double
    Dim sy As Double
    Dim s1 As Double
    Dim s2 As Double
    Dim s3 As Double
    Dim i, k As Integer         'Loop variables'
    Dim outputRows As Long
    Dim outputCols As Long
    Dim output() As Variant     'Temporary output array'
    Dim vertOutput As Boolean
    Dim vertInput As Boolean
    Dim vertLag As Boolean
    Dim numLags As Integer
    Dim maxLag As Integer
    Dim a As Integer
    Dim b As Integer
    Dim dataSize As Integer
    Dim p() As Double
    Dim t() As Double
    Dim j As Integer
    Dim totalt As Double
    
    'How many lags are there to calculate for?'
    With Application.Caller
        outputRows = .Rows.Count
        outputCols = .Columns.Count
    End With
    ReDim output(1 To outputRows, 1 To outputCols)
    
    'Vertical or horizontal output?'
    If outputRows > outputCols Then
        vertOutput = True
        numLags = outputRows
        'Set all output() to "#N/A"'
        For i = 1 To outputRows
            output(i, 1) = CVErr(xlErrNA)
        Next
    Else
        vertOutput = False
        numLags = outputCols
        'Set all output() to "#N/A"'
        For i = 1 To outputCols
            output(1, i) = CVErr(xlErrNA)
        Next
    End If
    
    'Vertical or horizontal input?'
    If dataRange.Rows.Count > dataRange.Columns.Count Then
        vertInput = True
        a = 1
        b = 0
        dataSize = dataRange.Rows.Count - 1
    Else
        vertInput = False
        a = 0
        b = 1
        dataSize = dataRange.Columns.Count - 1
    End If
    ReDim x(dataSize)
    
    'Check that lag is there'
    If IsMissing(lag) Then
        ReDim lags(numLags)
        'Populate lags'
        For i = 1 To numLags
            lags(i) = i
        Next
    ElseIf TypeName(lag) = "Range" Then
        'Horizontal or vertical lag'
        If lag.Rows.Count > lag.Columns.Count Then
            vertLag = True
            'Check that lag range matches output range in size'
            'Need to change this for maxLag'
            If lag.Rows.Count <> numLags Then numLags = lag.Rows.Count
            ReDim lags(numLags)
            maxLag = 0
            'Populate lags'
            For i = 1 To numLags
                lags(i) = lag(i, 1).Value
                If lags(i) > maxLag Then maxLag = lags(i)
            Next
        Else
            vertLag = False
            'Check that lag range matches output range in size'
            If lag.Columns.Count <> numLags Then numLags = lag.Columns.Count
            ReDim lags(numLags)
            maxLag = 0
            'Populate lags'
            For i = 1 To numLags
                lags(i) = lag(1, i).Value
                If lags(i) > maxLag Then maxLag = lags(i)
            Next
        End If
    Else
        'Should I check for array/single values?'
        ReDim lags(1)
        'Need to check if integer'
        lags(1) = lag
        'Set numLags to 1 just in case'
        numLags = 1
    End If
    
    'Check that diff is there'
    If IsMissing(diff) Then
        diff = 0
    End If
    
    'Fill data array x()'
    If diff > 0 Then
        For i = 0 To dataSize - 1
            If vertInput Then
                x(i) = dataRange(i + 1, 1).Value - dataRange(i + 2, 1).Value
            Else
                x(i) = dataRange(1, i + 1).Value - dataRange(1, i + 2).Value
            End If
        Next
    Else
        For i = 0 To dataSize
            If vertInput Then
                x(i) = dataRange(i + 1, 1).Value
            Else
                x(i) = dataRange(1, i + 1).Value
            End If
        Next
    End If
    
    If diff > 1 Then
        For k = 1 To diff
            For i = 0 To dataSize - 1 - k
                x(i) = x(i) - x(i + 1)
            Next
        Next
    End If
    
    'Redim for maxlag'
    ReDim p(maxLag)
    ReDim t(maxLag, maxLag)
    
    'Calculate autocorrelation'
    For i = 0 To maxLag
        sx = 0
        sy = 0
        s1 = 0
        s2 = 0
        s3 = 0
        For k = 0 To dataSize - i - diff
            sx = x(k) + sx
            sy = x(k + i) + sy
        Next
        sx = sx / (dataSize + 1 - i - diff)
        sy = sy / (dataSize + 1 - i - diff)
        
        For k = 0 To dataSize - i - diff
            s1 = s1 + (x(k) - sx) * (x(k + i) - sy)
            s2 = s2 + (x(k) - sx) ^ 2
            s3 = s3 + (x(k + i) - sy) ^ 2
        Next
        
        p(i) = s1 / Sqr(s2 * s3)
    Next
    
    'Set all ts to zero'
    For k = 1 To maxLag
        For j = 1 To maxLag
            t(j, k) = 0
        Next
    Next

    t(0, 0) = 1
    t(1, 1) = p(1)
    For k = 2 To maxLag
        'Calculate factors to take away from p(i)'
        totalt = 0
        For j = 1 To k - 1
            If k - 1 <> j And k - 2 > 0 Then
                t(k - 1, j) = t(k - 2, j) - t(k - 1, k - 1) * t(k - 2, k - 1 - j)
            End If
            totalt = totalt + t(k - 1, j) * p(k - j)
        Next
        t(k, k) = (p(k) - totalt) / (1 - totalt)
        
    Next
    
    For k = 1 To numLags
        If vertOutput Then
            output(k, 1) = t(lags(k), lags(k))
        Else
            output(1, k) = t(lags(k), lags(k))
        End If
    Next
    
    PAutoCor = output
    
End Function


Conclusion

I've added this function to my add-in which can be found on Sourceforge. Version 0.6.12 contains this macro.

As ever, all comments are welcome.