Thursday, 14 August 2014

Kernel Smoothing Macro for Excel

Introduction

There are multiple ways to smooth a series of data. I wrote an earlier entry detailing the use filters to smooth a data set. Kernel smoothing is another method that is related to filter smoothing. However, instead of using a given, limited set of weights to smooth the data, a function is used to calculate weights for every single data point.

I won't go into the mathematical detail here as Wikipedia has several pages that describe the method in a far better way than I ever could. All I will say is that the method is relatively simple. All we're doing is calculating a weighted average of each point based on every other single point. There are two pages that are helpful:

  1. Kernel Smoother - describes the kernel smoother and how it works
  2. Kernel statistics - describes the various kernels that can be used

The Macro

The macro itself takes 4 inputs:
  1. The y-coordinates of the data series
  2. The x-coordinates of the data series
  3. The kernel to use
  4. The scale parameter of the kernel
The x and y coordinates are fairly self-explanatory. The x-coordinates are filled with values from 1 to the number of y values if no x-coordinates are specified.

The kernel is the function used to calculate the weight. There are 8 kernel functions to choose from:
  1. Gaussian
  2. Uniform
  3. Triangular
  4. Epanechnikov
  5. Quartic
  6. Cubic
  7. Logistic
  8. Cosine
The default kernel function used is a gaussian and the default scale is 1. The kernel function can be specified with either the full name or the first letter of the function (two in the case of the cosine function).

The scale parameter determines the 'width' of the function i.e. how many neighbouring points are used to calculate the new value of each point. The scale can either be a reference to a cell containing a number, a range or a number. If the scale is a range then it has to be the same size as the range of the y-coordinates if it's size is greater than 1 cell. Also the scales are used by applying scale 1 to point 1, scale 2 to point 2, etc. for every point.

So the function takes the form:
=SmoothKernel(sy as Range,
              Optional sx as Range,
              Optional kerneltype as Variant,
              Optional scaleP as Variant)

The Code

The code isn't particularly elegant, especially when it comes to specifying the scale parameter and the kernel function. I will probably try to change it at some point but it still works so maybe not.

Also the odd comments at the beginning of the macro are in the form they are because I created a macro to run through the functions and create some HTML documentation from them. I'll write an entry about it at some point.


Public Function SmoothKernel(sy As Range, Optional sx As Range, Optional kerneltype As Variant, Optional scaleP As Variant) As Variant
    '@desc;Function to smooth data via kernels by converting every point to the corresponding kernel and summing the result.
    '@param;sy;Range;The y coordinates
    '@param;sx;Range;The x coordinates
    '@param;kernel type;Variant;The kernel to apply. Should be one of Gaussian, Uniform, Triangular, Epanechnikov, Quartic, Cubic, Logistic or Cosine. Defaults to Gaussian.
    '@param;scale;Variant;The size of the kernel
    '@return;Variant;The smoothed points
    
    Dim outputRows As Long
    Dim outputCols As Long
    Dim output() As Double
    Dim vert As Boolean
    Dim vertOutput As Boolean
    Dim i As Integer
    Dim j As Integer
    Dim k As Integer
    Dim x() As Double
    Dim y() As Double
    Dim kernel As String
    Dim tot_ker1 As Double
    Dim tot_ker2 As Double
    Dim b() As Double
    Dim temp_ker As Double
    
    'Test whether data is arranged vertically or horizontally
    vert = False
    k = sy.Columns.Count
    If sy.Rows.Count > sy.Columns.Count Then
        vert = True
        k = sy.Rows.Count
    End If
    
    With Application.Caller
        outputRows = .Rows.Count
        outputCols = .Columns.Count
    End With
    ReDim output(1 To outputRows, 1 To outputCols)
    
    'Test whether output is arranged vertically or horizontally
    vertOutput = False
    If outputRows > outputCols Then
        vertOutput = True
    End If
    
    'Check that output range is the same size as input range
    If outputRows <> sy.Rows.Count And outputCols <> sy.Columns.Count Then Exit Function
    
    'Populate output with zeroes
    For i = 1 To outputRows
        For j = 1 To outputCols
            output(i, j) = 0
        Next
    Next
    
    'Redimension variables
    ReDim x(1 To k)
    ReDim y(1 To k)
    ReDim b(1 To k)

    'Define kernels
    If IsMissing(kerneltype) Then
        kernel = "k_G"
    ElseIf kerneltype = "Gaussian" Or kerneltype = "G" Then
        kernel = "k_G"
    ElseIf kerneltype = "Uniform" Or kerneltype = "U" Then
        kernel = "k_U"
    ElseIf kerneltype = "Triangular" Or kerneltype = "T" Then
        kernel = "k_T"
    ElseIf kerneltype = "Epanechnikov" Or kerneltype = "E" Then
        kernel = "k_E"
    ElseIf kerneltype = "Quartic" Or kerneltype = "Q" Then
        kernel = "k_Q"
    ElseIf kerneltype = "Cubic" Or kerneltype = "C" Then
        kernel = "k_C"
    ElseIf kerneltype = "Logistic" Or kerneltype = "L" Then
        kernel = "k_L"
    ElseIf kerneltype = "Cosine" Or kerneltype = "Co" Then
        kernel = "k_Co"
    Else
        kernel = "k_G"
    End If
    
    'Define scale
    If IsMissing(scaleP) Then
        For i = 1 To k
            b(i) = 1
        Next
    Else
        If TypeName(scaleP) = "Range" Then
            If scaleP.Rows.Count = 1 And scaleP.Columns.Count = 1 Then
                For i = 1 To k
                    b(i) = scaleP
                Next
            ElseIf scaleP.Rows.Count <> sy.Rows.Count And scaleP.Columns.Count <> sy.Columns.Count Then
                Exit Function
            ElseIf scaleP.Rows.Count > scaleP.Columns.Count Then
                For i = 1 To k
                    b(i) = scaleP(i, 1).Value
                Next
            Else
                For i = 1 To k
                    b(i) = scaleP(1, i).Value
                Next
            End If
        Else
            For i = 1 To k
                b(i) = scaleP
            Next
        End If
    End If
    
    'Populate temporary variables
    For i = 1 To k
        If sx Is Nothing Then
            x(i) = i
        Else
            If vert = True Then
                x(i) = sx(i, 1).Value
                y(i) = sy(i, 1).Value
            Else
                x(i) = sx(1, i).Value
                y(i) = sy(1, i).Value
            End If
        End If
    Next
    
    For i = 1 To k
        tot_ker1 = 0
        tot_ker2 = 0
        For j = 1 To k
            If kernel = "k_U" Then
                temp_ker = k_U(x(i), x(j), b(j))
            ElseIf kernel = "k_T" Then
                temp_ker = k_T(x(i), x(j), b(j))
            ElseIf kernel = "k_E" Then
                temp_ker = k_E(x(i), x(j), b(j))
            ElseIf kernel = "k_Q" Then
                temp_ker = k_Q(x(i), x(j), b(j))
            ElseIf kernel = "k_C" Then
                temp_ker = k_C(x(i), x(j), b(j))
            ElseIf kernel = "k_L" Then
                temp_ker = k_L(x(i), x(j), b(j))
            ElseIf kernel = "k_Co" Then
                temp_ker = k_Co(x(i), x(j), b(j))
            Else
                temp_ker = k_G(x(i), x(j), b(j))
            End If
            
            tot_ker1 = tot_ker1 + temp_ker * y(j)
            tot_ker2 = tot_ker2 + temp_ker
        Next
        If vertOutput = True Then
            output(i, 1) = tot_ker1 / tot_ker2
        Else
            output(1, i) = tot_ker1 / tot_ker2
        End If
    Next
    
    SmoothKernel = output
    
End Function

The kernel functions are:

Private Function k_G(x1 As Double, x2 As Double, b As Double) As Double
    'Gaussian kernel
    k_G = Exp(-(((x1 - x2) ^ 2) / (2 * b ^ 2)))
    
End Function

Private Function k_E(x1 As Double, x2 As Double, b As Double) As Double
    'Epanechnikov kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_E = 0
    Else
        k_E = (3 / 4) * (1 - ((x1 - x2) / b) ^ 2)
    End If

End Function

Private Function k_L(x1 As Double, x2 As Double, b As Double) As Double
    'Logistic kernel
    k_L = 1 / (Exp((x1 - x2) / b) + Exp(-(x1 - x2) / b))
End Function

Private Function k_U(x1 As Double, x2 As Double, b As Double) As Double
    'Uniform kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_U = 0
    Else
        k_U = 1 / 2
    End If

End Function

Private Function k_T(x1 As Double, x2 As Double, b As Double) As Double
    'Triangular kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_T = 0
    Else
        k_T = (1 - Abs((x1 - x2) / b))
    End If

End Function

Private Function k_Q(x1 As Double, x2 As Double, b As Double) As Double
    'Quartic kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_Q = 0
    Else
        k_Q = (15 / 16) * (1 - ((x1 - x2) / b) ^ 2) ^ 2
    End If

End Function

Private Function k_C(x1 As Double, x2 As Double, b As Double) As Double
    'Cubic kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_C = 0
    Else
        k_C = (35 / 32) * (1 - ((x1 - x2) / b) ^ 2) ^ 3
    End If

End Function

Private Function k_Co(x1 As Double, x2 As Double, b As Double) As Double
    'Cosine kernel
    If Abs((x1 - x2) / b) > 1 Then
        k_Co = 0
    Else
        k_Co = (WorksheetFunction.Pi / 4) * Cos((WorksheetFunction.Pi / 2) * ((x1 - x2) / b))
    End If

End Function

Example

Just to give you a flavour of what the macro does, I created a series of data based on the sine function. Specifically the data is sin(x)+0.3*rand()-0.3

This will also show the shortcomings of the method and one way of surmounting them (although not entirely).



The above picture shows the original data series with the s=axis showing degrees. Let's apply the function to the data by entering:
=SmoothKernel(C2:C362,A2:A362,"G",1)
into cells D2:D362 and then pressing Ctrl+Shift+Enter to enter the formula. This gives:

It's better but not smooth. I've also added a dashed line to indicate a sine function - the original function that we hope to obtain from smoothing the data. So now can increase the scale to 7:

This makes the resulting data series smooth but notice the ends. Because there are no data points to the left of the first few points (obviously) the estimate is too high. It's correct in the sense that this is the weighted average for the point but it's obviously not a good estimate.

So what can we do? Change the scale parameter for the points around the ends. If we increase the scale where the estimate is off and decrease the scale close to these points. In this case, I made the weights 14, 12, 10 and then 3 for 19 cells. This gave:

Better, but it still shows end effects so you need to be careful what scale/weight you apply.

One other way of applying the weights here would be to apply weight 1 to all points when calculating the smoothed value of point 1. This would limit the end effect in a different way. I haven't written this into the macro yet but may do later.

As ever comments are welcome.