Introduction
Automatically detecting outliers in a set of data is generally fairly difficult. There are various papers for example from the University of York or the University of Pittsburgh summarising the various methods. What follows is a description of one the most simple. It is useful for straight line data (univariate) or any data that can be made linear.
The macro just looks at the correlation of the whole data series and calculates, for each data point, the difference of the correlation from the total without that point.
The point with the highest difference is the point that affects the correlation the most and is the most likely to be the outlier.
The Macro
As per usual most of the macro relates to arranging the data and checking the inputs.
--------------------------------------------------------------------------------
Function OutlierCorrelation(dataRange1 As range, dataRange2 As range) As Variant
Dim OutputRows As Long
Dim OutputCols As Long
Dim output() As Double
Dim vert As Boolean
Dim ma As Double
Dim i, j As Integer
Dim totalCor As Double
Dim cor() As Double
Dim x() As Double
Dim y() As Double
With Application.Caller
OutputRows = .Rows.Count
OutputCols = .Columns.Count
End With
ReDim output(1 To OutputRows, 1 To OutputCols)
'Check that dataRange1, dataRange2 and outputRange are the same size
If dataRange1.Rows.Count <> dataRange2.Rows.Count Then Exit Function
If dataRange1.Columns.Count <> dataRange2.Columns.Count Then Exit Function
If dataRange1.Rows.Count <> OutputRows Then Exit Function
If dataRange1.Columns.Count <> OutputCols Then Exit Function
vert = False
If dataRange1.Rows.Count > dataRange1.Columns.Count Then
vert = True
ReDim x(1 To dataRange1.Rows.Count - 1)
ReDim y(1 To dataRange1.Rows.Count - 1)
ReDim cor(1 To dataRange1.Rows.Count)
Else
ReDim x(1 To dataRange1.Columns.Count - 1)
ReDim y(1 To dataRange1.Columns.Count - 1)
ReDim cor(1 To dataRange1.Columns.Count)
End If
'Populate output with zeroes
For i = 1 To OutputRows
For j = 1 To OutputCols
output(i, j) = 0
Next
Next
'Calculate total correlation
totalCor = Application.WorksheetFunction.Correl(dataRange1, dataRange2)
If vert = True Then
For i = 1 To dataRange1.Rows.Count
For j = 1 To dataRange1.Rows.Count
If i = j Then
Else
If j > i Then
x(j - 1) = dataRange1(j, 1).Value
y(j - 1) = dataRange2(j, 1).Value
Else
x(j) = dataRange1(j, 1).Value
y(j) = dataRange2(j, 1).Value
End If
End If
Next
cor(i) = Correlation(x, y)
output(i, 1) = Abs(cor(i) - totalCor)
Next
Else
For i = 1 To dataRange1.Columns.Count
For j = 1 To dataRange1.Columns.Count
If i = j Then
Else
If j > i Then
x(j - 1) = dataRange1(1, j).Value
y(j - 1) = dataRange2(1, j).Value
Else
x(j) = dataRange1(1, j).Value
y(j) = dataRange2(1, j).Value
End If
End If
Next
cor(i) = Correlation(x, y)
output(1, i) = Abs(cor(i) - totalCor)
Next
End If
OutlierCorrelation = output
End Function
--------------------------------------------------------------------------------
The macro also needs another function to calculate the correlation:
--------------------------------------------------------------------------------
Function Correlation(x() As Double, y() As Double) As Double
Dim sx As Double
Dim sy As Double
Dim s1 As Double
Dim s2 As Double
Dim s3 As Double
Dim k As Integer
sx = 0
sy = 0
s1 = 0
s2 = 0
s3 = 0
For k = 1 To UBound(x)
sx = x(k) + sx
sy = y(k) + sy
Next
sx = sx / UBound(x)
sy = sy / UBound(x)
For k = 1 To UBound(x)
s1 = s1 + (x(k) - sx) * (y(k) - sy)
s2 = s2 + (x(k) - sx) ^ 2
s3 = s3 + (y(k) - sy) ^ 2
Next
Correlation = s1 / Sqr(s2 * s3)
End Function
--------------------------------------------------------------------------------
Example
Using the following data, which admittedly is very false:
The line was produced using y=2x+3 then adding uniform errors (8*rand()-4). I could have used normal errors using the norminv() function but it was easier using the rand() function on its own. It doesn't make too much difference for the purposes. I then changed the error on one of the points to be much higher. No prizes for spotting it and indeed a quick calculation of point-fitted line would quickly find it.
Using the macro gives:
The outlier is now very obvious. The correlation difference for the outlier is about 20 times higher than for the other points. For comparison the difference from the best fit line for the outlier is 6.5 time the average difference of the other points.
So this example is rather artificial but does mark less obvious outliers with real data.
No comments:
Post a Comment