Stats, MR and Data: Excel macro to detect outliers

Introduction

Automatically detecting outliers in a set of data is generally fairly difficult. There are various papers for example from the University of York or the University of Pittsburgh summarising the various methods. What follows is a description of one the most simple. It is useful for straight line data (univariate) or any data that can be made linear.

The macro just looks at the correlation of the whole data series and calculates, for each data point, the difference of the correlation from the total without that point.

The point with the highest difference is the point that affects the correlation the most and is the most likely to be the outlier.

The Macro

As per usual most of the macro relates to arranging the data and checking the inputs.

--------------------------------------------------------------------------------

Function OutlierCorrelation(dataRange1 As range, dataRange2 As range) As Variant

Dim OutputRows As Long

Dim OutputCols As Long

Dim output() As Double

Dim vert As Boolean

Dim ma As Double

Dim i, j As Integer

Dim totalCor As Double

Dim cor() As Double

Dim x() As Double

Dim y() As Double

With Application.Caller

OutputRows = .Rows.Count

OutputCols = .Columns.Count

End With

ReDim output(1 To OutputRows, 1 To OutputCols)

'Check that dataRange1, dataRange2 and outputRange are the same size

If dataRange1.Rows.Count <> dataRange2.Rows.Count Then Exit Function

If dataRange1.Columns.Count <> dataRange2.Columns.Count Then Exit Function

If dataRange1.Rows.Count <> OutputRows Then Exit Function

If dataRange1.Columns.Count <> OutputCols Then Exit Function

vert = False

If dataRange1.Rows.Count > dataRange1.Columns.Count Then

vert = True

ReDim x(1 To dataRange1.Rows.Count - 1)

ReDim y(1 To dataRange1.Rows.Count - 1)

ReDim cor(1 To dataRange1.Rows.Count)

Else

ReDim x(1 To dataRange1.Columns.Count - 1)

ReDim y(1 To dataRange1.Columns.Count - 1)

ReDim cor(1 To dataRange1.Columns.Count)

End If

'Populate output with zeroes

For i = 1 To OutputRows

For j = 1 To OutputCols

output(i, j) = 0

'Calculate total correlation

totalCor = Application.WorksheetFunction.Correl(dataRange1, dataRange2)

If vert = True Then

For i = 1 To dataRange1.Rows.Count

For j = 1 To dataRange1.Rows.Count

If i = j Then

Else

If j > i Then

x(j - 1) = dataRange1(j, 1).Value

y(j - 1) = dataRange2(j, 1).Value

Else

x(j) = dataRange1(j, 1).Value

y(j) = dataRange2(j, 1).Value

End If

cor(i) = Correlation(x, y)

output(i, 1) = Abs(cor(i) - totalCor)

Else

For i = 1 To dataRange1.Columns.Count

For j = 1 To dataRange1.Columns.Count

If i = j Then

Else

If j > i Then

x(j - 1) = dataRange1(1, j).Value

y(j - 1) = dataRange2(1, j).Value

Else

x(j) = dataRange1(1, j).Value

y(j) = dataRange2(1, j).Value

End If

cor(i) = Correlation(x, y)

output(1, i) = Abs(cor(i) - totalCor)

End If

OutlierCorrelation = output

End Function

--------------------------------------------------------------------------------

The macro also needs another function to calculate the correlation:

--------------------------------------------------------------------------------

Function Correlation(x() As Double, y() As Double) As Double

Dim sx As Double

Dim sy As Double

Dim s1 As Double

Dim s2 As Double

Dim s3 As Double

Dim k As Integer

sx = 0

sy = 0

s1 = 0

s2 = 0

s3 = 0

For k = 1 To UBound(x)

sx = x(k) + sx

sy = y(k) + sy

sx = sx / UBound(x)

sy = sy / UBound(x)

For k = 1 To UBound(x)

s1 = s1 + (x(k) - sx) * (y(k) - sy)

s2 = s2 + (x(k) - sx) ^ 2

s3 = s3 + (y(k) - sy) ^ 2

Correlation = s1 / Sqr(s2 * s3)

End Function

--------------------------------------------------------------------------------

Example

Using the following data, which admittedly is very false:

The line was produced using y=2x+3 then adding uniform errors (8*rand()-4). I could have used normal errors using the norminv() function but it was easier using the rand() function on its own. It doesn't make too much difference for the purposes. I then changed the error on one of the points to be much higher. No prizes for spotting it and indeed a quick calculation of point-fitted line would quickly find it.

Using the macro gives:

The outlier is now very obvious. The correlation difference for the outlier is about 20 times higher than for the other points. For comparison the difference from the best fit line for the outlier is 6.5 time the average difference of the other points.

So this example is rather artificial but does mark less obvious outliers with real data.

Stats, MR and Data

Thursday, 1 August 2013

Excel macro to detect outliers

Introduction

The Macro

Example

No comments:

Post a Comment