Sunday, 22 December 2013

D3 - Simple Line Chart

There seem to be a lot of tutorials on the web about D3 (www.d3js.org). However I've yet to find one that explains,simply, how to create a line chart. They all seem to be overly complicated, don't explain exactly what is happening, or call functions, classes and variables by the same name.

So here's my attempt. It took me a good few hours to get this working, which is pretty poor given that what I achieved is extremely simple.

Wednesday, 27 November 2013

Interactive Excel Maps - part 2

Last time I detailed the classes needed for the maps. This time I'll show the code I used to create the maps.

The first thing we need to do is create a new module and set up some global variables. A dictionary object to hold a reference to the map object and ensure that the memory used by it doesn't get collected. A variable to hold the count of maps and a MapEvent object that will switch on and off events when switching sheets and books etc.

Public mcd As New Dictionary
Public mcCount As Integer
Dim clsMapEvent As New MapEvent

The next step is to create the subroutines that will enable and disable the map events. These subroutines are called from the MapEvent class.

Sub Set_All_Maps()
    Dim mc1 As New MapChart
    Dim cs As New VBA.Collection
    Dim chtNumber As Integer
    
    ' Will not work or enable events for active sheet if sheet is a chart sheet'
 
    ' Enable events for all charts embedded on a sheet'
    ' Works for embedded charts on a worksheet or chart sheet'
    If ActiveSheet.ChartObjects.Count > 0 Then
        Dim chtObj As ChartObject
        Dim chtnum As Integer
        Dim sp As Shape
 
        chtnum = 1
        For Each chtObj In ActiveSheet.ChartObjects
            If Left(chtObj.name, 3) = "Map" Then
                Set mc1.MapChart = chtObj.Chart
                Set mc1.srcDataRange = range(Cells(2, 1), Cells(2, 2)).CurrentRegion
                For Each sp In chtObj.Chart.Shapes
                    cs.Add sp
                Next
                Set mc1.reg_shapes = cs
                If Len(chtObj.name) > 3 Then
                    chtNumber = CInt(Mid(chtObj.name, 5, Len(chtObj.name) - 4))
                    mc1.SetScaleLow Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 4).Value, Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 5).Value, Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 6).Value
                    mc1.SetScaleHigh Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 7).Value, Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 8).Value, Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(chtNumber, 9).Value
                End If
                mcd.Add mcd.Count + 1, mc1
            End If
        Next
    End If
End Sub
 
Sub Reset_All_Maps()
    ' Disable events for all charts previously enabled'
    Dim chtnum As Integer
    
    On Error Resume Next
    
    For chtnum = 1 To mcd.Count
        Set mcd.Item(chtnum).MapChart = Nothing
    Next
    mcd.RemoveAll
End Sub

Some things need explaining in the above code. Firstly this is all run from an add-in I've called surveyscience.xlam (available here, if you're interested). It contains a sheet called MapSheet that holds details of the maps that I've defined. This is used to update the properties of the map chart. Also, this will only work for embedded charts. Adding code for this should be fairly simple though. I didn't bother as I was using the maps for a dashboard.


The Map Routine

Finally, we need to define the subroutine to create the actual map. This happens in two stages. The first stage loads the map definition to memory, creating the region and region_bit objects along the way. The map created by this macro is based on the details held in the MapSheet worksheet. The variable mapRow passed to the subroutine is used to look up the details from MapSheet.

The second part creates the chart object and draws the map over the chart. The macro is fairly long unfortunately.

Private Sub CreateMap(mapRow As Integer)
    'Maps'
    Dim sdr As range        'Source data range'
    Dim mc1 As New MapChart 'New map chart'
    Dim co As Shape         'Base for map'
    
    'Maxs and mins of map data'
    Dim minX As Double
    Dim minY As Double
    Dim maxX As Double
    Dim maxY As Double

    minX = 1E+200
    minY = 1E+200
    maxX = -1E+200
    maxY = -1E+200
    
    'Load file'
    Dim i As Integer
    Dim j As Integer
    Dim k As Integer
    Dim fileNum As Integer
    Dim mapFSO As FileSystemObject
    Dim mapFile
    Dim LineText As String
    Dim arr
    Dim header
    Dim regName As String
    Dim typ As String
    Dim mapRegs As New VBA.Collection
    Dim temp_reg As Region
    Dim temp_bit As Region_Bit
    Dim temp_point As MapPoint2D
    Dim temp_coord
    Dim defStr1 As String
    Dim wk As Worksheet
    Dim mapRange As range
    
    'Add workbook if none exists'
    If Workbooks.Count = 0 Then
        Workbooks.Add
    End If
    
    'Add worksheet'
    Set wk = Worksheets.Add

    'Get path of map'
    Set mapRange = Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(1, 1).CurrentRegion
    Set mapRange = range(mapRange(mapRow, 1), mapRange(mapRow, 9))
    defStr1 = Workbooks("surveyScience.xlam").Worksheets("MapSheet").Cells(mapRow, 2).Text
    
    Set mapFSO = CreateObject("Scripting.FileSystemObject")
    Set mapFile = mapFSO.OpenTextFile(defStr1, 1, False, -1)
    
    'Extract data from file'
    Do While mapFile.AtEndOfStream <> True
        LineText = mapFile.ReadLine
        arr = Split(LineText, ";")
        header = Split(arr(0), ",")
        regName = header(5)
        wk.Cells(mapRegs.Count + 2, 1) = regName
        Set temp_reg = New Region
        temp_reg.setName (regName)
        temp_reg.setColour CInt(header(1)), CInt(header(2)), CInt(header(3))
        If header(0) = "f" Then
            temp_reg.setFill (True)
        Else
            temp_reg.setFill (False)
        End If
            
        arr = Split(arr(1), " ")
        For i = 0 To UBound(arr)
            If arr(i) = "M" Then
                Set temp_bit = New Region_Bit
                typ = "M"
            ElseIf arr(i) = "L" Then
                typ = "L"
            ElseIf arr(i) = "z" Then
                typ = "z"
                temp_reg.addBit temp_bit
                Set temp_bit = Nothing
            Else
                'Split point into x,y'
                temp_coord = Split(arr(i), ",")
                Set temp_point = New MapPoint2D
                temp_point.SetX (temp_coord(0))
                temp_point.SetY (temp_coord(1))
                If typ = "M" Then
                    temp_bit.setStartPoint temp_point
                ElseIf typ = "L" Then
                    temp_bit.addPoint temp_point
                End If
                    
                'Set mins and maxs'
                If temp_coord(0) > maxX Then maxX = temp_coord(0)
                If temp_coord(1) > maxY Then maxY = temp_coord(1)
                If temp_coord(0) < minX Then minX = temp_coord(0)
                If temp_coord(1) < minY Then minY = temp_coord(1)
                    
                Set temp_point = Nothing
            End If
        Next
            
        mapRegs.Add temp_reg
        Set temp_reg = Nothing
    Loop
    mapFile.Close
    
    'Create chart'
    Set co = wk.Shapes.AddChart
    co.Select
    
    'Increase size of chart by 20%'
    co.width = co.width * 1.2
    co.Height = co.Height * 1.2
    
    'Change name so that we can identify chart as a map'
    ActiveChart.Parent.name = "Map_" & mapRow
    
    'Assign active charts as map chart'
    Set mc1.MapChart = ActiveChart
    Set sdr = wk.range(Cells(2, 1), Cells(mapRegs.Count + 1, 2))
    Set mc1.srcDataRange = sdr
    Call mc1.SetScaleLow(mapRange.Cells(1, 4).Value, mapRange.Cells(1, 5).Value, mapRange.Cells(1, 6).Value)
    Call mc1.SetScaleHigh(mapRange.Cells(1, 7).Value, mapRange.Cells(1, 8).Value, mapRange.Cells(1, 9).Value)
    'Set properties of chart'
    ActiveChart.SetSourceData Source:=sdr
    ActiveChart.ChartType = xlColumnClustered
    ActiveChart.PlotBy = xlColumns
    ActiveChart.Legend.Delete
    ActiveChart.HasTitle = False
    ActiveChart.Axes(xlValue).Delete
    ActiveChart.Axes(xlCategory).Delete
    ActiveChart.Axes(xlValue).MajorGridlines.Delete
    
    'Maxs and mins of plot area'
    Dim x1 As Double
    Dim y1 As Double
    Dim x2 As Double
    Dim y2 As Double
    x1 = ActiveChart.PlotArea.Left
    y1 = ActiveChart.PlotArea.Top
    x2 = ActiveChart.PlotArea.Left + ActiveChart.PlotArea.width
    y2 = ActiveChart.PlotArea.Top + ActiveChart.PlotArea.Height
    
    'Background'
    'Chart Area background'
    With ActiveChart.Shapes.BuildFreeform(msoEditingAuto, -6, -6)
        .AddNodes msoSegmentLine, msoEditingAuto, -6, ActiveChart.ChartArea.Height
        .AddNodes msoSegmentLine, msoEditingAuto, ActiveChart.ChartArea.width, ActiveChart.ChartArea.Height
        .AddNodes msoSegmentLine, msoEditingAuto, ActiveChart.ChartArea.width, -6
        .AddNodes msoSegmentLine, msoEditingAuto, -6, -6
        .ConvertToShape.name = "Background"
    End With
    
    'Calculate scaling factors'
    Dim mx As Double
    Dim cx As Double
    Dim my As Double
    Dim cy As Double
    Dim dA As Double
    Dim rA As Double
    Dim ex As Double
    
    mx = (x2 - x1) / (maxX - minX)
    my = (y2 - y1) / (maxY - minY)
    cx = x1 - mx * minX
    cy = y1 - my * minY
    
    'Keep aspect the same'
    dA = Abs((y2 - y1) / (x2 - x1))
    rA = Abs((maxY - minY) / (maxX - minX))
    If (dA > rA) Then
        ex = (maxY - minY) * (dA / rA - 1)
        mx = (x2 - x1) / (maxX - minX)
        my = (y2 - y1) / ((maxY + ex / 2) - (minY - ex / 2))
        cx = x1 - mx * minX
        cy = y1 - my * (minY - ex / 2)
    ElseIf (dA < rA) Then
        ex = (maxX - minX) * (rA / dA - 1)
        mx = (x2 - x1) / ((maxX + ex / 2) - (minX - ex / 2))
        my = (y2 - y1) / (maxY - minY)
        cx = x1 - mx * (minX - ex / 2)
        cy = y1 - my * minY
    Else
        mx = 1
        my = 1
        cx = 0
        cy = 0
    End If
    
    'Draw parts of map'
    Dim reg_shape
    Dim t As Shape
    Dim rs As New VBA.Collection
    
    'Cycle through regions'
    For i = 1 To mapRegs.Count
        'Cycle through bits'
        For j = 1 To mapRegs.Item(i).getNumBits
            Set temp_bit = mapRegs.Item(i).getBit(j)
            'Create new shape'
            Set reg_shape = ActiveChart.Shapes.BuildFreeform(msoEditingAuto, temp_bit.getStartPoint().GetX() * mx + cx, temp_bit.getStartPoint().GetY() * my + cy)
            'Cycle through points'
            For k = 1 To mapRegs.Item(i).getBit(j).getNumPoints
                'Add point to shape'
                reg_shape.AddNodes msoSegmentLine, msoEditingAuto, temp_bit.getPoint(k).GetX() * mx + cx, temp_bit.getPoint(k).GetY() * my + cy
            Next
            'Convert to shape'
            If mapRegs.Item(i).getFill() = True Then
                Set t = reg_shape.ConvertToShape
                With t
                    .name = mapRegs.Item(i).getName
                    .fill.ForeColor.RGB = mapRegs.Item(i).getColour2()
                    .Line.ForeColor.RGB = RGB(0, 0, 0)
                    .Line.Weight = 0.25
                    
                End With
                rs.Add t
            End If
            
            Set reg_shape = Nothing
        Next
    Next
    
    Set mc1.reg_shapes = rs
    mcCount = mcCount + 1
    mcd.Add mcCount, mc1
End Sub

Conclusion

And there it is, a way to create reasonable and extensible maps in Excel.

Presumably you could combine the two parts of the macro and create the shapes directly from the file. It might be easier. However, splitting the code makes it easier to change the subroutine to load in different map formats.

One limitation that I've found in Excel 2013 is that the shapes seem to default to a mitre join. This makes the maps look horrendous. There is no way to programmatically change the joins to round. You have to manually create a shape, change the join type, set the shape as the default type and then run the macro. Odd.

As usual, comments are welcome.

Tuesday, 26 November 2013

Interactive Excel Maps - part 1

Introduction

I've recently been trying to get some interactive maps working in Excel for a dashboard we were using at work.

What follows is the first part of an explanation of how I did it and an example of what it looks like.


Of course I could have created the maps much more simply in R or Quantum GIS but that wouldn't have helped for the dashboard.

Thursday, 21 November 2013

Market Research and Statistics Excel add-in Part 2

I've removed the password protection from my add-in and it's now on version 0.6.

It's in the same place on www.surveyscience.co.uk

Wednesday, 23 October 2013

Excel Box Plot Macro

Introduction

A boxplot is a way of plotting the median of a set of data points alongside the quartile ranges and any outliers. Basically it's a good way of showing the range and distribution of a univariate data set.

The boxplot consists of a line showing the median, a box encompassing the first and third quartiles, and whiskers showing lines at 1.5 times the inter-quartile range above the third and below the first quartile. The inter-quartile range is just the third quartile minus the first quartile.

The image below is an example of what the macro produces:


It was produced from a random set of data generated from the normal distribution with a mean of 0 and a standard deviation of 1.

The Macro

The macro takes the raw data, a univariate set of points, calculates the quartiles and plots them on an open-high-low-close stock chart. Lines are added for the median and the inter-quartile boundaries. The outliers are added as a XY scatter plot series.

A form is used to get the range holding the data but you can adjust the macro to used any old range. The form also has a box to select whether you want the outliers plotted. Again, you can change the macro for this.

There are two parts to the macro. The first takes the data, calculates the various values and plots them. The second adjusts the chart to update it when it is resized or recalculated.

Here's the first part:

-------------------------------------------------------------------------------
'Define global variables
Dim bpd As New Dictionary
Dim boxPlotCount As Integer

Sub Boxplot()
    Dim q1 As Double
    Dim q2 As Double
    Dim q3 As Double
    Dim iqr As Double
    Dim Data As range
    Dim counter As Long
    Dim i As Long
    Dim outliers As range
    Dim outliersArray() As Integer
    Dim a1 As Shape
    Dim a2 As Shape
    Dim a3 As Shape
    Dim m As Double
    Dim c As Double
    Dim bpt As New BoxPlotChart
    
    'Show form to get location of data
    BoxPlotForm.Show
    
    If BoxPlotForm.cancelB = True Then
        Unload BoxPlotForm
        Exit Sub
    End If
    
    Set Data = range(BoxPlotForm.RefEdit1.Text)
    
    'Calculate median, first quartile, third quartile and inter-quartile range
    q1 = WorksheetFunction.Quartile_Inc(Data, 1)
    q2 = WorksheetFunction.Quartile_Inc(Data, 2)
    q3 = WorksheetFunction.Quartile_Inc(Data, 3)
    iqr = 1.5 * (q3 - q1)
    
    'Create new sheet
    Worksheets.Add
    Cells(1, 2).Value = "Data"
    Cells(1, 3).Value = "Chart Data"
    
    Cells(2, 1).Value = "Inter quartile range"
    Cells(3, 1).Value = "Quartile 1"
    Cells(4, 1).Value = "Median"
    Cells(5, 1).Value = "Quartile 3"
    Cells(6, 1).Value = "Inter quartile range"
    
    Cells(2, 2).Value = iqr
    Cells(3, 2).Value = q1
    Cells(4, 2).Value = q2
    Cells(5, 2).Value = q3
    Cells(6, 2).Value = iqr
    
    'Filter list of points to those outside IQR
    Cells(1, 5).Value = "Extremes"
    counter = 2
    For i = 1 To Data.Rows.Count
        If Data(i, 1).Value > (q3 + iqr) Or Data(i, 1).Value < (q1 - iqr) Then
            Cells(counter, 5).Value = Data(i, 1).Value
            counter = counter + 1
        End If
    Next
    
    Set outliers = range(Cells(2, 5), Cells(counter - 1, 5))
    ReDim outliersArray(counter - 2)
    For i = 0 To counter - 2
        outliersArray(i) = 1
    Next
        
    Cells(2, 3).Value = iqr
    Cells(3, 3).Value = q1
    Cells(4, 3).Value = q3 + iqr
    Cells(5, 3).Value = q1 - iqr
    Cells(6, 3).Value = q3
        
    For i = 9 To 12
        Cells(3, i).Value = q1
        Cells(4, i).Value = q3 + iqr
        Cells(5, i).Value = q1 - iqr
        Cells(6, i).Value = q3
    Next
    
    'Plot a ohlc chart chart
    ActiveSheet.Shapes.AddChart.Select
    ActiveChart.SetSourceData Source:=range(Cells(3, 9), Cells(6, 12))
    ActiveChart.ChartType = xlStockOHLC
    ActiveChart.PlotBy = xlRows
    ActiveChart.SetSourceData Source:=range(Cells(3, 3), Cells(6, 3))
    Set bpt.BoxChart = ActiveChart
    
    'Delete temporary data
    range(Cells(3, 9), Cells(6, 12)).ClearContents
    
    'Plot outliers
    If BoxPlotForm.PlotCheckBox.Value = True And counter > 2 Then
        'Plot these as a scatterplot with x of 1
        For i = 1 To counter - 2
            With ActiveChart.SeriesCollection.NewSeries
                .ChartType = xlXYScatter
                .Values = outliers(i, 1)
                .Name = "Outliers" & i
                .XValues = outliersArray(i - 1)
                .MarkerStyle = -4168
                .MarkerSize = 3
                .MarkerForegroundColor = RGB(0, 0, 0)
            End With
        Next
        
        'Need to make sure that both axes line up
        ActiveChart.Axes(xlValue, xlSecondary).MinimumScale = ActiveChart.Axes(xlValue).MinimumScale
        ActiveChart.Axes(xlValue, xlSecondary).MaximumScale = ActiveChart.Axes(xlValue).MaximumScale
        ActiveChart.Axes(xlValue).MinimumScale = ActiveChart.Axes(xlValue, xlSecondary).MinimumScale
        ActiveChart.Axes(xlValue).MaximumScale = ActiveChart.Axes(xlValue, xlSecondary).MaximumScale
    End If
    
    'Get rid of horizontal grid lines, x axis labels and ticks
    With ActiveChart.Axes(xlCategory)
        .MajorTickMark = xlNone
        .TickLabelPosition = xlNone
    End With
    ActiveChart.Legend.Delete
    ActiveChart.Axes(xlValue).MajorGridlines.Delete
    
    'Plot line on chart
    m = (ActiveChart.PlotArea.InsideTop - (ActiveChart.PlotArea.InsideTop + ActiveChart.PlotArea.InsideHeight)) / (ActiveChart.Axes(xlValue).MaximumScale - ActiveChart.Axes(xlValue).MinimumScale)
    c = ActiveChart.PlotArea.InsideTop - m * ActiveChart.Axes(xlValue).MaximumScale
    Set bpt.s1 = ActiveChart.Shapes.AddLine((ActiveChart.PlotArea.InsideLeft + 0.3 * ActiveChart.PlotArea.InsideWidth), m * q2 + c, (ActiveChart.PlotArea.InsideLeft + 0.7 * ActiveChart.PlotArea.InsideWidth), m * q2 + c)
    bpt.s1.Line.ForeColor.ObjectThemeColor = msoThemeColorAccent2
    
    'Plot lines at high and low markers
    Set bpt.s2 = ActiveChart.Shapes.AddLine((ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth), m * (q3 + iqr) + c, (ActiveChart.PlotArea.InsideLeft + 0.66 * ActiveChart.PlotArea.InsideWidth), m * (q3 + iqr) + c)
    bpt.s2.Line.ForeColor.ObjectThemeColor = msoThemeColorText1
    Set bpt.s3 = ActiveChart.Shapes.AddLine((ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth), m * (q1 - iqr) + c, (ActiveChart.PlotArea.InsideLeft + 0.66 * ActiveChart.PlotArea.InsideWidth), m * (q1 - iqr) + c)
    bpt.s3.Line.ForeColor.ObjectThemeColor = msoThemeColorText1
    
    bpt.tq = q3 + iqr
    bpt.lq = q1 - iqr
    bpt.q2 = q2
    
    boxPlotCount = boxPlotCount + 1
    'Add chart to dictionary
    bpd.Add boxPlotCount, bpt
    
End Sub
-------------------------------------------------------------------------------

And here is the boxplot object that replots the lines when it is adjusted. You'll need to create a class module called 'BoxPlotChart' to hold the code:

-------------------------------------------------------------------------------
Option Explicit

Public WithEvents BoxChart As Chart
Public posY As Integer
Public s1 As Shape
Public s2 As Shape
Public s3 As Shape
Public tq As Double
Public lq As Double
Public q2 As Double

Private Sub BoxChart_Calculate()
    'Redraw the median line
    
    'Need to know the position of the top and bottom in both
    'screen coordinates and scale
    
    Dim m As Double
    Dim c As Double
    
    m = (ActiveChart.PlotArea.InsideTop - (ActiveChart.PlotArea.InsideTop + ActiveChart.PlotArea.InsideHeight)) / (ActiveChart.Axes(xlValue).MaximumScale - ActiveChart.Axes(xlValue).MinimumScale)
    c = ActiveChart.PlotArea.InsideTop - m * ActiveChart.Axes(xlValue).MaximumScale
    
    If Not (s1 Is Nothing) Then
        s1.Left = (ActiveChart.PlotArea.InsideLeft + 0.3 * ActiveChart.PlotArea.InsideWidth)
        s1.Top = m * q2 + c
        s1.width = (0.4 * ActiveChart.PlotArea.InsideWidth)
    End If
    
    If Not (s2 Is Nothing) Then
        s2.Left = (ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth)
        s2.Top = m * tq + c
        s2.width = (0.33 * ActiveChart.PlotArea.InsideWidth)
    End If
    
    If Not (s3 Is Nothing) Then
        s3.Left = (ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth)
        s3.Top = m * lq + c
        s3.width = (0.33 * ActiveChart.PlotArea.InsideWidth)
    End If
    
End Sub

Private Sub BoxChart_Resize()
    'Redraw the median line
    
    'Need to know the position of the top and bottom in both
    'screen coordinates and scale
    
    Dim m As Double
    Dim c As Double
    
    m = (ActiveChart.PlotArea.InsideTop - (ActiveChart.PlotArea.InsideTop + ActiveChart.PlotArea.InsideHeight)) / (ActiveChart.Axes(xlValue).MaximumScale - ActiveChart.Axes(xlValue).MinimumScale)
    c = ActiveChart.PlotArea.InsideTop - m * ActiveChart.Axes(xlValue).MaximumScale
    
    s1.Left = (ActiveChart.PlotArea.InsideLeft + 0.3 * ActiveChart.PlotArea.InsideWidth)
    s1.Top = m * q2 + c
    s1.width = (0.4 * ActiveChart.PlotArea.InsideWidth)
    
    s2.Left = (ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth)
    s2.Top = m * tq + c
    s2.width = (0.33 * ActiveChart.PlotArea.InsideWidth)
    
    s3.Left = (ActiveChart.PlotArea.InsideLeft + 0.33 * ActiveChart.PlotArea.InsideWidth)
    s3.Top = m * lq + c
    s3.width = (0.33 * ActiveChart.PlotArea.InsideWidth)
    
End Sub
-------------------------------------------------------------------------------

Limitations

Although the macro works fine when you run it, if you save the workbook, close it and then reopen it, the charts are just charts. They no longer resize properly.

Presumably you'd need to cycle through the charts in the workbook and re-assign any that were boxplots to BoxPlotCharts.

Limitation number two is that it only creates a boxplot for one set of data. It would be fairly trivial to extend this to more than one set of data.

Limitation number three is that the function used to calculate the quartiles only appeared in Excel 2010. I believe that there is a 'quartile' function in previous versions though so it can be changed for this.

Conclusion

Hopefully this macro is useful. As ever, if you have any comments, improvements or find any bugs, etc. then please comment.

Saturday, 19 October 2013

Spline Interpolation Excel Macro

Introduction

This is my attempt at producing an Excel function to interpolate data using splines. A good introduction to spline interpolation is, once again, given by wikipedia.

The algorithm is taken from the book 'Numerical Analysis, 7th Edition, Burden and Faires'. The macro seems to give the same answers as the example in the book so I'm fairly confident that it mostly works.

The macro takes:
  1. The x coordinates of a set of points
  2. The y coordinates of a set of points
  3. The x coordinates of the points that you want to interpolate
  4. An optional parameter of the gradient at point zero
  5. An optional parameter of the gradient at point n
If the optional parameters are missing then free interpolation is performed, otherwise a clamped interpolation is done. The macro outputs the interpolated Y coordinates.


Monday, 30 September 2013

Market Research and Statistics Excel add-in

Just a quick post to say that I've collected the various macros detailed on this blog, with others, in one Excel add in.

I've put this add in on to:

Hopefully people will find it useful. One word of caution - it's about version 0.1 at the moment. You'll need to be wary of bugs etc.

But please give it a go and tell me what you think.

Thursday, 5 September 2013

Mean, mode and median on a skewed noisy distribution

Introduction

There are a lot of articles and blog entries on the differences between the three (usual) measures of the average of a distribution. Here's my version.

The Averages

  • Mean - the easiest to calculate as it's just the sum of the data points divided by the count of the point. See wikipedia for more details.
  • Mode - the most common value within a set of data points. See wikipedia for more details.
  • Median - the middle point of an ordered set of data points. If the number of points is even then the average of the two middle points can be taken. Again it's probably best to look at wikipedia for more complete details


The Data

The data points are taken from an ad-hoc survey. People were asked to answer a set of questions and the time to complete them was recorded. The times recorded range from around 3 seconds to just over 100,000 seconds. Both seem extremely unlikely. Below is a chart of the data with a red line denoting the smoothed distribution:
 

The chart was produced in R with the ggplot2 package. Basically the data points were loaded, tabulated and smoothed. As you can see, not much detail is shown. The extremes completely swamp the interesting data. A second chart, restricted to an upper limit of 800 seconds shows the details more clearly:

You'll also note that I've coloured the line according to the frequency. There's no reason apart from the fact that I think it looks good. It adds no meaning to the chart.

So what values do we get for the three averages from the raw data:
  • Mean - 198 seconds
  • Mode - 95 seconds
  • Median - 132 seconds
For this type of distribution with data that is skewed to the right we would always expect mean to be bigger than the median which will be bigger than the mode.

The following chart shows the same plot as above but with the averages shown:


There are two problems with these values though.
  1. The outliers are being included in the averages. The bulk of the data lies between 0 and 400 seconds. Values above this should be included but at what point do we start to say that the data points are outliers.
  2. The mode is calculated from the noisy data. On the distribution above, this is probably not too much of a problem. But what if the peak wasn't so pronounced? Then the noise could produce a value that is far from the 'true' value of the peak.
So what do we do? And what is the best measure of the average?

Removing Outliers

There are many ways to remove outliers but in this case I'm just going to look at the effect of the data points on the mean. So, calculating the mean for point 1, then points 1 to 2, and so forth until we get to a mean of points 1 to n. We can then show the effect of the individual points on the mean:
The elbow of this line lies around 500 seconds. Beyond this point the mean rise slowly but significantly. Showing the lower values in more detail:
This show that the mean basically stops rising with any 'speed' at around 2000 seconds. This seems a reasonable cut off point.

Recalculating the mean for this upper limit gives a value of 171 seconds, a reduction of 27 seconds (14%).

Noisy Modes

The next problem was how much is the mode affected by having noisy data. To see this I'm going to fit a spline to the data using the R function smooth.spline. I've restricted the degrees of freedom to 40. The default value gave a line that was too wavy.

The smoothed line is shown above in chart 2. You can see that the peak of the smoothed data does not correspond to the most common data point.

To calculate the max of the smoothed data you can just run:
dur3 <- smooth.spline(x=dur2$dur,y=dur2$freq,df=40)
dur4 <- data.frame(cbind(dur3$x,dur3$y))
dur4$X1[dur4$X2==max(dur4$X2)]
where dur2 is the tabulated original data.

This gives a modal value of 107 seconds, an increase of 12 seconds.

The Median

This doesn't change at all when restring the data set to 0 to 2000 seconds.

Which Average to choose?

So, which average should you choose to show the average time of completion for the questionnaire.

I think that the modal value would give an answer that was too low. It doesn't take into account the skew nature of the data.

The mean also has problems in that it is overly affected by outliers. It's not too much trouble to remove these for this scenario. After all if it takes around 2 minutes for most people to complete the questionnaire then cutting the data off at around 30 minutes seems sensible.

The median value seems a good compromise. It could still be said to be too low given the nature of the distribution but it is relatively simple to calculate and you don't have to worry about outliers in the data set.

Another question to answer would do we need a single measure of an average for this? The chart itself gives a better indication of how long it took to answer. By reducing it to a single value we lose a lot of the information such as the spread of the times. Maybe taking the mode and the spread at half the height of the peak would be better. This gives a spread of 67 to 174 seconds and therefore includes all measures of the average.

Friday, 23 August 2013

Death Statistics in England and Wales - Postscript

In the last post about death statistics I put in an animated GIF that I had created using a combination of Excel and GIMP.

Quite why I did it this way I don't know. It took hours (probably) to adjust the chart, copy it to paint, save it and then copy it as a layer to GIMP.

I've since discovered a few websites detailing how to create exactly the same thing using R. Googling 'animated gif R' gives, almost, thousands of examples. I now realise quite how stupid I was.

Below is the script that I used:

-------------------------------------------------------------------------------
death <- read.table("death.txt",header=TRUE)
#summary(death)

library(ggplot2)
library(Cairo)

for(i in 2:49){
  dp <- ggplot(data=death,aes(x=Age,y=death[,i])) + geom_line() + ylim(0,0.05)
  
  if(i<10){
    ttle <- paste('plot0',i,'.png',sep="")
  } else {
    ttle <- paste('plot',i,'.png',sep="")
  }

  CairoPNG(filename=ttle,width=480,height=480,dpi=300)
  #png(filename=ttle)
  print(dp)
  dev.off()
}

shell('"C:/Program Files/GraphicsMagick/gm.exe" convert -delay 15 -loop 0 plot*.png death.gif')
shell('"C:/Program Files/FFMpeg/bin/ffmpeg.exe" -start_number 02 -i plot%02d.png -vcodec libx264 death.mp4')
-------------------------------------------------------------------------------

The script loads in the data which is arranged as a table with age as the rows, year as the columns and deaths as a proportion of the total.

It then loads the required libraries.

Lines 7-20 do the work of creating the plots. These are saved as PNG files. Two things to note:
  1. I've set the y axis scale to run from 0->0.05. This is to keep the axes on a consistent scale.
  2. To make sure that the PNG files are ordered correctly I've had to zero pad the numbers for 2-9.
The work of creating the animated GIF is done by GraphicMagick. You can also use ImageMagick.

I won't go into any more detail as other sites give much better information. There's no point copying it.

The last line uses ffmeg to create a movie from the plots. This was inspired by www.animatedgraphs.co.uk.

Below is the resultant GIF:

The whole process takes seconds - once the script is written.

Thursday, 8 August 2013

An Excel macro to simplistically seasonally adjust time series data

Introduction

There are many ways to deseasonalise seasonal time series data.

I'm going to write about one of the easier methods. I can't remember what the official name for the method is but all we're doing is dividing the data point by the average for each seasonal period.

That definition probably needs explaining. So, say, for example we have quarterly data for 3 years. We take the average for each quarter over the years i.e. average Q1 for years 1, 2 and 3. Then do the same for Q2, Q3 and Q4. After this you divide the Q1 points by the Q1 average, the Q2 points by the Q2 average, etc.

There are some things to be careful of when using this method:

  1. The more data you have the better. You'll probably need more than 3 years worth of data for this to work well.
  2. You need to be careful of outliers affecting the seasonal averages by too much when there is very little data.
  3. The data series needs to be stationary. Or at least, mostly stationary. If it's not, you tend to get step changes in the resulting seasonally adjusted data. By stationary I mean that there is no long term trend in the data.

The Macro

Here is the listing of the macro.

--------------------------------------------------------------------------
Function SeasonAverage(labelRange As range, dataRange As range) As Variant

    Dim labelSum As Dictionary
    Dim labelCount As Dictionary
    Dim labelAvg As Dictionary
    Dim totalSum As Double
    Dim totalCount As Integer
    Dim OutputRows As Long
    Dim OutputCols As Long
    Dim output() As Double
    Dim i As Integer
    
    totalSum = 0
    totalCount = 0
    
    Set labelSum = New Dictionary
    Set labelCount = New Dictionary
    Set labelAvg = New Dictionary
    
    With Application.Caller
        OutputRows = .Rows.Count
        OutputCols = .Columns.Count
    End With
    ReDim output(1 To OutputRows, 1 To OutputCols)
    
    'Is dataRange, labelRange and outputRange the same size and shape
    If dataRange.Rows.Count = labelRange.Rows.Count & dataRange.Columns.Count = labelRange.Columns.Count & dataRange.Rows.Count = OutputRows & dataRange.Columns.Count = OutputCols Then
        Exit Function
    End If
    
    If labelRange.Rows.Count < labelRange.Columns.Count Then
        For i = 1 To labelRange.Columns.Count
            If labelSum.Exists(labelRange(1, i).Value) Then
                labelSum.Item(labelRange(1, i).Value) = labelSum.Item(labelRange(1, i).Value) + dataRange(1, i).Value
            Else
                labelSum.Add Key:=labelRange(1, i).Value, Item:=dataRange(1, i).Value
            End If
            If labelCount.Exists(labelRange(1, i).Value) Then
                labelCount.Item(labelRange(1, i).Value) = labelCount.Item(labelRange(1, i).Value) + 1
            Else
                labelCount.Add Key:=labelRange(1, i).Value, Item:=1
            End If
            totalSum = totalSum + dataRange(1, i).Value
            totalCount = totalCount + 1
        Next
    Else
        For i = 1 To labelRange.Rows.Count
            If labelSum.Exists(labelRange(i, 1).Value) Then
                labelSum.Item(labelRange(i, 1).Value) = labelSum.Item(labelRange(i, 1).Value) + dataRange(i, 1).Value
            Else
                labelSum.Add Key:=labelRange(i, 1).Value, Item:=dataRange(i, 1).Value
            End If
            If labelCount.Exists(labelRange(i, 1).Value) Then
                labelCount.Item(labelRange(i, 1).Value) = labelCount.Item(labelRange(i, 1).Value) + 1
            Else
                labelCount.Add Key:=labelRange(i, 1).Value, Item:=1
            End If
            totalSum = totalSum + dataRange(i, 1).Value
            totalCount = totalCount + 1
        Next
    End If
    
    'Calculate average
    'Dim ts As String
    Dim rKey
    For Each rKey In labelSum
        labelAvg.Add Key:=rKey, Item:=((labelSum.Item(rKey) * totalCount) / (labelCount.Item(rKey) * totalSum))
    Next
    
    'Output data
    If dataRange.Rows.Count < dataRange.Columns.Count Then
        For i = 1 To dataRange.Columns.Count
            output(1, i) = dataRange(1, i).Value / labelAvg.Item(labelRange(1, i).Value)
        Next
    Else
        For i = 1 To dataRange.Rows.Count
            output(i, 1) = dataRange(i, 1).Value / labelAvg.Item(labelRange(i, 1).Value)
        Next
    End If
    
    Set labelSum = Nothing
    Set labelCount = Nothing
    Set labelAvg = Nothing
    
    SeasonAverage = output
    
End Function
--------------------------------------------------------------------------


The function takes two arguments. The first is the list of labels. These need to be the list of the seasons (Q1, Q2, etc. or January, February, etc.). The second argument is the data. The output is a variant and hence the function needs to be used as an array function.

The labels are placed in to dictionary objects and averages calculated for each label. Then the data is divided by these averages.

Example

I'm going to use an artificial set of data for my example. The data has been generated by adding a constant line of 100 to a seasonal trend (ranging from -10 to +40) and then some noise (normal distribution with a mean of zero and standard deviation of 4). This is basically the definition of a seasonal data series.

The first chart shows the raw generated data:


The next chart shows the data once the seasonal component has been removed.

And the third chart compares the generated noise with the derived noise from applying the macro.

As expected the noises compare quite well. As they should, given the mathematics involved.

Conclusion

This is a very simplistic way of seasonally adjusting time series data. It does not take into account trend breaks, outliers (although if there is enough data, this method can be used to spot them), holiday movements, trading days, variable days in the month etc.

It is quite a good first stab at adjusting data and seeing long term trends though. Also, because of the simplicity of the method you can be fairly sure that not too many artifacts are being introduced into the data.

As always, comments, potential improvements etc are welcome.

Thursday, 1 August 2013

Excel macro to detect outliers

Introduction

Automatically detecting outliers in a set of data is generally fairly difficult. There are various papers for example from the University of York or the University of Pittsburgh summarising the various methods. What follows is a description of one the most simple. It is useful for straight line data (univariate) or any data that can be made linear.

The macro just looks at the correlation of the whole data series and calculates, for each data point, the difference of the correlation from the total without that point.

The point with the highest difference is the point that affects the correlation the most and is the most likely to be the outlier.

The Macro

As per usual most of the macro relates to arranging the data and checking the inputs.
--------------------------------------------------------------------------------
Function OutlierCorrelation(dataRange1 As range, dataRange2 As range) As Variant

    Dim OutputRows As Long
    Dim OutputCols As Long
    Dim output() As Double
    Dim vert As Boolean
    Dim ma As Double
    Dim i, j As Integer
    Dim totalCor As Double
    Dim cor() As Double
    Dim x() As Double
    Dim y() As Double
    
    With Application.Caller
        OutputRows = .Rows.Count
        OutputCols = .Columns.Count
    End With
    ReDim output(1 To OutputRows, 1 To OutputCols)
    
    'Check that dataRange1, dataRange2 and outputRange are the same size
    If dataRange1.Rows.Count <> dataRange2.Rows.Count Then Exit Function
    If dataRange1.Columns.Count <> dataRange2.Columns.Count Then Exit Function
    If dataRange1.Rows.Count <> OutputRows Then Exit Function
    If dataRange1.Columns.Count <> OutputCols Then Exit Function
    
    vert = False
    If dataRange1.Rows.Count > dataRange1.Columns.Count Then
        vert = True
        ReDim x(1 To dataRange1.Rows.Count - 1)
        ReDim y(1 To dataRange1.Rows.Count - 1)
        ReDim cor(1 To dataRange1.Rows.Count)
    Else
        ReDim x(1 To dataRange1.Columns.Count - 1)
        ReDim y(1 To dataRange1.Columns.Count - 1)
        ReDim cor(1 To dataRange1.Columns.Count)
    End If
    
    'Populate output with zeroes
    For i = 1 To OutputRows
        For j = 1 To OutputCols
            output(i, j) = 0
        Next
    Next
    
    'Calculate total correlation
    totalCor = Application.WorksheetFunction.Correl(dataRange1, dataRange2)
    
    If vert = True Then
        For i = 1 To dataRange1.Rows.Count
            For j = 1 To dataRange1.Rows.Count
                If i = j Then
                    
                Else
                    If j > i Then
                        x(j - 1) = dataRange1(j, 1).Value
                        y(j - 1) = dataRange2(j, 1).Value
                    Else
                        x(j) = dataRange1(j, 1).Value
                        y(j) = dataRange2(j, 1).Value
                    End If
                End If
            Next
            cor(i) = Correlation(x, y)
            output(i, 1) = Abs(cor(i) - totalCor)
        Next
    Else
        For i = 1 To dataRange1.Columns.Count
            For j = 1 To dataRange1.Columns.Count
                If i = j Then
                    
                Else
                    If j > i Then
                        x(j - 1) = dataRange1(1, j).Value
                        y(j - 1) = dataRange2(1, j).Value
                    Else
                        x(j) = dataRange1(1, j).Value
                        y(j) = dataRange2(1, j).Value
                    End If
                End If
            Next
            cor(i) = Correlation(x, y)
            output(1, i) = Abs(cor(i) - totalCor)
        Next
    End If
    
    OutlierCorrelation = output
    
End Function
--------------------------------------------------------------------------------


The macro also needs another function to calculate the correlation:
--------------------------------------------------------------------------------
Function Correlation(x() As Double, y() As Double) As Double

    Dim sx As Double
    Dim sy As Double
    Dim s1 As Double
    Dim s2 As Double
    Dim s3 As Double
    Dim k As Integer

    sx = 0
    sy = 0
    s1 = 0
    s2 = 0
    s3 = 0
    For k = 1 To UBound(x)
        sx = x(k) + sx
        sy = y(k) + sy
    Next
    sx = sx / UBound(x)
    sy = sy / UBound(x)
        
    For k = 1 To UBound(x)
        s1 = s1 + (x(k) - sx) * (y(k) - sy)
        s2 = s2 + (x(k) - sx) ^ 2
        s3 = s3 + (y(k) - sy) ^ 2
    Next
    Correlation = s1 / Sqr(s2 * s3)

End Function
--------------------------------------------------------------------------------

Example

Using the following data, which admittedly is very false:

The line was produced using y=2x+3 then adding uniform errors (8*rand()-4). I could have used normal errors using the norminv() function but it was easier using the rand() function on its own. It doesn't make too much difference for the purposes. I then changed the error on one of the points to be much higher. No prizes for spotting it and indeed a quick calculation of point-fitted line would quickly find it.

Using the macro gives:

The outlier is now very obvious. The correlation difference for the outlier is about 20 times higher than for the other points. For comparison the difference from the best fit line for the outlier is 6.5 time the average difference of the other points.

So this example is rather artificial but does mark less obvious outliers with real data.