Stats, MR and Data

Monday, 9 March 2015

Petrol Price Exploration

Introduction

With the petrol price coming down so rapidly recently I decided to have a look and see what correlation there is between the crude oil price and the price of a litre of petrol sold at a garage.

I'm sure this sort of analysis has come up quite a lot recently. It's certainly nothing new but I thought it would be interesting nonetheless.

The first thing to do is to get hold of the necessary data:

Petrol prices - ONS
Brent crude oil prices - US Energy Information Administration
Dollar to Pound conversion rates - Bank of England

The petrol prices are an average of pump prices collected from four oil companies and two supermarkets on the Monday of each week. The data goes back to 2003.

The Brent crude spot prices are a daily series although there are gaps in the series for holidays etc. The data goes back to 1987.

The dollar to pound conversion rate is again a daily series and again there are gaps on the holidays. The gaps are similar to the gaps found in the Brent crude prices but are not always the same - different country, different holidays.

It's interesting to note that the drop in price in 2008 was much larger than the recent drop. The two series are show similar trends but they're not currently directly comparable. Let's see if we can improve on that.

Manipulation

First we need to convert the Brent crude prices to pounds per barrel. For this we use the Bank of England conversion rate data.

Next we take only the Brent crude prices from the Monday of each week so that we are comparing prices on the same day of each week.

For both of these data series I had to interpolate some data points for holes in the series. This was done using the SplineInterpolate macro from the SurveyScience excel add-in. There's no particular reason to use this over a straight line interpolation. I just thought I would.

The fourth adjustment to the data was to the pump prices of petrol. I removed the taxes. There are two taxes added - a duty and VAT. I've assumed here that duty is added first and then VAT.

The result is below:

As you can see, the correlation is remarkably good. Some of this apparent correlation is due to the way Excel has chosen the scales but it still looks good.

I did also try a smoothing algorithm on the data but I think it hides too much of the detail for the analysis.

Correlation

So taking the data from the beginning of 2011, there appears to be a fairly stable set of prices. Let's see if how much time lag there is between crude oil price and pump prices.

To do this, I've taken the prices and calculated the correlation for the two original series and then calculated the correlation for the two series when the pump prices are shifted back by a number of weeks.

We get the following two charts:

The first chart shows the data when shifted back by -14 to +14 weeks. The second chart shows the peak in more detail with interpolated points (spline).

This shows that pump prices follow crude oil prices most closely 2 weeks and 5 days later i.e. there is a lag of 2 weeks and 5 days.

Accounting for this lag, let's plot the two series against each other:

The plot looks indicative of a strong relationship between the two prices. However note that there are very few points towards the bottom left. This is the recent drop in prices. Taking these out we get:

Still a strong relationship but the R squared value has dropped from 0.9 to 0.64. So you can predict the price given the crude oil price but you will be off by quite a bit a lot of the time. Much of this is due to the volatility of the crude oil prices compared to the pump price as the plot below shows. It is a plot of the two series accounting for the lag:

Further Analysis

One of the questions often raised is whether the price at the pump rises quickly on a rise in crude oil but drops slowly when crude oil prices decline.

There are two related ways to look at this:

Are the peaks in the two series closer together than the troughs,
Are the positive gradients of the pump prices more steep than the negative gradients.

Unfortunately I'm going to have to leave that for another time. Until then.

Wednesday, 4 February 2015

Extracting data from text files using Java

I use Java a fair amount to look at data contained within text files. It can be data from surveys, logs of various processes and sometimes extracts from databases.

This post links to four articles on how to read data line by line from a simple text file. The four articles are all very similar but detail different methods of accessing the data. They all consist of five steps:

Defining the file
Opening the file
Reading in the data
Doing something with the data
Outputting the result

The articles are:

Extract data from a simple text file
Extract data from gzipped text files - useful if the individual files are large. The biggest bottleneck is disc access times, especially HDDs.
Extract data from zipped text files (zip archives)
Extract data from a directory of text files - it's often the case that you want to analyse data from a whole raft of files.

Hopefully they will be of use.

Wednesday, 15 October 2014

New rim weighting Excel macro

Just a quick post to say that I've created a new rim weighting macro that takes separate demographic and target ranges as inputs. This is a change to the previous macro that made you put targets against every cell in the macro.

So the input looks like this:

Clicking on the weight by order (new) button brings up the dialog:

As you can see, the form is much simpler. The output looks like this:

The macro can be found in versions 0.8.x+ of my add-in which is in turn found on Survey Science.

Thursday, 14 August 2014

Kernel Smoothing Macro for Excel

Introduction

There are multiple ways to smooth a series of data. I wrote an earlier entry detailing the use filters to smooth a data set. Kernel smoothing is another method that is related to filter smoothing. However, instead of using a given, limited set of weights to smooth the data, a function is used to calculate weights for every single data point.

I won't go into the mathematical detail here as Wikipedia has several pages that describe the method in a far better way than I ever could. All I will say is that the method is relatively simple. All we're doing is calculating a weighted average of each point based on every other single point. There are two pages that are helpful:

Kernel Smoother - describes the kernel smoother and how it works
Kernel statistics - describes the various kernels that can be used

The Macro

The macro itself takes 4 inputs:

The y-coordinates of the data series
The x-coordinates of the data series
The kernel to use
The scale parameter of the kernel

The x and y coordinates are fairly self-explanatory. The x-coordinates are filled with values from 1 to the number of y values if no x-coordinates are specified.

The kernel is the function used to calculate the weight. There are 8 kernel functions to choose from:

Gaussian
Uniform
Triangular
Epanechnikov
Quartic
Cubic
Logistic
Cosine

The default kernel function used is a gaussian and the default scale is 1. The kernel function can be specified with either the full name or the first letter of the function (two in the case of the cosine function).

The scale parameter determines the 'width' of the function i.e. how many neighbouring points are used to calculate the new value of each point. The scale can either be a reference to a cell containing a number, a range or a number. If the scale is a range then it has to be the same size as the range of the y-coordinates if it's size is greater than 1 cell. Also the scales are used by applying scale 1 to point 1, scale 2 to point 2, etc. for every point.

So the function takes the form:

=SmoothKernel(sy as Range,

Optional sx as Range,

Optional kerneltype as Variant,

Optional scaleP as Variant)

The Code

The code isn't particularly elegant, especially when it comes to specifying the scale parameter and the kernel function. I will probably try to change it at some point but it still works so maybe not.

Also the odd comments at the beginning of the macro are in the form they are because I created a macro to run through the functions and create some HTML documentation from them. I'll write an entry about it at some point.

Public Function SmoothKernel(sy As Range, Optional sx As Range, Optional kerneltype As Variant, Optional scaleP As Variant) As Variant

'@desc;Function to smooth data via kernels by converting every point to the corresponding kernel and summing the result.

'@param;sy;Range;The y coordinates

'@param;sx;Range;The x coordinates

'@param;kernel type;Variant;The kernel to apply. Should be one of Gaussian, Uniform, Triangular, Epanechnikov, Quartic, Cubic, Logistic or Cosine. Defaults to Gaussian.

'@param;scale;Variant;The size of the kernel

'@return;Variant;The smoothed points

Dim outputRows As Long

Dim outputCols As Long

Dim output() As Double

Dim vert As Boolean

Dim vertOutput As Boolean

Dim i As Integer

Dim j As Integer

Dim k As Integer

Dim x() As Double

Dim y() As Double

Dim kernel As String

Dim tot_ker1 As Double

Dim tot_ker2 As Double

Dim b() As Double

Dim temp_ker As Double

'Test whether data is arranged vertically or horizontally

vert = False

k = sy.Columns.Count

If sy.Rows.Count > sy.Columns.Count Then

vert = True

k = sy.Rows.Count

End If

With Application.Caller

outputRows = .Rows.Count

outputCols = .Columns.Count

End With

ReDim output(1 To outputRows, 1 To outputCols)

'Test whether output is arranged vertically or horizontally

vertOutput = False

If outputRows > outputCols Then

vertOutput = True

End If

'Check that output range is the same size as input range

If outputRows <> sy.Rows.Count And outputCols <> sy.Columns.Count Then Exit Function

'Populate output with zeroes

For i = 1 To outputRows

For j = 1 To outputCols

output(i, j) = 0

'Redimension variables

ReDim x(1 To k)

ReDim y(1 To k)

ReDim b(1 To k)

'Define kernels

If IsMissing(kerneltype) Then

kernel = "k_G"

ElseIf kerneltype = "Gaussian" Or kerneltype = "G" Then

kernel = "k_G"

ElseIf kerneltype = "Uniform" Or kerneltype = "U" Then

kernel = "k_U"

ElseIf kerneltype = "Triangular" Or kerneltype = "T" Then

kernel = "k_T"

ElseIf kerneltype = "Epanechnikov" Or kerneltype = "E" Then

kernel = "k_E"

ElseIf kerneltype = "Quartic" Or kerneltype = "Q" Then

kernel = "k_Q"

ElseIf kerneltype = "Cubic" Or kerneltype = "C" Then

kernel = "k_C"

ElseIf kerneltype = "Logistic" Or kerneltype = "L" Then

kernel = "k_L"

ElseIf kerneltype = "Cosine" Or kerneltype = "Co" Then

kernel = "k_Co"

Else

kernel = "k_G"

End If

'Define scale

If IsMissing(scaleP) Then

For i = 1 To k

b(i) = 1

Else

If TypeName(scaleP) = "Range" Then

If scaleP.Rows.Count = 1 And scaleP.Columns.Count = 1 Then

For i = 1 To k

b(i) = scaleP

ElseIf scaleP.Rows.Count <> sy.Rows.Count And scaleP.Columns.Count <> sy.Columns.Count Then

Exit Function

ElseIf scaleP.Rows.Count > scaleP.Columns.Count Then

For i = 1 To k

b(i) = scaleP(i, 1).Value

Else

For i = 1 To k

b(i) = scaleP(1, i).Value

End If

Else

For i = 1 To k

b(i) = scaleP

End If

'Populate temporary variables

For i = 1 To k

If sx Is Nothing Then

x(i) = i

Else

If vert = True Then

x(i) = sx(i, 1).Value

y(i) = sy(i, 1).Value

Else

x(i) = sx(1, i).Value

y(i) = sy(1, i).Value

End If

For i = 1 To k

tot_ker1 = 0

tot_ker2 = 0

For j = 1 To k

If kernel = "k_U" Then

temp_ker = k_U(x(i), x(j), b(j))

ElseIf kernel = "k_T" Then

temp_ker = k_T(x(i), x(j), b(j))

ElseIf kernel = "k_E" Then

temp_ker = k_E(x(i), x(j), b(j))

ElseIf kernel = "k_Q" Then

temp_ker = k_Q(x(i), x(j), b(j))

ElseIf kernel = "k_C" Then

temp_ker = k_C(x(i), x(j), b(j))

ElseIf kernel = "k_L" Then

temp_ker = k_L(x(i), x(j), b(j))

ElseIf kernel = "k_Co" Then

temp_ker = k_Co(x(i), x(j), b(j))

Else

temp_ker = k_G(x(i), x(j), b(j))

End If

tot_ker1 = tot_ker1 + temp_ker * y(j)

tot_ker2 = tot_ker2 + temp_ker

If vertOutput = True Then

output(i, 1) = tot_ker1 / tot_ker2

Else

output(1, i) = tot_ker1 / tot_ker2

End If

SmoothKernel = output

End Function

The kernel functions are:

Private Function k_G(x1 As Double, x2 As Double, b As Double) As Double

'Gaussian kernel

k_G = Exp(-(((x1 - x2) ^ 2) / (2 * b ^ 2)))

End Function

Private Function k_E(x1 As Double, x2 As Double, b As Double) As Double

'Epanechnikov kernel

If Abs((x1 - x2) / b) > 1 Then

k_E = 0

Else

k_E = (3 / 4) * (1 - ((x1 - x2) / b) ^ 2)

End If

End Function

Private Function k_L(x1 As Double, x2 As Double, b As Double) As Double

'Logistic kernel

k_L = 1 / (Exp((x1 - x2) / b) + Exp(-(x1 - x2) / b))

End Function

Private Function k_U(x1 As Double, x2 As Double, b As Double) As Double

'Uniform kernel

If Abs((x1 - x2) / b) > 1 Then

k_U = 0

Else

k_U = 1 / 2

End If

End Function

Private Function k_T(x1 As Double, x2 As Double, b As Double) As Double

'Triangular kernel

If Abs((x1 - x2) / b) > 1 Then

k_T = 0

Else

k_T = (1 - Abs((x1 - x2) / b))

End If

End Function

Private Function k_Q(x1 As Double, x2 As Double, b As Double) As Double

'Quartic kernel

If Abs((x1 - x2) / b) > 1 Then

k_Q = 0

Else

k_Q = (15 / 16) * (1 - ((x1 - x2) / b) ^ 2) ^ 2

End If

End Function

Private Function k_C(x1 As Double, x2 As Double, b As Double) As Double

'Cubic kernel

If Abs((x1 - x2) / b) > 1 Then

k_C = 0

Else

k_C = (35 / 32) * (1 - ((x1 - x2) / b) ^ 2) ^ 3

End If

End Function

Private Function k_Co(x1 As Double, x2 As Double, b As Double) As Double

'Cosine kernel

If Abs((x1 - x2) / b) > 1 Then

k_Co = 0

Else

k_Co = (WorksheetFunction.Pi / 4) * Cos((WorksheetFunction.Pi / 2) * ((x1 - x2) / b))

End If

End Function

Example

Just to give you a flavour of what the macro does, I created a series of data based on the sine function. Specifically the data is sin(x)+0.3*rand()-0.3

This will also show the shortcomings of the method and one way of surmounting them (although not entirely).

The above picture shows the original data series with the s=axis showing degrees. Let's apply the function to the data by entering:

=SmoothKernel(C2:C362,A2:A362,"G",1)

into cells D2:D362 and then pressing Ctrl+Shift+Enter to enter the formula. This gives:

It's better but not smooth. I've also added a dashed line to indicate a sine function - the original function that we hope to obtain from smoothing the data. So now can increase the scale to 7:

This makes the resulting data series smooth but notice the ends. Because there are no data points to the left of the first few points (obviously) the estimate is too high. It's correct in the sense that this is the weighted average for the point but it's obviously not a good estimate.

So what can we do? Change the scale parameter for the points around the ends. If we increase the scale where the estimate is off and decrease the scale close to these points. In this case, I made the weights 14, 12, 10 and then 3 for 19 cells. This gave:

Better, but it still shows end effects so you need to be careful what scale/weight you apply.

One other way of applying the weights here would be to apply weight 1 to all points when calculating the smoothed value of point 1. This would limit the end effect in a different way. I haven't written this into the macro yet but may do later.

As ever comments are welcome.

Sunday, 8 June 2014

UK TV Listings in Excel

This article describes a simple way of getting TV programme listings from the Internet and into Excel using VBA.

For this I'm going to use the XMLTV feed provided by the Radio Times. This feed is usable only for personal and non-commercial use and is copyrighted. It is useful for analysing the TV guide though.

Method

The first thing to do is create a new module to hold the code. Let's call this TVGuide. To hold all the data we need to create a new workbook and worksheet. Then we can set up an external data query from a web page to get the channel data. I used the macro recorder to do this part which is why it's a bit verbose. I'm not sure how much of the code I need but it doesn't do any harm leaving it there.


Sub GetChannels()
'
' Get Channels from Radio Times
'

    Dim wb As Workbook
    Dim wk As Worksheet
    
    Set wb = Application.Workbooks.Add
    Set wk = ActiveWorkbook.Worksheets.Add
    wk.Name = "Channels"
    
    With ActiveSheet.QueryTables.Add(Connection:="URL;http://xmltv.radiotimes.com/xmltv/channels.dat", Destination:=Range("$A$1"))
        .Name = "channels"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
    
End Sub

You'll note that the address of the feed is http://xmltv.radiotimes.com/xmltv/channels.dat. This page contains a list of the channel numbers with their descriptions separated by a pipe character. So the next job is to separate these into two columns.


'Split data by pipe
Columns("A:A").Select
Selection.TextToColumns Destination:=Range("A1"), DataType:=xlDelimited, _
    TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=False, Tab:=False, _
    Semicolon:=False, Comma:=False, Space:=False, Other:=True, OtherChar _
    :="|", FieldInfo:=Array(Array(1, 1), Array(2, 1)), TrailingMinusNumbers:=True

'Delete row 2
Rows("2:2").Select
Selection.Delete Shift:=xlUp

'Label columns
Range("A1").FormulaR1C1 = "Channel Number"
Range("B1").FormulaR1C1 = "Channel"

'Change widths of columns
Columns("A:A").EntireColumn.AutoFit
Columns("B:B").EntireColumn.AutoFit

'Sort data
wk.Sort.SortFields.Clear
wk.Sort.SortFields.Add Key:=Range("B:B"), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
With wk.Sort
    .SetRange Range("A1").CurrentRegion
    .Header = xlYes
    .MatchCase = False
    .Orientation = xlTopToBottom
    .SortMethod = xlPinYin
    .Apply
End With

The above code needs to be inserted before the End Sub of the previous code example. I've also labelled the columns, deleted a row, changed the widths of the columns and sorted the data by channel description. This makes it easier to find the channel you're interested in.

So, that's nice. You have a list of channels sorted by their description. It's not very useful though. Fortunately, it's also easy to get the actual TV guide data for each channel. The address is almost exactly the same as before but instead of channel.dat at the end of the address we have {channel number}.dat where {channel number} is the number from the previous list.

Now all we need is some way to use the first list to get the actual time data. The easiest way (well, I thought it was easy) is to create a class module and enable events on the workbook that you've created.

So let's create that new class module and call it ChannelSelector. We want to allow events within the workbook we've created. Therefore we need to create a variable for the workbook in question and use the WithEvents keyword. We also need to attach this newly defined variable to the workbook that was created earlier.


Option Explicit
Private WithEvents App As Workbook

Private Sub Class_Initialize()
    Set App = ActiveWorkbook
End Sub

The above five lines are almost all that's needed to do this. We also need to create a variable within the TVGuide module for this class i.e. instantiate an object of this class. So:


Option Explicit
Public ChannelHandler As ChannelSelector

defines the variable in the module. We then 'set' this variable to the class defined above. This is done at the end of the GetChannels() subroutine.


'Set events for channel numbers
Set ChannelHandler = New ChannelSelector

Now that we've allowed user-defined events for this workbook, we need to create a subroutine to handle them. In this case we want to handle the double click event when the user double clicks on the channel number. Therefore we use the SheetBeforeDoubleClick event. Below is the code for the event handler.


Private Sub App_SheetBeforeDoubleClick(ByVal Sh As Object, ByVal Target As Range, Cancel As Boolean)
    Dim channel As String
    Dim wk As Worksheet
    
    'Cancel event
    Cancel = True
    
    channel = Target.Text
    
    Set wk = ActiveWorkbook.Worksheets.Add
    wk.Name = channel
    With ActiveSheet.QueryTables.Add(Connection:="URL;http://xmltv.radiotimes.com/xmltv/" & channel & ".dat", Destination:=Range("$A$1"))
        .Name = "chan"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
    
    'Split data by tilde
    Columns("A:A").Select
    Selection.TextToColumns Destination:=Range("A1"), DataType:=xlDelimited, _
        TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=False, Tab:=False, _
        Semicolon:=False, Comma:=False, Space:=False, Other:=True, OtherChar _
        :="~", FieldInfo:=Array(Array(1, 1), Array(2, 1)), TrailingMinusNumbers:=True
        
End Sub

This uses almost exactly the same code as before to extract data from the XMLTV feed. This time however the data needs to be split by the tilde (~) character.

Conclusion

And there you have it, the times for the next two weeks for all the programs for the channel that you just double clicked.

Obviously this is just a simple example and there is no error checking whatsoever. If you double click on something other than a relevant channel number then the program will fail with a runtime error.

There's a lot that can be done to expand on this and produce a nice TV Guide. I just wanted to give a simple introduction though.

Thursday, 17 April 2014

Clearing the Excel Text To Columns Pasting Feature using VBA

One issue with using the 'Text to Columns' feature within Excel is that any text that you subsequently paste also gets split into columns - whether you wanted it done or not.

There doesn't seem to be any way to turn this feature off. However, there are three methods that I know of that can stop it happening. These are:

Close Excel - this seems to me to be a bit of a nuclear option. You have to save all your work, close the application and then reopen all the spreadsheets again. Boring.
Use the 'Text to Columns' feature but with different options. This sets the TTC procedure to a different delimiter and so you can now paste without data wandering off into subsequent columns. Annoying.
Use a macro to reset the TTC options. It's described below. Better.

The macro is very simple and just repeats the steps that you'd take in option 2 above.

The Code

1 Dim v As Variant
2
3 v = ActiveCell
4 ActiveCell.Value = "Rem"
5 ActiveCell.TextToColumns Tab:=False, Semicolon:=False, Comma:=False, Space:=False, Other:=False, DataType:=xlDelimited
6 ActiveCell = v

Line 1 declares a variable to hold whatever is in the current active cell. Line 3 then populates this variable with the ActiveCell object. Line 4 puts a temporary value in this cell. Then in line 5 we run a TTC with no delimiters. This resets the paste options. Line 6 resets the active cell.

There are a few problems with writing the macro like this.

For a start, you have to select a cell.
This cell can have no protection enabled.
A worksheet/book has to be open.
The active cell can't be part of a pivot table.

Survey Science Add-in

A better way would be to include this procedure as part of an add-in that could be run via a button in a custom ribbon and use a worksheet in the add-in. This way it can be run with a single click and you know what cell it was over-writing.

And this is how I've implemented it in my add-in.

As usual comments are welcome.

Wednesday, 9 April 2014

Market Research and Statistics Excel add-in Part 4

Another update to the Excel add-in. It's now on version 0.7.

I felt that it was time to up the minor version number as I'd added more functionality and a lot more bug fixes. I've even managed to write some documentation for rim weighting.

The most notable addition is a subroutine to select a stratified sample from a sampling frame.

The full list of changes for version 0.7 and the add-in itself are on:
http://sourceforge.net/projects/surveyscience

Comments on the add-in are always welcome.