Stats, MR and Data: 2015

Wednesday 9 December 2015

Wordpress blog

I've now added a wordpress blog to my website. It doesn't have many entries on it at the moment but I'll be updating that one from now on rather than this one, I think.

Wednesday 21 October 2015

American Football Statistics part 2

In the last post, I went through an interesting (for me anyway) observation from a book on the proportion of wins for American football teams. I only gave a chart for the example of the Buffalo Bills team.

Well, I've now put the data together for the rest of the teams and placed it on my site. You can now look at the stats for the rest of the teams:

Or at least those teams that have data from 1960 to 2014. I still find it hard to find much periodicity in quite a few of the teams. And some teams have some quite lengthy periods of high win percentages (San Francisco 49ers):

It's only to be expected that there would be some exceptions. What I should do if I wanted to 'prove' it one way or the other is to calculate the periodicity for each team and then see what correlation we get. Next time maybe.

Data courtesy of www.pro-football-reference.com
Charts courtesy of D3 (www.d3js.org)

Monday 19 October 2015

American Football Statistics

Recently I've been reading a book about how mathematical our everyday lives are (Towing Icebergs, falling dominoes and other adventures in applied mathematics by Robert B. Banks). In this book the author has an interesting chapter dedicated to the statistics of America football.

In this chapter he uses the data from the performances of the NFL teams from 1960 to 1992 to suggest that a first order linear discrete delay differential equation can be used to model the teams winning record for each season and that the performance is periodic with a specific time between each peak in the percentage of wins for that season.

The rationale is that (basically) the performance of the team from the previous season dictates the order (in reverse) of selection of new talent for the upcoming season.

The equation he derives is:

dU(t)/dt=a[U_m-U(t-τ)]

where U(t) is the proportion of wins in each season, U_m is the league wide average value of U and a is the growth coefficient.

Irritatingly he says that there are many ways to solve the equation but then uses an approximate, 'risky' method to produce a solution (Taylor series expansion in case you're wondering). He does however provide references so that you can follow up on the details.

The example he goes through in most detail is that of the Buffalo Bills. The graph of their performance for the years in question is below (thanks to this site for the data). It does indeed seem to be periodic between the years he mentioned.

This seemed odd to me. Obviously you wouldn't be able to determine the exact rank of each team in each year but you would know which teams were likely to be going down or up the rankings. You'd be able to tell, for example, that if your team did well one season, then they were likely to do less well the next.

However, when you look at the data for subsequent years the pattern was hard to determine. It seemed that the Bills were on a consistent downward trend after this. Although given the volatility of the data it's possible that a lot of frequencies would fit this chart.

Has something changed in the way selection is now carried out? I don't know. Is the pattern that he spotted (or at least went through) the same for other teams throughout this time period? Does it also change after 1992?

I'll follow it up by looking at all the teams that were around between 1960 and 1992 next time.

Monday 12 October 2015

Analysis of the numbers of views of my blog part 2

Last time I looked at the number of views per month I was getting on my blog. This time I will go into more detail on the numbers behind this.

First up is the number of views per day for each entry by time. This should give me an indication of which entries are more popular. I've divided the number of views by the number of days since the entry was written so that individual entries can be compared on an equal basis.

The picture is possibly a little hard to see when plotted like this but it does show that there are some entries that have a much higher 'popularity rating than others. To put this in a better perspective I've plotted a frequency distribution of them:

There are roughly four entries/posts that have more than 1 view per day. These are my 'popular' entries (and yes, this is popular in a very relative sense).

Let's look at the titles for the top 20 posts. Is there something in common? If I want to make this blog more popular is there something I should focus on? Something other than statistics probably. Anyway here is the chart:

Well, 2 of the popular posts are Java related, 2 are Excel related. I only have 4 Java related posts so this is fairly good for Java. However I have over 20 Excel related posts - not so good for Excel. Especially as one of those is to do with rim weighting.

My next chart shows the number of times that a post label appears. For those that don't know, the labels are just descriptive words or phrases that describe the post.

You can see that I've been concentrating on Excel quite a bit over the years.

The last chart shows the average views per day for these labels ranked by the average.

This just re-iterates the points above - Java is popular. Having said that, Java also has a very high standard deviation. Only one post has ever been relatively very high. And it was one of the first that I did and was about JavaFX rather than plain old Java.

As for whether I can make the blog more popular by picking what to write about based on these figures, I'm not sure. I think that there is far too much noise in the data. It's hard to pick a common theme. I'm sure I could do more analysis on this data but one for the future I think.

Wednesday 7 October 2015

Analysis of the numbers from my blog

I've decided to write shorter entries for my blog. The entries end up being very long and cover a lot of aspects so I'm splitting them up to cover only one to a few aspects at a time.

So for my first shorty, I'm going to look at the numbers of views to this blog. Initially it will just be a look at the number of views per month. I'll then go into the views per post, what looks most interesting for people etc.

So, without further ado, here are the numbers of views per month plotted with the number of posts per month:

The left hand y-axis shows the number of views per month. The right hand y-axis shows the number of blog posts per month.

So, I'm getting on average about 800 views per month since the beginning of 2014. I've no idea if this is good or bad. Probably really bad as I don't connect with other people or mention the blog to anyone.

You can also see that there has been a major slowdown in blog entries recently. This corresponds to a drop in the number of views.

The next chart shows the difference in views from one month to the next plotted against the number of blog entries per month.

By simply calculating the correlation between the two series in the chart above we get a value of 0.2. This would seem to suggest that there is no correlation between the number of blog posts and the monthly increase in the number of views. It's all very random.

I will leave you with one final chart:

This shows the cumulative number of views and the cumulative number of posts. As you can see the number of views rises linearly with time. So much so, in fact, that you can fit a linear trendline through the data from January 2014 onwards to get the gradient of ~869 views per month.

From this it would seem that nothing I have, or have not done, has affected the upward trend.

Next time, I'll look at the data from the individual blog entries.

Monday 3 August 2015

London Land Use 2005

I've put another article on my website about London land use statistics.

I've used R, R markdown, ggplot2, d3heatmap and maptools to plot the data. This does unfortunately mean that it can't be easily posted here and that it's a fairly sizeable page (about 1.4Mb when created in R Studio).

Here are some pictures from the page:

Anyway, see what you think.

Wednesday 3 June 2015

Rim weighting web tool

Another rim weighting tool. This time I've stuck it on my site and programmed it in JavaScript.

It uses the same conventions as the Excel tool and you will need:

A tab delimited file holding the demographics. It will need to have a header labelling each of the columns of data.
A tab delimited file holding the targets. This will need to have a header and 3 columns with the rim name, the cell name and the target. The rim name will need to correspond to one of the headers in the demographic file.

Drag these files to the relevant boxes on the page, set the parameters and then click on the rim weight button.

The weights will then be placed in the weights tab below the button.

The tool is still in a beta stage so improvements will be made but it does work. I do need to test the speed of it though.

As it's JavaScript it all happens in the browser. This means that nothing has to be loaded to a server but it does also mean that nothing is saved.

Comments always welcome.

Tuesday 5 May 2015

Sample size estimator

I've decided to put some of the tools from my Excel add-in on my website. The first one I've done is the sample size estimator for surveys found here.

Give it a go and see what you think. As ever, comments are welcome.

Friday 17 April 2015

D3 Tooltips for a line chart

I wrote a blog entry on my attempts to write a line chart for D3 here. One thing it didn't have was tooltips when you hovered over the chart to tell you what the data was at that point.

I've added these for an article I recently wrote for my website. Their behaviour is limited but it's a start.

Composition

The tips themselves are constructed of one SVG 'rect' element and two SVG 'text' elements. They are positioned via the mouseover event.

Code

The first thing to do is to create the SVG element to hang all the child elements off:

var chk = d3.select("#cht1")

.append("svg")

.attr("class", "chk")

.attr("width", 960)

.attr("height", 600);

Then we need to create the tooltip elements:

chk.append("rect")

.attr("width", 70)

.attr("height", 50)

.attr("x","-2000")

.attr("y","-2000")

.attr("rx","2")

.attr("ry","2")

.attr("class", "tooltip_box")

.attr("id", "tooltip1")

.attr("opacity", "0.0");

chk.append("text")

.attr("class","bbd_tooltip_text")

.attr("id","bbd_tt_txt1")

.attr("x", "-2000")

.attr("y", "-2000")

.attr("dy", ".35em")

.attr("dx", ".35em")

.text(" ");

chk.append("text")

.attr("class","bbd_tooltip_text")

.attr("id","bbd_tt_txt2")

.attr("x", "-2000")

.attr("y", "-2000")

.attr("dy", ".35em")

.attr("dx", ".35em")

.text(" ");

I've given the elements a starting position of (-2000,-2000). It's not strictly necessary as I could have just made their opacity attribute equal to zero. I've also given the elements ids and classes.

Now we need to make them move with the mouse. I've added <rect> elements over the data points and it is to these that we add the event function:

.on("mouseover", function(d, i) {

//Select mouse position

var ym = d3.mouse(this)[1];

d3.select("#tooltip1")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym)

.attr("opacity", "0.5");

d3.select("#bbd_tt_txt1")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym+12)

.text("x="+resp_data[i].x);

d3.select("#bbd_tt_txt2")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym+32)

.text("y="+resp_data[i].y);

})

In the code above resp_data is an array holding all the data, x_scale is a D3 scale object and ym holds the y position of the mouse.

For the three elements of the tooltip I've changed the opacity, the position and the text.

We also need to clear up the tooltip when we exit the <rect> element:

.on("mouseout", function() {

d3.select("#tooltip1")

.attr("x", "-2000")

.attr("y", "-2000")

.attr("opacity", "0.0");

d3.select("#bbd_tt_txt1")

.attr("x", "100")

.attr("y", "100")

.text(" ");

d3.select("#bbd_tt_txt2")

.attr("x", "-2000")

.attr("y", "-2000")

.text(" ");

})

and finally we need to change the tooltip when the mouse moves:

.on("mousemove", function(d, i) {

var ym = d3.mouse(this)[1];

d3.select("#tooltip1")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym)

.attr("opacity", "0.5");

d3.select("#bbd_tt_txt1")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym+12)

.text("x="+resp_data[i].x);

d3.select("#bbd_tt_txt2")

.attr("x", x_scale(resp_data[i].x)+10)

.attr("y", ym+32)

.text("y="+resp_data[i].y);

})

And that's it, apart from CSS to style the elements. I'll leave that up to you though.

Thursday 16 April 2015

Ofcom Broadband Report 2013 Analysis

I still can't get D3 to work with blogger so I've written this as an article on my website.

It basically just looks at the data contained within the Ofcom 2013 broadband report with charts and maps created using D3.

I'll create some posts to explain how I did it later. Hope you enjoy!

Monday 9 March 2015

Petrol Price Exploration

Introduction

With the petrol price coming down so rapidly recently I decided to have a look and see what correlation there is between the crude oil price and the price of a litre of petrol sold at a garage.

I'm sure this sort of analysis has come up quite a lot recently. It's certainly nothing new but I thought it would be interesting nonetheless.

The first thing to do is to get hold of the necessary data:

Petrol prices - ONS
Brent crude oil prices - US Energy Information Administration
Dollar to Pound conversion rates - Bank of England

The petrol prices are an average of pump prices collected from four oil companies and two supermarkets on the Monday of each week. The data goes back to 2003.

The Brent crude spot prices are a daily series although there are gaps in the series for holidays etc. The data goes back to 1987.

The dollar to pound conversion rate is again a daily series and again there are gaps on the holidays. The gaps are similar to the gaps found in the Brent crude prices but are not always the same - different country, different holidays.

It's interesting to note that the drop in price in 2008 was much larger than the recent drop. The two series are show similar trends but they're not currently directly comparable. Let's see if we can improve on that.

Manipulation

First we need to convert the Brent crude prices to pounds per barrel. For this we use the Bank of England conversion rate data.

Next we take only the Brent crude prices from the Monday of each week so that we are comparing prices on the same day of each week.

For both of these data series I had to interpolate some data points for holes in the series. This was done using the SplineInterpolate macro from the SurveyScience excel add-in. There's no particular reason to use this over a straight line interpolation. I just thought I would.

The fourth adjustment to the data was to the pump prices of petrol. I removed the taxes. There are two taxes added - a duty and VAT. I've assumed here that duty is added first and then VAT.

The result is below:

As you can see, the correlation is remarkably good. Some of this apparent correlation is due to the way Excel has chosen the scales but it still looks good.

I did also try a smoothing algorithm on the data but I think it hides too much of the detail for the analysis.

Correlation

So taking the data from the beginning of 2011, there appears to be a fairly stable set of prices. Let's see if how much time lag there is between crude oil price and pump prices.

To do this, I've taken the prices and calculated the correlation for the two original series and then calculated the correlation for the two series when the pump prices are shifted back by a number of weeks.

We get the following two charts:

The first chart shows the data when shifted back by -14 to +14 weeks. The second chart shows the peak in more detail with interpolated points (spline).

This shows that pump prices follow crude oil prices most closely 2 weeks and 5 days later i.e. there is a lag of 2 weeks and 5 days.

Accounting for this lag, let's plot the two series against each other:

The plot looks indicative of a strong relationship between the two prices. However note that there are very few points towards the bottom left. This is the recent drop in prices. Taking these out we get:

Still a strong relationship but the R squared value has dropped from 0.9 to 0.64. So you can predict the price given the crude oil price but you will be off by quite a bit a lot of the time. Much of this is due to the volatility of the crude oil prices compared to the pump price as the plot below shows. It is a plot of the two series accounting for the lag:

Further Analysis

One of the questions often raised is whether the price at the pump rises quickly on a rise in crude oil but drops slowly when crude oil prices decline.

There are two related ways to look at this:

Are the peaks in the two series closer together than the troughs,
Are the positive gradients of the pump prices more steep than the negative gradients.

Unfortunately I'm going to have to leave that for another time. Until then.

Wednesday 4 February 2015

Extracting data from text files using Java

I use Java a fair amount to look at data contained within text files. It can be data from surveys, logs of various processes and sometimes extracts from databases.

This post links to four articles on how to read data line by line from a simple text file. The four articles are all very similar but detail different methods of accessing the data. They all consist of five steps:

Defining the file
Opening the file
Reading in the data
Doing something with the data
Outputting the result

The articles are:

Extract data from a simple text file
Extract data from gzipped text files - useful if the individual files are large. The biggest bottleneck is disc access times, especially HDDs.
Extract data from zipped text files (zip archives)
Extract data from a directory of text files - it's often the case that you want to analyse data from a whole raft of files.

Hopefully they will be of use.