Stats, MR and Data

Monday, 29 July 2013

ONS Interactive Content

Just a quick post this time. Came across the interactive data sets that the ONS seem to be pushing quite a lot on their home page.

Some of them are excellent and very interesting. I particularly liked their flow maps. There are two that are quite interesting:

Internal migration - where do people move to within the England and Wales.
UK trade in services - how much is being exported and where to (within the EU).

I did something similar in Java recently whilst working on some Spanish data. What I did only shows distribution of data within the regions of Spain so not quite as flashy. I'll write it up soon, although it may take a while.

The interactive content on the ONS website is split into several sections including the recent census, economy, the labour market, people and places, population and migration and education.

All very exciting.

Wednesday, 3 July 2013

European GDP Statistics

Much is written about GDP in Europe these days. So I thought I'd have a look and see what the actual figures are and what has changed in them over the last few years.

I obtained the figures from Eurostat. The figures in question were the GDP per capita figures and the GDP figures were from the 'nama_gdp_c' data set.

The first chart below shows the average relative GDP for separate European countries from 2000 to 2014 (forecasted) indexed on Germany.

I then looked at the growth in GDP for each of the top 5 countries by indexing the GDP for each year on the GDP for 2000.

As you can see, the UK has shown the largest relative drop, although it seems to be growing fairly well since 2009. It's still not back at the levels it was at in 2007 though. The other 4 countries show a small decline in 2009.

The third chart shows the GDP per capita in PPS. To quote Eurostat:

Gross domestic product (GDP) is a measure for the economic activity. It is defined as the value of all goods and services produced less the value of any goods or services used in their creation. The volume index of GDP per capita in Purchasing Power Standards (PPS) is expressed in relation to the European Union (EU27) average set to equal 100. If the index of a country is higher than 100, this country's level of GDP per head is higher than the EU average and vice versa. Basic figures are expressed in PPS, i.e. a common currency that eliminates the differences in price levels between countries allowing meaningful volume comparisons of GDP between countries. Please note that the index, calculated from PPS figures and expressed with respect to EU27 = 100, is intended for cross-country comparisons rather than for temporal comparisons.

Therefore it shows the GDP per capita in PPS relative to the EU as a whole.

On this measure Italy has dropped the most, closely followed by the UK. Germany has actually grown. But remember these figures are relative to the EU. Therefore even if Germany looks like it's growing, on this measure, it's only against the rest of Europe.

Anyway, of the 5 big EU countries, the UK seems to be doing quite badly.

Saturday, 29 June 2013

Deaths statistics in England and Wales

I found some interesting data on the ONS website recently. It related to the number of registered deaths by year by age of the person in England and Wales. A bit morbid maybe but interesting none the less.

Above is an animated gif of the number of deaths by year of age as a proportion of the total number of deaths in that year. The gif was created in GIMP, following their tutorial, although I did find it a lot easier to just paste the charts as new layers. I've decided that I don't really like animated gifs. You can't stop them, and when you want to study a particular point in detail it's gone before you can.

Anyway, there are 3 interesting points to this:

Average age - The movement of the peak to the right.
Infant mortality - The drop in the percentage of the number of deaths of people aged zero.
The war - The dip in the percentage of the number of deaths of people aged 80 around the year 2000.

An Excel macro to concatenate values in ranges of cells

Introduction

The concatenate function in Excel seems very limited. Maybe I'm not using it correctly or there's a clever way of concatenating ranges of cells that I don't know but having to click on every cell that you want to concatenate just seems a waste of time.

I therefore thought that a better solution would be to add a function that automatically concatenates ranges of cells. Better yet, you can specify a separator for the cells if, for example, you wanted to output a range of semi-colon delimited data rather than just comma or tab delimited.

Macro

Anyway here's the macro:

-----------------------------------------------------------------------------

Function ConcatenateRange(d As Range, Optional separator As String) As String

Dim s As String

Dim i, j As Integer

s = ""

For i = 1 To d.Columns.Count

For j = 1 To d.Rows.Count

s = s + d.Cells(j, i).Text + separator

s = Left(s, Len(s) - Len(separator))

ConcatenateRange = s

End Function

-----------------------------------------------------------------------------

Example

Say you had a list of data spread over six cells from A1 to F1:

The function will join the cells, separated by the semi-colon.

Monday, 24 June 2013

WEFF - determining the loss of precision after demographically weighting a sample

Introduction

To improve the estimate of a measurement from a sample, the sample is usually demographically weighted to the population being measured. One example would be using a sample to measure the sales of cars in a country. You may wish the sample to represent the population of the country in question but what variables do you want to control for/to? What effect does increasing the number of variables you weight by have on the data? Sometimes you may be better off not weighting to a variable if it has little to no effect on the accuracy of the estimate but a large effect on the precision.

The WEFF

One way of measuring the effect of the weights and the weighting structure on the precision of the estimate is to use what is called the WEFF (sometimes just called F - see the Journal of Official Statistics, volume 19, No. 2, 2003 pp 81-97 for more details).

The WEFF is defined as:

Which is just one plus the population standard deviation of the weights divided by the mean of the weights. An alternative formulation is:

where n is the number of weights and x is the weight variable.

The loss in precision is determined by dividing the sample size by the WEFF. Therefore a small value is desirable.

Rim Weighting Excel Macro

Update - 09/12/2015

I've updated the macro and put it into one Excel add-in. It's now available here.

Introduction

In a previous post I wrote about two methods to demographically weight a market research sample to make the sample represent a population. For example, if you had a sample of 1000 people to measure the average height of a certain country, you might want to weight the sample by age and gender so that the sample represented the population in those two variables (dimensions, fields) exactly.

In this post I'm going to detail a macro that can be used to rim weight a sample.

The algorithm for rim weighting is very simple. Graham Kalton and Ismael Flores-Cervantes give a very good explanation of rim weighting (and other demographic weighting methods) in an article for the Journal of Official Statistics. This website is well worth a look for other articles on running panels, weighting schemes and, it seems, everything else to do with statistics. Anyway, I give a short explanation below.

To give a concrete example and help with the explanation I'll use the example given in the previous blog entry.

	Gender
Age	Male	Female	Total
Young	30%	20%	50%
Old	25%	25%	50%
Total	55%	45%	100%

The above table shows the population proportions for age and gender. Let's assume the sample proportions are as below:

	Gender
Age	Male	Female	Total
Young	20%	35%	55%
Old	20%	25%	45%
Total	40%	60%	100%

The totals are called the marginal distributions or the rims.

Rim weighting is an iterative process. The first dimension's (in this case age) totals are used to calculate a weight such that the sample totals (40% and 60%) equal the population totals (55% and 45%). This weight is then adjusted so that the next dimension's (gender) sample totals equal the population totals for that dimension. This is one iteration.

The process is repeated until either all the sample totals equal the population totals or the number of iterations exceeds a predefined maximum.

Using an online panel to show Windows OS share

Deciphering http user agent data from an online panel

The http user agent string is the data that is sent by the browser or software to identify that software to the website.

Looking at the http user agent data generated by your panel members can be very instructive. You can find out what operating they are using, what browser and the browser version. Some user agents will also tell you what device the software is being run on. The downside to this is that user agent data can be spoofed. See here for a fuller explanation of spoofing.

One benefit online panels have over the usual statistics from providers, such as Statcounter, is that you know exactly who has logged on and will not over or under count them in the statistics. As ever, Wikipedia has a list of reasons for over and under counting. There's no point repeating them here.

So, how have I decoded the strings? Badly, it turns out. There are better ways to do what I've done and there are websites around that have decoders available. See wurfl and MobileESP for examples.

The user agent is made up of around 5 parts. Websites such as www.useragentstring.com show exactly how these are made up.

I was initially only interested in 4 items from the string and these are relatively easy to extract. Therefore I did it myself. Some of the aforementioned websites are subject to restrictive licences for commercial use and so I couldn't use them for work. The 4 items are:

Operating system
OS version
Browser
Browser version

Part 1 is a simple matter of looking for the relevant string, although you do need to be careful with the order in which you search as Android phones will have Linux in the string as well (assuming that you're interested in splitting out Android) and iPads etc. have OS X as part of their string.

So I searched for:
a) Windows
b) Android
c) Intel Mac OS X
d) Linux
e) Mac OS X - iPad and iPhones

This won't find Blackberry or Symbian systems, although it's trivial to add these.
Part 2 is more complicated but the OS version is generally around the OS string. Therefore it's just a case of creating a regular expression to extract the relevant information. For all systems we can use the pattern:

((\d+)(\.\d+)*)

The pattern is searching for any number of digits (one or more) and then any number of a dot coupled with any number of digits (one or more) zero or more times. Hopefully that makes sense.

You will need to replace underscores with dots in the user agent string and Windows strings pre-NT need to be catered for differently (Win98 is one example). The first group from the match is captured.

Part 3 is once again simple. We look for the strings:
a) MSIE
b) Firefox
c) Chrome
d) Safari
e) Opera

Chrome strings always include Safari so it needs to be searched for first. Again Blackberry and feature phones won't be found if only searching with these strings.

Part 4 is not so simple and requires a bit of faff to find the version. The above pattern can be reused with the addition of \/ at the beginning to extract the version number. The version number is usually with the browser string and so should be relatively easy to find but inconsistencies within the user agents often cause problems.

Results

Here's an example of the data that can be extracted:

This chart shows the smoothed, raw (so not weighted to be representative) percentage share of the various Windows operating systems in Great Britain from early 2007 to early 2013. I've also put in the release dates of the operating systems (according to Wikipedia).

More work would need to be done on this (weighting, test the smoothing) but it does show fairly well the tail off in the rise in Windows 7 since Windows 8 was released. Also although it's early days the take up of the Windows 8 is only just behind that of the previous 2 systems.

The gradient of the Windows 7 share line is clearly steeper than either Vista of Windows 8 but it's not by much. The next few months will be interesting.

I also find it interesting that Vista never had more share than XP.

Because the data is from a panel we could look at switching and demographics. Another time maybe.

Stats, MR and Data

Monday, 29 July 2013

ONS Interactive Content

Wednesday, 3 July 2013

European GDP Statistics

Saturday, 29 June 2013

Deaths statistics in England and Wales

Wednesday, 26 June 2013

An Excel macro to concatenate values in ranges of cells

Introduction

Macro

Example

Monday, 24 June 2013

WEFF - determining the loss of precision after demographically weighting a sample

Introduction

The WEFF

Saturday, 22 June 2013

Rim Weighting Excel Macro

Update - 09/12/2015

Introduction

Monday, 3 June 2013

Using an online panel to show Windows OS share

Deciphering http user agent data from an online panel

Results