Saturday 29 June 2013

Deaths statistics in England and Wales

I found some interesting data on the ONS website recently. It related to the number of registered deaths by year by age of the person in England and Wales. A bit morbid maybe but interesting none the less.



Above is an animated gif of the number of deaths by year of age as a proportion of the total number of deaths in that year. The gif was created in GIMP, following their tutorial, although I did find it a lot easier to just paste the charts as new layers. I've decided that I don't really like animated gifs. You can't stop them, and when you want to study a particular point in detail it's gone before you can.

Anyway, there are 3 interesting points to this:

  1. Average age - The movement of the peak to the right.
  2. Infant mortality - The drop in the percentage of the number of deaths of people aged zero.
  3. The war - The dip in the percentage of the number of deaths of people aged 80 around the year 2000.


Wednesday 26 June 2013

An Excel macro to concatenate values in ranges of cells

Introduction

The concatenate function in Excel seems very limited. Maybe I'm not using it correctly or there's a clever way of concatenating ranges of cells that I don't know but having to click on every cell that you want to concatenate just seems a waste of time.

I therefore thought that a better solution would be to add a function that automatically concatenates ranges of cells. Better yet, you can specify a separator for the cells if, for example, you wanted to output a range of semi-colon delimited data rather than just comma or tab delimited.

Macro

Anyway here's the macro:

-----------------------------------------------------------------------------
Function ConcatenateRange(d As Range, Optional separator As String) As String

    Dim s As String
    Dim i, j As Integer
    
    s = ""

    For i = 1 To d.Columns.Count
        For j = 1 To d.Rows.Count
            s = s + d.Cells(j, i).Text + separator
        Next
    Next
    
    s = Left(s, Len(s) - Len(separator))
    
    ConcatenateRange = s

End Function
-----------------------------------------------------------------------------

Example

Say you had a list of data spread over six cells from A1 to F1:


The function will join the cells, separated by the semi-colon.

Monday 24 June 2013

WEFF - determining the loss of precision after demographically weighting a sample

Introduction

To improve the estimate of a measurement from a sample, the sample is usually demographically weighted to the population being measured. One example would be using a sample to measure the sales of cars in a country. You may wish the sample to represent the population of the country in question but what variables do you want to control for/to? What effect does increasing the number of variables you weight by have on the data? Sometimes you may be better off not weighting to a variable if it has little to no effect on the accuracy of the estimate but a large effect on the precision.

The WEFF

One way of measuring the effect of the weights and the weighting structure on the precision of the estimate is to use what is called the WEFF (sometimes just called F - see the Journal of Official Statistics, volume 19, No. 2, 2003 pp 81-97 for more details).

The WEFF is defined as:
Which is just one plus the population standard deviation of the weights divided by the mean of the weights. An alternative formulation is:

where n is the number of weights and x is the weight variable.

The loss in precision is determined by dividing the sample size by the WEFF. Therefore a small value is desirable.

Saturday 22 June 2013

Rim Weighting Excel Macro

Update - 09/12/2015

I've updated the macro and put it into one Excel add-in. It's now available here.

Introduction

In a previous post I wrote about two methods to demographically weight a market research sample to make the sample represent a population. For example, if you had a sample of 1000 people to measure the average height of a certain country, you might want to weight the sample by age and gender so that the sample represented the population in those two variables (dimensions, fields) exactly.

In this post I'm going to detail a macro that can be used to rim weight a sample.

The algorithm for rim weighting is very simple. Graham Kalton and Ismael Flores-Cervantes give a very good explanation of rim weighting (and other demographic weighting methods) in an article for the Journal of Official Statistics. This website is well worth a look for other articles on running panels, weighting schemes and, it seems, everything else to do with statistics. Anyway, I give a short explanation below.

To give a concrete example and help with the explanation I'll use the example given in the previous blog entry.
Gender
Age Male Female Total
Young 30% 20% 50%
Old 25% 25% 50%
Total 55% 45% 100%

The above table shows the population proportions for age and gender. Let's assume the sample proportions are as below:
Gender
Age Male Female Total
Young 20% 35% 55%
Old 20% 25% 45%
Total 40% 60% 100%


The totals are called the marginal distributions or the rims.

Rim weighting is an iterative process. The first dimension's (in this case age) totals are used to calculate a weight such that the sample totals (40% and 60%) equal the population totals (55% and 45%). This weight is then adjusted so that the next dimension's (gender) sample totals equal the population totals for that dimension. This is one iteration.

The process is repeated until either all the sample totals equal the population totals or the number of iterations exceeds a predefined maximum.

Monday 3 June 2013

Using an online panel to show Windows OS share

Deciphering http user agent data from an online panel

The http user agent string is the data that is sent by the browser or software to identify that software to the website.

Looking at the http user agent data generated by your panel members can be very instructive. You can find out what operating they are using, what browser and the browser version. Some user agents will also tell you what device the software is being run on. The downside to this is that user agent data can be spoofed. See here for a fuller explanation of spoofing.

One benefit online panels have over the usual statistics from providers, such as Statcounter, is that you know exactly who has logged on and will not over or under count them in the statistics. As ever, Wikipedia has a list of reasons for over and under counting. There's no point repeating them here.

So, how have I decoded the strings? Badly, it turns out. There are better ways to do what I've done and there are websites around that have decoders available. See wurfl and MobileESP for examples.

The user agent is made up of around 5 parts. Websites such as www.useragentstring.com show exactly how these are made up.

I was initially only interested in 4 items from the string and these are relatively easy to extract. Therefore I did it myself. Some of the aforementioned websites are subject to restrictive licences for commercial use and so I couldn't use them for work. The 4 items are:

  1. Operating system
  2. OS version
  3. Browser
  4. Browser version

Part 1 is a simple matter of looking for the relevant string, although you do need to be careful with the order in which you search as Android phones will have Linux in the string as well (assuming that you're interested in splitting out Android) and iPads etc. have OS X as part of their string.

So I searched for:
a) Windows
b) Android
c) Intel Mac OS X
d) Linux
e) Mac OS X - iPad and iPhones

This won't find Blackberry or Symbian systems, although it's trivial to add these.
Part 2 is more complicated but the OS version is generally around the OS string. Therefore it's just a case of creating a regular expression to extract the relevant information. For all systems we can use the pattern:
 ((\d+)(\.\d+)*)
The pattern is searching for any number of digits (one or more) and then any number of a dot coupled with any number of digits (one or more) zero or more times. Hopefully that makes sense.

You will need to replace underscores with dots in the user agent string and Windows strings pre-NT need to be catered for differently (Win98 is one example). The first group from the match is captured.

Part 3 is once again simple. We look for the strings:
a) MSIE
b) Firefox
c) Chrome
d) Safari
e) Opera

Chrome strings always include Safari so it needs to be searched for first. Again Blackberry and feature phones won't be found if only searching with these strings.

Part 4 is not so simple and requires a bit of faff to find the version. The above pattern can be reused with the addition of \/ at the beginning to extract the version number. The version number is usually with the browser string and so should be relatively easy to find but inconsistencies within the user agents often cause problems.


Results

Here's an example of the data that can be extracted:



This chart shows the smoothed, raw (so not weighted to be representative) percentage share of the various Windows operating systems in Great Britain from early 2007 to early 2013. I've also put in the release dates of the operating systems (according to Wikipedia).

More work would need to be done on this (weighting, test the smoothing) but it does show fairly well the tail off in the rise in Windows 7 since Windows 8 was released. Also although it's early days the take up of the Windows 8 is only just behind that of the previous 2 systems.

The gradient of the Windows 7 share line is clearly steeper than either Vista of Windows 8 but it's not by much. The next few months will be interesting.

I also find it interesting that Vista never had more share than XP.

Because the data is from a panel we could look at switching and demographics. Another time maybe.