Stats, MR and Data: Demographically Weighting a Market Research Sample

What’s a sample?

When trying to research a subject or find out information about an item, such as the amount of sales of a chocolate bar or how many people have bought cars in the last year, a group of people are asked to provide this information. This is the sample.

To extend the first example above, we might want to measure the sales of chocolate in Great Britain. To do this we ask a sample of people how much chocolate they have bought over a period of time. From this we can extrapolate an estimate of total sales for the area we’re interested in.

This brings to mind several questions:
1) How big a sample do we need?
2) If the sample is picked randomly, how do we estimate the accuracy?
3) If the demographic profile of the sample doesn’t match that of the population, how accurate is the estimate?
4) Can this estimate be improved?

Why demographically weight?

Demographic weighting is used to align the profile of the sample to that of the population. This improves the accuracy of the estimate (by how much?). This can be seen (intuitively) by taking the extreme example of having a sample that is 10% male, 90% female compared to a population that is 50% male, 50% female. If we are estimating the average height of the population then taking a simple average will give an estimate that is far too low. We need to take the weighted average.

This is, in effect, all that demographic weighting is used for.

How?

There are two main methods of demographically weighting a sample to produce an estimate of a market figure. Which one is used depends on how big the sample is and how many dimensions are used within the weighting.

Cell weighting – the demographics used for weighting are interlaced and targets are derived for every single cell. For example, to derive an estimate of purchasing of bread by households in Great Britain we might want to weight by age, gender and household size. This gives 3 dimensions. If you imagine a 3 dimensional graph with age on one axis, gender on another and household size on the other. These are all discrete variables and on each intersection (e.g. male, 25-34 year olds, 2 person household) we would need to know the target, i.e. how many of these we would expect within the population.

This type of weighting will be more accurate as every cell is given an accurate target. However it is almost impossible to derive accurate targets for most of these cells.
With samples that are of even a reasonable size some cells will be very small. The target may even round to zero. Should you then delete data for someone from this cell?
The more dimensions used for weighting, the harder it is to derive targets and the more volatile the weights will be.
Computationally simple – the weight is just the target divided by the actual number of people for the cell.

Rim weighting – the demographics used for weighting are not interlaced and targets are only derived for each of the dimensions. Using the example above, we would only need to know targets for males, 25-34 year olds and 2 person households separately.

This type of weighting is less accurate as only the rims (gender or age etc.) are guaranteed to match the targets. The interlaced targets may be completely off.
Targets are generally easier to derive – for example you may be able to find out what proportion of people have what broadband Internet network as their supplier. You will almost certainly not be able to find out to any degree of accuracy the gender and age of them.
More computationally complex – with more than 2 dimensions an iterative approach has to be taken.

The table below illustrates the differences between the two types of demographic weighting. In this example we are weighting by two demographics, age and gender. The blue highlighted cells show the targets for cell weighting. The grey highlighted cells show the targets for rim weighting. The grey cells are the rims of the table. From this table you can see the differences you will get from the two methods. Let’s take young males. The target for the cell is 30%. However if we were to rim weight the data, the target would be 27.5%. So we’re 2.5% off, which is almost 10% away from the target.

Advantage and disadvantages of weighting

The main advantage of weighting is that it makes any estimate of the subject we’re interested in more representative and accurate. After all, weighting the data gives you a weighted average of the data.

However, it does not make the estimate more precise. In fact, weighting the data will generally increase the standard errors of any estimate. The degree to which it does so can be found out by calculating the WEFF, the weighting effect.

Also, it should be remembered that just because you are weighting to certain demographics, this does not make the sample more representative of the demographics that you do not weight to. For example, if you do not take social class into account in your weighting scheme, any analysis on this will be subject to systematic sampling errors. These could include a lack of social class A if you only sample in certain areas or certain types of people.

WEFF

The WEFF is basically the square of the population standard deviation of the weights divided by the square of the mean of the weights added to one. When calculating the standard errors the sample size is decreased by the WEFF. As you can imagine the weights can have a dramatic effect on how big a sample is needed to achieve a level of accuracy.

It is for this reason that the weights are generally capped at a relatively low level.

Conclusion

This is just a quick overview of demographic weighting. I haven’t really answered many of the questions asked at the beginning. The next post in the market research entries will cover the algorithm for rim weighting and I’ll detail an Excel macro to generate the weights. Until then.

Stats, MR and Data

Wednesday, 15 May 2013

Demographically Weighting a Market Research Sample