Stats, MR and Data: May 2013

Thursday, 30 May 2013

A Java Convex Hull Program

I was a little reticent to publish this as, although the algorithm works, it has two fairly major limitations:

only works with positive integers
fails to end if the start point (the left most point) is not on the convex hull

However it may be useful to some people and it is at least a record of what I've tried. So on to the program.

As noted in a previous blog entry I used this when creating the area maps for UK postcodes. It didn't show the area very well, after all most areas are concave hulls, but did at least make nice maps.

The algorithm is just an implementation of the determinant method of calculating the convex hull. Google 'java convex hull' for some good websites showing the details. The only problem with these websites is that they never showed the code used - none that I saw anyway. In any case, where is the fun in having something spelled out to you (oh the irony). Listed next is what I came up with.

private ArrayList<Point> calcVHull(ArrayList<Point> listPoints){

//Calculate convex vector of points

ArrayList<Point> hullPoints = new ArrayList<>();

Point lp = new Point(Integer.MAX_VALUE,0);

Point rp = new Point(0,0);

Point np = new Point();

//Calculate leftmost and rightmost points - lp and rp

for(int i=0;i<listPoints.size();i++){

if(listPoints.get(i).x<=lp.x){

lp.move(listPoints.get(i).x, listPoints.get(i).y);

}

if(listPoints.get(i).x>=rp.x){

rp.move(listPoints.get(i).x, listPoints.get(i).y);

}

//So calculate determinant and find the point for which every det is positive

hullPoints.add(new Point(lp));

np.setLocation(lp);

do {

for(int i=0;i<listPoints.size();i++){

boolean nextPoint = true;

if(np.x==listPoints.get(i).x && np.y==listPoints.get(i).y){

continue;

}

for(int j=0;j<listPoints.size();j++){

if(i==j || (np.x==listPoints.get(j).x && np.y==listPoints.get(j).y)){

continue;

}

long det = (long) np.x*((long) listPoints.get(i).y-(long) listPoints.get(j).y)+ (long) listPoints.get(i).x*((long) listPoints.get(j).y- (long) np.y)+ (long) listPoints.get(j).x*((long) np.y- (long) listPoints.get(i).y);

if(det<0){

nextPoint = false;

break;

}

if(nextPoint){

np.setLocation(listPoints.get(i));

hullPoints.add(new Point(np.x,np.y));

//System.out.println(np.x + ";" + np.y);

break;

}

//This condition does not work well. It will sometimes never end

//if the left most point is not on the convex hull

} while (!(np.x==lp.x && np.y==lp.y));

return hullPoints;

}

Performance

The chart below shows the time taken in seconds to calculate the hull with the number of points in the area along the x-axis.

The time taken is largely irrelevant as your computer will be different to mine. What is more important is the shape of the line. The order is roughly n*log(n) - which surprised me. I thought it would scale as the square of the number of points.

Limitations

As stated earlier, this will only work with integers. It should be fairly easy to convert doubles to integers though if you don't mind limiting the accuracy. So if the numbers you were dealing with had only 2 decimal places you'd multiply the numbers by 100.

The bigger problem is the fact that the algorithm will never end if the starting point is not on the hull. It only becomes a problem when the left-most point lies in a straight line with 2 other points that are part of the convex hull. This happens very rarely but will occur more often when there are a lot of points within a small area. I guess one way around this could be to split the points into a top half of points and a bottom half of points along a line between the left-most and right-most points. Calculate the convex hulls for both of these and combine. This is also one way of speeding up the algorithm but I decided not to implement this. There's no guarantee that this wouldn't also create starting points that aren't on the hull.

The other guaranteed way to stop this happening would be to check for duplicate points every time a point is added to the hull. I'm not sure what this would do to the speed though. One to test.

Wednesday, 29 May 2013

An Excel Auto-Correlation Macro

Wikipedia defines the auto-correlation as the cross-correlation of a signal with itself. For a series of data points measured through time it is the correlation between points measured a specific time apart.

I won't go into any more detail about auto-correlation as there are many good descriptions online:

Wikipedia
Mathworld
NIST Engineering Statistics Handbook - this handbook is well worth reading for definitions of other time series related functions

There's no point repeating what others have already written. Especially when they've written it better than I ever could.

The only things I will define relate directly to the function. These are:

Lag - this is the number of points n from point t to calculate the auto-correlation for. So for lag=1, the correlation is calculated for points (t, t+1, t+2, ...) with points (t+1, t+2, t+3, ...).
Difference - when calculating the auto-correlation one often needs to calculate it on the series (t+1-t) rather than (t). This is to ensure some degree of stationarity and a clearer picture of the auto-correlation.

Function

So, on to the function. It's another array function (see here for details on how to use them). I've

Java Google Maps polyline encoding function

In this post I'm going to detail the polyline encoding method I wrote in Java a while back.

It was originally written to graphically show the areas covered by certain UK postal sectors on Google Maps. The program converted the post codes in the postal sector into a list of points defining either a convex or concave hull and produced the URL of a Google Map showing the area. The concave hull algorithm (based on alpha shapes) doesn't work too well and takes an age to run so I haven't really used it.

It didn't actually display the map as the firewall where I work gets a little upset by multiple requests for the same address (as would happen when you test anything).

Anyway, on to the function.

The definition of the algorithm is here on the Google developers maps site. I won't go through it as I'll just be repeating what is written, with much more style, on the algorithm site.

The definition of the static map API is here, again on the Google developers maps site.

Taking the definition of the algorithm, I produced this, messy, Java code:

-----------------------------------------------------------------------------
private String encodePolyline(double latlong){
//Google's procedure for encoding polyline data
//This doesn't cater for backslashes in string literals i.e. the character sequence "\"
//which should be returned as "\\". Function createPolyLine will do this.

List of useful statistical and market research websites

Below is a list of websites I've found useful over the years in the field of market research. Hope you find them useful too, although they do have a very UK centric bias. I've split them into various areas.

Population and other country wide statistics

Office of National Statistics - the UK government statistics website. Has millions of useful data on most aspects of the UK including economy, population, government etc. Good luck finding anything though - it’s a horrible maze of unrelated pages.
UK & NI National Population Projections - a branch of the ONS (above) that has downloads of the national population projections for the UK, GB, Wales, Scotland, England, NI etc. Very useful and easy to navigate site if you want to know population figures for, for example, demographic weighting.
Eurostat - the European equivalent of the ONS, it’s the statistical office of the European Union, situated in Luxembourg. The place to go for Europe wide population projections etc.
Eurostat population projections - holds the population projections for Europe at a country level.
Spanish Statistical Office - the Spanish equivalent of the ONS. Excellent when compared to the ONS. Population projections were a breeze to find and everything was laid out well and in an easy to read font.
US Census Bureau - US census department
United Nations Statistics Division - UN statistics website. Has a lot of information (demographic yearbooks, etc.) and quite a few handbooks with advice on designing samples, surveys, errors and other areas in all parts of the world.

These are only the ones that I've used over the years. There are obviously ones for other countries but the UN covers most (all?) countries. I'll add more later.

Market Research Websites

Market Research Society - the world's leading research association (apparently). Supports and promotes market research. Has useful guidelines and advice for data protection, freedom of information, online research, questionnaire design, etc.
ESOMAR - world association for market, social and opinion research. Another market research society.
efamro - umbrella organisation of the European market research societies.

As you can imagine there are a lot more market research society websites out there that cover the various countries. I won't list them all (efamro covers a fair amount of Europe after all) but I will add to the list when I get time.

Market Research News Websites

MR Web - a market research news service and jobs website. You can subscribe to their emailing list for both jobs and news which I recommend.

Research Live - has news, articles, features and blogs. I'm assuming that it's linked to the MRS as this website is fairly prominent on there. It also has a jobs search area.

There are other obviously other websites with market research news (marketing week comes to mind) but these two are fairly good.

Statistics Websites

R - not a statistics website exactly but a statistical software package. Does a lot of statistical analyses out of the box and has a load of packages that can be downloaded to add functions. It does have a steep learning curve though and a lot of the syntax is very unintuitive.
Journal of Statistical Software - publishes articles, etc. on the subject of statistical software and algorithms. Has some interesting articles and includes code for algorithms should you wish to implement them.
Journal of Official Statistics - a lot of articles on methodology such as methods of weighting survey data, handling wave non response of panels, response rates to different question layouts etc. There's about 28 years worth of articles on here and some of them are very relevant to running panels and market research in general. It's published by Statistics Sweden.
GeoHive - as it says on the site, it is mainly global population statistics but also has geopolitical data. It also lists what must be all the statistical agencies in the world under it's resources section. Impressive and makes my list look rather inadequate.
Statsoft - this the link to the textbook that describes, in broad strokes, a lot of statistical concepts. Good for an overview butt doesn't go into much detail. Statsoft also produce a data analysis software package.
GRETL - free, open source software for regression, econometrics and time series analysis. I've never actually used it but it comes with an impressive manual describing the various analyses that are available in a lot of detail.
Royal Statistical Society - a society with the mission statement of 'Advancing the science and application of statistics, and promoting use and awareness for public benefit'. A busy site that I've never found to be very useful. Others may disagree.

Conclusion

This is only a start to the list. I'll update it when I get the time and if I find any other sites. There are a few blogs I want to put on here and some of the more general mathematical sites and other statistical software package websites. I'm getting tired of writing statistics though so, all in good time.

If there are any that you think should be added then stick them in a comment.

Wednesday, 15 May 2013

Demographically Weighting a Market Research Sample

What’s a sample?

When trying to research a subject or find out information about an item, such as the amount of sales of a chocolate bar or how many people have bought cars in the last year, a group of people are asked to provide this information. This is the sample.

To extend the first example above, we might want to measure the sales of chocolate in Great Britain. To do this we ask a sample of people how much chocolate they have bought over a period of time. From this we can extrapolate an estimate of total sales for the area we’re interested in.

This brings to mind several questions:
1) How big a sample do we need?
2) If the sample is picked randomly, how do we estimate the accuracy?
3) If the demographic profile of the sample doesn’t match that of the population, how accurate is the estimate?
4) Can this estimate be improved?

Why demographically weight?

Demographic weighting is used to align the profile of the sample to that of the population. This improves the accuracy of the estimate (by how much?). This can be seen (intuitively) by taking the extreme example of having a sample that is 10% male, 90% female compared to a population that is 50% male, 50% female. If we are estimating the average height of the population then taking a simple average will give an estimate that is far too low. We need to take the weighted average.

This is, in effect, all that demographic weighting is used for.

How?

There are two main methods of demographically weighting a sample to produce an estimate of a market figure. Which one is used depends on how big the sample is and how many dimensions are used within the weighting.

Cell weighting – the demographics used for weighting are interlaced and targets are derived for every single cell. For example, to derive an estimate of purchasing of bread by households in Great Britain we might want to weight by age, gender and household size. This gives 3 dimensions. If you imagine a 3 dimensional graph with age on one axis, gender on another and household size on the other. These are all discrete variables and on each intersection (e.g. male, 25-34 year olds, 2 person household) we would need to know the target, i.e. how many of these we would expect within the population.

This type of weighting will be more accurate as every cell is given an accurate target. However it is almost impossible to derive accurate targets for most of these cells.
With samples that are of even a reasonable size some cells will be very small. The target may even round to zero. Should you then delete data for someone from this cell?
The more dimensions used for weighting, the harder it is to derive targets and the more volatile the weights will be.
Computationally simple – the weight is just the target divided by the actual number of people for the cell.

Rim weighting – the demographics used for weighting are not interlaced and targets are only derived for each of the dimensions. Using the example above, we would only need to know targets for males, 25-34 year olds and 2 person households separately.

This type of weighting is less accurate as only the rims (gender or age etc.) are guaranteed to match the targets. The interlaced targets may be completely off.
Targets are generally easier to derive – for example you may be able to find out what proportion of people have what broadband Internet network as their supplier. You will almost certainly not be able to find out to any degree of accuracy the gender and age of them.
More computationally complex – with more than 2 dimensions an iterative approach has to be taken.

The table below illustrates the differences between the two types of demographic weighting. In this example we are weighting by two demographics, age and gender. The blue highlighted cells show the targets for cell weighting. The grey highlighted cells show the targets for rim weighting. The grey cells are the rims of the table. From this table you can see the differences you will get from the two methods. Let’s take young males. The target for the cell is 30%. However if we were to rim weight the data, the target would be 27.5%. So we’re 2.5% off, which is almost 10% away from the target.

Advantage and disadvantages of weighting

The main advantage of weighting is that it makes any estimate of the subject we’re interested in more representative and accurate. After all, weighting the data gives you a weighted average of the data.

However, it does not make the estimate more precise. In fact, weighting the data will generally increase the standard errors of any estimate. The degree to which it does so can be found out by calculating the WEFF, the weighting effect.

Also, it should be remembered that just because you are weighting to certain demographics, this does not make the sample more representative of the demographics that you do not weight to. For example, if you do not take social class into account in your weighting scheme, any analysis on this will be subject to systematic sampling errors. These could include a lack of social class A if you only sample in certain areas or certain types of people.

WEFF

The WEFF is basically the square of the population standard deviation of the weights divided by the square of the mean of the weights added to one. When calculating the standard errors the sample size is decreased by the WEFF. As you can imagine the weights can have a dramatic effect on how big a sample is needed to achieve a level of accuracy.

It is for this reason that the weights are generally capped at a relatively low level.

Conclusion

This is just a quick overview of demographic weighting. I haven’t really answered many of the questions asked at the beginning. The next post in the market research entries will cover the algorithm for rim weighting and I’ll detail an Excel macro to generate the weights. Until then.

Monday, 13 May 2013

Gain Function of a filter in Excel

What have I gained

Last time I wrote about how to apply a filter to some time series data. The example given at the end of the post showed the effect of applying a filter to some clothing data from the ONS.

However, what did the filter actually do?

As mentioned before, the filter removes some frequencies from the series. The frequencies that have been removed are revealed by looking at the gain function. Although it should be pointed out that we're interested in the periodicities rather than frequencies.

The macro

It's another array function - this cuts down on processing and makes the calculation within Excel quicker and more efficient.

Without further ado:

--------------------------------------------------------------------------

Function GainFunction(filter As Variant, Optional b As Variant) As Variant

Dim OutputRows As Long

Dim OutputCols As Long

Dim output() As Double

Dim vert As Boolean

Dim i As Integer

Dim j As Integer

Dim c() As Double

Dim d() As Double

Dim size As Integer

Dim g1() As Double

Dim g2() As Double

'Load a into array c

If TypeName(filter) = "Range" Then

If filter.Rows.Count > filter.Columns.Count Then

ReDim c(1 To filter.Rows.Count)

For i = 1 To UBound(c)

c(i) = filter(i, 1)

Else

ReDim c(1 To filter.Columns.Count)

For i = 1 To UBound(c)

c(i) = filter(1, i)

End If

Else

ReDim c(1 To UBound(filter))

For i = 1 To UBound(c)

c(i) = filter(i)

End If

With Application.Caller

OutputRows = .Rows.Count

OutputCols = .Columns.Count

End With

ReDim output(1 To OutputRows, 1 To OutputCols)

'Populate output with zeroes

For i = 1 To OutputRows

For j = 1 To OutputCols

output(i, j) = 0

If IsMissing(b) Then

If OutputRows > OutputCols Then

vert = True

size = OutputRows

ReDim g1(1 To OutputRows)

ReDim g2(1 To OutputRows)

ReDim d(1 To OutputRows)

Else

vert = False

size = OutputCols

ReDim g1(1 To OutputCols)

ReDim g2(1 To OutputCols)

ReDim d(1 To OutputCols)

End If

For i = 1 To size

d(i) = i

Else

'Load b into array d

If TypeName(b) = "Range" Then

If b.Rows.Count > b.Columns.Count Then

vert = True

size = b.Rows.Count

If size > OutputRows Then size = OutputRows

If size < OutputRows Then Exit Function

ReDim g1(1 To size)

ReDim g2(1 To size)

ReDim d(1 To size)

For i = 1 To size

d(i) = b(i, 1)

Else

vert = False

size = b.Columns.Count

If size > OutputCols Then size = OutputCols

If size < OutputCols Then Exit Function

ReDim g1(1 To size)

ReDim g2(1 To size)

ReDim d(1 To size)

For i = 1 To UBound(d)

d(i) = b(1, i)

End If

Else

size = UBound(b)

If OutputRows > OutputCols Then

vert = True

If size > OutputRows Then size = OutputRows

If size < OutputRows Then Exit Function

Else

vert = False

If size > OutputCols Then size = OutputCols

If size < OutputCols Then Exit Function

End If

ReDim g1(1 To size)

ReDim g2(1 To size)

ReDim d(1 To size)

For i = 1 To size

d(i) = b(i)

End If

'Create gain function

For i = 1 To size

g1(i) = 0

g2(i) = 0

For j = 1 To UBound(c)

g1(i) = g1(i) + c(j) * Cos(j * 2 * [pi()] / d(i))

g2(i) = g2(i) + c(j) * Sin(j * 2 * [pi()] / d(i))

If vert = True Then

output(i, 1) = Sqr(g1(i) ^ 2 + g2(i) ^ 2)

Else

output(1, i) = Sqr(g1(i) ^ 2 + g2(i) ^ 2)

End If

GainFunction = output

End Function

--------------------------------------------------------------------------

As ever, most of the macro is concerned with setting up the parameters and dealing with the fact that the input data could be horizontal or vertical. Only the last 10 or so lines actually calculate the gain function.

The optional variable/parameter b is the list of periodicities to calculate the function for. You don't have to include it, but I strongly recommend that you do. In this way you can control what you're checking.

I also recommend that you don't bother looking at periodicities less than 1. Below 1 the frequencies tend toward infinity and the gain function becomes very spiky (that's the technical term).

An Example

Using the 13 point symmetric filter (1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13, 1/13) gives the following gain function:

As you can see from the chart, zeroes in the gain function occur at 13, 6.5 and then half the preceding periodicity from then on. These periodicities are removed from the data. And everything from just over 1 to around 14 are, mostly, removed. So most of the short term, high frequency data is smoothed out.

The filter shown here has the same number for each of the 13 terms. It doesn't have to, although it does make it easier to predict what will happen to the data. Essentially the data point is just the average of the surrounding 12 points - 6 from either side.

There is another aspect to filters which I will cover in the next blog. And that is the change to the phasing of the data.

Until next time.

Stats, MR and Data

Thursday, 30 May 2013

A Java Convex Hull Program

Performance

Limitations

Wednesday, 29 May 2013

An Excel Auto-Correlation Macro

Function

Thursday, 23 May 2013

Java Google Maps polyline encoding function

Tuesday, 21 May 2013

List of useful statistical and market research websites

Population and other country wide statistics

Market Research Websites

Market Research News Websites

Statistics Websites

Conclusion

Wednesday, 15 May 2013

Demographically Weighting a Market Research Sample

What’s a sample?

Why demographically weight?

How?

Advantage and disadvantages of weighting

WEFF

Conclusion

Monday, 13 May 2013

Gain Function of a filter in Excel

What have I gained

The macro

An Example