Thursday, November 14, 2013

What is data sampling?

We know about food sampling. We take a pick of the different food available and conduct a taste test. Thereafter we make conclusion of which restaurant is better.

It is the same in the case for data analysis purposes. At times, there is not enough time, energy, money, labour/man power or it is simply not possible to measure every single item of the population.

Sampling is a shortcut method for investigating a whole population. Data is gathered on a small part of the population, and used to inform what the whole picture is like.

An appropriate sampling strategy is adopted to obtain a representative, and statistically valid sample of the whole.

Today, I will touch on the basics of sampling strategy. 

The bigger sample size, the more accurate a representation of the population. However there is a need to balance between obtaining a statistically valid representation, and the amount of resources needed. 

A sampling strategy made with the minimum of bias is the most statistically valid. 

There are 3 main types of sampling strategy:

1. Random - the least biased of the 3. E.g. To carry out a survey of 100 out of a small town with a population of 1,000, we can randomly pick 100 people.

2. Systematic - essentially a variant of 1 that involves some listing of the 1,000 people in our earlier example. Divide 1,000 by the sample size of 100, yielding the result of 10. Next we pick a number between 1 and 10, say 6. Then records 6th, 16th, ... to 996th will form the sample.

3. Stratified. In this form of sampling, the population is first divided into two or more non-overlapping groups (known as strata) based on some characteristics of interest in the research. If a random sample is drawn from each group. The whole sampling procedure is described as stratified random sampling.

More on Stratified Method

Let me dwell a little more on stratified method. 

The key benefit of this method is to ensure that cases from smaller strata of the population are included in sufficient numbers to allow comparison. 

E.g. If we are interested in how job satisfaction varies by race among a group of employees at a firm. To explore this issue, we need to create a sample of the employees of the firm. However, if the employee population at this particular firm is predominantly white, with a simple random sample of employees, we are likely to end up with very small numbers of Blacks and Asians. The numbers could be too small for comparison in one or more of the smaller groups.

Rather than taking a simple random sample from the firm's population at large, in a stratified sampling design, we ensure that appropriate numbers of elements are drawn from each racial group in proportion to the percentage of the population as a whole. Say we want a sample of 1000 employees - we would stratify the sample by race (group of White employees, group of African American employees, etc.), then randomly draw out 750 employees from the White group, 100 from the African American and 150 from the Asian. This yields a sample that is proportionately representative of the firm as a whole.

Hope this little post has piqued your interest in data analysis.

No comments: