Monday, November 11, 2013

Data analysis - Logic and Simplicity

Today, I will like to share about something which is my bread and butter. Something which is close to my heart - data analysis.

If I were to use one word to describe data analysis, it would be beauty - sheer beauty - to be exact.

Data analysis is a body of methods that help to uncover the story behind numbers. Technical terms include to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy.

In data analysis, the golden rule is to simplify the problem. And the best data analyst is the one who gets the job done, and done well, with the most simple methods. Oscar Wilde once said, "Simplicity is beauty." That is why to me, data analysis is beauty.

Data analysis is both a science and an art. Present the same data to two persons, you are likely to have two different answers. 

Of course, there is a set of common toolkit used by fellow data analysts.  They are: 
1. Think about the data. 
2. Look for the central tendency such as mean and median. It can also be rate of growth. 
3. Look for outliers and explore the possible reasons. E.g. Is it due to the data collection process. If need be, go back to first principle.
4. Prove your point using evidence.
5. Think if the data make sense. Think about the world behind the numbers and let good sense and reason guide the analysis.
6. Strive for parsimony.  Parsimony is the analyst’s version of the phrase “Keep It Simple.” It means getting the job done with the simplest tools, provided that they work.

To me, the most powerful rule is the first one, “Think”, followed by the fifth one. The data are telling us something about the real world, but what?

This is a good example of the thinking behind a data analyst.

Between 1790 and 1990 the population of the United States increased by 245 million people, from 4 million to 249 million people. 

Can one say, the population grew at an average rate of 1.2 million people per year, 245 million people divided by 200 years? 

The arithmetic is correct — 245 million people divided by 200 years is approximately 1.2 million people per year. But the interpretation “grew at an average rate of 1.2 million people per year” would be wrong. 

Why? Because the conclusion is not “sensible”.  How is it possible that the American population of 4 million people in the United States in 1790 increased to 5.1 million people in one year? That would have been a 30 percent increase in one year — which is not likely (and didn’t happen). 

It would be more valid, to describe the annual growth using a percentage, stating that the population increased by an average of 2 percent per year — 2 percent per year when the population was 4 million (as it was in 1790), 2 percent per year when the population was 250 million (as it was in 1990). 

Have fun thinking.

No comments: