The Subtle Subject of Statistics

Encyclopædia Britannica describes statistics as “the science of collecting, analyzing, presenting, and interpreting data.” The Babylonians were among the first to use statistics. They applied the results of their census to determine how much food was needed to feed their population. Later civilizations used census records to assess and collect taxes, and to raise armies.

The U.S. Constitution empowers the government to conduct a census every ten years in order to apportion direct taxes and representation in the House of Representatives. Today, census statistics are also used by the government to track trends in the U.S. economy, and to allocate appropriations for things like schools, roads, and entitlement programs. You can learn more about this at the US Census website at https://www.census.gov/.

In statistics, we talk about special concepts, like “randomness”, “distribution” and “sampling”. Something is random if it is equally likely to take on any of its possible values. So suppose we number ten balls sequentially numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. If we put all the balls in an urn, and select one (without looking!), then it will randomly be any value from 1 to 10.

This brings us to the idea of a distribution. A distribution describes the frequency of each of the possible values occurring. In this example, we say the distribution is uniform because the probability of selecting any possible number is equally likely. We can express this idea using probability notation: p(selecting a 1) = p(selecting a 2) = … = p(selecting a 10) or 1/10.

One of the most common distributions is the normal distribution, where the X coordinate represents the value the sample takes on, and the Y coordinate represents the frequency with which this value occurs. The mean or average of its bell-shaped curve is denoted MU, and is calculated as the sum of all the sampled values for X divided by the number of samples, or (X1 + X2 + … + Xn)/n = MU. The standard deviation SIGMA describes the spread or dispersion of the curve, and is given by:

SIGMA = SQRT[((X1-MU)^2 + (X2-MU)^2 + (X3-MU)^2 + … + (Xn-MU)^2))/n]

where “^2” means “squared” and “SQRT[ ]” means “take the square root of the expression that follows in the brackets”. The equation for the normal distribution is:

Y = EXP[-(X-MU)^2/(2*SIGMA^2)] / SIGMA * SQRT(2*PI)

where EXP( ) is the exponential function, and PI = 3.14159…. For more details, type “normal distribution” into Wolfram Alpha at https://www.wolframalpha.com/

Sampling involves the way you collect data values during your statistical experiment. You need to know what kinds of information you want to collect, what the sources of your information will be, what sort of sampling will be used (random sampling is one desirable method), how the data will be collected (e.g. via questionnaire, door to door survey, etc.), when and how often the data will be collected, what you plan to compare the data to, and how you intend to analyze the data. You want to design your sampling strategy to try to eliminate any biases.

Poor sampling strategies have been the downfall of many statistical studies. They can provide unreliable results, invalidate your conclusions, and, if published, tarnish your reputation. Smart researchers consult statisticians on not just the sampling strategy, but the overall methodology for their study in hopes of avoiding such gotcha’s.

What have we learned about statistics? Statistics is not all that hard to understand. Statistical concepts like randomness, distributions, and sampling are straightforward. But statistics can be tricky to use correctly. Poorly designed and executed statistical studies can lead to spurious results. While most respectable researchers try to conduct reliable studies, biases can sneak in. Worse still, there are individuals out there who have every intention of misleading with statistics, and we’re going to call them out next week.