Simple Statistics: Dispersion

This is part 3 of our series on Simple Statistics, previous segments are available in our archives.

Yesterday, we talked about ways to measure the center of your data. Today, we’ll build on that by focusing on common ways to measure the spread of your data, the range, variance, and standard deviation. Knowing how far your data is spread out, particularly how far it is from the center, is important because it will help you understand the types of outcomes to expect in the future.

Measures of dispersion

The range is one of the easiest measures of dispersion to calculate. It is just the difference between the maximum and minimum values in your data set. Using my example from yesterday, the maximum and minimum transactions were $250.71 and $43.24, respectively. Therefore, the range, $207.47, gives us a quick indication that the transaction amounts are pretty spread out.

Though the range is helpful for knowing the “width” of your data, you still don’t have a sense as to where most values fall – are they close together or very spread out? That’s what the standard deviation and its related metric, the variance, tell you.1 They measure how far your data points fall from the mean.

When measuring distances from the mean, by definition, you’ll have some values are less than the mean and some greater. One way to deal with the negative differences is to square them, which will result in only positive values. The mean of the squared differences is called the variance, often denoted 𝜎2. In order to put the units back to what you started with, take the square root of the variance, which is called the standard deviation, often denoted 𝜎. In general, smaller standard deviations indicate your data points are closer together.

Let’s take a look at how these calculations play out using the transactional data from Doug’s Desserts, noting that we already computed the mean yesterday ($51.35):

Customer Transaction Amount ($) Mean ($) Difference ($) Squared Differences ($2)
Customer A 52.87 51.35 1.52 2.31
Customer B 50.06 51.35 -1.29 1.66
Customer C 61.34 51.35 9.99 99.80
Customer D 49.23 51.35 -2.12 4.49
Customer E 43.24 51.35 -8.11 65.77


Taking the mean of the squared differences, we find the variance of our data to be 34.81 dollars squared. By taking the square root, we find that the standard deviation is $5.90.

Now let’s look at the variance and standard deviation when we add in the Customer F’s transaction, noting that the mean we calculated yesterday was $84.58:

Customer Transaction Amount ($) Mean ($) Difference ($) Squared Differences ($2)
Customer A 52.87 84.58 -31.71 1,005.52
Customer B 50.06 84.58 -34.52 1,191.63
Customer C 61.34 84.58 -23.24 540.10
Customer D 49.23 84.58 -35.35 1,249.62
Customer E 43.24 84.58 -41.34 1,709.00
Customer F 250.71 84.58 166.13 27,599.18

 

The variance of this data is 5549.17 dollars squared and standard deviation is $74.49, both much larger than in the previous calculation. The purpose of these metrics is to give you some insight into how spread out your data is, so you’d want your values to get larger as the data is more spread out, which they do!

Distributions

The metrics we’ve covered today are all related to a larger theme of the “distribution” of your data. Measures of central tendency don’t tell you the full story of what you can expect in your data, so you need to combine them with measures of dispersion to have a good sense of the types of outcomes you should expect in your data. In the example below, both the orange and red distributions4 have the same mean 𝜇 (shown with the blue dashed line). But the orange distribution has a smaller standard deviation than the red set since its data points are closer together. Only looking at the mean as a metric of what to expect would be misleading since the means are the same, but the distribution of the data are very different!

Dispersion

Tomorrow we will start talking about how two variables are related to each other by looking at covariance and correlation.

Questions? Send any questions on data analytics or pricing strategy to doug@outlier.ai and I’ll answer them in future issues!

p.s. There was a typo yesterday, the median should have been $51.47, not $51.20, in the second example (thanks Phil!).

 

[1] There is a technical difference between the calculations of standard deviation and variance if you are using the full population of data rather than a small sample of data, called Bessel’s correction. In the examples today, I use the population version.

[2] These are examples of a commonly used distribution, the normal (or bell-curve) distribution.