Simple Statistics: Central Tendency

This is part 2 of our series on Simple Statistics, previous segments are available in our archives.

One of the goals of data analysis is to reduce lots of information down to easily digestible metrics. That’s where descriptive statistics can help! They can boil down huge amounts of data to a single, insightful number. One of the best places to start to understand your data is by measuring the “center” of your data set using the mean, median, and mode.

Measures of central tendency

There are lots of ways to measure the center point of your data. You are probably most familiar with the mean,1 or the “average,” of a data set, 𝜇. Mathematically, the mean is calculated by summing up all of the values in the data and then divide by the number of observations. The median is the middle number of your data set, after sorting the values. Finally, the mode is the value that appears most frequently in your data set.

While there are multiple measures of central tendency, you want to use to right tool for the job. Suppose at my fictional company, Doug Desserts, you want to forecast how much revenue you should expect to per transaction. Computing the “center” of your recent transactions would be a good estimate of what to expect in the near future.

Assume the following are a series of recent transactions from Doug’s Desserts:

Customer Transaction Amount ($)
Customer A 52.87
Customer B 50.06
Customer C 61.34
Customer D 49.23
Customer E 43.24

 

The mean transaction amount ($51.35) is quite similar to the median ($50.06). Because the transactions are not too far apart, both the mean and the median give you a similar estimate of what to expect in the future. In this example, the mode is not helpful since each of the transaction values is unique – each transaction amount is a mode.

Suppose that the next transaction at Doug’s Desserts is:

Customer Transaction Amount ($)
Customer F 250.71


Including the new transaction, the mean jumps up to $84.58, but the median of $51.20 doesn’t change that much.
2 The large difference between them is because the mean is more sensitive to outliers than the median. You could overestimate the amount of money you should expect per transaction if you only looked at the mean. Therefore, if your goal is to describe the most “typical” value to expect, then the median is your best bet as it is not very sensitive to the distribution of your data.

As its own metric, the mean is preferred to the median in cases where you want to utilize its advantageous mathematical properties.3 However, the most value is when you compute both metrics – if they are very different you probably have some outliers that you should take a closer look at.

Is the mode ever useful?

Given that the mode was not very helpful in either of the previous examples, you might be wondering when you would use this metric. While the mode can be interesting in other examples of continuous, numeric data, it is often most useful when looking at categorical data, i.e., discrete values. For example, suppose Customers A-E all bought cupcakes, but Customer F bought a wedding cake. Then the mode of order types would be cupcakes. In this example, the mode is useful for understand the most common type of product purchased and the mean and the median are not relevant or calculable.

In most cases, these measures of central tendency will provide different results, though there are some special cases where they will all be the same. As shown in the example, whether these are the same or different depend on the dispersion, or distribution, of the values in your data set – our topic for tomorrow!

Questions? Send any questions on data analytics or pricing strategy to doug@outlier.ai and I’ll answer them in future issues!

 

[1] Did you know that there are multiple types of means? The one you are probably most familiar with is called the arithmetic mean, which is what we will focus on here. But if you want to learn more, check out the geometric and harmonic means as well.

[2] When there are an even number of observations, then the median is typically considered to be the mean of the two “middle” values.

[3] For example, if you are given the mean value and number of observations, you can multiply them to get the total value. The same calculation does not generally apply to the median though. Using the example earlier, the total revenue of the six transactions is $507.45, or the mean ($84.58) times six transactions. The median ($51.20) times six observations only equals $307.20.

The mean is also easy to update with new values, all you need is the mean, number of observations, and the new value. In order to re-calculate the mean with Customer F’s transaction, I didn’t have to re-sum all the values and divide. I could simply multiply the mean ($51.35) times five transactions to get $256.75, and then average that with the new transaction. To update the median though, you need to re-sort all of the observations to find the new middle value.