Yesterday we talked about using a metric to determine whether two variables are related to each other. That is great to describe what has happened in the past, but how do we use that information to predict the future (last week’s topic)? One common technique is linear regression.
Linear regression is a fundamental statistical concept that is widely used to analyze the relationship between two (or more)¹ variables. I’ll go over some of the more important concepts related to linear regression today, but there are lots of details I won’t be able to cover. I encourage you to read more on your own to make sure you understand when and how to use linear regression.
The best way to start thinking about linear regression is that you are trying to understand how one variable, commonly called the dependent variable, depends on the value of another variable, commonly called the independent variable. You build your model based on the historical data you have available, and then, using the results of the trained model, you can predict the value of the dependent variable based on the independent variable.
Using the example data from Doug’s Desserts we’ve been looking at this week, the dependent variable would be the revenue from each transaction in my data and the independent variable would be the time since opening. I can use linear regression to answer the question, “how much more or less can I expect to make per transaction for every additional hour my store is open?” by finding the straight line that best fits my data and using that to predict future outcomes.
How does it work?
The crux of the algorithm is that a linear regression model tells you the optimal line to draw through your data. This line is found by minimizing the squared vertical distance between each data point and the modeled line using a technique called least squares. So, out of the infinite number of straight lines one could draw through the data set, linear regression tells you the one that fits the data best. That sounds like magic! Not quite, just calculus, but still pretty amazing. The good news is that there are many tools out there that will solve this for you so you don’t need to do it by hand. Here is what the best fit line looks like for the Doug’s Desserts data I’ve been showing you this week—I’ll show you how to create linear regression models and create plots like this in future posts.
Reading the output
No matter what tool you use to create a linear model, you will see lots of numbers in the output, many of which might not make sense to you unless you are a statistician. The coefficient estimates of the model are always a great place to start in reading the output. In the case of Doug’s Desserts, the intercept was estimated as $25.09 and the coefficient for Time, which represents the slope of the best fit line, was estimated at $15.03. The intercept gives me a starting point—when the store opens I can expect transactions of $25.09. For each hour Doug’s Desserts is open, I can expect an extra $15.03 per transaction. For example, after being open for two hours, I should expect a total transaction amount of $25.09 + (2 * $15.03) = $55.15.
At this point, you may be thinking that the red line in the example seems to fit pretty poorly since the line does not look all that close to the data points. Even though it is the one that fits the best, it might be a bad idea to predict my revenue based on it. The standard errors, which are estimates of the standard deviation of the variable, help tell you how much you should trust the accuracy of your coefficients. The smaller the standard error, the more accurate the coefficient estimate. In my example, the standard errors are large, which give me little confidence that the model is actually worth using due to the error caused by Customer F’s anomalously large transaction!
Finally, you will often hear about the R-squared of a regression, which measures the amount of variation in the dependent variable that is explained by the independent variable—the higher the R-squared the better. The R-squared in our example is 0.26, or 26% of the variation in revenue is explained by the time of purchase. There is still a lot left unexplained! However, low R-squared do not immediately discredit models, it really depends on what you are studying.
I hope this week’s review of simple statistics has been useful, but you should not use them blindly. Even though the math to calculate these metrics is well-understood, the trick to using them effectively is in their interpretation. A great example of this nuance is Anscombe’s quartet, which shows how you can end-up with the same statistical measurements with very different data sets. It is always a good idea to sanity check your models and results, and using charts is a good way to get started. That’s why next week I’ll talk about some ways to visually communicate the ideas we’ve talked about this week, plus some general thoughts on data visualization. The following week, I’ll talk about the tools you can use to calculate and visualize your results.
 Though linear regression can be applied to describe the relationship between more than two variables, it’s easiest to conceptualize what is going on by only considering two variables, so that will be my focus today.