Data Visualization: Plotting Regression Results

This is part 3 of our series on Data Visualization, previous segments are available in our archives.

When we discussed linear regression last week, we focused on a model that only had two variables. In that case it was easy to interpret and plot the results on top of a scatterplot. Most models you’ll create have more than two dimensions, making it hard to plot the regression coefficients on top of the data. Today, I’ll show you my preferred way of plotting this data, using the same simulated data I used yesterday for my hypothetical company, Doug’s Desserts.

Making sense of regression output

I’ve created a model that predicts revenue based on the time of the transaction, the channel of the transaction, and the interaction of these two variables. Here are the metrics from last week:

Coefficient Estimate Standard Error
Intercept 44.71 1.33
Time 0.58 0.37
Channel (Store) -6.02 1.85
Time x Channel (Store) -0.82 0.51

R-squared:  0.08523

Note how this model has four coefficient estimates, shown in the “Estimate” column of the table, instead of the two we talked about last week, so we can’t make the same plot as before to show the linear relationship between two variables. Instead, in order to plot the coefficients and understand their significance, I like to plot error bars.

plotting

The error bars represent the 95% confidence interval of the coefficient estimate, while the dot represents the exact model coefficient estimate. When the confidence interval of a coefficient includes zero, there is not enough evidence to suggest that the variable is different than zero, so it is not a good predictor of the dependent variable.1 In this model, two of the variables, “Time” and “Time x Channel (Store)” include $0 so they are not good predictors of revenue, so I’ve colored them light grey. The “Intercept” and “Channel (Store)” confidence intervals do not include zero, so these are good predictors of revenue2 and are shown in black. From this we know that the channel used to purchase goods for Doug’s desserts predicts how much revenue the business will earn. In this case, because the “Channel (Store)” coefficient is negative, we can conclude that purchases from the store will generate less revenue than online sales.

There is a ton of information packed into the regression output and I’ve only talked about how to visualize one part of it. I encourage you to read more about what each value in the output means.

Tomorrow, I’ll talk about a common pitfall in making a single plot to show the relationship between two measurements, dual-scaled axes.

 

Questions? Send any questions on data analytics or pricing strategy to doug@outlier.ai and I’ll answer them in future issues!

 

1 In statistics speak, we’d say that we failed to reject the null hypothesis that this coefficient is equal to zero.

2 In statistics speak, we’d say that we rejected the null hypothesis that this coefficient is equal to zero.