# Simple Statistics: Data Relationships

This is part 4 of a 5 part series on Simple Statistics.

The metrics we’ve discussed so far are used to understand the nuances of a single variable. While that’s a great start to your analysis, you often also want to know how variables relate to one another, which the **covariance** and **correlation** tell you, so you can understand what actions you take impact others.

## Measures of relationship

The covariance and correlation are metrics that provide insight into the amount of linear¹ relationship between two variables. When these metrics are positive, that means that the variables move in the same direction—when one goes up, the other goes up as well. When the metrics are negative, the variables move in opposite directions—when one goes up the other goes down. When the metrics are close to zero, it means that there is not much of a mathematical relationship between the two variables.

The covariance is calculated by computing the difference between each variable’s observed value and that variable’s mean, multiplying these differences, and taking the mean². For example, suppose I want to know if there is a relationship between the amount each customer spends in my shop and the time of day (measured in hours since the store opened that day).

Customer | Transaction ($) | Mean ($) | Difference ($) |
---|---|---|---|

Customer A | 52.87 | 84.58 | -31.71 |

Customer B | 50.06 | 84.58 | -34.52 |

Customer C | 61.34 | 84.58 | -23.24 |

Customer D | 49.23 | 84.58 | -35.35 |

Customer E | 43.24 | 84.58 | -41.34 |

Customer F | 250.71 | 84.58 | 166.13 |

Customer | Time Since Store Opening (hours) | Mean (hours) | Difference (hours) |
---|---|---|---|

Customer A | 0.75 | 3.96 | -3.21 |

Customer B | 1.00 | 3.96 | -2.96 |

Customer C | 3.00 | 3.96 | -0.96 |

Customer D | 5.50 | 3.96 | 1.54 |

Customer E | 6.50 | 3.96 | 2.54 |

Customer F | 7.00 | 3.96 | 3.04 |

Now we multiply the differences.

Customer | Difference ($) | Difference (hours) | Product |
---|---|---|---|

Customer A | -31.71 | -3.21 | 101.79 |

Customer B | -34.52 | -2.96 | 102.18 |

Customer C | -23.24 | -0.96 | 22.31 |

Customer D | -35.35 | 1.54 | -54.44 |

Customer E | -41.34 | 2.54 | -105.00 |

Customer F | 166.13 | 3.04 | 505.04 |

If we take the mean of the “Product” column, then we find the covariance to be 95.31 (note how Customer F’s transaction pulls the mean higher—this will be relevant tomorrow). From the positive sign of the value, we conclude that the size of the transaction goes up as the amount of time goes up. Though it’s hard to interpret the magnitude of 95.31 after all those calculations, plus its value is dependent on the measurement of the data. Suppose in my example, I measured the time since I opened by minutes instead of hours. The covariance jumps up to 5,718.71 even though the data did not really change.

That is where correlation comes in handy. If you divide covariance by the product of the standard deviation of each of the two variables, then you get a normalized value (*i.e.*, one that does not depend on the magnitudes) that is always between -1 and 1, called the correlation coefficient. The correlation coefficient is independent of the units of measurement—no matter whether I measure the time since open in hours or minutes, it is 0.51.

## Correlation is not causation

You’ve probably read or heard somewhere, “correlation does not imply causation” and seen examples of spurious correlation. This is just a reminder that there are lots of variables that can be related and you should consider many in your model before drawing conclusions.

[1] These metrics measure the linear relationship between variables. Even when covariance and correlation are close to 0, the variables could have a relationship, just not a linear one. [2] Similar to variance, there is a technical difference between the calculation if you are using the full population of data rather than a small sample of data.