Trendlines

Trendlines

This is part 1 of a 5 part series on Trendlines.

One of the best, and most versatile, tools when analyzing data are trend lines. Trendlines capture the aggregate behavior of a data set and are useful in a variety of applications ranging from Ad Campaign Optimization to Data Auditing. They can help you identify problems in your business, or predict the future.

Consider this data, which like much real world data is very noisy: Now, look at that same data with a trendline: As you can see, the trendline captures the aggregate movement of the data making it much easier to see the overall trend.

There are many different mathematical approaches for creating trend lines, ranging from simple (moving average) to complex (ARIMA) and the approach that works best for your data depends largely on the type and nature of your data. Choosing the right approach can be the difference between an informative trendline and a very misleading faux trend.

This week we will survey a variety of ways you can create trend lines and when each approach might be appropriate.

Tomorrow we’ll get started with the simplest form of trendline, moving averages.

Trendlines: Moving Averages

This is part 2 of a 5 part series on Trendlines.

The simplest form of trendline is the moving average. It is called a moving average because you choose a window (known as the period) and for each point of your data you average the points around it to create a new, averaged value. You move this window along your data to generate a new series of averaged points which are less noisy than the raw values, and more representative of your data’s natural pattern.

For example, the following shows how the moving average is calculated over a series of numbers using a period of 3: By moving the window along the data, you reach an averaged value for each point. This averaging will smooth your data, creating a trendline that captures the natural trend of your data. Below is a moving average (period of 10) on our raw data from yesterday: The longer the period you choose, the more smooth the data and clearer the trend. There are obvious advantages and disadvantages to using moving averages for trendlines.

Pros

• Moving averages are easy to calculate and give you a general idea of the movement, including cyclical changes, data quickly.

Cons

• You have to choose the period to use, which could introduce bias.
• As you can notice in the chart above, a moving average can’t estimate values at the beginning of your data (as there aren’t enough points to average). You can shorten the period, but then your values will be measured differently than the other points and potentially less reliable.
• Moving averages are great for explaining existing data, but have no predictive properties so you cannot easily use them to estimate what will happen next.

Tomorrow we’ll cover a more powerful technique for adding trendlines to your data: linear and polynomial regressions.

“Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them.”
– Laurence Johnston Peter, Peter’s Almanac (1982)

Trendlines: Regressions

This is part 3 of a 5 part series on Trendlines.

Linear and Polynomial Regression Trendlines

Linear regressions are the most common approach to adding trendlines to data as they are both very flexible and have great predictive properties about the future. We’ve covered Linear Regressions previously so I won’t review how to calculate them here, instead we’ll discuss how to use them effectively for trendlines.

While polynomial regression requires that more coefficients be fit to the input set of values, they can be very effective in estimating trends. The biggest challenge is deciding what degree polynomial is the right fit for your data. A general rule of thumb is that every degree you add to your polynomial adds another bend into the curve of your trendline. The following is our raw data with a 2nd degree polynomial trendline. And here is the same data with a 6th degree polynomial: You can see the 6th degree version has significantly more curves in it. Whatever software you use to create a polynomial regression trendline will ask you to choose the degree and it’s important to choose wisely! If you notice, the first example (2nd degree) shows the trend increasing at the end, while the second (6th degree) shows it decreasing at the end. Depending on what you choose it can tell a different story about what’s likely to happen next!

Speaking of which, let’s cover the pros and cons of this approach to trendlines.

Pros

• Linear and polynomial regressions emphasize the trend in the trendline, allowing you to see the overall movement of the data. If the data is increasing or decreasing, it should become clear.
• The output of a linear and polynomial regressions is a formula to describe your data, so can easily predict future values (using that same formula). This means your trendline can double as a short term prediction!

Cons

• As we mentioned, choosing the degree of the polynomial in your regression is critical. Too high and you will over-fit your data and it will be no better than a moving average. Too low and it might not accurately reflect the movement of the data.
• If there are significant shifts in the middle of your data, such as changes in data definitions or collection practices, the regression model will have trouble adjusting. As a result you can only run a regression on fairly clean data to ensure good results.

Tomorrow we’ll cover an advanced way to add trendlines to data which combines both the moving average and the linear regression: the auto-regressive integrated moving average (ARIMA).

“Once you make a decision, the universe conspires to make it happen.”

Trendlines: ARIMA Trendlines

This is part 4 of a 5 part series on Trendlines.

One of the more advanced, generalized methods to analyze trends in time series data is called an ARIMA model, which is an acronym for Auto Regressive Integrated Moving Average. These models are particularly helpful in understanding how data over time can be predicted by controlling for cyclical, seasonal, and potentially irregular changes to the data. Let’s start by breaking down this model into its three components to understand what it is doing, look at an example, and then talk about the pros and cons.

Beginning in the middle, the term integrated (I) refers to the number of times the model takes a difference of data points. A difference of points is calculated by subtracting each value from the value before it. The reason this is done is to highlight the changes unexplained by trends and seasonality.

Next, the moving average (MA) model (which is actually different than the moving average calculation we discussed previously) attempts to explain the values, resulting from taking the difference of points, as movements around an average value. These movements are modeled as their own regression.

Finally, the autoregressive (AR) model is one where the values each day are assumed to be linked together. The structure of the model is very similar to the typical linear regression we talked about yesterday, where in an autoregressive model, the predicted value today is dependent on its value on the previous day(s), plus an error term to control for shocks to the system.

So, now that we know what each part of an ARIMA model does, let’s think about an example to make it all more clear. Suppose you had data for the sale of a particular good each day. This might be a good candidate for an ARIMA model because the sales of products are generally consistent each day, but have weekly cyclicality and seasonal changes. Plus, there are often shocks to the system that cause abnormal changes that persist for a short period of time, e.g., coupon offers or advertising campaigns.

Pros

• ARIMA models are extremely flexible and can be used in situations where your data has cyclical and seasonal changes, and the potential for irregular shocks.

Cons

• You need to determine the number of time periods for which to choose for each component. For example, do the sales today depend on just the sales yesterday, or both yesterday and the day before, or further back in time?
• These models only work on time series data.
“If it be right, do it boldly; if it be wrong, leave it undone.”

Trendlines: Detecting Real Trends

This is part 5 of a 5 part series on Trendlines.

When Not to Use Trendlines

While you can, in theory, add a trendline to almost any data, it is not always a good idea. The techniques we have reviewed this week are simply mathematics, and mathematics can be applied to any group of numbers. Take, for example, the following random data: This is completely random data. Does a trendline make sense for this data? Absolutely not, it’s random! Can I still add a trendline? Sure, why not, let’s add a 2nd degree polynomial regression: If you can add a nice trendline like this to random data, the potential of using trendlines to mislead should be apparent. Using data to make decisions relies on clear communication of reliable results (see our series on Data Storytelling), so if the mathematics won’t stop you from adding trendlines you will need to use your judgment.

So, how do you decide if adding a trendline makes sense? Here are a few questions to ask:

• Is there really a trend? If you can’t see any trend at all without the trendline, chances are there is no real trend there at all. Trendlines are better at helping you understand trends that are already there, not finding trends hiding in noisy data.
• Does the math check out? R-squared (aka the Coefficient of Determination) is a measure of how well a line fits the actual data. The lower the R-squared value the worse the fit. The trendline in the above random data had an R-squared value of 0.03 which means it doesn’t really fit at all, so adding a trendline doesn’t make sense. Note, however, low R-squared do not immediately discredit models, it really depends on what you are studying.

The addition of trend lines to your data is, in the end, a judgment that you will make as a data user. Choose carefully.

In Review: Trendlines are a powerful tool for understanding data, enabling you to do things ranging from anomaly detection to predicting the future. There are a number of different ways you can calculate trend lines and you will need to use your judgment to decide which method you should use, and if a trendline is appropriate for your data at all.

Enjoy your 4th of July! Next Tuesday is the 4th of July holiday here in the United States, so we are going to take a break from sending you a daily email. Look for your next Data Driven Daily on Monday, July 10. Happy 4th!

“Mathematical and absolute certainty is seldom to be attained in human affairs”