Last week, we talked about the value in understanding both the measures of central tendency and dispersion in your data. Today, I’ll show you some ways I like to explore and visualize this information using simulated data from Doug’s Desserts, my hypothetical company. I’ll take a look at a data set that contains information about 1,000 transactions (captured in the rows of my data set) including the transaction amount, time (measured in hours since the store opened), and whether the purchase was made in the store or online.
Looking at the raw data
Before I try to summarize my data, either numerically or using a chart, I always start by visualizing my raw data first. This step gives me a frame of reference for the type of data that I will be working with and helps me understand the numerical results I’ll calculate. A great place to start is a scatterplot as it can show you all of the observations in a single chart and starts to give you an idea of the central tendencies and dispersion of your data.
One limitation with this type of plot is that many of the data points are overplotted and all blend together into clumps of black ink. Also, you get a general idea of how the data is distributed, but the trends in your data are not very clear. To address these challenges, I like to use some other techniques to improve this chart.
First, to alleviate the overplotting, I add opacity to the points. Now when you see a section of dark color, it indicates there is a concentration of points there. Also, I add “rugs” to the axes, with opacity, to help give me a better idea about where the center of the data is on each axis.
One last step that I’ll often take is to color my data points by categorical variables, in this case Channel.
This chart is starting to give me some insight into what is going on. It looks like the red points, i.e., online transactions, might have slightly higher revenue than the blue points, i.e., transactions in the store, but are intermixed on the Time axis. We’ll need to explore further.
Digging deeper by summarizing data
Let’s dig a bit more deeply into the relationship between revenue and transaction channel. I prefer to begin by summarizing the data by bucketing the values into bins, for example five dollar increments, and create a histogram. Binning data helps you see a bit more of the big picture of your data, especially with continuous data like money, because there is too much granularity otherwise. Notice that I’ve continued using opacity and color to differentiate and view the channels at the same time.
A quick glance at this chart quickly indicates that the median transaction for online transactions is likely higher than transactions made in the store. As a final step to compare the actual median and mean values of each transaction channel, I like creating box-plots with the histogram of data represented as a Wilkinson dot plot.
There is a lot going on in this chart, so let me explain. The dots represent binned values of the data, very similar to the histogram from above. The white boxes show the quartiles of the data. The very top of each box is the 75th percentile, the bottom of the box is the 25th percentile, and the line in the middle of the box is the 50th percentile, i.e., the median. The “whiskers” extending from each box give you a sense of the outliers in the data¹. I’ve also plotted the mean values for each data set as a dark grey point so that I can see how the median compares to the mean, giving me an even better sense of the distribution of the data.
So, we can see definitively that the median transaction revenue for online stores ($45.87) is lower than its mean ($46.55), but higher than the median and mean transaction revenue from the store ($36.04 and $37.96, respectively) and have a good sense of how the data is distributed. With this information in hand, we should feel confident that we have an interesting data set that is worth modeling. Tomorrow we’ll use linear regression to explore the linear relationship between the variables we’ve discussed today and think about how to visualize those relationships.
 The exact definition relates to the inter-quartile range. See this help document for more details of the definition used in this plot.