Visualizing Data: When, Why, and How
Part 3: The Importance of Integrity: How Plot Parameters Influence Interpretation (Article 1)
This article is part of a three-part series on data visualization. Parts 1 and 2 focus on using data visualization throughout the data science workflow and determining whether visualizing your data is an appropriate approach for communicating information. Part 3, The Importance of Integrity, focuses on factors that affect effective and honest communication of a data story. This is the first article of Part 3.
Data visualization is a form of communication, whether you’re telling the story to yourself or to someone else. This means that you have the power to clarify or to conceal, to tell the truth or to deceive. And because images are so powerful, this form of communication is often more compelling than just using words or numbers – so it is even more important to maintain integrity.
In its best form, data visualization enhances understanding. In its worst form, it not only detracts from communication, but actually misleads the viewer into believing something untrue despite technically accurate data.1 As a scientist or communicator, you bear the responsibility of being honest and maintaining integrity with your visualizations. So, how do you navigate this?
Communication is hard, no matter the form. Think about the pitfalls of communicating with language. Even when you speak the same language as someone else, you may have different basic assumptions and word connotations that result in different interpretations of the same sentence – and this doesn’t even account for verbal complexities like tone and phrasing. However, the more you understand about how people interpret words and ideas, the more accurately you can communicate an idea.
The same is true for visualization. A viewer’s interpretation can be affected by many subtle choices that you make, whether intentionally or unintentionally. The more tools you have for understanding how people interpret visualizations, the more clearly you will be able to communicate. In this section, we will discuss several aspects of visualizations that affect interpretation, and some key concepts that will give you tools for thinking about and assessing your own visualizations.
In Part 1 of this series, we discussed several examples of how data visualization can enhance your workflow, supporting understanding of steps throughout the data science process, as well as insight around the data itself. In Part 2, we discussed some factors to help you determine whether or not data visualization is the right choice for communicating information. In this final part, The Importance of Integrity, we will build on these ideas by discussing some considerations that are often overlooked when visualizing data and that influence the viewer’s interpretation. The goal is to demonstrate how basic decisions around plot structure and features can help or hinder your ability to communicate an honest story. We will discuss scales and ranges, and use concepts from colour theory to explain how seemingly innocuous or arbitrary colour choices can subtly influence our interpretation of underlying patterns. The examples will be simple to demonstrate concepts clearly, but the concepts are just as applicable to more complex scenarios.
Part 3 focuses on visualization for communication, when you’re telling the story of your data to someone else. However, any consideration for communicating data to others also applies to interpreting data yourself. As you go through the points below, consider how you may also be impacted by the nuances of data visualization throughout your own analytical process.
There are 3 articles in Part 3:
- How Plot Parameters Influence Interpretation
- How Colour Choice Influences Interpretation
- Maps: Potentials & Pitfalls
By the end of Part 3, hopefully you will be armed with a few new tools for understanding how people may interpret your visualizations, so that you can better communicate your data story!
How plot parameters influence interpretation
The scale of a plot affects our interpretation of variation across the data, effectively by providing an unspoken context. Are numbers close to the end of a range actually high, or just at the high end of your scale?
For example, the range of a y-axis can make increases or decreases appear more minor or substantial. The plots below show data on the number of daily births in Quebec in 1978 2 (monthly mean). The plots have the same data plotted with two different y-axis ranges. Do the peaks in May and September look important or minimal?
In (a), which is scaled to the range of the y values (mean daily births), there appears to be a notable change in daily birth rate throughout the year. Births in May look substantially higher than throughout most of the rest of the year, and there is another notable peak in September. Just glancing at the plot, it looks like the September peak is half the height of the May peak – although “half” does not necessarily have the meaning you would expect, as it is relative to a potentially arbitrary range.
In (b), which is scaled to start at 0 and have some space between the data and the top of the plot, the fluctuations look substantially less important. There still appears to be a seasonal trend, but it also appears to be fairly unimportant compared to the relative consistency of the value throughout the year.
Without knowing more about births in Quebec, or, more importantly, the contextual question for these plots, it’s difficult to tell whether this fluctuation is notable. Is the question about seasonal cycles, such that plot (a) is more relevant? Or is the question related to hospitals’ preparation for births, such that the small fluctuations over the year may be less relevant than the relatively consistent birth rate throughout the year?
What if this data were presented with similar data from different years?
The additional time series help provide context for the fluctuation in the first time series. On one hand, the seasonal fluctuations appear to be relatively consistent. On the other hand, the annual fluctuations may be greater; the highest values in 1988 are comparable to the lowest values in 1978.
Even with more contextual data, however, the scale of the y-axis drives the viewer’s sense of the importance of the fluctuation, no matter the original question driving the visualization. If possible, the scale of a plot should be relevant to the question being asked; if not, it is important to provide context in some other way, such as the legend or text. In this case, if the question is about seasonal variation, the plot should have an axis that shows this variation, as in (a). If the question is about birth rate throughout and across years and the total range of seasonal variation (about 80 births/day) is negligible in this context, the second plot (b) may be more appropriate. For data exploration, both plots are useful.
It may seem obvious that plot ranges will affect interpretation of the data, but it is easy to forget, whether because you’re using default plotting parameters, or glancing at a figure on a website or from your data, or you forget that viewers of your figures won’t have the same understanding that you do. It’s important to remember that decisions you make when visualizing your data should relate back to the question you are asking and the context of the data itself!
Interpretation of a trend in data can be similarly influenced by plot parameters driven by extreme points. For example, when outliers or other extreme points influence axis ranges, it can appear that there is no trend in data when there actually is one. When this is the case, there are several options for presenting the data differently. These include presenting an additional plot with more limited axis ranges and exploring data transformations.
In the two examples below, the same data (a-c and d-f) is presented in several different ways: with all data, with a subset of the data, and with log-transformed y values. How does each presentation affect the conclusions you might draw from the data?
With both datasets, a quick glance at the original plots (a and d) might suggest that there are very few points of interest – many points are close to 0.
However, examining these low points more closely (b and e) suggests that there is a pattern in both cases. In (b), there is an apparent sinusoidal relationship between X and Y. In (e), Y appears to increase exponentially with X, an indicator that log-transformation may be meaningful here.
When Y is log-transformed, differences between smaller values are more visible relative to differences between larger values. This means that high outliers have less of an effect on the overall range of values. In (c), log-transformation effectively shows that there is a pattern between X and log(Y), but does not necessarily help with modelling the relationship. In (f), log-transformation shows that the relationship between X and log(Y) is actually linear – thus it not only makes the pattern more visible, it also gives you more information about the kind of relationship between the variables.
Time series ranges
The above section demonstrated how the y-axis range can affect one’s interpretation of a pattern in data. The same is true for the x-axis range. For example, the amount of a time series that is depicted in a plot affects the viewer’s context for interpretation of changes over time.
The plot below shows the daily closing price of the Swiss Market Index, Switzerland’s most important stock market index, from mid-1991 through mid-1994. Imagine that it is sometime in 1995, and you are casually interested in how the SMI has been doing lately. What would you think of the following plot?
If this plot were accompanied by a story about how the SMI is dropping, you might believe it. But what if the rest of the more recent data were shown? Or, what if this were put into context of the next few years?
With the larger picture, while the drop in 1994 still seems like it might be notable, the conclusion about the current state or recent trend changes.
Hypothetically, it might seem obvious that the importance of a temporal trend depends on its past and future context. However, this doesn’t stop people from drawing conclusions from visualizations of recent trends, nor others from presenting partial stories. Whether you are the viewer or the visualizer, it is important to ask yourself: What comes next?
Of course, it’s not a simple issue of what is right or wrong, but also of what your story is about. Are you telling a story about a drop in the SMI in 1994 and its causes and effects, or are you telling a story about recent trends and hypothetical predictions for the future? The former might warrant a closer zoom on the data around 1994, whereas the latter would warrant a broader picture. In either case, however, it is important to maintain integrity by providing the appropriate context.
Example: Using lines to highlight a nonexistent pattern
Another design choice that affects interpretation is the use of lines between points on a plot. When points are plotted as a continuous variable and have a logical ordered relationship, it makes sense to connect them with lines to show this relationship. However, when points are plotted against a discrete (categorical) variable, these lines can be misleading.
The previous examples have been cases where there is no de facto right or wrong approach, but where the best approach depends on the context and the question being asked. In this case, however, the right course is more clear. Except in very rare circumstances3, don’t draw lines between points representing data for categorical variables. (In fact, you probably shouldn’t ever represent categorical data as points.)
For example, this bar graph shows the weights of a variety of adult mammals of different species.
The lines between the points suggest that there is a trend between them, but because the y-axis is discrete (a categorical variables, rather than a continuous variable) and there is no de facto reason for them to be related in an ordered way – nor to be spaced at equal intervals – the trend is nonsensical. The lines help draw your eye between the point, but they imply a continuous relationship that does not exist.4 This is misleading to the viewer and should not be done. (In fact, plotting this data as points is also probably not a good idea. It would be better to try a bar graph or a boxplot.)
In this case, the lines between the points are confusing and irrelevant, but at least they are unlikely to lead to any particular incorrect conclusion being drawn. A more pernicious case would be when the x-axis has units that could be continuous, but the sampled points are not evenly spaced. Here the offending decision is not only to connect the points, but also to plot a continuous variable as a categorical one. This could indicate a notable trend or event when one does not actually exist.
For example, in the following plot, daily birth rate in Quebec is again plotted, but this time for time points that were sampled at inconsistent intervals.
In the plot above, it looks like there was a sudden increase between the last two points of the time series. What if the x-axis were plotted as a continuous variable?
In this case, it becomes clear that although there was an increase in daily births after 1987, it was not as sudden as it originally looked. Instead, it appears to be an increase over several years.
The above examples demonstrate how setting up your plot space – including axis limits as well as defining the axis scales – determine the context for the data and therefore its interpretation. Of course, there are no set rules for determining how to set up a plot honestly; that depends on the story that you are trying to tell. The important thing to remember is that the interpretation of the data will depend on these parameters, and you can use that understanding to improve your ability to tell a data story.
Part 2 of the series is When Is Data Visualization a Good Choice.
The third part of this series, The Importance of Integrity, consists of three articles, How Plot Parameters Influence Interpretation, How Color Choice Influences Interpretation, and Maps – Potentials & Pitfalls (forthcoming).
- To be clear, it is possible to mislead with technically accurate data through the use of numbers just as much as with visualizations. For example, a reader’s conclusion can be skewed by what aspect of the data you choose to describe and what analyses you show. As an excellent example, in 2015, Dr. John Bohannon – not an MD, but a science journalist with a PhD in bacterial molecular biology – demonstrated this point by conducting and publishing a study that showed that chocolate can help with weight loss. Despite his use of p-hacking to get a positive result (in this case, a tiny sample size of 15 people, and a full 18 different measurements, two of which were significant), the study was published (by a predatory or fake journal) and picked up (extraordinarily) widely by the media. Conclusion: You can lie with numbers. Additionally: Don’t eat chocolate for weight loss purposes. Eat it, in moderation, because it’s delicious. ↩
- Number of daily births in Quebec, Jan. 01, 1977 to Dec. 31, 1990; Source: Time Series Data Library (citing: Hipel and McLeod (1994)) ↩
- I won’t discount the possibility that an exception does exist where there might be a very good reason for wanting to highlight the trend between discrete points and where this will not be misleading to the viewer. However, even if this case does exist, it is not good practice. ↩
- It actually takes some creative coding to convince ggplot to put lines between points representing categorical variables – which is good! ↩