line charts

Stata tutorial: Adding the 95% Confidence Interval to a Two-way Line Plot

I created a tutorial on how to add the 95% CI to a two-way line plot in Stata. I use the “connected” command to generate a line plot in Stata, and then I added the 95% CI to each value. Surprisingly, Stata does not have a native feature to allow users to generate these 95% CI on a two-way line plot.

I used the AHRQ Medical Expenditure Panel Survey (MEPS) database for the motivating example. In this tutorial, we plotted the average total healthcare expenditure from 2008 to 2019.

I build this tutorial on Stata, but I used R Markdown to write the tutorial. The R Markdown code is located in my GitHub site (Stata - Line plot with 95% CI tutorial).

You can find the tutorial on my Github site and RPubs page.

I used Stata SE 17 to build this.

Communicating data effectively with data visualizations: Part 38 (Replicating the LA Times COVID-19 tracker)

INTRODUCTION

Recently, the staff at the Los Angeles Times (LA Times) provided a COVID-19 tracker on their website. This is an impressive set of data visualizations of COVID-19 cumulative cases, new cases, vaccinations, and deaths. I was particularly struck by the “New cases by day” figure which includes a bar chart overlaid with a 7-day moving average line chart. The visualization effectively used the moving average to adjust for the spikes in new COVID-19 cases but maintained the spikes on a daily basis. None of the data are lost and illustrates the spikes in new COVID-19 cases while adjusting for the moving average. The color schemes were also optimal where the daily new cases used a softer color, but the moving average line used a darker color highlighting its importance in the figure.

I wanted to write an article on how to replicate this figure using Excel.

Source: Los Angeles Times, “Tracking the coronavirus in California,” url: https://www.latimes.com/projects/california-coronavirus-cases-tracking-outbreak/ [Accessed on June 24, 2021] * This is for educational purposes only.

DATA SOURCE

Data used in this article can be found on the LA Times GitHub site. I used the “latimes-county-totals.csv” data (link to the raw data). I also made the data available with the final figure on the following Dropbox location.

 

TUTORIAL

Step 1. Download and visually inspect the data.

After you’ve downloaded the data, take a moment to inspect them. The columns that are used in this tutorial are “date” and “new_confirmed_cases.” But you can use the other columns to replicate other parts of the LA Times COVID-19 tracker.

Step 2. Insert a bar chart and select the appropriate data.

Insert a clustered column chart using the Insert tab on the Excel ribbon. When selecting the data, make sure that you select the “new_confirmed_cases” (other data are available, but the new cases are what we are replicating in this exercise).

The default bar chart does a pretty good job of replicating the LA Times figure.

However, we’ll have to do a few edits to the axes to match the LA Times figure.

 

Step 3. Modify the axes.

Let’s focus on the Y-axis first. Right-click on the Y-axis and select “Format Axis…” In the Axis Options panel, change the Minimum value to 0 and the Major value to 20000. This will match the settings in the target figure. (Note: There are negative values in the data, but these are very small numbers and assumed to be ignorable.) Next, in the Number options, change Category to “Number” and the value in the “Decimal places” to 0. Make sure that you check the box next to “Use 1000 Separator (,)” to replicate the same format in the target figure.

For the X-axis, right-click on the bottom axis and select “Format Axis…”This will open the Axis Options panel where you can make several adjustments to the X-axis. First, we want to change the X-axis display values from dates to months. Change the Number Category field to “Custom” then change the Format Code to “mmm”; make sure to click on “Add” for the changes to take effect. Next, go to the Axis Type area and change the Minimum to “02/01/2020” since we want our timeline to begin on Feb of 2020. Then change the Major value to 4 to match the monthly interval of the target figure. The X-axis should be thicker with tick marks on the outside. To modify these, navigate to the Tick Marks option and change Major type to “Outside” and then click on the Paint Bucket (Fill & Line) option; increase the Width to 1.5. These should match the target figure’s X-axis format.

Step 4. Add the 7-day moving average.

Excel has a Data Analysis tool that will automatically estimate the 7-day average. I’ve written a previous tutorial that describes how to use this tool. I’ll briefly review how to estimate a 7-day moving average.

In the Data tab, click on the Data Analysis tool (instructions on how to install the Data Analysis tool is here). This will open the Data Analysis Tools box. Select “Moving Average” from the tools kit and enter the appropriate values in the options box. For the Input Range, select all the values from “new_confirmed_cases” column. Enter a value of “7” in the Interval field; this will automatically calculated the 7-day moving average. In the Output Range, select a single cell where you want to moving average to be pasted after it is calculated. I chose to use the next available cell on the dataset ($F$2).

Step 5. Add the 7-day moving average to the chart.

To include the moving average data to the current daily new cases bar chart, right-click on the chart and select “Select Data.” This will open a box where you can add new data. Select “Add” which will open the “Edit Series” box. Updates the Series name with the name of the column (“moving_avg”). For the Interval field, change this to “7” for the 7-day moving average. Then in the Series values, select the 7-day moving average data.

By default, Excel will generate a bar chart for the 7-day moving average. However, we want a link chart. We can change this by right-clicking on the bars of the chart and selecting “Change Series Chart Type…” This will open a box that will allow us to select the type of chart for each data. For the “moving_avg” data, change the Chart Type to “Line.” This will create a line chart for the 7-day moving average which will be overlaid over the daily new cases.

Step 6. Modifying the chart format.

To closely match the chart to the one presented in the LA Times, I made the following adjustments. Your mileage may vary depending on the library of fonts available. I tried to select fonts that most Excel users will have access to.

I changed the Y-axis font to Adabi script. The X-axis font was changed to Arial Nova.

The width of the horizontal gridlines was increased to 1.5. The color of the daily new cases bar chart was changed to a light blue using a hex code of #8DC6DF. The color of the 7-day moving average was changed to a dark blue using a hex code of #2B869B; additionally, the width was increased to 2.0.

Step 7. Comparison between LA Times and user-generated charts.

Once the modifications have been made, compare the charts.

CONCLUSIONS

Using data from the LA Times, we can replicate the data visuals on their COVID-19 tracker website. This allows users to verify the data that are presented on a public site. Additionally, it allows us to generate our own data visualizations that could inform policy and education the public on the rate of new cases in California.

 

REFERENCES

Data was based on the LA Times (link), which was accessed on 24 June 2021.

Excel file used for this exercise can be download from the following Dropbox folder.

 

ACKNOWLEDGEMENT

The data visual used in this exercise was based on the work of the staffers at the LA Times. They deserve all the credit and acknowledgement for developing these stunning visuals.

Communicating data effectively with data visualizations - Part 1 (Principles of Data Viz)

Introduction

Data visualization is a form of visual communication that takes quantitative information and displays it as a graphic, an abstraction of the real world. Effective data communication makes complex statistical analysis accessible without excessive mental burden. It is also used to identify patterns through data exploration. Unlike information visualization which includes catch-phrases such as “Infoviz” and “Infographics,” data visualization is intuitive, informative, and “pretty” while simultaneously focused on scientifically structured comparisons, analytic precision, and statistical inference. The challenge is compressing all the quantitative information into a single chart or graphic that provides a narrative or purpose that can be synthesized and acted on with very little mental effort.

There are a variety of data visualizations that can be used such as choropleths, heatmaps, scatter plots, and dot plots (this list is not all inclusive). The selection is dependent on the data, audience, and narrative. How complex is the analysis? Who are you presenting this information to? Why should the audience care?

The best way to present data effectively is with a good story. Your graphic should be able to tell a story based on the quantitative information. Every graphic you create should be a self-contained narrative of the data. This can be achieved using simple tools, but the creation of effective data visualization depends more on your ability to tell a good story. The purpose of this article is to highlight some important principles of data visualization, review common data visualizations, and develop a mechanism to select the most effective data visualization.

Principles of data visualization

Data visualization can be traced to several different schools of thought (e.g., Edward Tufte and William S. Cleveland), but the fundamental principles are similar and often overlap. Edward Tufte identified several key principles when developing data visualizations (Table 1).

Table 1. Tufte's principles for graphical integrity. *

Principle**

Description

Avoid chart junk

Inventive displays seldom generate interest. Rather, they generate visual noise.

 

Data-ink ratio

Use ink to show the data. Ink that does not contribute to the reporting of the data should be removed.

 

Numbers should be directly proportional to the numerical quantities represented

The "Lie Factor" is a proportion of the Size of the effect shown in the graphing / Size of the effect in the data. The graphic should not inflate the actual magnitude of the change.

 

Use small multiples and repeat

High quality information graphic portrays many numbers per square inch. Small multiple, comparative images work especially well for this. Examples include sparklines.

 

Avoid graphical distortions and ambiguity

Avoid distortions of numbers by graphic devices. Show data variation in context, and label them. Write out explanation of the data on the graphic itself. Properly label events in the data.

 

Multifunction

Information layers and architecture emerge best when data display elements serve multiple functions. Different readings at different levels of detail (micro-macro) serve this goal well. For example, the y-axis can be used to provide scale while calling out to important values by either coloring that value differently or enlarging it.

 

Show data variation, not design variation

Use scales that are similar and do not generate ambiguity. Be consistent in the data when displaying them as a graphic.

 

In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units

Properly adjust current due to inflation or population growth. We want to the currency in real purchasing power (value) rather than nominal purchasing power.

 

The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data

Using graphics to show the proportional change of a metric can bias our perception due to the number of dimensions that are changing. If we look at a single metric such as budget, then we are only looking at a one-dimensional scale, meaning that when the budget increase, it only changes in one dimension. However, it is easy to use a display such as a 2-dimensional picture and scale it up according to the one-dimensional scale. For example, if we have a 2-dimensional graphic and we scale it according to an increase on a one-dimensional metric, the actual proportional increase in 4 times (2^2 = 4). If this was a 3-dimensional object, then the proportional increase in 8 (2^3 = 8).

 

Graphics must not quote data out of context

An accurate picture must report the totality of the effect. Showing only one piece of the data with graphics is just as bad as the data. Context is critical. In time-series analysis, it is imperative that the researcher provides an illustration of the overall trend including any changes in seasonality. Therefore, apply rational judgement when presenting data visualization. The use of comparison groups helps to answer any secular impacts that may not be captured when looking at data at a single point in time.

 

* From Tufte ER. (2001) The Visual Display of Quantitative Information. Second Edition. Cheshire, CT. Graphics Press, LLC.

** This table provides fundamental principles on graphical integrity and data graphics and is not all inclusive.

Figure 1. Box plots of MLB wins in the 2017 season. [click to enlarge]

Dot plots are simple graphics that use points (filled in circles) instead of line or bars on a simple scale. They convey the same information as bar charts, but use less ink to do so. The advantage they provide is that they reduce the junk of the bar charts which contain useless space that are uninformative. In Figure 2a, the dot plot provides the same information from the previous bar charts; however, there is a better sense of scale with the removal of the clutter introduced by the bar charts. Like the bar charts, use pastel colors to dampen the effect of the teams that are not the focus of the chart and use solid colors to bring out the teams with the most and least wins (Figure 2b). The minor grid lines do not provide any information about the data and should be removed (Figure 2c). Finally, Figure 2d takes the dot plots and use data values to provide the audience with the actual number of wins. This is also reinforced by the pastel and solid colors, which provide good contrast between the teams that have the most and least wins.

Figure 2. Dot plots of MLB wins in the 2017 season. [click to enlarge]

Line plots are graphics that use lines to illustrate a trend. A line plot would not be appropriate for the baseball wins example because the x-axis does not have any continuous scale, which is needed for line plots. Table 2 provides data on MLB players’ batting averages from 2013 to 2017. The table provides us with information across five years, but the order and rankings are difficult to determine.

Table 2. Batting averages of Major League Baseball players (2013-2017).

Players

2013

2014

2015

2016

2017

Yasiel Puig

0.319

0.296

0.255

0.263

0.260

Justin Turner

0.280

0.340

0.294

0.275

0.332

Michael Trout

0.323

0.287

0.299

0.315

0.329

Ichiro Suzuki

0.262

0.284

0.229

0.291

0.250

The table doesn’t do a good job illustrating the trends over time. Instead, it is a good reference that is searchable. When it comes to visually telling a story, the table doesn’t do a good job. Converting this table to several line plots can help illustrate the changes in each players’ batting averages over time. Figure 3a trends each player’s batting averages, but the clutter makes it difficult to identify any important patterns. For graphics that use a time interval (or continuous interval) on the x-axis, it is useful to truncate the y-axis to see any incremental changes in the trend.

Figure 3b truncates the batting average from 0 to 0.360 to 0.200 to 0.360. Now the changes in batting average is more discernable from this truncated y-axis. It’s clear that Yasiel Puig’s batting average declined from 2013, but Justin Turner’s batting average improved. However, this still feels cluttered. The different lines and colors make it hard tell that Justin Turner was improving. In fact, it seems like all the players except for Yasiel Puig were improving. To make sense of the clutter, let’s assume that we were interested in the player who had the most improvement from 2013. Calculating the percent change between 2013 and 2017 and then putting it on the graphic provides us with some metric to distinguish Justin Turner from the rest of the other players.

Figure 3c adds the percent change in batting averages from 2013 to 2017 with the player’s name. The legend was removed because it didn’t contribute much to the graphic once the names were adjacent to each line. Despite these modifications, it’s not easy to distinguish the improvement in batting averages for Justin Turner. There are too many competing colors, which distract the focus from Justin Turner’s improvement.

Figure 3d dampens the non-critical lines using a single pastel color and matching the to the trend lines, which highlights Justin Turner’s trend line, the only one with color. This technique draws your attention to Justin Turner’s trend while providing details about the change in trend and the player associated with that change.

Figure 3. Line plots of MLB players’ batting averages (2013-2017). [click to enlarge]

Summary

So far, basic principles and examples of data visualization were presented in this article, which is part of an on-going series on data visualization. Since this is a primer on data visualizations, you should review existing graphics and try to apply some of these principles. Web-based data visualizations are prevalent and can be found in places such as the R-Shiny gallery and Tableau gallery. As you start to explore different data visualizations, you’ll discover many creative and useful tools. Next issue, we’ll discuss other data visualization graphics that will reflect the Tufte’s principles for graphical integrity and excellence.

References

Tufte ER. (2001) The Visual Display of Quantitative Information. Second Edition. Cheshire, CT. Graphics Press, LLC.

Knaflic CN. (2015) Storytelling with Data: A Data Visualization Guide for Business Professionals. Hoboken, New York. John Wiley & Sons, Inc.