data visualization

Stata tutorial: Adding the 95% Confidence Interval to a Two-way Line Plot

I created a tutorial on how to add the 95% CI to a two-way line plot in Stata. I use the “connected” command to generate a line plot in Stata, and then I added the 95% CI to each value. Surprisingly, Stata does not have a native feature to allow users to generate these 95% CI on a two-way line plot.

I used the AHRQ Medical Expenditure Panel Survey (MEPS) database for the motivating example. In this tutorial, we plotted the average total healthcare expenditure from 2008 to 2019.

I build this tutorial on Stata, but I used R Markdown to write the tutorial. The R Markdown code is located in my GitHub site (Stata - Line plot with 95% CI tutorial).

You can find the tutorial on my Github site and RPubs page.

I used Stata SE 17 to build this.

Communicating data effectively with data visualization: Part 41 (Color Blind Friendly Palette)

BACKGROUND

Data visualization has evolved into sophisticated interactive graphics where users can gather more information compared to simple two-dimensional graphics. Color has become an important element to data visualization adding another dimension of information and data. However, people with color vision deficiency (also known as “color blindness”) are unable to experience or benefit from the use of colors in data visualization. This can have important implications when developing our data visualization, which is meant to inform and educate. To maximize the uptake and adoption of  our tools and reports with data visualization, we need to consider using color blind friendly palettes.

 

Recently, I created a report using the RColorBrewer color palette, which contains three types of color palettes: sequential, diverging, and qualitative. This color palette package in R was developed by Cynthia Brewer, and you can experiment with different color combinations at https://colorbrewer2.org.

 

Sequential palettes are used for ordered data which are represented varying gradient levels of color. For example, a low order might be represented by a light shade of blue, but a high order might be represented by a darker shade of blue (Figure 1). 

Figure 1. Example of a sequential palette

Diverging palettes are used to put equal emphasis on the mid-range values with the ends denoted by contrasting hues. For example, using colors to represent the values between 1 and 5 where 3 is the mid-range, we could color 3 as White, 1 as Blue, and 5 as Red (Figure 2).

Figure 2. Example of a diverging palette

Qualitative palettes are used to on categorical data, and the colors are used to represent distinct categories. Suppose you have a variable called food and there are 5 categories in that variable (chicken, meat, fish, vegetables, and fruit). You can use different hues to represent each category (Figure 3)

COLOR VISION DEFICIENCY

Color vision deficiency affects approximately 8% of males and 0.4% of females in the general population who are of European descent.1 However, these numbers vary among other races. For example, the prevalence of color vision deficiency among males in China and Japan are 4% and 6.5%, respectively.

 

There are several types of color vision deficiency. The red-green color vision deficiency is the most common, which are classified as protanomaly for red weakness and deuteranomaly for green weakness.2 For red-blind and green-blind, the categories are protanopia and deuteranopia, respectively. Less prevalent are the blue-weakness/blind (tritanomaly and tritanopia) and complete absence of color vision (achromatopsia). Examples of the different types of color vision deficiency are provided in the figure below (Figure 4).

Figure 4. Examples of color vision deficiency.

POTENTIAL SOLUTIONS

Picking color palettes that would not affect readers with color vision deficiency is not easy. Particularly, when you have a lot of colors on the palette that you are using. One tool that is helpful is ColorBrewer 2.0. You can select the “colorblind safe” option to select color palettes are friendly for readers with color vision deficiency (Figure 5).

Figure 5. ColorBrewer options. Source: https://colorbrewer2.org/

Another method to generate data visualizations for readers with color vision deficiency is to use different line styles or shapes. For example, we can use different types of dashed lines and markers for this line chart (Figure 6).

Figure 6. Line chart using different dash line types and markers (square, triangle, and circle).

CONCLUSIONS

Color vision deficiency can hinder our readers from extracting information from our data visualization. Hence, it is critical that we generate data visualization using the methods above to help our audience. Using the correct color palette and varying the style of the figures by changing the line styles (e.g., dashed) or marker shapes, can assist our readers regardless of their color vision deficiency. More importantly, it will help to reduce any barriers to accessing vital information for our entire audience.

  

ACKNOWLEDGMENTS

I used the Coblis-Color Blindness Simulator to generate the varying color vision deficiency examples in Figure 4.

 

REFERENCES

1.         Birch J. Worldwide prevalence of red-green color deficiency. J Opt Soc Am A Opt Image Sci Vis. 2012;29(3):313-320. doi:10.1364/JOSAA.29.000313

2.         Chakrabarti S. Psychosocial aspects of colour vision deficiency: Implications for a career in medicine. Natl Med J India. 2018;31(2):86-96. doi:10.4103/0970-258X.253167

Communicating data effectively with data visualization: Part 40 (Percentage of population with COVID-19 vaccination)

INTRODUCTION

Since the COVID-19 pandemic began on 12 December 2019, there have been several innovations in the forms of vaccinations and treatments. Of critical importance is the world-wide international support for COVID-19 vaccination. A timeline of the COVID-19 pandemic can be found at the CDC website.

The website informationisbeautiful.net has a great COVID-19 dashboard that ranks countries that have administered vaccinations.

Source: www.informationisbeautiful.net (https://informationisbeautiful.net/visualizations/covid-19-coronavirus-infographic-datapack/)

TUTORIAL

We will recreate this chart using Excel. Data can be download from my Dropbox folder or from Our World in Data website.

The data will contain information for 39 countries. These variables include the total number of vaccinations, the number of people vaccinated, the number of people who are fully vaccinated, the number of boosters, the total population, percentage of the population who are vaccinated, percentage of the population who are fully vaccinated, and the percentage of the population with boosters.

We will create a vertical bar chart for the percentage of the total population that received at least one COVID-19 vaccination. You can use these methods to recreate the vertical bar charts of the remaining figures in the figure above.

Visually inspect the data. It should look like the following.

We will us the “total_vaccinations,” “percent_vaccinated,” “percent_full_vaccinated,” and “percent_boosted” to generate the figures.

Step 1: Insert the vertical bar chart.

Once the vertical bar chart is inserted onto the Excel sheet, right click on the chart and click on “Select data.” We want to highlight the data column titled, “percent_fully_vaccinated.” For the “Series name,” enter “% fully vaccinated.”

Step 2. Edit the labels.

In the “Horizontal (Category) Axis Labels,” click on the “Edit” button and select the
location.” In the “Axis label range,” highlight the values in the “location” column. This will assign each value of the “percent_fully_vaccinated” with the country.

Step 3. Delete unnecessary labels and titles.

The figure should look like the following. We can delete the x-axis, the axis title, and the chart title. Extend the chart so that you can view all the countries. You can also narrow the chart so that it will resemble the size of the informationisbeautiful.net figures.

Step 4. Create a progress bar

We will add a progress bar, which will allow us to see how much of the population is fully vaccinated. This requires us to use the “full_bar” column which has a value of “1” for each country. This value represents a 100% progress goal.

Right-click on the chart and add a new data entry. Select the data under the column called “full_bar.” The values are all “1” to represent a 100% progress bar.

The chart will have two bars for each country; one bar represents the percentage of the population that is fully vaccinated, and other bar represents 100% progress. We need to edit the overlay and the width of the bar to match the ones on the informationisbeautiful.net charts.

Right-click the chart and select “Format Data Series…” Change the “Series Overlap” to 100% and the “Gap Width” to 40%. You may not see the blue bar chart because the red bar chart will be in front of it. To fix this, go to the “Select Data Source” window and use the arrow to reposition the data series. This will place the blue bar chart in front of the red bar chart.

Step 5. Add the data labels.

The next step will include the data labels to the progress bar. We want to have the data labels aligned on the right. When you right-click the blue chart, you will get the data labels to be aligned to the left or the end of the bar. This does not get us to the data labels aligned on the right.

To do this, we need to right-click the red bar (not the blue bar) and click on “Add data labels.” This will all the data value for the progress bar, which is “1.”

Right-click on one of the data labels that has a “1” and select “Format Data Labels…” This will open a window where you will need to check the box next to “Value From Cells.” When prompted, select the values in the “percent_fully_vaccinated” column. This will add the percentage of the population that is fully vaccinated on the right-side of the bar chart. Additionally, uncheck the box next to “Value.” This will remove the value from the bar chart so that you will no longer see the “1.”

The chart should be updated to have the percentage of the population who are fully vaccinated aligned to the right-side of the bar chart.

Step 6. Modify the chart to improve the aesthetics.

Once you’ve added the data labels and aligned them on the right-side of the bar chart, you can start to edit the font color, bar colors, and delete the grid lines to emulate the chart in informationinbeautiful.net website.

I changed the background to a dark gray and the data labels to a white color. I also changed the font to Arial Narrow.

I only presented part of the figure here. But you can download and view the whole figure from my Drobox folder.

CONCLUSIONS

Using the vertical bar chart with progress bars allow us to emulate the figures generated by informationisbeautiful.net. I like these types of charts for data visualization because they provide some indication of progress. You can see that the United States is behind many nations when it comes to having the population fully vaccinated. This is much lower than the UAE and South Korea where 95.6% and 86.6% of the total population are fully vaccinated. Hopefully, you can use this exercise to help you develop similar charts for your data visualizations.

 

REFERENCES

Data was access using the Our World in Data COVID-19 site.

The data visualization that inspired this exercise was based on informationisbeautiful.net COVID-19 Coronavirus Data Dashboard.

The Excel file with the data used for this tutorial is available on my Dropbox folder.

Visualizing linear regression models using R - Part 2

I continue my previous blog post on visualizing linear regression models using R (link). Part 2 focuses on using visualization to assess whether the model’s residuals were associated with the predicted values and whether they are normally distributed.

The R Markdown code that I wrote to create this tutorial is located on my GitHub site (link).

You can find the tutorials on my RPubs site:

  • Part 1 - Visualizing linear regression model using R (link)

  • Part 2 - Visualizing linear regression model using R (link)

(NOTE: on 30 January 2022, I updated these tutorials and they can be found in my RPubs page here. The R Markdown code is saved on my GitHub page here.)

Visualizing linear regression models using R - Part 1

I wrote a tutorial on how to visualize linear regression models using R. In the tutorial I used the lm() command and the predict3d package to generate the models and visualize them using R. You can view the RPubs tutorial here. (NOTE: on 30 January 2022, I updated this tutorial and it can be found in my RPubs page here.) I created this tutorial using R Markdown, and the codes are available on my GitHub site (link).

Reproduction number—COVID-19

BACKGROUND

As the COVID-19 pandemic, which began in December 2019, continues into its second year, public health measures have been put into place to mitigate its spread. At the time of writing this article, there have been over 4.5 million deaths and over 216 million cases due to COVID-19.[1] Surveillance of COVID-19 remains an important public health measure of understanding the spread and impact. Daily reports such as the John Hopkins COVID-19 dashboard provide end users with visual and statistical information about the surges in cases and deaths associated with COVID-19. However, one measure that is of great interest is the reproduction number or R0.

 

Reproduction number (R0) and effective reproduction number (Rt)

The reproduction number is the number of new cases that is directly caused by exposure to a single case.[2,3] Figure 1 provides a visual explanation of the basic reproduction number. However, the underlying assumption with R0 is that everyone in the population is susceptible to infection. With the introduction of vaccines, the R0 isn’t a good measure of the reproductive capabilities of COVID-19. Instead, the effective reproduction number (Rt) is used to provide a more realistic reproduction number based on the population being infected, recovered, or vaccinated. The Rt changes over time as the population susceptible to infection changes.

Figure 1. Basic reproduction number.

I wanted to create a figure that would highlight the changes associated with the Rt for each state in the United States. To do this, I downloaded the Rt data from the by Xihong Lin's Group in the Department of Biostatistics at the Harvard T.H. Chan School of Public Health. They have an amazing COVID-19 tracker dashboard that captures the changing patterns of Rt for each state. Then I created a Cleveland plot to show where the Rt was near the beginning of the pandemic and where it is currently (August 2021). (Note: I wrote a tutorial on creating Cleveland plots that you can review here.) Here is the final figure (because of the length of the figure, I cropped it to show the first 30 states or territories):

 

Figure 2. Effective reproduction number (Rt) for U.S. states and territories, April 17, 2020 (past) to August 14, 2021 (recent).

The blue dots denote the most recent effective reproduction number (14 August 2021) and the past dots denote the earliest effective reproduction number (17 April 2020).

It seems that some states have gotten worse in terms of increase effective reproduction number since the beginning of the pandemic. This could be due to lack of good data in the early phases of the pandemic. However, what is of concern is the high effective reproduction numbers in some states (Rt > 2), which indicates that the pandemic is still spreading at an alarming rate.

There were some missing data which are identified by a single dot (blue or red) or an empty field in the recent or past effective reproduction number. Rather than fill these in, I left them empty. There may be data in between the two time periods that I could have used, but I left those out.

One thing to mention is that this Cleveland plot only tells us one dimension of the effective reproduction number story (the difference between the most recent Rt and the earliest Rt). It doesn’t tell us much about how the effective reproduction number changes across time. For that, I direct your attention to the Lin’s Laboratory Group at Harvard, they have a great figure that shows the fluctuation of the effective reproduction number for the U.S. and its states/territories (see example):

Source: Lin’s Laboratory Group at Harvard (link). [last accessed on 30 August 2021].

CONCLUSIONS

The effective reproduction number provides us with some interesting patterns in spread of COVID-19 by states/territories. It seems to have worsened over time, but this could be due to poor data early in the pandemic. There are some issues with the us of effective reproduction number for policy decisions. Reporting delays can impact the estimates for the effective reproduction number. A technique called “nowcasting” is used to estimate the reproduction number.[3] But when I explored some of the work in this area, there appears to be a variety of methods for performing this technique. Despite this limitation, the effective reproduction number may be useful to evaluate public health policy decisions to reduce the spread of the COVID-19 pandemic.[4,5]

 

DATA SOURCE

I provided the link to the COVID-19 Spread Tracker from the Lin Lab at Harvard. You can also download a curated version of the data for this article from my Dropbox folder. The data are current as of 17 August 2021. If you’re interested in recreating this Cleveland plot, I recommend downloading the most recent data to see how much the effective reproduction number has changed.

REFERENCES

  1. Worldometeres.info. COVID Live Update: 217,770,381 Cases and 4,521,936 Deaths from the Coronavirus - Worldometer. Accessed August 30, 2021. https://www.worldometers.info/coronavirus/

  2. Lim J-S, Cho S-I, Ryu S, Pak S-I. Interpretation of the Basic and Effective Reproduction Number. J Prev Med Pub Health. 2020;53(6):405-408. doi:10.3961/jpmph.20.288

  3. Adam D. A guide to R — the pandemic’s misunderstood metric. Nature. 2020;583(7816):346-348. doi:10.1038/d41586-020-02009-w

  4. Inglesby TV. Public Health Measures and the Reproduction Number of SARS-CoV-2. JAMA. 2020;323(21):2186-2187. doi:10.1001/jama.2020.7878

  5. Pan A, Liu L, Wang C, et al. Association of Public Health Interventions With the Epidemiology of the COVID-19 Outbreak in Wuhan, China. JAMA. 2020;323(19):1915-1923. doi:10.1001/jama.2020.6130

Communicating data effectively with data visualizations: Part 38 (Replicating the LA Times COVID-19 tracker)

INTRODUCTION

Recently, the staff at the Los Angeles Times (LA Times) provided a COVID-19 tracker on their website. This is an impressive set of data visualizations of COVID-19 cumulative cases, new cases, vaccinations, and deaths. I was particularly struck by the “New cases by day” figure which includes a bar chart overlaid with a 7-day moving average line chart. The visualization effectively used the moving average to adjust for the spikes in new COVID-19 cases but maintained the spikes on a daily basis. None of the data are lost and illustrates the spikes in new COVID-19 cases while adjusting for the moving average. The color schemes were also optimal where the daily new cases used a softer color, but the moving average line used a darker color highlighting its importance in the figure.

I wanted to write an article on how to replicate this figure using Excel.

Source: Los Angeles Times, “Tracking the coronavirus in California,” url: https://www.latimes.com/projects/california-coronavirus-cases-tracking-outbreak/ [Accessed on June 24, 2021] * This is for educational purposes only.

DATA SOURCE

Data used in this article can be found on the LA Times GitHub site. I used the “latimes-county-totals.csv” data (link to the raw data). I also made the data available with the final figure on the following Dropbox location.

 

TUTORIAL

Step 1. Download and visually inspect the data.

After you’ve downloaded the data, take a moment to inspect them. The columns that are used in this tutorial are “date” and “new_confirmed_cases.” But you can use the other columns to replicate other parts of the LA Times COVID-19 tracker.

Step 2. Insert a bar chart and select the appropriate data.

Insert a clustered column chart using the Insert tab on the Excel ribbon. When selecting the data, make sure that you select the “new_confirmed_cases” (other data are available, but the new cases are what we are replicating in this exercise).

The default bar chart does a pretty good job of replicating the LA Times figure.

However, we’ll have to do a few edits to the axes to match the LA Times figure.

 

Step 3. Modify the axes.

Let’s focus on the Y-axis first. Right-click on the Y-axis and select “Format Axis…” In the Axis Options panel, change the Minimum value to 0 and the Major value to 20000. This will match the settings in the target figure. (Note: There are negative values in the data, but these are very small numbers and assumed to be ignorable.) Next, in the Number options, change Category to “Number” and the value in the “Decimal places” to 0. Make sure that you check the box next to “Use 1000 Separator (,)” to replicate the same format in the target figure.

For the X-axis, right-click on the bottom axis and select “Format Axis…”This will open the Axis Options panel where you can make several adjustments to the X-axis. First, we want to change the X-axis display values from dates to months. Change the Number Category field to “Custom” then change the Format Code to “mmm”; make sure to click on “Add” for the changes to take effect. Next, go to the Axis Type area and change the Minimum to “02/01/2020” since we want our timeline to begin on Feb of 2020. Then change the Major value to 4 to match the monthly interval of the target figure. The X-axis should be thicker with tick marks on the outside. To modify these, navigate to the Tick Marks option and change Major type to “Outside” and then click on the Paint Bucket (Fill & Line) option; increase the Width to 1.5. These should match the target figure’s X-axis format.

Step 4. Add the 7-day moving average.

Excel has a Data Analysis tool that will automatically estimate the 7-day average. I’ve written a previous tutorial that describes how to use this tool. I’ll briefly review how to estimate a 7-day moving average.

In the Data tab, click on the Data Analysis tool (instructions on how to install the Data Analysis tool is here). This will open the Data Analysis Tools box. Select “Moving Average” from the tools kit and enter the appropriate values in the options box. For the Input Range, select all the values from “new_confirmed_cases” column. Enter a value of “7” in the Interval field; this will automatically calculated the 7-day moving average. In the Output Range, select a single cell where you want to moving average to be pasted after it is calculated. I chose to use the next available cell on the dataset ($F$2).

Step 5. Add the 7-day moving average to the chart.

To include the moving average data to the current daily new cases bar chart, right-click on the chart and select “Select Data.” This will open a box where you can add new data. Select “Add” which will open the “Edit Series” box. Updates the Series name with the name of the column (“moving_avg”). For the Interval field, change this to “7” for the 7-day moving average. Then in the Series values, select the 7-day moving average data.

By default, Excel will generate a bar chart for the 7-day moving average. However, we want a link chart. We can change this by right-clicking on the bars of the chart and selecting “Change Series Chart Type…” This will open a box that will allow us to select the type of chart for each data. For the “moving_avg” data, change the Chart Type to “Line.” This will create a line chart for the 7-day moving average which will be overlaid over the daily new cases.

Step 6. Modifying the chart format.

To closely match the chart to the one presented in the LA Times, I made the following adjustments. Your mileage may vary depending on the library of fonts available. I tried to select fonts that most Excel users will have access to.

I changed the Y-axis font to Adabi script. The X-axis font was changed to Arial Nova.

The width of the horizontal gridlines was increased to 1.5. The color of the daily new cases bar chart was changed to a light blue using a hex code of #8DC6DF. The color of the 7-day moving average was changed to a dark blue using a hex code of #2B869B; additionally, the width was increased to 2.0.

Step 7. Comparison between LA Times and user-generated charts.

Once the modifications have been made, compare the charts.

CONCLUSIONS

Using data from the LA Times, we can replicate the data visuals on their COVID-19 tracker website. This allows users to verify the data that are presented on a public site. Additionally, it allows us to generate our own data visualizations that could inform policy and education the public on the rate of new cases in California.

 

REFERENCES

Data was based on the LA Times (link), which was accessed on 24 June 2021.

Excel file used for this exercise can be download from the following Dropbox folder.

 

ACKNOWLEDGEMENT

The data visual used in this exercise was based on the work of the staffers at the LA Times. They deserve all the credit and acknowledgement for developing these stunning visuals.

Communicating data effectively with data visualizations: Part 37 (visualization COVID vaccinations by race, gender, and age)

INTRODUCTION

The introduction of COVID-19 vaccines sparked a race to administer as many people as possible in the United States (US). To date (31 May 2021), there have been over 167 million people with at least one dose of the vaccine administered according to the Centers for Disease Control and Prevention (CDC).1 However, there have been some concerns about disparities in vaccine coverage among racial, gender, and age groups. The CDC Covid Data Tracker indicates that a disproportionate number of racial groups are not receiving the vaccine. Additionally, there are a disproportionate number of older people who have received at least one dose of the vaccine and less males receiving the vaccine compared to the percentages that make up the US population. We will use simple data visualization techniques to characterize the vaccination of racial groups in the US using data from the CDC.

 

DATA SOURCE

Data used in this article are located at the CDC’s website (link).

There are several pieces of data that are important. The “% Persons at least One Dose” represents the percentage of people who received at least one dose in the US among those with data on race. The “% US population” represents the percentage of the US population made up by the group. For example, among those who received vaccine, 14.2% identified as Hispanic/Latino, which makes up 17.2% of the entire US population.

Figure 1.jpg

VACCINATIONS BY RACE

A horizontal bar chart was constructed to visually compare the percentage of people in different racial groups who received a vaccine and the percentage of people that made up that racial group in the US.

Figure 2.jpg

Based on this chart, the percentages of Hispanic/Latino and Black, non-Hispanic who received vaccines are less than the percentages that make up the US population. This is in contract to the White non-Hispanic and Asian, non-Hispanic racial groups who have a similar percentages of people who received the vaccine and make up the US population. This figure gives us an illustration on the potential disparities associated with vaccinations.

However, another way to view this figure is to look at the differences in the percentages that received vaccination and make up the US population. With the figure below, we can easily see the disparities.

Figure 3.jpg

VACCINATIONS BY GENDER

According to the CDC, the percentage of males who received vaccination is lower than the percentage that make up the US population. Females have the opposite issue; the percentage of females who received vaccination exceed the percentage the make up the US population.

Figure 4.jpg

We can also view this as the differences in the percentages that received COVID-19 vaccination and make up the US population. The percentages of females who received a vaccination exceeds the percentage that makes up the US population, but the percentage of males who received a vaccination is below the percentage that makes up the US population.

Figure 5.jpg

VACCINATIONS BY AGE GROUPS

According to the CDC, most of the vaccinations occurred among age groups of 40-49, 50-64, 65-74, and 75+ years. However, fewer percentages of the younger age groups have received the vaccination compared to the percentages that make up the US population.

Figure 6.jpg

When we change the metric to look at the differences in the percentages that received the vaccination and the percentages that make up the US population, it is apparent that younger age groups are not receiving the vaccines. However, this is a reflection of the vaccination roll out when early on older age groups were given priority.

fIGURE 7.jpg

CONCLUSIONS

Using simple data visualization techniques and differences in the percentages of vaccination and US population composition, we can graphically display disparities in vaccinations by racial, gender, and age groups. It is unclear what causes these disparities, but it will be paramount that any barriers to a fair and equitable distribution of healthcare be addressed. The biggest challenge will be to establish healthcare policies that will distribute vaccines equitably to all patients in the US.

 

REFERENCES

  1. Centers for Disease Control and Prevention. COVID Data Tracker. Centers for Disease Control and Prevention. Published March 28, 2020. Accessed May 31, 2021. https://covid.cdc.gov/covid-data-tracker

Communicating data effectively with data visualizations: Part 36 (Scatter plots and Moving Averages for COVID-19 vaccination)

INTRODUCTION

As of 27 April 2021, the Centers for Disease Control and Prevention (CDC) reported that approximately 42.7% of the U.S. population have received at least one dose of the coronavirus (COVID-19) vaccine and 29.1% are fully vaccinated.1 Moreover, a total of over 200 million doses have been administered. Despite the effectiveness of the availability of COVID-19 vaccines, there are some people who have decided to not receive the vaccine. The Economist published a data visualization that summarized the results of weekly surveys that asked respondents, “When a covid-19 vaccine becomes available to you, will you get vaccinated?” This figure shows that approximately 60% of the population will get or already have received the COVID-19 vaccine. However, a little over 20% stated that they will not get the vaccine and nearly 20% are unsure. Using data from The Economist, we can use Microsoft Excel to recreate the figure below:

* Source: The Economist – “When a covid-19 vaccine becomes available to you, will you get vaccinated?” (link). This is for educational purposes only.

DATA

The data to recreate the figure can be found at The Economist YouGov data site (link). I also abstracted the data and saved them onto an Excel file, which is located here. Using the data from The Economist, we will create a scatter plot and use the 2-day moving average to re-create the figure above.

Creating the Scatter Plot with Moving Average

Step 1. Inspect the data

The data include the percentage of respondents who answered “Yes” (includes both Yes and those who received the vaccine), “No”, and “Not Sure”. The timeline is composed of the calendar year, month, and the week the data was collected. Notice that for some of the months, there were 5 weeks. I altered the scale so that the data would fit into a 4-week bin. Hence, you’ll notice that the weekly scale is 1, 2, 3, 3.5, and 4.

Step 2. Highlight the data and insert a line chart

Step 3. Modify the data source

Once the line chart is inserted onto the workspace, you’ll notice that the scatter plot hasn’t been created. A few more steps are required. Right-click on the chart and click on “Select Data” to open the window that will allow you to make modifications to the data source.

Click on the “Edit” option and select the rows containing the Year and Month on the data sheet. This will create the X-axis labels for our data visualization. You should notice that the X-axis labels have changed to reflect the Year and Month.

To further modify the data source, right-click on the chart again and click on “Select Data….” At the data source window, remove “Year”, “Month”, and “Week” from the legend series to only display the percentage of respondents to the survey question. Once these have been removed, click “Ok.” This will generate a link chart with the respondents’ answers to the survey question.

Step 4. Change line chart to scatter plot

Currently, the X-axis displays the data from 2021 before 2020. We can change this by right-clicking on the X-axis and then clicking on “Format Axis.” This brings up a window where we can click on the box by “Categories in reverse order” for the X-axis to display the respondents answers in chronological order.

To change the line chart to a scatter plot, right-click on the line and select  “Format Data Series…” This will open the options to modify the data on the chart. Instead of a line chart, we want a scatter plot. To make this change, click on the paint bucket and under the line group click on “No line” to remove the line chart. Then under the Marker group, select the circle under “Type.” This will replace the line chart with a scatter plot.

Repeat this for the other lines and you should have a chart that looks like the following:

Step 5. Adjust the aesthetics and add a moving average line

The final step involves making changes to the aesthetics of the chart. You can remove the labels on the Y-axis, delete the gridlines, delete the chart title, eliminate the border of the X-axis, and then delete the legend.  You can add moving average lines for each of the scatter plots by right-clicking on the data points and selecting “Add Trendline…” Then select “Moving average” with “2” periods. Change the moving average line from a dashed line to a solid line and a higher thickness.

The final figure can be achieved by changing the colors of the scatter plots and moving average lines. You add additional labels to emulate the figure from The Economist.

CONCLUSIONS

Using data from The Economist, we recreated the figure showing how the perceptions of respondents changed regarding getting a COVID-19 vaccine from 2020 to 2021. All the data files can be downloaded from here.

 

REFERENCES

1. CDC. COVID Data Tracker. Centers for Disease Control and Prevention. Published March 28, 2020. Accessed April 27, 2021. https://covid.cdc.gov/covid-data-tracker

The Economist YouGov data site (link)

 

Communicating data effectively with data visualizations: Part 35 (Heat maps – COVID-19 Deaths by Week and State)

INTRODUCTION

With the COVID-19 pandemic entering its second year, the number of deaths in the United States (US), has total over 546 thousand deaths (data from March 26, 2021). Of note is the variation in death rates across US. CDC provides a daily update on the number of deaths reported by each state (and major cities) in the US, which is accessible through their portal. However, it can be difficult to visualize all these data; proper visualizations will assist the audience with interpretability and add to the narrative that the COVID-19 pandemic has been more severe is some states compared to others. The Economist provided some helpful data visualizations for Great Britain by using heat maps to report the deviations from the expected weekly death counts due to COVID-19. Here is an example.

* Source: The Economist – Tracking COVID-19 excess deaths across countries (link). This is for educational purposes only.

From this visualization, it is clear that all sectors of Britain were hit by COVID-19 at similar times. Some areas had higher deviations from the expected mortality rates, while other had lower deviations. Additionally, there appears to be another peak in the deviations from the expected number of deaths in late 2020. Using this example as a model, we can recreate this heat map for the US. Fortunately, The Economist has a GitHub portal with all the data for different countries including the US.

DATA

Data for this exercise is located the The Economist GitHub portal. We will use the “covid-19-excess-deaths-tracker,” which includes data on the deviation from the expected deaths (%). Data on the deviation from the expected deaths (%) for the US will be used to emulate the example from The Economist.

 

Heat map of deviations from the number expected death (%) for the US.

We will use Excel to create a heat map of the weekly deviations from the number of expected death because of COVID-19 in the US from January 2020 to February 2021.

Step 1. Inspect the data:

The data for this exercise is located here. Download and open the file. Review the data under the “exercise data” sheet. It should look like the following:

We will manipulate this data so copy and paste this into a new sheet.

Step 2. Add spaces:

One of the elements that The Economist added to their heat map were spaces between each state or territory.

This can be accomplished manually by inserted a row between each state or territory. However, when there are a lot of rows, this can be a burden. A systematic method includes a couple of columns (number and alternate) to create the spaces we need. These will represent the spaces between the states/territories.

Once you have created these, select the entire data set. Then sort them by “number” and then by “alternate”; see picture below:

You should have spaces between each state/territory now.

Step 3. Change column width and row height:

Select the columns and right click to change the width. Set the value to 2. Once you done this, the weekly deviations should look like the following:

Next, use the Filter feature in Excel to only select the “alternate” values  of 1. This should only show values for the “alternate” column that equal to 1. Select the rows where the column values are 1 and right-click. This will allow you to select “Row Height” so that you can change the value to 25. Please see the illustration on how to do this below:

Now that you’ve change the row height for the “alternate” column values for 1, we will change the row height for the “alternate” column values for 0 and 0.5 using the same method. The only difference is that the row height will be 5 instead of 25. See the illustration below on how to perform this step:

Step 4. Add condition formatting rules:

Now that the data column width and row height have been set properly, you can add the conditional formatting rules to change the color of the boxes to reflect the deviations from the number of expected death (%) due to COVID-19.

First, select all the data. Then go to the Home tab and select “Conditional Formatting.” Make sure to click on “New Rule...” Select the “3-Color Scale” because we will use the gradient levels to create our heat map. The darker color will denote the higher deviations and the lighter color will denote the low or no deviations from the number of expected deaths (%). Make sure to change the “Type” to “Number.” You can enter any values that work for you, but for this example, use “0”, “0.5”, and “1” for the “Minimum”, “Midpoint”, and “Maximum” types, respectively.

Next, with the data still selected, right-click on the data to open a window. Select “Format Cells” because we want to remove the values from the cells. In the Number tab, select “Custom” and enter the following: "";"";"";"". This will change the number values so that it doesn’t show in the cells.

The Excel file should start looking like a heat map. However, there are a few more features we can do to emulate the heat map on The Economist website.

Step 5. Adding borders to the heat map:

After formatting the cells so that the numbers are not visible, you can add borders to the heat map. Select the heat map area, right-click anywhere on the data and select “Format Cells”; you will select the “Border” tab and change the border color to white. Then apply the borders to the heat map.

Once you’ve applied the border color change, the heat map should look like the following:

Next, we will add a border to separate the states/territories from each other.

Use the “Filter” feature and select rows where the “alternate” values are 0.5. The rows will be very narrow since we set them to a value of 5 earlier. Now, select the data area including the state/territories column and right-click on the data. Select “Format Cells” and then change the border color to a light shade of gray and apply this to the top border. This will apply a light gray border to separate the states/territories from each other. See illustration below for how to perform this step.

Once you’ve added the border, the heat map should look like the following:

Step 6. Final steps:

In the last steps, you can change the font, hide some of the values, and add a label to identify the weekly deviations. For the final figure, I hid the “alternate” and “number” columns, and I hid the weekly labels. I replaced the weekly labels with a more informative timeline to based on quarters. I also rearranged the states/territories so that I had the United States at the top and some of the more active states/territories at the top (e.g., New York and New York City). This should be remarkably similar to the heat map developed by The Economist. The final figure can be downloaded from the following link.

CONCLUSIONS

Using the data from The Economist, we recreated their heat map for the US. We can visualize when the COVID-19-related deaths exceeded the expected all-cause mortality deaths on a weekly basis from January 2020 to February 2021. For example, New York City was hit with a high percentage of COVID-19 deaths around April 2020, which was also reported for The District of Columbia, New York State, and New Jersey. However, North Dakota, South Dakota, California, and Arizona were hit with high percentages of COVID-19 deaths that exceeded the expected number of deaths later in the year.

Heat maps provide us with a useful visual to compare and contrast different states/territories and the severity of COVID-19 deaths using simple tools provided by Excel.

 

REFERENCES

I used the lessons from a previous blog on developing heat maps (link).

Data on deviations from the number of expected deaths (%) in the US was provided by The Economist (link). However, I cleaned the data so that you can use it for your own exercise (link).