September 22, 2020

Communicating data effectively with data visualizations: Part 29 (Forest plots)

September 22, 2020

INTRODUCTION

A recent paper by Wang and colleagues reported that patients with recent diagnosis of substance use disorder (SUD) had a greater risk for COVID-19.[1] The adjusted odds ratio was 8.699 with a 95% confidence interval (CI) of 8.411 to 8.997). Patients with opioid use disorder (OUD) were at the greatest risk.

The authors used a forest plot to summarize their findings (Figure 1). A forest plot is a diagram that displays the measurement of interest (e.g., odds ratio) with horizontal error bars to represent the 95% CI for several variables, which are aligned on the Y-axis. Forest plots are common in pair-wise meta-analysis where multiple studies are used to describe the effect size of the treatment versus the comparator group. Studies are arranged along the vertical axis and the odds ratio with 95% CI are displayed next to the studies. This allows the readers to see the effect size (e.g., odds ratio) and the uncertainty surrounding each study (or variable) in the meta-analysis.

Wang and colleague used this method to illustrate the odds of developing COVID-19 for different types of substance use disorders diagnosis along with their uncertainties. It’s an effective way to illustrate how much of a risk each SUD diagnosis category is associated with developing COVID-19.

Figure 1. Forest from the study that we will recreate.[1]*

(*This figure is used for educational purposes only.)

DATA

We will use the study by Wang and colleagues[1] to recreate their forest plot using Excel. Although it is much easier to code this in Stata or R, for the purpose of this tutorial, we will use Excel.

Step 1. Get the data

Since there are a few data points, we can enter these directly into an Excel spreadsheet. There are some nuances that we will need to consider when plotting these data points. A template is available to assist you with entering the data correctly. I have provided an illustration below for how you should set up your data in Excel. (You can download this template here.)

Once the data have been entered into Excel, we can begin to generate the figure.

Step 2. Insert a scatter plot chart

In the tab, select Insert and then select the scatter plot drop down. You will see a series of different scatter plots to choose from. Select the one that says “Scatter.”

Step 3. Select the data for the Scatter plot

Right-click in the empty chart area to select the data. Select the Add button to select the data of interest.

In the Series name field, enter “data.” Then for the Series X-values, select the values in the Odds ratio column. For the Series Y-values, select the value in the Y position column as indicated in the figure.

Your scatter should appear on the Chart area.

Step 4. Add labels on the Y-axis

We want to have the labels on the Y-axis indicating what SUD diagnosis was associated with an increased odd for developing COVID-19. Similar to the previous data entry, we will begin by Right-clicking on the Chart Area and selecting Add data. For the X values, select the values in the Y Label position column. For the Y values, select the values in the Y position column as indicated below. Once you select the data, the Chart Area will update with the orange scatter points on the Y-axis.

Right-click the orange scatter points on the Y-axis. Then select the “Add Data Labels” to include data labels to the orange scatter points.

Right click on the data labels and click on the Format Data Series. Make sure to check the box next to Series Name and uncheck the box next to Y value. Check the box next to the Values From Cell, click on the Select Range box to open another window where you can select the data labels. Select all the SUD diagnosis for this data range in the Select Data Label Range field.

Data labels for the SUD diagnoses will appear on the right side of the scatter on the Y-axis. Next, we will reposition the SUD diagnoses labels to the left of the Y-axis. Right-click on the orange scatter and then select “None” in the Marker options to remove the scatter on the Y-axis. Then Right-click on the Y-axis value and then enter the “Delete” key on your keyboard. This will remove the labels on the Y-axis (e.g., 1, 2, 3, 4, 5, 6, and 7). All that should remain are the SUD diagnoses labels. To reposition these, right-click on the SUD diagnoses labels and then check the “Left” option in the Label Position field.

This will reposition the SUD diagnoses labels to the left of the Y-axis.

Step 5. Add the error bars for the 95% CI

Next, we will include the horizontal error bars to represent the 95% CI. Click anywhere in the Chart Area and the Chart Design tab will become available in the Ribbon. Select “Add Chart Element” dropdown arrow to open up the available options and select “Error Bars.” Select the “Standard Errors” to display both the horizontal and vertical error bars around the scatter points on the chart.

We want to keep the horizontal error bars, so we need to delete the vertical error bars. Select the vertical error bars and then hit the “Delete” key on your keyboard. This will remove the vertical error bars and leave you with only the horizontal error bars. Next, you want to adjust the horizontal error bars. Currently, this is not displaying the correct 95% CI. Right-click on the horizontal error bar and select the “Format Error Bars…” option.

Select the “Specify Value” box next to the Custom option for the error bars. We will determine what Excel plots for the 95% CIs. For the “Positive Error Value” select the values under the “UL – OR” column. Similarly, for the “Negative Error Value” select the values under the “LL – OR” column.

The appropriate error bars now reflect the 95% CIs from the figure generated by Wang and colleagues.

Step 6. Adding the null line at Odds Ratio = 1

To add the vertical line that cross where the odds ratio (OR) is equal to 1, we need to use the column “Null position.” Right-click anywhere on the Chart Area and click on “Select Data. This opens up the data menu. Click on “Add Data.” Then follow the instructions to select values in the Null Position column for the Series X values box and the values in the Y position column for the values in the Series Y values box. We’ll name the “null” data series since this is where the odds ratio is equal to 1.

Right click on the scatter and select “Change Series Chart Type” to open the window where you can select different chart styles. For the Null data series (e.g., odds ratio is equal to 1), change the Chart Type from “Scatter” to “Scatter with Straight Lines.” This will generate a straight vertical line along the values where the odds ratio is equal to 1.

Right-click on the scatter point along the straight line and Format Data Series…; then remove the marker by selection None under the Marker Options.

Step 7. Modify the chart presentation

At this stage, most of the necessary steps to include the forest plot is complete. Final steps involve changing the colors, adjusting the length of the Y-axis, and removing the gridlines. I also included the odds ratio (OR) and 95% CI by the forest plot on the right side by enter each value into the corresponding cells in Excel. I also added a blue line at the top of the forest plot and some text boxes for the labels corresponding to the SUD diagnoses and the odds ratios with their 95% confidence intervals.

CONCLUSIONS

The final Excel forest plot is similar to the one generated by Wang and colleagues.1 I opted to leave out the P-values since they were all significant and did not include any additional information to the chart. Some additional modifications included the use of a red dotted line for where the odds ratio (OR) is equal to 1 and the use of a blue top border to separate the labels for the chart columns.

Forest plots are great when you want to show the impact each variable has on a particular outcome. In our example, each of the different SUD diagnosis has an impact on the odds of developing COVID-19. From the forest plot, it is easy to identify OUD as having the greatest odds of developing COVID-19.

One additional thing that we can do is order this from the highest odds ratio to the lowest odds ratio, which will give us a better way to compare relative strengths across the different SUD diagnosis categories (see below).

You can download the Excel file for this exercise here.

REFERENCES

Wang QQ, Kaelber DC, Xu R, Volkow ND. COVID-19 risk and outcomes in patients with substance use disorders: analyses from electronic health records in the United States. Mol Psychiatry. Published online September 14, 2020:1-10. doi:10.1038/s41380-020-00880-7.

Mark Bounthavong

August 21, 2020

Data visualization

Communicating data effectively with data visualizations: Part 28 (Scatter plot)

Mark Bounthavong

August 21, 2020

Data visualization

INTRODUCTION

Between 1918 to 1919, the influenza pandemic (also known as the “Spanish Flu”) raged across the world and caused over 40 million deaths. Cities in the United States enacted nonpharmaceutical interventions (e.g., social distancing, shelter-in-place mandates) to reduce the transmission of the influenza pandemic, overall and peak attack rates, and the number of deaths. Some of the cities were successful in mitigating the calamity associated with the pandemic, but others were not. The experiences that these cities learned in the past yield important insight for policy makers today to tackle the current COVID-19 pandemic.

Markel and colleagues (2007) reported on the impact of nonpharmaceutical interventions enacted by cities in the United States and their effect that they had on mitigating the influenza pandemic of 1918 to 1919.[1] Briefly, their report highlights that cities that implemented these public health interventions early had greater delays in the time to reaching peak mortality, lower peak mortality rates, and lower total mortality.

We will recreate one of the figures (Figure 1c) in this manuscript using Excel and the data provided.

Figure from the study that we will recreate.[1]*

DATA

Data for this tutorial come directly from the study’s Table 1. We will use the Public Health Response (days) in the X-axis and the Excess Pneumonia and Influenza Mortality rate (deaths per 100,000 population). You can download the data from the following link.

Step 1. Get the data

Download the data from this link. Data has been cleaned specifically for this tutorial.

Step 2. Insert a scatter plot chart

After downloading the data, open the Excel file. Look for the column that contains the Public health response time, days; this will be the data for the X-axis. Now, look for the column that contains the Excess pneumonia and influenza mortality, deaths / 100,000 population; this is the data for the Y-axis.

In Excel, insert the Scatter plot by selecting the Scatter option in the Charts tab.

Step 3. Select the data for the Scatter plot

An empty figure will appear. Right-click in any area in the empty figure and you should be able to click on “Select Data”. From there, click on “Add” to add data and select the appropriate data for the X-axis values and the Y-axis values.

Clicking “OK” will generate a scatter plot of the excess deaths across the time the public health responded to the pandemic.

Step 5. Adjust the axes

First, we want to move the Y-axis so that it is flushed with the left side of the chart instead of intersecting at zero on the X-axis.

This will change the Y-axis position from its intersection on the X-axis = 0 to X-axis = -15.

Step 5. Change the color of the scatter

To finalize the scatter plot, change the color and size of the scatter.

FINAL SCATTER PLOT

Once all the adjustments have been made, we can add some data labels for some of the select cities, which were also highlighted with a different color.

CONCLUSION

After recreating the figure from the paper by Markel and colleagues,[1] it is clear that as public health response is delayed, there is a general trend for excess deaths due to the influenza pandemic to increase. Although other types of interventions occurred during this pandemic, the findings from Markel and colleague provides some empirical evidence that early public health measures have significant contributions in terms of mitigating the excess deaths due to the influenza pandemic. Policy makers can use the lessons from the past to inform them about the effectiveness of public health nonpharmaceutical interventions in delaying or reducing the mortality of the current COVID-19 pandemic.

REFERENCE

Markel H, Lipman HB, Navarro JA, et al. Nonpharmaceutical Interventions Implemented by US Cities During the 1918-1919 Influenza Pandemic. JAMA. 2007;298(6):644-654. doi:10.1001/jama.298.6.644

Mark Bounthavong

July 20, 2020

Data visualization

Communicating data effectively with data visualizations: Part 27 (Building a COVID-19 surveillance dashboard using Power BI)

Mark Bounthavong

July 20, 2020

Data visualization

INTRODUCTION

With the advent of the COVID-19 pandemic, several high-profile dashboards have been created using data from around the world. Since Power BI is a new tool, I thought it would be a good opportunity to develop a simple COVID-19 dashboard using it using existing data.

We will use this opportunity to create the following dashboard:

DATA

Data for this tutorial came from Our World in Data (OWID) GitHub site. You can download a version of the data that focuses on the United States from my Dropbox folder. There are a lot of variables in this dataset; however, the codebook for all of the variables is located here.

POWER BI TUTORIAL

Step 1. Download data

The data for the United States is located here. Download this to a convenient location.

Step 2. Loading the data into Power BI

When you start Power BI, select the “Get data.” Make sure you click on the Excel format and then click on “Connect.” Locate the data where you saved it and then in the Navigator window, select “Sheet 1.” Make sure to Load the data to complete the process.

Step 3. Plotting total deaths

In the Visualizations panel, select the link chart. In the Fields panel, you can select the “data” variable. However, we will not use the built-in hierarchy. Instead, we will use the date order (e.g., January 1, 2020; January 2, 2020; January 3, 2020; etc).

After you have selected the date for the X-axis, the next variable is the Total Deaths.

Step 4. Change the labels

The current visualization uses the default labels. However, we can change these in the Visualization panel.

To change the name of the visual, you will need to use the Format option

Once you’ve done these steps, the first visualization should be nearly complete. Additional edits to the visualization can be done using the Format (paint roller icon) such as the background color and line color.

Step 6. Include the Card text

You can add additional visualization such as the number of deaths using the card feature in the Visualization panel.

Step 7. Repeat steps for the other visualizations

You can repeat these steps for the other COVID-19 metrics (e.g., new deaths, total cases, and number of new cases).

Once you have added all the visualizations to your dashboard, you will hopefully get something like the example below.

CONCLUSIONS

Power BI makes it very convenient to build dashboards using available data. Other software are also available, so I would recommend exploring them. However, I would also encourage you to explore other dashboards that are created using Power BI; Microsoft has a library of great examples. By using the methods presented in this tutorial, you can build other dashboards using different types of data.

Files used in this tutorial are available here.

REFERENCES

Data for this tutorial came from the Our World in Data GitHub site.

Mark Bounthavong

June 26, 2020

Data visualization, GIS

Communicating data effectively with data visualizations: Part 26 (COVID-19 choropleth)

Mark Bounthavong

June 26, 2020

Data visualization, GIS

INTRODUCTION

Power BI is a useful tool that allows you to build dashboards and data visuals. One ability that makes Power BI unique is its mapping feature. Although there are pre-built map functions in Power BI, better and easier applications are available through their AppSource store. This tutorial will go over how to create a choropleth of the most recent COVID-19 pandemic using the incidence rate (dated June 21, 2020) for the United States (US) using Power BI and the Drilldown Choropleth application.

DATA

You will need to download two types of data: (1) COVID-19 incidence rate for each county in the United States (You can download the Excel file from this Dropbox link) and (2) Shape files for the United States. Fortunately, there are many sources of shape files for the United States, and I have made one available using the Albers USA projection, which has the state of Alaska and Hawaii in the lower left corner of the map. You can use the link to the map shape file here. Data for the shape files comes from the US Census. [I downloaded the ZIP file and converted the SHP files to TopoJSON using the following conversion tool (Mapshaper.org).] The COVID-19 data contains US county codes called the Federal Information Processing Standard Publication (FIPS), a 5-digit code that identifies counties. Incidence rate is defined as the number of confirmed cases per 100,000 persons.

Albers US projection

POWER BI TUTORIAL

For this tutorial, you will need to have Power BI installed on your desktop to re-create this choropleth.

Step 1. Get the data

When you start Power BI, you will see a start screen that allows you to select the data you need. Since the COVID-19 incidence rates for the US data are saved as an Excel file, you will need to select Excel as the data type and then connect to the data.

After clicking “Connect,” the Navigator will open. Select Sheet 1 containing the data. Notice that there are two variables (FIPS1 and FIPS). These are the data that will be used to pair the incidence rate to the US county code. The FIPS code has a “0” in front of some of the county code (e.g., 05001) for a total of 5 digits. However, notice that these are in numeric format so the preceding “0” is missing. To remedy this, we need transform the data. Make sure to select “Transform Data” from the Navigator screen.

After selection “Transform,” you will need to change the FIPS variable from numeric to text.

You will be asked to either “Replace” or “Add New Step;” make sure that you select “Replace.” This will replace the numeric data to text, which is what we want for the FIPS variable. Check the data after you do this; notice that the “0” is now preceding values that are less than 5 digits. (FIPS code is a 5-digit county code.) After the data has been transformed, select “Close & Apply” to finalize the data. This will load the data into Power BI, which you will use to build the choropleth.

Step 2. Download and Install “Drilldown Choropleth” app

Power BI allows you to download apps from the AppSource store. Click on the “…” and select “Get more visuals.” This will open the AppSource store where you can type the name of the app you want.

Step 3. Enter data into the Drilldown Choropleth app

After you install the Drilldown Choropleth, an icon will appear in your Visualization panel. Click on this icon (Drilldown Choropleth); this will open up options in the Visualization panel. There are two important fields (“Location” and “Values”). Click and drag the FIPS variable into the “Location” field and the Incidence_Rate variable into the “Values” field.

Next, click on the Paint roll icon to open the Drilldown Choropleth options. Expand the “Shape” option to change the “Projection” to “albersUSA” and then enter the link to the TopoJSON shape file for the United States counties. The TopoJSON link is here: https://raw.githubusercontent.com/mbounthavong/Maps/master/c_03mr20.json

After entering the projection type and the TopoJSON link, you can make changes to the color of the choropleth. Since this is a choropleth, you will need to enter values for the Minimum, Center, and Maximum colors. This will generate a gradient where the darker colors reflect high incidence rates and the lighter colors reflect low incidence rates of COVID-19.

Finally, you can change the size of the visual and save it. Since there are a lot of data being processed, updating the choropleth will take a few minutes. With a little patience, you will generate a suitable choropleth of the COVID-19 incidence across US counties for June 21, 2020.

CONCLUSIONS

Power BI allows us to create visually spectacular choropleths. Additionally, it allows us to create choropleths that are interactive. You can hover over each shape and see the incidence rate per county. However, there are limitations. Inserting a legend to describe what the colors represent is not easily possible (I was unable to solve this problem). There are roundabout ways to get the legend inserted into this choropleth, but it will require additional software (as far as I know). There is potential for the Drilldown Choropleth app to improve by adding an easy way to include functionalities for legends and labels. Despite this limitation, Power BI allows us to create these choropleths without having to resort to more complex programming in R or Python.

Files used in this tutorial are available here.

REFERENCES

Mapshaper.org (link: https://mapshaper.org/)

US Census Cartographic files (link: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)

COVID-19 data come from the Johns Hopkins GitHub site (link: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/06-21-2020.csv)

Mark Bounthavong

May 19, 2020

Data visualization

Communicating data effectively with data visualizations: Part 25 (Moving averages)

Mark Bounthavong

May 19, 2020

Data visualization

INTRODUCTION

As the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues around the world, data visualization experts have provided stunning visuals of the pandemic’s terrifying impact on public health. Several of these include a moving average metric that smooths the curves of the pandemic’s mortality data. Common moving averages presented by several of these visualizations include the 5-day average daily deaths and the 7-day average daily deaths. Information Is Beautiful presents mortality on a 7-day moving average while International SOS presents the average daily deaths as 5-day moving averages.

Figure 1. 7-day moving average daily deaths by Information Is Beautiful (Last updated on 19 May 2020).

MOTIVATING EXAMPLE

In this article, we will replicate the 5-day and 7-day (moving) average daily deaths due to SARS-CoV-2 using data from the European Centre for Disease Control and Prevention (ECDC)1 or from the Our World in Data’s GitHub site.2 Due to the changing nature of SARS-CoV-2 data, this exercise will have used data that would ultimately be updated in the future. Please visit the ECDC or the Our World in Data site to download the most recent SARS-CoV-2 data.

We will download data from the Our World in Data’s GitHub site and then use Excel to recreate this plot for the United States (US).

Step 1. Download data from the OWID Covid-19 GitHub site

You can download the raw data for all the countries from the OWID’s site here. However, you can also download the cleaned data for this article (parsed for the United States) from my Dropbox folder here (I cleaned the data and prepared them for use in Excel).

The data has the following format:

We are interested in the new_deaths column, which is the number of new confirmed deaths by day. The moving average will estimate the average daily deaths across either 5-day or 7-day time periods.

Step 2. Add columns for the 5-day and 7-day moving average.

When you load the data into Excel, you will need to create two columns; one for the 5-day average daily death and another for the 7-day average daily deaths. I used MAD_5 for 5-day moving average daily deaths and MAD_7 for 7-day moving average daily deaths.

Step 3. Activate the Analysis ToolPak

Excel has a tool that will perform simple analysis; however, you may need to activate this if it is not on the ribbon.

Once the Options selection has been selected, you can navigate to the Add-in option and open the window to select the Analysis ToolPak. You only need the Analysis ToolPak for this exercise, but it is a good idea to explore the other tools available (e.g., Solver).

Step 4. Estimate the moving average

Once the Analysis ToolPak is activated, return to your main worksheet with the new data columns. Start by selecting the cell where you want the first moving average value to be calculated. Then select the Data Analysis add-in. This will open a window with different Analysis Tools. Select the Moving Average tool.

Once you click “Ok,” you will be asked for several elements to calculate the moving average. You will need to select the Input Range, the number of days needed for calculating the moving average, and the Output Range. After selecting the appropriate information, you can click “Ok” to generate the moving average. Do these steps for the 5-Day and 7-Day moving average daily deaths.

Step 5. Generate the line graph

Now that the average daily deaths are calculated for 5-day and 7-day moving averages, you can generate a line graph that resembles Figure 1.

Insert a line graph onto the Excel worksheet. Right click on any empty area of the line graph; this will open up the edit options.

You can make edits to the figure like entering the data to create the line graph. The Series values is the set of numbers that you calculated for the 5-day average daily deaths.

Once you have selected the data for the Series values, you can edit the x-axis label. Currently, the x-axis defaults to a sequence of 1, 2, 3, …, n. You want this to reflect the dates which are on the B column. Click on Edit and select the dates on the B column to properly relabel the x-axis.

This will generate the first line graph with the 5-day average daily deaths visualization.

Step 6. Finalize the figure

After adding the 5-day average daily deaths, go ahead and repeat these steps for the 7-day average daily deaths. Using Excel’s format feature, you can make changes to the figure’s size, height, and colors. To replicate Figure 1, you will need to change the y-axis to a logarithmic scale.

After changing the scale on the y-axis and adding the 7-day average daily deaths, you should get a figure similar to the one below.

You can also present this without the logarithmic scale.

CONCLUSIONS

The moving average daily deaths smooths the line plots and provide a trend that is easy to understand and interpret. The major difference between these moving average plots and the mortality curves (see previous blog) is the ability to see the decline in daily deaths. The average daily deaths curve allows us to see the decline in the amount of deaths on a moving average. Mortality curves don’t do that since they are based on the total amount of deaths. Despite the differences in plots, both the moving average daily deaths and mortality curves provide critical visual information about the current SARS-CoV-2 trends.

You can download the data and complete exercise here.

REFERENCES

European Centre for Disease Control and Prevention. Download today’s data on the geographic distribution of COVID-19 cases worldwide. European Centre for Disease Prevention and Control. Published April 18, 2020. Accessed May 19, 2020. https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide
Our World in Data. GitHub: Owid/Covid-19-Data. Our World in Data; 2020. Accessed May 19, 2020. https://github.com/owid/covid-19-data

Mark Bounthavong

May 8, 2020

Methods

Formulating a good research question

Mark Bounthavong

May 8, 2020

Methods

On April 16, 2020, I gave a presentation to students from the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) Student Chapter at the University of Washington’s Comparative Health Outcomes, Policy, & Economics (CHOICE) Institute. I reviewed some of the ways to think about a research topic, how to narrow the scope of the topic, and how to formulate a specific and testable research question. The presentation was meant to help students develop their own process for developing a good research question for their thesis.

I discussed the FINER criteria for formulating a research question

I also discussed the PICOT format of a research question.

The presentation is available on the CHOICE Institute’s blog: https://choiceblog.org/2020/04/27/best-practices-in-developing-research-questions/

Mark Bounthavong

April 18, 2020

Data visualization

Communicating data effectively with data visualizations: Part 24 (Mortality Curves)

Mark Bounthavong

April 18, 2020

Data visualization

INTRODUCTION

The continual threat of infection by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has ground the world to a metaphysical stop. Economies appear to be under threat of a long recession, political debates have delayed needed relief to citizens, lack of N95 masks for healthcare workers place them at greater risk for doing their jobs, and mortality has increased world-wide. We are undoubtedly experiencing an seminal period in the 21st Century and data analysts have rushed to develop stunning visuals and dashboards such as the ones developed by Johns Hopkins University, “Our World in Data”,1 and the Centers for Disease Control and Prevention to feature the impact the SARS-CoV-2 is having on our world.

MOTIVATING EXAMPLE

In this article, we will replicate the total deaths due to SARS-CoV-2 using data from the European Centre for Disease Control and Prevention (ECDC)2 or from the Our World in Data’s GitHub site.3 Due to the changing nature of SARS-CoV-2 data, this exercise will have used data that would ultimately be updated in the future. Please visit the ECDC site to download the most recent SARS-CoV-2 data.

The figure we will replicate is one posted on the “Our World in Data” website and looks like the following:

Source: Our World in Data. Total confirmed COVID-19 deaths: How rapidly are they increasing? URL: https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death [last accessed: 17 April 2020].

We will download data from the ECDC and then use Excel to recreate this plot for several countries (we won’t create plots for all the countries, but you can feel free to do so by taking advantage of the available data).

Step 1. Download data from the ECDC.

You can download the raw data from the ECDC’s site here. Alternatively, you can also download the cleaned data for this article here (I cleaned the data and prepared them for use in Excel).

The data has the following format:

Each column represents a country and the rows represents the total number of deaths for each day after the 5th confirmed death.

Step 2. Select all of the data and insert a line chart.

Once all the data have been downloaded, select all of them. Insert a line chart and Excel will automatically generate a figure for you. This figure will need to be edited further, but Excel does a good job of plotting the total number of deaths along the X-axis (time).

Right-click in the chart region and click on the Select Data option. You want to de-select the “Days since the 5th total confirmed death” because this is not the value of interest. Rather, this represents the values for the X-axis (time).

Once you de-select the “Days since the 5th total confirmed death,” your line chart should look like the following figure.

Step 3. Changing the scale on the Y-axis.

In the figure from the Our World in Data, the values for total deaths are plotted using a log-scale. When Excel generates the line chart, it automatically uses the continuous scale on the Y-axis. To change this, you need to right-click on the Y-axis and then select Format Axis...

Once you make the changes to the Y-axis scale, the line chart should now look similar to the one from Our World in Data.

Step 4. Adding the axes labels and formatting the lines.

Once the line chart’s Y-axis has been transformed into a log-scale, you can make changes to the axes labels and the line formatting. Select the Design tab to made changes to the Y-axis label. Select the Add Chart Element to open the drop-down menu, then select the Axis Title followed by the Primary Vertical option. This will allow you to make changes to the Y-axis label.

Change the Y-axis label to read, “Log number of COVID-19 deaths.” Do the same thing for the X-axis label, but change it to read “Days since 5th death occurred.” Your figure should look like the following.

You can make the lines thinner by right-clicking on one of them (e.g., China), opening up the options. Select Format Data Series… and then adjusting the Width to be 1.5 points. This will make the line easier to see without having to take up space with the thicker lines.

Repeat this process for all the lines in the chart. Once you have completed that, the line chart should look like the following.

Step 5. Now all that’s left is changing some of the aesthetics.

The final line chart replicates the figure from Our World in Data and provides the references line for a doubling in the number of deaths for each country. The reference line (“Doubling every 5 days”) was creating using a base of 10 on a log-scale to replicate a doubling of that value every 5 days. The reference line was also placed on a secondary axis to create a continuous line (since creating a line on the same scale as the other countries would have yielded gaps because we’re doubling deaths every 5 days). In the Design tab, you can add the secondary horizontal axis to match that of the primary horizontal axis (time). The secondary vertical and horizontal axes had their font color changed to white to hide them from view and to clean the final figure.

Gridlines were added along with the label for the reference line, which indicates to point where the total confirmed SARS-CoV-2 deaths are doubling every five days.

Here is the final chart after some formatting changes were made.

Conclusions

Although we plotted total deaths from SARS-CoV-19 from existing data, these were limited to five countries. More countries can be added using the available data, and it is encouraged that you try to plot all the other countries as an exercise. The reference line provides us with the doubling of deaths on a log-scale and carefully provides the readers with a threshold where certain doubling of deaths would be reported. China seems to have controlled their total number of deaths, but there is a spike at the end of day 87 that shows an increase in deaths. This may be due to reporting error or a change in the definitions of death. The US, including the other European countries, are trailing the rest of the world in containing the SARA-CoV-2 pandemic. Hopefully, this type of data visualization will help inform decision makers to develop policies that would mitigate the impact of SARS-CoV-19 on mortality any parts of the world.

You can download the data and complete exercise here.

References

Roser M, Ritchie H, Ortiz-Ospina E, Hasell J. Coronavirus Disease (COVID-19) – Statistics and Research. Our World Data. March 2020. https://ourworldindata.org/coronavirus. Accessed April 17, 2020.
European Centre for Disease Control and Prevention. Download today’s data on the geographic distribution of COVID-19 cases worldwide. European Centre for Disease Prevention and Control. https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide. Published April 18, 2020. Accessed April 18, 2020.
Our World in Data. GitHub: Owid/Covid-19-Data. Our World in Data; 2020. https://github.com/owid/covid-19-data. Accessed April 18, 2020.

Mark Bounthavong

March 19, 2020

Data visualization

Communicating data effectively with data visualizations: Part 23 (Epidemic Curves)

Mark Bounthavong

March 19, 2020

Data visualization

INTRODUCTION

In December 2019, a novel strain of coronavirus was detected in Wuhan, the capital city of the Hubei province in China. This coronavirus was designated as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This current iteration of the coronavirus has many similar characteristics to its earlier ancestor SARS-CoV-1,[1] which was first detected in 2003 and known simply as SARS. According to a recent study, clinical characteristics of patients in China who were infected with SARS-CoV-2 included fever (up to 88.7% who were hospitalized) and cough (67.8%).[2] The median age of patients infected was 47.0 years (IQR: 35.0-58.0) with a large distribution of 58.3% over the age of 50 years having severe symptoms. Additionally, the case fatality rate was reported to be 1.4%.

In January 2020, the World Health Organization (WHO) in regards to the SARS-CoV-2 outbreak declared a global health emergency.[3] Regardless, as SARS-CoV-2 spread across the globe into a pandemic, many countries started to report the attributable number of cases and deaths. According to the WHO, the global total of confirmed cases is at 191,127 and global deaths is at 7,807 (as reported on March 18, 2020).[4]

One of the most important tools in understanding the SARS-CoV-2 epidemic course is the epidemic curve. Epidemic curves allow epidemiologists to visualize the progression of an outbreak by surveilling the number of cases across time.[5] The epidemic curve informs epidemiologists about the pattern of the outbreak’s spread, magnitude, time to exposure, and outliers. Moreover, the epidemic curve is constantly updated as more data become available.

As the SARS-CoV-2 pandemic spreads to other countries, many data visualizations have been developed to help educate and inform people. Johns Hopkins University has developed a real-time dashboard with epidemic curves on the SARS-CoV-2 pandemic that is an excellent source of global cases and mortality. The Centers for Disease Control and Prevention (CDC) also has a series of data visualizations on the SARS-CoV-2 outbreak in the United States including an epidemic curve.

This article will review the features of an epidemic curve and provide a tutorial on creating one based on the available data from the CDC on the SARS-CoV-2 outbreak in the United States.

EPIDEMIC CURVE

When an outbreak happens, there is an urgency to determine when it first occurred. Epidemiologists carefully, collect data to determine who patient zero is and when the case was first identified. This gives them a starting point for when the outbreak occurred. Epidemic curves provide information on the outbreaks’ spread, magnitude, incubation period, outliers, and time trend. Key features of the epidemic curve include the number of cases on the Y-axis and the date of illness on the X-axis. Figure 1 illustrates the key features of the epidemic curve for a point-source outbreak.

Figure 1. Key features of the epidemic curve for a point-source outbreak.

Epidemic curves can tell us information about the outbreak’s pattern of spread. Figure 1 illustrates the pattern of spread for a point-source outbreak. In a point-source outbreak, a single source of contamination affects a group of people at a single event (e.g., rotten potato salad at a dinner party). Other patterns include continuous-source and propagated-source outbreaks. Continuous-source outbreak occurs when the group of people are exposed to a source of contamination for a period of time (e.g., lead poisoning in children). Propagated-source outbreak occurs when the contamination is spread from person-to-person (e.g., flu). The SARS-CoV-2 is an example of a mixed-source outbreak where the early outbreak was due to a common-source (e.g., possibly from zoonotic transmission from animal to human) followed by a propagated-source outbreak where the virus is spread from person-to-person via air droplets or physical contact.[6] Based on a recent study, the mean incubation period for SARS-CoV-2 is 5.1 days (95% CI: 4.5 to 5.8 days).[6]

Motivating example

The SARS-CoV-2 outbreak in the United States was first reported in January 14, 2020. Data on the number of cases can be downloaded from the CDC’s SARS-CoV-2 surveillance site (Note: CDC data are updated daily; hence, the data for this exercise will not reflect these changes). We will use these data to create an epidemic curve of the SARS-CoV-2 outbreak in the United States using Excel. You can download the data files used in this exercise here.

The data are arranged in a wide format where the date (time) is represented by columns and the number of cases is represented by rows. This makes it much easier to generate the epidemic curve in Excel.

Step 1. Highlight the data and Insert a bar chart.

Select the data and insert a clustered column chart. The default version will provide a simple epidemic curve. However, we want to remove the spaces between the bars. To do that, we will need to format the

Step 2. Changing the size of the bars.

To change the size of the bars, we need to right-click one of them to bring up the editing menu. Then we select the Format Data Series… to bring up the options. We set the Gap Width of the bars to 0% so that their sides are in contact with each other. But to distinguish them, we can change the outline’s color to White. Increasing the width of the border’s color will increase gaps between each bar.

After a few more changes (e.g., color and labels), the final epidemic curve will represent the CDC’s data on SARS-CoV-2 on March 18, 2020 (Figure 2). Since data are constantly changing and require validation during an outbreak, this epidemic curve will eventually change. It is recommended that you constantly update this exercise’s data in order to have the most recent, accurate, and valid data from the CDC on SARS-CoV-2. You can also compare your findings to those of the CDC at their website.

Figure 2. Cases of SARS-CoV-2 in the United States.

Conclusions

Epidemic curves are helpful in understanding a disease outbreak in a community. They provide us with a visual representation of the outbreak’s magnitude, pattern, and time period, which will allow us to implement public health policy to stem, reduce, and eventually eradicate the contagion from our population. Although this is a short introduction on epidemic curves, it will, hopefully, be enough for you to review and interpret other epidemic curves in the news or literature.

Files related to this exercise are available here.

References

van Doremalen N, Bushmaker T, Morris DH, et al. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1. N Engl J Med. 2020;0(0):null. doi:10.1056/NEJMc2004973
Guan W, Ni Z, Hu Y, et al. Clinical Characteristics of Coronavirus Disease 2019 in China. N Engl J Med. 2020;0(0):null. doi:10.1056/NEJMoa2002032
World Health Organization. Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV). https://www.who.int/news-room/detail/30-01-2020-statement-on-the-second-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-outbreak-of-novel-coronavirus-(2019-ncov). Accessed March 19, 2020.
World Health Organization. Coronavirus Disease 2019 (COVID-19) Situation Report –58.; 2020. https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200318-sitrep-58-covid-19.pdf?sfvrsn=20876712_2. Accessed March 19, 2020.
Centers for Disease Control and Prevention. Interpretation of Epidemic (Epi) Curves during Ongoing Outbreak Investigations | Foodborne Outbreaks | Food Safety | CDC. https://www.cdc.gov/foodsafety/outbreaks/investigating-outbreaks/epi-curves.html. Published November 16, 2018. Accessed March 19, 2020.
Lauer SA, Grantz KH, Bi Q, et al. The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Ann Intern Med. March 2020. doi:10.7326/M20-0504

Mark Bounthavong

February 10, 2020

Data visualization

Communicating data effectively with data visualizations: Part 22 (How to create a double axes figure in Excel)

Mark Bounthavong

February 10, 2020

Data visualization

INTRODUCTION

Displaying two types of information on a single figure is commonly done using double axes. Normally, you would use a single y-axis to reflect the outcome or metric of interest and a single x-axis for a standard set of categories (e.g., time). However, if the scales are different, leveraging a secondary axis can enhance and improve the visualization. In this article, we will demonstrate how to create a secondary axis using Excel.

MOTIVATING EXAMPLE

For example, the total overdose deaths are plotted on the y-axis and the year is plotted on the x-axis (Figure 1).

Figure 1. Total overdose deaths in the United States, 1999 to 2017.Source: https://www.drugabuse.gov/related-topics/trends-statistics/overdose-death-rates — **Figure 1. Total overdose deaths in the United States, 1999 to 2017.**
Source: https://www.drugabuse.gov/related-topics/trends-statistics/overdose-death-rates

Total overdose deaths in the United States (US) is large and growing. In 1999 it started at 16,849 deaths, but by 2019, that number rose to over seventy thousand. This is a large number and the scale that is used can impact how the viewer will synthesize this visualization.

However, if you wanted to add another line or data dimension such as the number of deaths due to opioids with a benzodiazepine prescription, that number may not be on the same scale as the total overdose deaths that’s currently represented by the y-axis. Figure 2 illustrates the Total Overdose Deaths and Overdose Deaths due to an Opioid-Benzodiazepine combination. Notice how the Opioid Deaths due to an Opioid-Benzodiazepine combination is plotted on the same scale as the Total Overdose Deaths. Using the same scale makes it difficult to discern the change in deaths associated with an Opioid-Benzodiazepine combination.

**Figure 2. Overdose Deaths due to Opioids, 1999 to 2017.**

To improve this, we may consider using a secondary y-axis to correctly scale for the number of deaths due to opioids-benzodiazepine co-prescribing.

Step 1. Open the Excel document with the data, which is located here.

Step 2. Review the data. In this example, we have the total overdose deaths and the deaths associated with opioids-benzodiazepine co-prescribing from 1999 to 2017.

Step 3. Select the three columns and use the Excel Insert tab to select the line figure.

Step 4. Review the line graph.

The default line plot that is generated includes all the variables in a single y-axis. This is not what we want.

We need to correct this and select the correct axes for these variables.

Step 5. The Year variable needs to be on the x-axis.

Right-click anywhere on the chart space. A menu of options will appear. You want to click on the Select Data… option to make changes to the data and how they are displayed on the chart.

Then Remove the Year variable from the Legend Entries (Series) panel. This will remove Year from being plotted on the y-axis. Next, you want to edit the Total Overdose Deaths in the Horizontal (Category) Axis Labels panel.

After clicking on Edit, you will be given a small window to select the range of data to display on the x-axis. Use the button in the Axis label range to select the Years 1999 to 2017. This will populate the x-axis with the years corresponding to the Total Overdose Deaths.

Step 6. Selecting the secondary y-axis.

Once the Year has been correctly specified, we can create a secondary y-axis to represent the Deaths due to opioids-bzd since these numbers are much smaller than the Total Overdose Deaths.

Right click on the line that represents Deaths due to opioids-bzd. Select the Format Data Series… option. This will give you a menu with the choice to use the Primary or Secondary axis. Select the Secondary axis.

This will generate a chart that will include two y-axes: (1) Total Overdose Deaths and (2) Deaths due to opioids-bzd.

You can format the figure to improve its appearance with Excel’s other functions. In this example, the line colors were changed to provide a better contrast, the y-axes were properly labelled, and the font colors matched those of the lines.

CONCLUSIONS

Using double axes can improve the narrative of your data visualization. Take advantage of this option when presenting data with the same x-axis, but different measures. In this example, the metric of interest was death, but the scales were different. Total Overdose Deaths were significantly larger than Deaths due to opioid-benzodiazepine co-prescribing. Therefore, to better visualize the increasing number of deaths due to opioid-benzodiazepine co-prescribing, a secondary y-axis was added.

SUPPLEMENTS

Data and Excel file used for this exercise are available here.

Mark Bounthavong

January 30, 2020

MEPS

Medical Expenditure Panel Survey (MEPS) Guide - Part 1

Mark Bounthavong

January 30, 2020

MEPS

INTRODUCTION

Medical Expenditure Panel Survey (MEPS) is a publicly available dataset on healthcare expenditures that is representative of the US population.

The MEPS homepage contains vital information about the methods used to validate household responses and guides on how to properly use these data for research or exploration. You can learn about MEPS in its background section.

MEPS data files are available for download here. The most important file is the Full Year Consolidated Data Files, which contains the data for unique household responses on their characteristics and expenditures. These data are great practice for those interested in learning more about MEPS. Each of the Full Year Consolidated Data Files contain information about the data in the form of Documentation and Code Books. For example, the 2017 Full Year Consolidated Data Files Documentation and Code Book are located here.

If you area a Stata user, there are Stata programming statements available to copy and paste into a Stata *.do file. These programming statements are used to transform the MEPS data into a *.dta file that is usable by Stata. Follow the instructions in the programming statement to properly transform the MEPS data. This is similar to an extract-transform-load (ETL) process.

MEPS has a library of reports that uses its data. You can search for topics using their search engine. For example, Report #43 describes the annual opioid usage among adults treated for conditions associated with pain versus other conditions from 2013 to 2015.

Other examples of MEPS data being used in research include the following:

Hamad R, Niedzwiecki MJ. The short-term effects of the earned income tax credit on health care expenditures among US adults. Health Serv Res. 2019 Dec;54(6):1295-1304. doi: 10.1111/1475-6773.13204. Epub 2019 Sep 30.
Watanabe JH. Examining the Pharmacist Labor Supply in the United States: Increasing Medication Use, Aging Society, and Evolution of Pharmacy Practice. Pharmacy (Basel). 2019 Sep 19;7(3). pii: E137. doi: 10.3390/pharmacy7030137.
Bounthavong M, Li M, Watanabe JH. An evaluation of health care expenditures in Crohn's disease using the United States Medical Expenditure Panel Survey from 2003 to 2013. Res Social Adm Pharm. 2017 May - Jun;13(3):530-538. doi: 10.1016/j.sapharm.2016.05.042. Epub 2016 May 20.