February 22, 2021

Communicating data effectively with data visualizations: Part 34 (Progress bars in Excel)

February 22, 2021

INTRODUCTION

With the introduction of the COVID-19 vaccines, states have monitored their vaccination progress. Current progress of each state’s vaccination rates are available at USAFacts.org. California, which has been experiencing some of the highest COVID-19 related rates and deaths, has approximately 81.2% of its current allocation of COVID-19 vaccines. Here is a screen capture of California’s progress as of 22 February 2021:

This visualization contains several progress bars showcasing the COVID-19 vaccination progress in California. We will re-create this in Excel using the data presented in this visualization.

Step 1. Enter data into Excel.

Using the visualization, we can enter the data into Excel. The data should be arranged in the following manner as reflected in Table 1:

Step 2. Select the data and insert a clustered bar chart.

The default chart is automatically generated by Excel. However, we want to have the labels for the axis to denote the legend. To do that we need to right-click on the chart area and click on “Select Data…” This will bring up a window where you will need to click on “Switch Row/Column” to change the horizontal axis labels.

Step 3. Modify chart to add data labels.

Right-click on the chart area and click on “Select the data…” This will bring up a window that will allow you to make changes to the data structure. We will unselect the “Goal” and “Current progress” data. This will only leave the Goal (%) and % bars. Next, right-click the top bar and click on “Add data labels.”

Step 4. Overlap the two bars.

Next, we want to overall the two bars so that we can show the progress of vaccinations in California. Right-click on one of the bars and click on “Format Data Series…” Change the series overlap to 100%. This will cause the two bars to overlap, which will look like a progress bar. Since we only left the data label for the top bar, we can see distinctly the vaccination progress in California.

Step 5. Modify the aesthetics.

The progress chart is modified using the Abadi font and changed the patterns and colors of the bars. I also added the title “COVID-19 vaccination progress, California.

Step 6. Add some text.

Then we can add some text to emulate the visualization from USAFacts.org. Change the order of the data so that it reflects the one on the USAFacts.org website.

Conclusions

Using the overlap feature of Excel will allow us to make progress charts. The example from California showcases how this visualization can effectively show the current vaccination progress. These types of visualizations will help decision makers monitor progress and make any changes when necessary. Files for this exercise can be downloaded here.

References

Data from USAFacts.org.

Mark Bounthavong

January 20, 2021

Data visualization

Communicating data effectively with data visualizations: Part 33 (Bar charts with secondary axis)

Mark Bounthavong

January 20, 2021

Data visualization

INTRODUCTION

Secondary axis allows us to plot two pieces of data with large differences in their scale. For instance, plotting the number of new COVID cases, which number in the hundreds of thousands, will outweigh the number of employees who lose their jobs, which number in the tens of thousands. There is an order of magnitude that makes it difficult to see how these two metrics are presented side-by-side on a single figure. However, we can take advantage of the secondary vertical axis to present both pieces of data that will be visually intuitive to interpret, while preserving the differences in their scale. One problem with this method is how Excel executes this. Instead of maintaining the bar or column chart, Excel will overlay both bars (or columns). This is illustrated in Figure 1.

Figure 1. Excel overlays the two bars in the figure when using the secondary axis.

To address this issue, we will use a method described by Steve True on his Excel Dashboard Templates Website.

MOTIVATING EXAMPLE

We will use data from the California’s Employment Development Department to visualize the number of employees who lost their jobs during the COVID-19 pandemic and compare these trends to the number of new COVID-19 cases in California. Data on the number of COVDI-19 cases in California came from the California Data Portal. You can also download the Excel exercise file from the following shared Dropbox location.

Step 1. Selecting the data.

First, inspect the data. There are three columns of data the contain the month, number of employees affected, and the number of new COVID-19 cases. Next select the data and insert a “Clustered Column” chart.

Excel will automatically generate a figure where the dates are on the X-axis and the values for the metrics are on the Primary Y-axis.

Notice that the Number of new cases is exponentially larger than the number of employees who lost their jobs during the pandemic. It is difficult to see how the pandemic impacts the number of employees who’ve lost their jobs. To address this, we’ll use the secondary axis for the Number of new cases.

Step 2. Select the secondary axis.

To select the secondary axis, right-click on one of the orange bars that denotes the Number of new cases. This will open a window where you can select “Format Data Series…” Once you’ve done this, select the “Secondary Axis.” This will change your current bar chart into a chart with two axes.

The problem with this updated chart is the way Excel overlays the bars. Here is an illustration of how Excel does this. By changing the gap width, we reveal that the bars are actually over one another and not stacked.

Step 3. Fixing this problem.

The solution to this issue is to add gaps between the bars. Add two new columns between the Number of Employees affected and the Number of new cases; label these as “gap1” and “gap2.”

Now, select all the data and insert a bar chart. You should get the following chart.

Right-click on the yellow bar to open the Format Data Series option. In the Format Data Series window, use the “Series Options” drop-down button to select the data element we want to edit. The first data element is “gap2” and the second data element is “gap1.” We want to assign “gap2” to the Secondary Axis and “gap1” to the Primary Axis. Make sure that you change the “Gap Width” to 25% for both the “gap1” and “gap2” data elements. Keep the “Series Overlap” at 0% for both data elements.

Now, your bar chart should have the bars for the Number of Employees affected and the Number of new cases side-by-side (rather than overlaid) and using both axes.

We could improve this bar chart by editing the legend to remove the “gap1” and “gap2” labels, change the color of the bars, modify the fonts, add axis labels, and add a chart title.

CONCLUSIONS

It is possible to use Excel to create bar charts with two vertical axes. Although Excel doesn’t do this intuitively, we can use the extra columns denoted by “gap1” and “gap2” to generate the space needed to have the bars side-by-side.

REFERENCES

I ran into this problem when I wanted to use two different scales of metrics on a recent project, and I was perplexed as how Excel doesn’t intuitively create the bar chart that I needed. Fortunately, I found Steve True’s Excel Dashboard Templates website where he details how to solve this issue. I highly recommend visiting his site as he has wealth of resources on how to use Excel efficiently.

California COVID-19 data are located at the California Open Data Portal

Data on California layoffs are located at California Employment Development Department

Mark Bounthavong

December 16, 2020

Data visualization

Communicating data effectively with data visualizations: Part 32 (John W. Tukey short biography)

Mark Bounthavong

December 16, 2020

Data visualization

INTRODUCTION

John Wilder Tukey (1915 to 2000) was a mathematician, statistician, and data visualization pioneer. He has been attributed with coining computer science terms such as “bit” (shortened version of binary digit) and “software.” However, Tukey is best remembered for his contributions to data visualizations.

Tukey developed the foundations for exploratory data analysis (EDA) which has taken root as a critical first step to understanding complexities of observations. According to Tukey, EDA was about hypothesis generation. Unlike confirmatory analysis, EDA uses data to identify potential hypothesis to explain observed phenomena and assist with selection of appropriate statistical tests. According to Tukey, data visualization plays an important role in EDA. Visualizing the data using his methods allows analysts to better understand the data. Tukey was responsible for developing innovative data visualizations including a new count tally system, stem-and-leaf displays, and box-and-whisker plot. This article will highlight several of Tukey’s innovative data visualizations that continue to be used in today’s data analysis world.

Source: https://alchetron.com/John-Tukey

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John W. Tukey

Exploratory Data Analysis (1977)

Count tally

Conventional counting tally uses the vertical “stroke” method where you have a single pencil stroke to denote a single count. On the fifth count, a diagonal line is sketched across the four vertical strokes. This can cause miscounts due to the difficulty of interpreting the strokes. Here are some errors that Tukey highlights from his textbook, Exploratory Data Analysis:

Source: Tukey JW. Exploratory Data Analysis. Pearson; 1st edition (January 1, 1977)

Tukey developed an innovative counting tally method that was more efficient than convention methods by using dots and lines to indicate counts. This method is considered easier to interpret without any miscounts, especially when the counting is performed quickly.

Stem-and-leaf display

Although histograms allow us to see whether our data are normally distributed, they do not provide much information. Tukey developed an innovative method to capture additional elements while visualizing the data’s distribution. This visualization is known as the stem-and-leaf display which provides data analysts both with descriptive information and the data distribution. He believed that the histogram left out critical information that the would be informative to the analyst. By using the stem-and-lead display, a data analyst can observe the raw values of the data and quickly identify the mode and outliers. The following is a figure taken from Tukey’s 1977 text Exploratory Data Analysis. The figure represents two different stem-and-leaf plots that display the same data. The “#” represents the frequency in each bin, which are ordered by the first number character of the value. For example, the value 16 is represented as 1 | 6.

Box-and-whisker plots

Tukey discussed improving the box-and-whisker plot (also known as the box plot) by having the whisker length to be standardized at 1.5 times the interquartile range (IQR). This would allow analysts to identify the outliers that exceed this whisker length. In one example from his textbook, Tukey highlights the benefits of the box-and-whisker plot by measuring the elevations of states and volcanoes. The reader can easily identify the outliers as they are labeled (by Tukey’s hand) on the plots.

CONCLUSIONS

Tukey made significant contributions to mathematics, computer science, statistics, and data analysis. But his pursuit for efficient methods to display data has led to innovative methods of data visualizations that we continue to use. Data visualization, according to Tukey, was and important part of analysis from which we could generate hypothesis and select the appropriate inferential tests. He saw the world in a different way, which has helped us shed a little illumination on the mysteries of the world.

REFERENCE

1. Tukey JW. Exploratory Data Analysis. Pearson; 1st edition (January 1, 1977).

Mark Bounthavong

November 18, 2020

Data visualization

Communicating data effectively with data visualizations: Part 31 (Visualizing COVID-19 deaths using sparklines)

Mark Bounthavong

November 18, 2020

Data visualization

INTRODUCTION

As the COVID-19 pandemic continues to place a public health burden on our society, the death toll in the United States has reached a terrible milestone; over 240,000 deaths were reported as of November 17, 2020.[1] COVID-19 is now the third leading cause of death behind heart disease and cancer followed by accidents (unintentional injuries) and chronic respiratory diseases (reported by Scientific American).[2] Of note is the epidemiology spread of new deaths over time. New York has borne the brunt of the incidental deaths; early in the pandemic New York City had the highest 7-day moving average of COVID-related death in the country (1,193 deaths in April 21, 2020), but, recently, the 7-day moving average has dropped to 8 deaths in November 16, 2020. One way that we can visualize this is to use sparklines. Excel has a feature that allows users to plot sparklines onto the spreadsheet. We will use this feature to visualize the change in new COVID-related deaths over time by state.

MOTIVATING EXAMPLE

We will recreate the following figure using the Excel sparklines feature to visualize the changing number of new deaths due to COVID-19 from January to October 2020. Data can be downloaded from the CDC Tracker site or from the Dropbox folder. (Data was pulled on November 18, 2020.)

CONSTRUCTING THE SPARKLINES

Step 1. Open and examine the data

You can download and open the Excel file located here. The data should look like the following:

The data are in wide format and are ordered by Total Deaths in descending order. The abbreviation NYC represents New York City instead of the state. The data also include districts (some cities) and US territories.

Step 2. Identify the location of the sparklines options

Excel’s sparkline feature is located in the ribbon area. You can locate this in the Insert tab under the sparklines area.

For this tutorial, we will use the “Column” option. This will generate bar charts within the cell that we identify.

Step 3. Insert the sparkline

In the cell next to the “Total Deaths” column, we will add the column label “New Deaths Over Time.” Select the empty cell for the first state and insert the sparkline. A window will open prompting for information. Data Range should include the data for the state from Jan to Oct. Once you’ve selected your data, click on “OK.”

Step 4. Change row height to fit the sparkline

The sparkline is inserted into a narrow cell with a low row height. We can change this by right-clicking on the row label, selecting “Row Height…” changing the “Row Height” to 45, and then confirming that this is the height you want. Feel free to change the height to fit your needs.

Step 5. Select other states

At this point, the sparklines are nearly complete. You can expand this to the other states by selecting the corner of the cell and dragging the sparkline to the other cells.

Step 6. Hide cells

We can hide the cells with the number of deaths from Jan to Oct. However, when you do this, the sparklines will disappear. Excel uses the unhidden cells to create the sparklines. When you hide the data, the sparklines will not generate the visual. To fix this, open the sparklines ribbon and select the “Edit Data” box. This opens several options. Select the “Hidden & Empty Cell…” and click on the “Show data in hidden rows and columns.” This will allow you to hide cells that the sparklines are using to generate the visuals.

Step 7. Final touches

You can change the color of the sparklines and add some table lines to separate the states. In this tutorial, the Arial Nova Light font was used. Here is the final sparklines for the first five states with the most total number of COVID-related deaths in the United States.

CONCLUSIONS

Using sparklines can enhance a data visualization by adding a time series of historical events (e.g., COVID-19 deaths). This is useful when you want to incorporate this alongside some data in a table, spreadsheet, or dashboard. In our tutorial, we used the historical of deaths for each states in the United States to give the audience some perspective of what has occurred despite the high number of deaths. For instance, New York City had the highest number of total deaths, but historically, these deaths occurred in the months of April to July and dropped significantly in recent months. Other states started out with low number of deaths but have many their total deaths in recent months. This provides us with another dimension to the visualization of how different the epidemiologic spread of deaths occurred across the United States.

REFERENCES

CDC. COVID-19 Cases, Deaths, and Trends in the US | CDC COVID Data Tracker. Centers for Disease Control and Prevention. Published March 28, 2020. Accessed November 18, 2020. https://covid.cdc.gov/covid-data-tracker
Stix YZ Gary. COVID-19 Is Now the Third Leading Cause of Death in the U.S. Scientific American. Accessed November 18, 2020. https://www.scientificamerican.com/article/covid-19-is-now-the-third-leading-cause-of-death-in-the-u-s1/

Mark Bounthavong

October 20, 2020

Data visualization

Communicating data effectively with data visualizations: Part 30 (Butterfly charts)

Mark Bounthavong

October 20, 2020

Data visualization

INTRODUCTION

COVID-19 data on cases and deaths highlight the devastating impact it has had on public health. As of 20 October 2020, there has been over 8.5 million cases and over 200,000 deaths. A majority of deaths have been among the elderly while a majority of the cases have been among the younger population. The Wall Street Journal recently published an article describing this relationship. Of concern is the potential for transmission to occur between the younger group who have the most cases and the elderly population.

In this tutorial, we will compare the distribution of cases and deaths across different age groups to visualize the relationships between these dimensions. To do that, we will use a butterfly chart, which juxtaposes two vertical bar charts in a mirror-like fashion. Butterfly charts allow us to plot two data sets using a common dimension; this allows us to visually see their differences and scales.

Here is an example of a butterfly chart from datavizproject.com.

MOTIVATING EXAMPLE

We will use data from the CDC on COVID-19 cases and deaths distributed by age groups. You can download the data from the CDC website site here. You can also download the Excel workbook for this exercise here.

CONSTRUCTING THE BUTTERFLY CHART

Step 1. Open the Excel File and review the data.

The main data will include the Age Group, Percentage of cases, Percentage of deaths, Mirror1, Mirror2, and Middle. These columns will be used to build the butterfly chart. The next smaller table will be used to re-align the data on a different X-axis.

Here are some data definitions:

Age Group = Age distribution of the population
Percentage of cases = Proportion of patients within each age group that had confirmed COVID-19 testing
Percentage of deaths = Proportion of patients within each age group that died of COVID-19 related disease
Mirror1 = Represents the amount of gap that is created from the left variable of the butterfly chart. This is estimated using: 50 – Percentage of cases (50 was used because it was a reasonable value after the max value). For instance, the max value for the Percentage of cases is 23.7. Therefore, Mirror1 = 50 – 23.7 = 26.3.
Mirror2 = Represents the amount of gap that is created from the right variable of the butterfly chart. This is estimated using: 50 – Percentage of deaths (50 was used because it was a reasonable value after the max value)
Middle = Represents the gap in the middle where we will place our data labels

Step 2. Create a stacked horizontal bar chart.

Select the data shown in the figure below and select the Stacked Bar Chart.

The stacked bar chart will look like the following:

Step 3. Order the categories.

Once the stacked bar chart is created, we will re-order the categories so that the Middle values are in the middle of the group. The order should be Mirror1, Percentage of cases, Middle, Percentage of deaths, and Mirror2.

Step 4. Remove color from the bars.

Next, we will remove the color fills from the Mirror1, Middle, and Mirror2 bars from the stacked bar chart. We should start to see the beginnings of a butterfly chart.

Step 5. Add labels to the middle of the butterfly chart.

Once the selected bars have their fill colors removed, we can add labels to reflect the age categories. First, we will reverse the order of the Y-axis by right-clicking on it and the selecting the “Categories in reverse order.” Second, we can remove the gridlines by clicking on them and then clicking on the “Delete” button. Third, we will add age category labels to the Middle bars by right-clicking on the bars, selecting “Add data labels,” then select “Format data label” and check the “Category Name” and uncheck the “Values.” This should replace the values with the age category names for the Middle bar in the stacked bar chart.

Step 6. Adding the new X-axis labels.

Since the current stacked bar chart uses the X-axis from the main data table, we don’t have a normalized axis. To do that, we will need to add a new set of data and then replace our current X-axis with the updated X-axis.

First, we need to establish where we would like to zero-out our normalized X-axis. Looking at the current X-axis, the left side of the butterfly chart starts at X=50 and the right side of the butterfly chart starts at X=90.

Second, Right-click anywhere on the chart area. Click to add a new data set, then click on the “Add” to add the new data. In the Series Values, select the values in the Old X-axis column.

You chart will look a little strange, but that’s okay. We’ll change the axis so that it looks a little bit more reasonable.

Third, right click on one of the bars from the newly added data, click on the “Change Series Chart Type…” Then change the Chart Type from Stacked Bar Chart for Series 6 to Scatter. This will change the bars to a scatter plot that we will manipulate into a new X-axis.

Fourth, we will right-click on the scatter plot and open a window to update the data by clicking on “Select Data…” Then on Series6, click on “Edit” to update the data. Using the values on the main table, select the Old X-axis values for the Y-values in the Edit Series window; then select the values in the New X-axis for the X-values.

The chart will have a scatter pattern like an upside trapezoid.

We will use a trick with the Y-axis to make this shape a straight line. Let’s add some data labels to the scatter. Afterwards, we want to reposition the label values to the bottom of the scatter points.

To change the scatter points from an upside trapezoid to a straight line, we will compress the Y-axis. To do this right-click on the Y-axis, and change the range of the axis from 0 to 10,000.

The bar chart will now have the scatter at the bottom of the chart along with the labels for the scatter points.

Step 7. Delete the Y-axes, remove the legend, and zoom into the chart.

We are nearly done. All that’s left is to clean the chart of unnecessary labels and axes. First, delete the two Y-axes. Then delete the legend. We can also remove the top X-axis, there’ no need to have that. We just want to keep the bottom X-axis. We can remove the scatter by right-clicking on it and then changing the fill and border colors to none.

After a series of these aesthetic change, your chart should look like the following.

CONCLUSIONS

Based on the final butterfly chart, we can see that the younger patients had a large percentage of cases and the elderly patients had a large percentage of deaths. Policy makers can review this visualization and immediate identify this association, and they may conclude that the reason why there are so many deaths in the elderly population is due to transmission from the younger population.

REFERENCES

I used the following YouTube video by Doug H to help me write this tutorial.

The Excel file for this tutorial is located here.

The WSJ article that highlights the association between age and COVID-19 cases/deaths can be located here (but you will need a subscription to read the whole story).

Mark Bounthavong

September 22, 2020

Data visualization

Communicating data effectively with data visualizations: Part 29 (Forest plots)

Mark Bounthavong

September 22, 2020

Data visualization

INTRODUCTION

A recent paper by Wang and colleagues reported that patients with recent diagnosis of substance use disorder (SUD) had a greater risk for COVID-19.[1] The adjusted odds ratio was 8.699 with a 95% confidence interval (CI) of 8.411 to 8.997). Patients with opioid use disorder (OUD) were at the greatest risk.

The authors used a forest plot to summarize their findings (Figure 1). A forest plot is a diagram that displays the measurement of interest (e.g., odds ratio) with horizontal error bars to represent the 95% CI for several variables, which are aligned on the Y-axis. Forest plots are common in pair-wise meta-analysis where multiple studies are used to describe the effect size of the treatment versus the comparator group. Studies are arranged along the vertical axis and the odds ratio with 95% CI are displayed next to the studies. This allows the readers to see the effect size (e.g., odds ratio) and the uncertainty surrounding each study (or variable) in the meta-analysis.

Wang and colleague used this method to illustrate the odds of developing COVID-19 for different types of substance use disorders diagnosis along with their uncertainties. It’s an effective way to illustrate how much of a risk each SUD diagnosis category is associated with developing COVID-19.

Figure 1. Forest from the study that we will recreate.[1]*

(*This figure is used for educational purposes only.)

DATA

We will use the study by Wang and colleagues[1] to recreate their forest plot using Excel. Although it is much easier to code this in Stata or R, for the purpose of this tutorial, we will use Excel.

Step 1. Get the data

Since there are a few data points, we can enter these directly into an Excel spreadsheet. There are some nuances that we will need to consider when plotting these data points. A template is available to assist you with entering the data correctly. I have provided an illustration below for how you should set up your data in Excel. (You can download this template here.)

Once the data have been entered into Excel, we can begin to generate the figure.

Step 2. Insert a scatter plot chart

In the tab, select Insert and then select the scatter plot drop down. You will see a series of different scatter plots to choose from. Select the one that says “Scatter.”

Step 3. Select the data for the Scatter plot

Right-click in the empty chart area to select the data. Select the Add button to select the data of interest.

In the Series name field, enter “data.” Then for the Series X-values, select the values in the Odds ratio column. For the Series Y-values, select the value in the Y position column as indicated in the figure.

Your scatter should appear on the Chart area.

Step 4. Add labels on the Y-axis

We want to have the labels on the Y-axis indicating what SUD diagnosis was associated with an increased odd for developing COVID-19. Similar to the previous data entry, we will begin by Right-clicking on the Chart Area and selecting Add data. For the X values, select the values in the Y Label position column. For the Y values, select the values in the Y position column as indicated below. Once you select the data, the Chart Area will update with the orange scatter points on the Y-axis.

Right-click the orange scatter points on the Y-axis. Then select the “Add Data Labels” to include data labels to the orange scatter points.

Right click on the data labels and click on the Format Data Series. Make sure to check the box next to Series Name and uncheck the box next to Y value. Check the box next to the Values From Cell, click on the Select Range box to open another window where you can select the data labels. Select all the SUD diagnosis for this data range in the Select Data Label Range field.

Data labels for the SUD diagnoses will appear on the right side of the scatter on the Y-axis. Next, we will reposition the SUD diagnoses labels to the left of the Y-axis. Right-click on the orange scatter and then select “None” in the Marker options to remove the scatter on the Y-axis. Then Right-click on the Y-axis value and then enter the “Delete” key on your keyboard. This will remove the labels on the Y-axis (e.g., 1, 2, 3, 4, 5, 6, and 7). All that should remain are the SUD diagnoses labels. To reposition these, right-click on the SUD diagnoses labels and then check the “Left” option in the Label Position field.

This will reposition the SUD diagnoses labels to the left of the Y-axis.

Step 5. Add the error bars for the 95% CI

Next, we will include the horizontal error bars to represent the 95% CI. Click anywhere in the Chart Area and the Chart Design tab will become available in the Ribbon. Select “Add Chart Element” dropdown arrow to open up the available options and select “Error Bars.” Select the “Standard Errors” to display both the horizontal and vertical error bars around the scatter points on the chart.

We want to keep the horizontal error bars, so we need to delete the vertical error bars. Select the vertical error bars and then hit the “Delete” key on your keyboard. This will remove the vertical error bars and leave you with only the horizontal error bars. Next, you want to adjust the horizontal error bars. Currently, this is not displaying the correct 95% CI. Right-click on the horizontal error bar and select the “Format Error Bars…” option.

Select the “Specify Value” box next to the Custom option for the error bars. We will determine what Excel plots for the 95% CIs. For the “Positive Error Value” select the values under the “UL – OR” column. Similarly, for the “Negative Error Value” select the values under the “LL – OR” column.

The appropriate error bars now reflect the 95% CIs from the figure generated by Wang and colleagues.

Step 6. Adding the null line at Odds Ratio = 1

To add the vertical line that cross where the odds ratio (OR) is equal to 1, we need to use the column “Null position.” Right-click anywhere on the Chart Area and click on “Select Data. This opens up the data menu. Click on “Add Data.” Then follow the instructions to select values in the Null Position column for the Series X values box and the values in the Y position column for the values in the Series Y values box. We’ll name the “null” data series since this is where the odds ratio is equal to 1.

Right click on the scatter and select “Change Series Chart Type” to open the window where you can select different chart styles. For the Null data series (e.g., odds ratio is equal to 1), change the Chart Type from “Scatter” to “Scatter with Straight Lines.” This will generate a straight vertical line along the values where the odds ratio is equal to 1.

Right-click on the scatter point along the straight line and Format Data Series…; then remove the marker by selection None under the Marker Options.

Step 7. Modify the chart presentation

At this stage, most of the necessary steps to include the forest plot is complete. Final steps involve changing the colors, adjusting the length of the Y-axis, and removing the gridlines. I also included the odds ratio (OR) and 95% CI by the forest plot on the right side by enter each value into the corresponding cells in Excel. I also added a blue line at the top of the forest plot and some text boxes for the labels corresponding to the SUD diagnoses and the odds ratios with their 95% confidence intervals.

CONCLUSIONS

The final Excel forest plot is similar to the one generated by Wang and colleagues.1 I opted to leave out the P-values since they were all significant and did not include any additional information to the chart. Some additional modifications included the use of a red dotted line for where the odds ratio (OR) is equal to 1 and the use of a blue top border to separate the labels for the chart columns.

Forest plots are great when you want to show the impact each variable has on a particular outcome. In our example, each of the different SUD diagnosis has an impact on the odds of developing COVID-19. From the forest plot, it is easy to identify OUD as having the greatest odds of developing COVID-19.

One additional thing that we can do is order this from the highest odds ratio to the lowest odds ratio, which will give us a better way to compare relative strengths across the different SUD diagnosis categories (see below).

You can download the Excel file for this exercise here.

REFERENCES

Wang QQ, Kaelber DC, Xu R, Volkow ND. COVID-19 risk and outcomes in patients with substance use disorders: analyses from electronic health records in the United States. Mol Psychiatry. Published online September 14, 2020:1-10. doi:10.1038/s41380-020-00880-7.

Mark Bounthavong

August 21, 2020

Data visualization

Communicating data effectively with data visualizations: Part 28 (Scatter plot)

Mark Bounthavong

August 21, 2020

Data visualization

INTRODUCTION

Between 1918 to 1919, the influenza pandemic (also known as the “Spanish Flu”) raged across the world and caused over 40 million deaths. Cities in the United States enacted nonpharmaceutical interventions (e.g., social distancing, shelter-in-place mandates) to reduce the transmission of the influenza pandemic, overall and peak attack rates, and the number of deaths. Some of the cities were successful in mitigating the calamity associated with the pandemic, but others were not. The experiences that these cities learned in the past yield important insight for policy makers today to tackle the current COVID-19 pandemic.

Markel and colleagues (2007) reported on the impact of nonpharmaceutical interventions enacted by cities in the United States and their effect that they had on mitigating the influenza pandemic of 1918 to 1919.[1] Briefly, their report highlights that cities that implemented these public health interventions early had greater delays in the time to reaching peak mortality, lower peak mortality rates, and lower total mortality.

We will recreate one of the figures (Figure 1c) in this manuscript using Excel and the data provided.

Figure from the study that we will recreate.[1]*

DATA

Data for this tutorial come directly from the study’s Table 1. We will use the Public Health Response (days) in the X-axis and the Excess Pneumonia and Influenza Mortality rate (deaths per 100,000 population). You can download the data from the following link.

Step 1. Get the data

Download the data from this link. Data has been cleaned specifically for this tutorial.

Step 2. Insert a scatter plot chart

After downloading the data, open the Excel file. Look for the column that contains the Public health response time, days; this will be the data for the X-axis. Now, look for the column that contains the Excess pneumonia and influenza mortality, deaths / 100,000 population; this is the data for the Y-axis.

In Excel, insert the Scatter plot by selecting the Scatter option in the Charts tab.

Step 3. Select the data for the Scatter plot

An empty figure will appear. Right-click in any area in the empty figure and you should be able to click on “Select Data”. From there, click on “Add” to add data and select the appropriate data for the X-axis values and the Y-axis values.

Clicking “OK” will generate a scatter plot of the excess deaths across the time the public health responded to the pandemic.

Step 5. Adjust the axes

First, we want to move the Y-axis so that it is flushed with the left side of the chart instead of intersecting at zero on the X-axis.

This will change the Y-axis position from its intersection on the X-axis = 0 to X-axis = -15.

Step 5. Change the color of the scatter

To finalize the scatter plot, change the color and size of the scatter.

FINAL SCATTER PLOT

Once all the adjustments have been made, we can add some data labels for some of the select cities, which were also highlighted with a different color.

CONCLUSION

After recreating the figure from the paper by Markel and colleagues,[1] it is clear that as public health response is delayed, there is a general trend for excess deaths due to the influenza pandemic to increase. Although other types of interventions occurred during this pandemic, the findings from Markel and colleague provides some empirical evidence that early public health measures have significant contributions in terms of mitigating the excess deaths due to the influenza pandemic. Policy makers can use the lessons from the past to inform them about the effectiveness of public health nonpharmaceutical interventions in delaying or reducing the mortality of the current COVID-19 pandemic.

REFERENCE

Markel H, Lipman HB, Navarro JA, et al. Nonpharmaceutical Interventions Implemented by US Cities During the 1918-1919 Influenza Pandemic. JAMA. 2007;298(6):644-654. doi:10.1001/jama.298.6.644

Mark Bounthavong

July 20, 2020

Data visualization

Communicating data effectively with data visualizations: Part 27 (Building a COVID-19 surveillance dashboard using Power BI)

Mark Bounthavong

July 20, 2020

Data visualization

INTRODUCTION

With the advent of the COVID-19 pandemic, several high-profile dashboards have been created using data from around the world. Since Power BI is a new tool, I thought it would be a good opportunity to develop a simple COVID-19 dashboard using it using existing data.

We will use this opportunity to create the following dashboard:

DATA

Data for this tutorial came from Our World in Data (OWID) GitHub site. You can download a version of the data that focuses on the United States from my Dropbox folder. There are a lot of variables in this dataset; however, the codebook for all of the variables is located here.

POWER BI TUTORIAL

Step 1. Download data

The data for the United States is located here. Download this to a convenient location.

Step 2. Loading the data into Power BI

When you start Power BI, select the “Get data.” Make sure you click on the Excel format and then click on “Connect.” Locate the data where you saved it and then in the Navigator window, select “Sheet 1.” Make sure to Load the data to complete the process.

Step 3. Plotting total deaths

In the Visualizations panel, select the link chart. In the Fields panel, you can select the “data” variable. However, we will not use the built-in hierarchy. Instead, we will use the date order (e.g., January 1, 2020; January 2, 2020; January 3, 2020; etc).

After you have selected the date for the X-axis, the next variable is the Total Deaths.

Step 4. Change the labels

The current visualization uses the default labels. However, we can change these in the Visualization panel.

To change the name of the visual, you will need to use the Format option

Once you’ve done these steps, the first visualization should be nearly complete. Additional edits to the visualization can be done using the Format (paint roller icon) such as the background color and line color.

Step 6. Include the Card text

You can add additional visualization such as the number of deaths using the card feature in the Visualization panel.

Step 7. Repeat steps for the other visualizations

You can repeat these steps for the other COVID-19 metrics (e.g., new deaths, total cases, and number of new cases).

Once you have added all the visualizations to your dashboard, you will hopefully get something like the example below.

CONCLUSIONS

Power BI makes it very convenient to build dashboards using available data. Other software are also available, so I would recommend exploring them. However, I would also encourage you to explore other dashboards that are created using Power BI; Microsoft has a library of great examples. By using the methods presented in this tutorial, you can build other dashboards using different types of data.

Files used in this tutorial are available here.

REFERENCES

Data for this tutorial came from the Our World in Data GitHub site.

Mark Bounthavong

June 26, 2020

Data visualization, GIS

Communicating data effectively with data visualizations: Part 26 (COVID-19 choropleth)

Mark Bounthavong

June 26, 2020

Data visualization, GIS

INTRODUCTION

Power BI is a useful tool that allows you to build dashboards and data visuals. One ability that makes Power BI unique is its mapping feature. Although there are pre-built map functions in Power BI, better and easier applications are available through their AppSource store. This tutorial will go over how to create a choropleth of the most recent COVID-19 pandemic using the incidence rate (dated June 21, 2020) for the United States (US) using Power BI and the Drilldown Choropleth application.

DATA

You will need to download two types of data: (1) COVID-19 incidence rate for each county in the United States (You can download the Excel file from this Dropbox link) and (2) Shape files for the United States. Fortunately, there are many sources of shape files for the United States, and I have made one available using the Albers USA projection, which has the state of Alaska and Hawaii in the lower left corner of the map. You can use the link to the map shape file here. Data for the shape files comes from the US Census. [I downloaded the ZIP file and converted the SHP files to TopoJSON using the following conversion tool (Mapshaper.org).] The COVID-19 data contains US county codes called the Federal Information Processing Standard Publication (FIPS), a 5-digit code that identifies counties. Incidence rate is defined as the number of confirmed cases per 100,000 persons.

Albers US projection

POWER BI TUTORIAL

For this tutorial, you will need to have Power BI installed on your desktop to re-create this choropleth.

Step 1. Get the data

When you start Power BI, you will see a start screen that allows you to select the data you need. Since the COVID-19 incidence rates for the US data are saved as an Excel file, you will need to select Excel as the data type and then connect to the data.

After clicking “Connect,” the Navigator will open. Select Sheet 1 containing the data. Notice that there are two variables (FIPS1 and FIPS). These are the data that will be used to pair the incidence rate to the US county code. The FIPS code has a “0” in front of some of the county code (e.g., 05001) for a total of 5 digits. However, notice that these are in numeric format so the preceding “0” is missing. To remedy this, we need transform the data. Make sure to select “Transform Data” from the Navigator screen.

After selection “Transform,” you will need to change the FIPS variable from numeric to text.

You will be asked to either “Replace” or “Add New Step;” make sure that you select “Replace.” This will replace the numeric data to text, which is what we want for the FIPS variable. Check the data after you do this; notice that the “0” is now preceding values that are less than 5 digits. (FIPS code is a 5-digit county code.) After the data has been transformed, select “Close & Apply” to finalize the data. This will load the data into Power BI, which you will use to build the choropleth.

Step 2. Download and Install “Drilldown Choropleth” app

Power BI allows you to download apps from the AppSource store. Click on the “…” and select “Get more visuals.” This will open the AppSource store where you can type the name of the app you want.

Step 3. Enter data into the Drilldown Choropleth app

After you install the Drilldown Choropleth, an icon will appear in your Visualization panel. Click on this icon (Drilldown Choropleth); this will open up options in the Visualization panel. There are two important fields (“Location” and “Values”). Click and drag the FIPS variable into the “Location” field and the Incidence_Rate variable into the “Values” field.

Next, click on the Paint roll icon to open the Drilldown Choropleth options. Expand the “Shape” option to change the “Projection” to “albersUSA” and then enter the link to the TopoJSON shape file for the United States counties. The TopoJSON link is here: https://raw.githubusercontent.com/mbounthavong/Maps/master/c_03mr20.json

After entering the projection type and the TopoJSON link, you can make changes to the color of the choropleth. Since this is a choropleth, you will need to enter values for the Minimum, Center, and Maximum colors. This will generate a gradient where the darker colors reflect high incidence rates and the lighter colors reflect low incidence rates of COVID-19.

Finally, you can change the size of the visual and save it. Since there are a lot of data being processed, updating the choropleth will take a few minutes. With a little patience, you will generate a suitable choropleth of the COVID-19 incidence across US counties for June 21, 2020.

CONCLUSIONS

Power BI allows us to create visually spectacular choropleths. Additionally, it allows us to create choropleths that are interactive. You can hover over each shape and see the incidence rate per county. However, there are limitations. Inserting a legend to describe what the colors represent is not easily possible (I was unable to solve this problem). There are roundabout ways to get the legend inserted into this choropleth, but it will require additional software (as far as I know). There is potential for the Drilldown Choropleth app to improve by adding an easy way to include functionalities for legends and labels. Despite this limitation, Power BI allows us to create these choropleths without having to resort to more complex programming in R or Python.

Files used in this tutorial are available here.

REFERENCES

Mapshaper.org (link: https://mapshaper.org/)

US Census Cartographic files (link: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)

COVID-19 data come from the Johns Hopkins GitHub site (link: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/06-21-2020.csv)

Mark Bounthavong

May 19, 2020

Data visualization

Communicating data effectively with data visualizations: Part 25 (Moving averages)

Mark Bounthavong

May 19, 2020

Data visualization

INTRODUCTION

As the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues around the world, data visualization experts have provided stunning visuals of the pandemic’s terrifying impact on public health. Several of these include a moving average metric that smooths the curves of the pandemic’s mortality data. Common moving averages presented by several of these visualizations include the 5-day average daily deaths and the 7-day average daily deaths. Information Is Beautiful presents mortality on a 7-day moving average while International SOS presents the average daily deaths as 5-day moving averages.

Figure 1. 7-day moving average daily deaths by Information Is Beautiful (Last updated on 19 May 2020).

MOTIVATING EXAMPLE

In this article, we will replicate the 5-day and 7-day (moving) average daily deaths due to SARS-CoV-2 using data from the European Centre for Disease Control and Prevention (ECDC)1 or from the Our World in Data’s GitHub site.2 Due to the changing nature of SARS-CoV-2 data, this exercise will have used data that would ultimately be updated in the future. Please visit the ECDC or the Our World in Data site to download the most recent SARS-CoV-2 data.

We will download data from the Our World in Data’s GitHub site and then use Excel to recreate this plot for the United States (US).

Step 1. Download data from the OWID Covid-19 GitHub site

You can download the raw data for all the countries from the OWID’s site here. However, you can also download the cleaned data for this article (parsed for the United States) from my Dropbox folder here (I cleaned the data and prepared them for use in Excel).

The data has the following format:

We are interested in the new_deaths column, which is the number of new confirmed deaths by day. The moving average will estimate the average daily deaths across either 5-day or 7-day time periods.

Step 2. Add columns for the 5-day and 7-day moving average.

When you load the data into Excel, you will need to create two columns; one for the 5-day average daily death and another for the 7-day average daily deaths. I used MAD_5 for 5-day moving average daily deaths and MAD_7 for 7-day moving average daily deaths.

Step 3. Activate the Analysis ToolPak

Excel has a tool that will perform simple analysis; however, you may need to activate this if it is not on the ribbon.

Once the Options selection has been selected, you can navigate to the Add-in option and open the window to select the Analysis ToolPak. You only need the Analysis ToolPak for this exercise, but it is a good idea to explore the other tools available (e.g., Solver).

Step 4. Estimate the moving average

Once the Analysis ToolPak is activated, return to your main worksheet with the new data columns. Start by selecting the cell where you want the first moving average value to be calculated. Then select the Data Analysis add-in. This will open a window with different Analysis Tools. Select the Moving Average tool.

Once you click “Ok,” you will be asked for several elements to calculate the moving average. You will need to select the Input Range, the number of days needed for calculating the moving average, and the Output Range. After selecting the appropriate information, you can click “Ok” to generate the moving average. Do these steps for the 5-Day and 7-Day moving average daily deaths.

Step 5. Generate the line graph

Now that the average daily deaths are calculated for 5-day and 7-day moving averages, you can generate a line graph that resembles Figure 1.

Insert a line graph onto the Excel worksheet. Right click on any empty area of the line graph; this will open up the edit options.

You can make edits to the figure like entering the data to create the line graph. The Series values is the set of numbers that you calculated for the 5-day average daily deaths.

Once you have selected the data for the Series values, you can edit the x-axis label. Currently, the x-axis defaults to a sequence of 1, 2, 3, …, n. You want this to reflect the dates which are on the B column. Click on Edit and select the dates on the B column to properly relabel the x-axis.

This will generate the first line graph with the 5-day average daily deaths visualization.

Step 6. Finalize the figure

After adding the 5-day average daily deaths, go ahead and repeat these steps for the 7-day average daily deaths. Using Excel’s format feature, you can make changes to the figure’s size, height, and colors. To replicate Figure 1, you will need to change the y-axis to a logarithmic scale.

After changing the scale on the y-axis and adding the 7-day average daily deaths, you should get a figure similar to the one below.

You can also present this without the logarithmic scale.

CONCLUSIONS

The moving average daily deaths smooths the line plots and provide a trend that is easy to understand and interpret. The major difference between these moving average plots and the mortality curves (see previous blog) is the ability to see the decline in daily deaths. The average daily deaths curve allows us to see the decline in the amount of deaths on a moving average. Mortality curves don’t do that since they are based on the total amount of deaths. Despite the differences in plots, both the moving average daily deaths and mortality curves provide critical visual information about the current SARS-CoV-2 trends.

You can download the data and complete exercise here.

REFERENCES

European Centre for Disease Control and Prevention. Download today’s data on the geographic distribution of COVID-19 cases worldwide. European Centre for Disease Prevention and Control. Published April 18, 2020. Accessed May 19, 2020. https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide
Our World in Data. GitHub: Owid/Covid-19-Data. Our World in Data; 2020. Accessed May 19, 2020. https://github.com/owid/covid-19-data