September 15, 2025

R Programming

Loading data into the R environment

September 15, 2025

R Programming

I wrote a short exercise on how to load *.csv and *.xlsx files into the R environment, which I posted on my RPubs site (link)

Mark Bounthavong

August 30, 2025

R Programming, Statistics & Probability

R - Tips and Tricks (Guide) - Part 2

Mark Bounthavong

August 30, 2025

R Programming, Statistics & Probability

I wrote a second R guide to help students navigate and use R and RStudio in their biostatistics course. I focused on creating vectors, matrices, and dataframes.

The guide can be found on my RPubs site.

Mark Bounthavong

October 20, 2020

Data visualization

Communicating data effectively with data visualizations: Part 30 (Butterfly charts)

Mark Bounthavong

October 20, 2020

Data visualization

INTRODUCTION

COVID-19 data on cases and deaths highlight the devastating impact it has had on public health. As of 20 October 2020, there has been over 8.5 million cases and over 200,000 deaths. A majority of deaths have been among the elderly while a majority of the cases have been among the younger population. The Wall Street Journal recently published an article describing this relationship. Of concern is the potential for transmission to occur between the younger group who have the most cases and the elderly population.

In this tutorial, we will compare the distribution of cases and deaths across different age groups to visualize the relationships between these dimensions. To do that, we will use a butterfly chart, which juxtaposes two vertical bar charts in a mirror-like fashion. Butterfly charts allow us to plot two data sets using a common dimension; this allows us to visually see their differences and scales.

Here is an example of a butterfly chart from datavizproject.com.

MOTIVATING EXAMPLE

We will use data from the CDC on COVID-19 cases and deaths distributed by age groups. You can download the data from the CDC website site here. You can also download the Excel workbook for this exercise here.

CONSTRUCTING THE BUTTERFLY CHART

Step 1. Open the Excel File and review the data.

The main data will include the Age Group, Percentage of cases, Percentage of deaths, Mirror1, Mirror2, and Middle. These columns will be used to build the butterfly chart. The next smaller table will be used to re-align the data on a different X-axis.

Here are some data definitions:

Age Group = Age distribution of the population
Percentage of cases = Proportion of patients within each age group that had confirmed COVID-19 testing
Percentage of deaths = Proportion of patients within each age group that died of COVID-19 related disease
Mirror1 = Represents the amount of gap that is created from the left variable of the butterfly chart. This is estimated using: 50 – Percentage of cases (50 was used because it was a reasonable value after the max value). For instance, the max value for the Percentage of cases is 23.7. Therefore, Mirror1 = 50 – 23.7 = 26.3.
Mirror2 = Represents the amount of gap that is created from the right variable of the butterfly chart. This is estimated using: 50 – Percentage of deaths (50 was used because it was a reasonable value after the max value)
Middle = Represents the gap in the middle where we will place our data labels

Step 2. Create a stacked horizontal bar chart.

Select the data shown in the figure below and select the Stacked Bar Chart.

The stacked bar chart will look like the following:

Step 3. Order the categories.

Once the stacked bar chart is created, we will re-order the categories so that the Middle values are in the middle of the group. The order should be Mirror1, Percentage of cases, Middle, Percentage of deaths, and Mirror2.

Step 4. Remove color from the bars.

Next, we will remove the color fills from the Mirror1, Middle, and Mirror2 bars from the stacked bar chart. We should start to see the beginnings of a butterfly chart.

Step 5. Add labels to the middle of the butterfly chart.

Once the selected bars have their fill colors removed, we can add labels to reflect the age categories. First, we will reverse the order of the Y-axis by right-clicking on it and the selecting the “Categories in reverse order.” Second, we can remove the gridlines by clicking on them and then clicking on the “Delete” button. Third, we will add age category labels to the Middle bars by right-clicking on the bars, selecting “Add data labels,” then select “Format data label” and check the “Category Name” and uncheck the “Values.” This should replace the values with the age category names for the Middle bar in the stacked bar chart.

Step 6. Adding the new X-axis labels.

Since the current stacked bar chart uses the X-axis from the main data table, we don’t have a normalized axis. To do that, we will need to add a new set of data and then replace our current X-axis with the updated X-axis.

First, we need to establish where we would like to zero-out our normalized X-axis. Looking at the current X-axis, the left side of the butterfly chart starts at X=50 and the right side of the butterfly chart starts at X=90.

Second, Right-click anywhere on the chart area. Click to add a new data set, then click on the “Add” to add the new data. In the Series Values, select the values in the Old X-axis column.

You chart will look a little strange, but that’s okay. We’ll change the axis so that it looks a little bit more reasonable.

Third, right click on one of the bars from the newly added data, click on the “Change Series Chart Type…” Then change the Chart Type from Stacked Bar Chart for Series 6 to Scatter. This will change the bars to a scatter plot that we will manipulate into a new X-axis.

Fourth, we will right-click on the scatter plot and open a window to update the data by clicking on “Select Data…” Then on Series6, click on “Edit” to update the data. Using the values on the main table, select the Old X-axis values for the Y-values in the Edit Series window; then select the values in the New X-axis for the X-values.

The chart will have a scatter pattern like an upside trapezoid.

We will use a trick with the Y-axis to make this shape a straight line. Let’s add some data labels to the scatter. Afterwards, we want to reposition the label values to the bottom of the scatter points.

To change the scatter points from an upside trapezoid to a straight line, we will compress the Y-axis. To do this right-click on the Y-axis, and change the range of the axis from 0 to 10,000.

The bar chart will now have the scatter at the bottom of the chart along with the labels for the scatter points.

Step 7. Delete the Y-axes, remove the legend, and zoom into the chart.

We are nearly done. All that’s left is to clean the chart of unnecessary labels and axes. First, delete the two Y-axes. Then delete the legend. We can also remove the top X-axis, there’ no need to have that. We just want to keep the bottom X-axis. We can remove the scatter by right-clicking on it and then changing the fill and border colors to none.

After a series of these aesthetic change, your chart should look like the following.

CONCLUSIONS

Based on the final butterfly chart, we can see that the younger patients had a large percentage of cases and the elderly patients had a large percentage of deaths. Policy makers can review this visualization and immediate identify this association, and they may conclude that the reason why there are so many deaths in the elderly population is due to transmission from the younger population.

REFERENCES

I used the following YouTube video by Doug H to help me write this tutorial.

The Excel file for this tutorial is located here.

The WSJ article that highlights the association between age and COVID-19 cases/deaths can be located here (but you will need a subscription to read the whole story).

Mark Bounthavong

February 10, 2020

Data visualization