December 5, 2021

Logistic regression in R

December 5, 2021

I wrote a tutorial on how to construct logistic regression models in R. This tutorial was published on RPubs (link). I go through the use of the glm() command to perform a crude logistic regression model and a multivariable logistic regression model. The data (diabetes.csv) that I used for the motivating example is located here.

The RMarkdown code I used to create the tutorial is located on my GitHub site (link).

Mark Bounthavong

November 10, 2021

Data visualization, R Programming, Methods

Visualizing linear regression models using R - Part 2

Mark Bounthavong

November 10, 2021

Data visualization, R Programming, Methods

I continue my previous blog post on visualizing linear regression models using R (link). Part 2 focuses on using visualization to assess whether the model’s residuals were associated with the predicted values and whether they are normally distributed.

The R Markdown code that I wrote to create this tutorial is located on my GitHub site (link).

You can find the tutorials on my RPubs site:

Part 1 - Visualizing linear regression model using R (l i nk)
Part 2 - Visualizing linear regression model using R (link)

(NOTE: on 30 January 2022, I updated these tutorials and they can be found in my RPubs page here. The R Markdown code is saved on my GitHub page here.)

Mark Bounthavong

October 28, 2021

R Programming, Statistics & Probability

Visualizing linear regression models using R - Part 1

Mark Bounthavong

October 28, 2021

R Programming, Statistics & Probability

I wrote a tutorial on how to visualize linear regression models using R. In the tutorial I used the lm() command and the predict3d package to generate the models and visualize them using R. You can view the RPubs tutorial here. (NOTE: on 30 January 2022, I updated this tutorial and it can be found in my RPubs page here.) I created this tutorial using R Markdown, and the codes are available on my GitHub site (link).

Mark Bounthavong

September 26, 2021

R Programming

R tutorial on using the epitools package to assess confounding and interaction

Mark Bounthavong

September 26, 2021

R Programming

I created an R tutorial on using the epitools and epiR packages to assess confounding and interaction. It is located on my RPubs page here. I used R Markdown to create this tutorial and I uploaded my code on my GitHub page here. Here is a figure summarizing the lessons from the tutorial.

Mark Bounthavong

August 30, 2021

Data visualization, Methods, Commentary

Reproduction number—COVID-19

Mark Bounthavong

August 30, 2021

Data visualization, Methods, Commentary

BACKGROUND

As the COVID-19 pandemic, which began in December 2019, continues into its second year, public health measures have been put into place to mitigate its spread. At the time of writing this article, there have been over 4.5 million deaths and over 216 million cases due to COVID-19.[1] Surveillance of COVID-19 remains an important public health measure of understanding the spread and impact. Daily reports such as the John Hopkins COVID-19 dashboard provide end users with visual and statistical information about the surges in cases and deaths associated with COVID-19. However, one measure that is of great interest is the reproduction number or R0.

Reproduction number (R0) and effective reproduction number (Rt)

The reproduction number is the number of new cases that is directly caused by exposure to a single case.[2,3] Figure 1 provides a visual explanation of the basic reproduction number. However, the underlying assumption with R0 is that everyone in the population is susceptible to infection. With the introduction of vaccines, the R0 isn’t a good measure of the reproductive capabilities of COVID-19. Instead, the effective reproduction number (Rt) is used to provide a more realistic reproduction number based on the population being infected, recovered, or vaccinated. The Rt changes over time as the population susceptible to infection changes.

Figure 1. Basic reproduction number.

I wanted to create a figure that would highlight the changes associated with the Rt for each state in the United States. To do this, I downloaded the Rt data from the by Xihong Lin's Group in the Department of Biostatistics at the Harvard T.H. Chan School of Public Health. They have an amazing COVID-19 tracker dashboard that captures the changing patterns of Rt for each state. Then I created a Cleveland plot to show where the Rt was near the beginning of the pandemic and where it is currently (August 2021). (Note: I wrote a tutorial on creating Cleveland plots that you can review here.) Here is the final figure (because of the length of the figure, I cropped it to show the first 30 states or territories):

Figure 2. Effective reproduction number (Rt) for U.S. states and territories, April 17, 2020 (past) to August 14, 2021 (recent).

The blue dots denote the most recent effective reproduction number (14 August 2021) and the past dots denote the earliest effective reproduction number (17 April 2020).

It seems that some states have gotten worse in terms of increase effective reproduction number since the beginning of the pandemic. This could be due to lack of good data in the early phases of the pandemic. However, what is of concern is the high effective reproduction numbers in some states (Rt > 2), which indicates that the pandemic is still spreading at an alarming rate.

There were some missing data which are identified by a single dot (blue or red) or an empty field in the recent or past effective reproduction number. Rather than fill these in, I left them empty. There may be data in between the two time periods that I could have used, but I left those out.

One thing to mention is that this Cleveland plot only tells us one dimension of the effective reproduction number story (the difference between the most recent Rt and the earliest Rt). It doesn’t tell us much about how the effective reproduction number changes across time. For that, I direct your attention to the Lin’s Laboratory Group at Harvard, they have a great figure that shows the fluctuation of the effective reproduction number for the U.S. and its states/territories (see example):

Source: Lin’s Laboratory Group at Harvard (link). [last accessed on 30 August 2021].

CONCLUSIONS

The effective reproduction number provides us with some interesting patterns in spread of COVID-19 by states/territories. It seems to have worsened over time, but this could be due to poor data early in the pandemic. There are some issues with the us of effective reproduction number for policy decisions. Reporting delays can impact the estimates for the effective reproduction number. A technique called “nowcasting” is used to estimate the reproduction number.[3] But when I explored some of the work in this area, there appears to be a variety of methods for performing this technique. Despite this limitation, the effective reproduction number may be useful to evaluate public health policy decisions to reduce the spread of the COVID-19 pandemic.[4,5]

DATA SOURCE

I provided the link to the COVID-19 Spread Tracker from the Lin Lab at Harvard. You can also download a curated version of the data for this article from my Dropbox folder. The data are current as of 17 August 2021. If you’re interested in recreating this Cleveland plot, I recommend downloading the most recent data to see how much the effective reproduction number has changed.

REFERENCES

Worldometeres.info. COVID Live Update: 217,770,381 Cases and 4,521,936 Deaths from the Coronavirus - Worldometer. Accessed August 30, 2021. https://www.worldometers.info/coronavirus/
Lim J-S, Cho S-I, Ryu S, Pak S-I. Interpretation of the Basic and Effective Reproduction Number. J Prev Med Pub Health. 2020;53(6):405-408. doi:10.3961/jpmph.20.288
Adam D. A guide to R — the pandemic’s misunderstood metric. Nature. 2020;583(7816):346-348. doi:10.1038/d41586-020-02009-w
Inglesby TV. Public Health Measures and the Reproduction Number of SARS-CoV-2. JAMA. 2020;323(21):2186-2187. doi:10.1001/jama.2020.7878
Pan A, Liu L, Wang C, et al. Association of Public Health Interventions With the Epidemiology of the COVID-19 Outbreak in Wuhan, China. JAMA. 2020;323(19):1915-1923. doi:10.1001/jama.2020.6130

Mark Bounthavong

July 30, 2021

Data visualization, R Programming

Forest plots in R

Mark Bounthavong

July 30, 2021

Data visualization, R Programming

I wrote a tutorial on how to create forest plots in R. It’s posted on the RPubs site; here’s the link.

I wrote the tutorial on R Markdown and posted the code on my GitHub page (link).

This was an entertaining exercise to learn how to do this in R. There are, of course, lots of ways to create forest plots in R, but I wanted to learn how to do this using ggplot2 and some native R packages.

Here is the final forest plot.

Mark Bounthavong

June 24, 2021

Data visualization

Communicating data effectively with data visualizations: Part 38 (Replicating the LA Times COVID-19 tracker)

Mark Bounthavong

June 24, 2021

Data visualization

INTRODUCTION

Recently, the staff at the Los Angeles Times (LA Times) provided a COVID-19 tracker on their website. This is an impressive set of data visualizations of COVID-19 cumulative cases, new cases, vaccinations, and deaths. I was particularly struck by the “New cases by day” figure which includes a bar chart overlaid with a 7-day moving average line chart. The visualization effectively used the moving average to adjust for the spikes in new COVID-19 cases but maintained the spikes on a daily basis. None of the data are lost and illustrates the spikes in new COVID-19 cases while adjusting for the moving average. The color schemes were also optimal where the daily new cases used a softer color, but the moving average line used a darker color highlighting its importance in the figure.

I wanted to write an article on how to replicate this figure using Excel.

Source: Los Angeles Times, “Tracking the coronavirus in California,” url: https://www.latimes.com/projects/california-coronavirus-cases-tracking-outbreak/ [Accessed on June 24, 2021] * This is for educational purposes only.

DATA SOURCE

Data used in this article can be found on the LA Times GitHub site. I used the “latimes-county-totals.csv” data (link to the raw data). I also made the data available with the final figure on the following Dropbox location.

TUTORIAL

Step 1. Download and visually inspect the data.

After you’ve downloaded the data, take a moment to inspect them. The columns that are used in this tutorial are “date” and “new_confirmed_cases.” But you can use the other columns to replicate other parts of the LA Times COVID-19 tracker.

Step 2. Insert a bar chart and select the appropriate data.

Insert a clustered column chart using the Insert tab on the Excel ribbon. When selecting the data, make sure that you select the “new_confirmed_cases” (other data are available, but the new cases are what we are replicating in this exercise).

The default bar chart does a pretty good job of replicating the LA Times figure.

However, we’ll have to do a few edits to the axes to match the LA Times figure.

Step 3. Modify the axes.

Let’s focus on the Y-axis first. Right-click on the Y-axis and select “Format Axis…” In the Axis Options panel, change the Minimum value to 0 and the Major value to 20000. This will match the settings in the target figure. (Note: There are negative values in the data, but these are very small numbers and assumed to be ignorable.) Next, in the Number options, change Category to “Number” and the value in the “Decimal places” to 0. Make sure that you check the box next to “Use 1000 Separator (,)” to replicate the same format in the target figure.

For the X-axis, right-click on the bottom axis and select “Format Axis…”This will open the Axis Options panel where you can make several adjustments to the X-axis. First, we want to change the X-axis display values from dates to months. Change the Number Category field to “Custom” then change the Format Code to “mmm”; make sure to click on “Add” for the changes to take effect. Next, go to the Axis Type area and change the Minimum to “02/01/2020” since we want our timeline to begin on Feb of 2020. Then change the Major value to 4 to match the monthly interval of the target figure. The X-axis should be thicker with tick marks on the outside. To modify these, navigate to the Tick Marks option and change Major type to “Outside” and then click on the Paint Bucket (Fill & Line) option; increase the Width to 1.5. These should match the target figure’s X-axis format.

Step 4. Add the 7-day moving average.

Excel has a Data Analysis tool that will automatically estimate the 7-day average. I’ve written a previous tutorial that describes how to use this tool. I’ll briefly review how to estimate a 7-day moving average.

In the Data tab, click on the Data Analysis tool (instructions on how to install the Data Analysis tool is here). This will open the Data Analysis Tools box. Select “Moving Average” from the tools kit and enter the appropriate values in the options box. For the Input Range, select all the values from “new_confirmed_cases” column. Enter a value of “7” in the Interval field; this will automatically calculated the 7-day moving average. In the Output Range, select a single cell where you want to moving average to be pasted after it is calculated. I chose to use the next available cell on the dataset ($F$2).

Step 5. Add the 7-day moving average to the chart.

To include the moving average data to the current daily new cases bar chart, right-click on the chart and select “Select Data.” This will open a box where you can add new data. Select “Add” which will open the “Edit Series” box. Updates the Series name with the name of the column (“moving_avg”). For the Interval field, change this to “7” for the 7-day moving average. Then in the Series values, select the 7-day moving average data.

By default, Excel will generate a bar chart for the 7-day moving average. However, we want a link chart. We can change this by right-clicking on the bars of the chart and selecting “Change Series Chart Type…” This will open a box that will allow us to select the type of chart for each data. For the “moving_avg” data, change the Chart Type to “Line.” This will create a line chart for the 7-day moving average which will be overlaid over the daily new cases.

Step 6. Modifying the chart format.

To closely match the chart to the one presented in the LA Times, I made the following adjustments. Your mileage may vary depending on the library of fonts available. I tried to select fonts that most Excel users will have access to.

I changed the Y-axis font to Adabi script. The X-axis font was changed to Arial Nova.

The width of the horizontal gridlines was increased to 1.5. The color of the daily new cases bar chart was changed to a light blue using a hex code of #8DC6DF. The color of the 7-day moving average was changed to a dark blue using a hex code of #2B869B; additionally, the width was increased to 2.0.

Step 7. Comparison between LA Times and user-generated charts.

Once the modifications have been made, compare the charts.