July 29, 2019

Communicating data effectively with data visualization – Part 17 (Multivariate Dimensions)

July 29, 2019

MULTI-DIMENSIONAL DATA VISUALIZATIONS

Data visualizations can improve how we see complex data, in particular, where multiple dimensions are involved. For instance, in an X-Y plan, we can have the months on the X-axis and the number of patients on the Y-axis (Figure 1). Let’s imagine that number of patients represents some outcome you are interested in (e.g., number of patients who has 5+ prescription medications). Time and the number of patients are dimensions on this two dimensional plan.

Figure 1. Two-dimensional X-Y axes figure.

As a rule, whenever you want to display multiple dimensions, each dimension needs to be represented onto a single figure, which is challenging given that a figure is normally on a two-dimensional plane. What if we wanted to have a figure with more than two dimensions? What if we wanted to have a figure with months on the X-axis, number of patients on the Y-axis, and include a third dimension denoting different genders? How would we go about doing that? Figure 2 illustrates how we can do this by adding lines and labeling them using different colors.

Figure 2. Figure with three dimensions.

Figure 2 is able to capture three dimensions of data into a single two dimensional figure. The number of patients is captured in the Y-axis and the time in months is captured in the X-axis. Gender is represented by the colored lines that show the difference in the relationship between number of patients and time associated with males and females.

Alternatively, we use the color blue for the following conditions:

E[Y | male]

We use the color red for the following conditions:

E[Y | female]

The legend providers additional clarification that the different line colors denote the gender types. It is critical to include clear and intuitive legends so that your readers will immediately recognize their reference and label. Without a legend, your audience will have to guess what color belongs to what gender type.

How about adding another dimension such as age? This would increase the number of dimensions on this figure from three to four. For example, what if we wanted to see how being older (80+ years) impacted the relationship between the number of patients and time across genders? Well, this can be accomplished by using different types of lines (e.g., dotted lines and dashed lines).

Figure 3 illustrates how using different types of lines (dotted for the 80+ year old patient and dashed for the <80 year old patient) can provide a visual accounting of the differences across genders and across age in terms of the number of patients and months. The legend provides additional clarification as to the age groups associated with the different line types.

Figure 3. Figure with four dimensions.

Alternatively, we continue to use the color blue for the following conditions but add different line types for the age groups:

E[Y | 80+years & male] (dotted lines)

E[Y | <80 years & male] (dashed lines)

Similarly, we continue to use the color red for the following conditions but add different line types for the age groups:

E[Y | 80+years & female] (dotted lines)

E[Y | <80 years & female] (dashed lines)

Using colors and line types allow us to capture multiple dimensions onto a two dimensional figure. We essentially are showing a stratified descriptive analysis of the age groups nested within each gender and their relationships between the number of patients and time (months).

How about adding a fifth dimension? How could one do that?

A simple way to introduce a fifth dimension is to use the concept of small multiples by Edward Tufte.[1] Tufte uses small multiples to include additional dimensions. Figure 4 illustrates how we can leverage small multiples to look at the differences in the relationships between number of patient and time for different genders and age groups across different states.

Figure 4. Small multiples of states with differing patterns of number of patients and time for different genders and age groups.

Using small multiples allow us to compare the differences in the association between number of patients and months across different states stratified by gender and age groups. The number of patients increased across time for all gender and age groups in California. Similar patterns are observed in Virginia, but the rate of increase in the number of patients across time is lower in the female group and age cohorts. In Ohio, different patterns are observed compared to California and Virginia. Males and their associated age groups have a decreasing number of patients across time. Conversely, females and their age cohorts have a positive correlation between the number of patients and time.

CONCLUSIONS

Adding dimensions can improve the figure you design by incorporating complex relationships across different data characteristics. In our example, we demonstrate how we can integrate dive dimensions of data to a two-dimensional figure that tell us information about the association between the outcomes (number of patients) with time (months) across states stratified by gender and age groups. Be creative with how you integrate multiple dimensions into a figure. Ask yourself if this is something that will help improve the story the figure is conveying. There are times when a simple figure will do. But when you have a lot of data and want to tell a story, consider adding dimensions to the figure to get a narrative that will excite and capture your audience’s attention.

REFERENCES

Tufte ER. The Visual Display of Quantitative Information. Second. Cheshire, CT: Graphics Press, LLC.; 2001.

Mark Bounthavong

June 22, 2019

Data visualization

Communicating data effectively with data visualization – Part 16 (UpSet diagrams)

Mark Bounthavong

June 22, 2019

Data visualization

INTRODUCTION

Venn diagrams are useful visualizations that illustrate intersections between several groups. A common Venn diagram includes three transparent circles that overlap each other. In Figure 1, a Venn diagram with three groups denoted as A, B, and C has a total of 4 intersections (AB, AC, BC, and ABC).

Figure 1. Venn diagram with three groups and four intersections.

However, this can get more complicated when there are more than three groups or multiple interactions. Figure 2 illustrates a more complicated example.

Figure 2. Complicated Venn diagram.

How many interactions are there? It’s quite difficult to determine based on this figure. An alternative data visualization method to capture the complexity of interactions and overlaps illustrated by Figure 1 is called an UpSet diagram. Please go to the Gehlenborg Lab for more information about UpSet diagrams.

UpSet diagrams visualizes complex intersections as sets of a matrix where the columns represent the number of interactions across different sets of groups (rows).[1,2] Each set can represent a combination of interactions. This is a very useful way to see how many campaigns are delivered in a single academic detailing visit.

Academic detailers deliver educational outreach to providers using a combination of tools that include unbiased evidence-based educational materials, online clinical dashboards, and audit-and-feedback approaches.[3,4] However, a single academic detailing interaction can include several campaigns and key messages. Since each visit may incorporate several campaigns, there needs to be a mechanism to capture the different sets of combinations. From an operations perspective, monitoring the number of interactions is challenging, especially, when you want to capture workload for your staff or facility. Fortunately, UpSet diagrams, developed by researchers at the Department of Biomedical Informatics, Harvard Medical School and the Institute School of Computing, University of Utah, can easily capture these complex combinations of campaigns with each academic detailing visit.

Below is an example of an UpSet diagram that lists the different campaigns in the rows and the number of visits using the various combinations of campaigns (Figure 3). For example, there are 31 visits that include the Pain and OEND campaigns. Further, there are 119 visits that include the OUD and OEND campaigns. Notice how easy it was to visually see the different campaigns that were combined into a single visit. A single visit can be illustrated to show how many campaigns were involved. By using an UpSet diagram, you can visually capture which complex interactions were more frequent or infrequent.

These UpSet diagrams are useful ways to capture the complex nature of academic detailing and the various combinations of campaigns delivered.

Figure 3. Example UpSet diagram with several campaigns and the number of visits.

MOTIVATING EXAMPLE

Excel is unable to create UpSet diagrams with its current tools. Fortunately, the researchers at Harvard University and the University of Utah have created an UpSet R Shiny app that you can access online. All you need is your own set of data to upload. The R Shiny app will generate the UpSet figure and you can adjust the settings to get the figure you need.

The UpSet R Shiny app is located here.

When creating a data set, make sure that you save it as a *.CSV file with headers.

Use the following example dataset to upload onto the UpSet R Shiny app, which is located here.

After you open the UpSet R Shiny app in a browser, make sure that you are in the Option 1 tab. This is where you will see some instructions to upload the data you want to view the complex interactions for an academic detailing visit.

Browse for the file that contains the data on all the academic detailing visits with their campaigns. The data should be saved as a *.CSV file. We will use the pbmads_2.csv file. The data structure should have the each campaign as a binary variable (1 = YES, 0=NO) similar to how you would create a dummy variable.

Make sure to keep the columns separated using the “Comma” separator.

Then, click on “Plot” to see how the UpSet diagram appears in the R Shiny app.

After you click on Plot, you will see the app update and you should be able to view the Setting and Plot areas. The Setting area allows you to configure your plot. You can change the order of the plot and the type of ordering rules (e.g., Frequency and Degree). You can also change the size of the fonts using the Advanced section.

The Frequency order provides the largest number of visits in descending order.

The Degree order provides the campaigns with the most complex interactions or combinations in ascending order.

Based on these figures, you can easily discern the complexity of interactions academic detailing visits include into a single educational outreach. The most complex interaction are the visits that include the Opioid Use Disorder (OUD), Opioid Overdose Education and Naloxone Distribution (OEND), posttraumatic stress disorder (PTSD), Pain, and Other campaign topics (N=3). The individual solo campaign with the most visits is the OUD (alone) campaign with 175 visits.

CONCLUSIONS

UpSet diagrams make it easy to categorize the academic detailing visits into different combination categories. We can apply this method to other program monitoring metrics such the differences in these visit combinations across time and the number of attendees. Other additional areas where UpSet diagrams are useful include complex genetic markers. Try and think of ways where you can use this method to simplify complex Venn diagrams or complex interactions across different groups.

Since the UpSet R Shiny app has limited functionality, you can explore other features using R or Python to generate more complex UpSet diagrams. The UpSetR package is available for the R environment. The UpSetPlot package is available for Python.

You can access the GitHub site for UpSet diagrams here.

I encourage you to read the papers on UpSet diagrams by Conway and colleagues and Lex and colleagues: Paper 1 and Paper 2.

YouTube video on UpSet diagrams.

REFERENCES

1. Conway JR, Lex A, Gehlenborg N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinforma Oxf Engl. 2017;33(18):2938-2940. doi:10.1093/bioinformatics/btx364

2. Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H. UpSet: Visualization of Intersecting Sets. IEEE Trans Vis Comput Graph. 2014;20(12):1983-1992. doi:10.1109/TVCG.2014.2346248

3. Avorn J, Soumerai SB. Improving drug-therapy decisions through educational outreach. A randomized controlled trial of academically based “detailing.” N Engl J Med. 1983;308(24):1457-1463. doi:10.1056/NEJM198306163082406

4.Avorn J. Academic Detailing: “Marketing” the Best Evidence to Clinicians. JAMA. 2017;317(4):361-362. doi:10.1001/jama.2016.16036

Mark Bounthavong

May 16, 2019

Data visualization

Communicating data effectively with data visualization – Part 15 (Diverging Stacked Bar Chart for Likert scales)

Mark Bounthavong

May 16, 2019

Data visualization

BACKGROUND

Surveys or questionnaires are used to capture respondent’s perceptions about any number of products, ideas, or subjects. You can ask someone how they are feeling to how much they agree with a particular statement. Or you can ask someone if they are satisfied with a product or service they recently received. Items (or questions) in surveys can solicit these types of responses. Free response questions allow the respondents to write their responses in an unstructured manner as long as it answers the question or purpose of the survey. But most surveys ask questions that require a specific response using multiple choice, ratings scale, or Likert scales.

In this article, we will discuss the Likert-type scale and how we can visualize this using Excel.

LIKERT-TYPE SCALE

The Likert scale was first developed by Rensis Likert, who was a psychologist in the early 20th Century. He developed the 5-point Likert scale as part of his PhD dissertation in order to capture peoples’ ratings of international affairs. The Likert scale is unique because it provides a rating that is ordered sequentially (Positively to Negatively or Agreement to Disagreement).

Figure 1. Example of a 5-point Likert scale.

The 5-point Likert scale is quite common in psychometric research. A statement is usually provided and the participant is asked to rate their level of agreement. Notice how the scale is ordered sequentially from Strongly Disagree in the left of the scale and Strongly Agree to the right of the scale. This is an important feature of Likert-type scales with an added convenience, higher values are associated with higher agreements. You can reverse this as well, but we will keep the order for the remainder of this article.

VISUALIZE THE LIKERT-TYPE SCALE

There are many ways to visualize a Likert scale. We can use pie or bar charts to capture the different responses to a Likert-type question or statement.

Figure 2. Bar and Pie charts used to visualize Likert scale responses.

However, the best way to visualize Likert scales is to build a Diverging Stacked Bar Chart.

Figure 3. Diverging stacked bar chart using a set of hypothetical data for three statements.

The red dotted line in Figure 3 represents the divergent point where the stacked horizontal bar chart aligns. This is effective when you want to suggest that certain set of ranked responses are more important than the other. In this example, Strongly Agree and Agree are given more precedence than the other ranks. Rightly so, since a majority of the responses were Strongly Agree or Agree.

MOTIVATING EXAMPLE

Let’s assume that we administered a survey with three questions. The results are as follows:

We have a 5-point Likert scale with responses in all the different ranks. A majority of the responses were either Strongly Agree or Agree, so we’ll create a diverging point with these two ranks.

TUTORIAL

The Excel file for the tutorial is located here.

There are two ways to create this diverging stacked bar chart.

Method 1: I learned how to create diverging stacked bar charts from Stephanie Evergreen’s blog Evergreendata. I got her method step-by-step and then go over an alternative method.

Part 1.1 Estimate the buffers at the end of the stacks.

First, change the values to percentages (Step 1). Then, determine where the divergence will occur (Step 2). For our example, the divergence is between Neutral and Agree.

The next step involves estimating the values at the ends of the stacked bar chart. There are two ends, left and right (Steps 3 and 4). Once the buffers have been estimated, we will plot the stacked bar chart.

Part 1.2. Plot the divergence stacked bar chart.

First, select all the data (Step1). Then Select the 100% Stacked Bar Chart from the Insert tab (Step 2). This should generate a default stacked bar chart (Step 3).

Right-click on the any area in the chart and click on the “Select data…” to change the data arrangement (Step 4). Select the “Switch row/column” to change the Y-X arrangement of the data.

Once you switch the rows and column, the chart will change and look like the one below.

Now, we want to remove the color of the buffers at the ends of the stacked bar (Step 7). Right-click the left end of the stack and select “No Fill” (Step 8). Repeat this for the right end of the stack.

The diverging stacked bar chart should resemble the figure we presented at the beginning of this article.

Removing the buffer labels from the legend, deleting the grid lines, changing the font (Adobe Gothic Standard B), and changing the stacked bars colors can improve the figure.

The challenge with this chart is the labels on the axis. The statements are too far to the left of the diverging stacked bar chart. To fix this, delete the labels on the left and insert text boxes with the statements.

Remove the borders, right alignment with the statements, and add labels. We should be very close to the figure we presented above.

Here is the final diverging stacked bar chart.

Method 2: I learned this other method from John Peltier at his Peltier Tech Blog.

This method requires you to create a divergent point based on distance. For instance, if you want to make sure that you have the divergent point where the responses are at Strongly Agree and Agree. So, subtract the distance from that point.

First, change all the values left of the divergent point to negative (Step 1).

Then rearrange the order of responses so that the furthest rank is closest to the divergent point (Step 2).

The select the data (Step 3), Insert a 100% Stacked Bar Chart (Step 4), and then Visualize the chart (Step 5).

Right-click anywhere in the chart area and click “Select Date…” (Step 6). Then Switch row and column (Step 7).

You will get the chart on the left below. However, this is not complete. Right-click on the Y-axis (Step 8) and the Format Axis (Step 9).

Change the label position for “Low” (Step 10) and then review the chart. Notice where the divergent point is located at. This is the same as the previous stacked bar chart that was constructed using Method 1.

Now, you can change the colors, delete the gridlines, remove the X-axis to create a plot below. Notice that there are some values that are negative. That’s because of the data we rearranged earlier to generate the distance from the divergent point. To fix this, you will need to manually change the values.

After manually change the values, your plot will look similar to the one below.

Conclusions

Visualizing the Likert scale using horizontal diverging stacked bar charts is a good method to see how the participants respond to questions or statements on a survey or questionnaire. However, not all Likert-type scales will necessarily need a diverging stacked bar chart to illustrate its point. You can also use a conventional stacked bar chart, which we will discuss in a future article.

The Excel file for this tutorial is located here.

REFERENCES

I used the following references to help write this article.

Stephanie Evergreen’s blog Evergreendata is an excellent resource for learning about other data visualization methods.

John Peltier’s blot Peltier Tech Blog is another wonderful resource where you can learn more about Excel charts and data visualization.

The following paper provides details on how to create diverging stacked bar charts using R.

Heiberger RM, Robbins NB. Design of diverging stacked bar charts for Likert scales and other applications. Journal of Statistical Software. 2014;57(5): 1-32. https://www.jstatsoft.org/article/view/v057i05/v57i05.pdf

Mark Bounthavong

April 28, 2019

Data visualization

Communicating data effectively with data visualization – Part 14 (Gantt Charts)

Mark Bounthavong

April 28, 2019

Data visualization

INTRODUCTION

A useful calendar can be helpful in scheduling your meetings, avoiding conflicts, and remembering important dates. Applying data visualization to a calendar can help to identify key events throughout the day, week, or month. Here is an example of a color-coded calendar for a single person from Sadiq Javer from BoostSolutions.

Each meeting or event is color coded to indicate a particular category. For a single user, this is sufficient to manage a complex day, week, or month. However, if you are a project manager or lead, managing the calendars of a group or team, this task can be challenging.

A solution is to use a Gantt chart to organize the calendars of several members in your team. Gantt chart is a type of bar chart that provides a longitudinal visualization of schedules and timelines. It was invented by Henry Gantt who is known for his work on scientific management. Gantt charts are useful for project management and can illustrate major deadlines or milestones in the project’s life cycle.

Conveniently, the same tools used for project management can be applied to managing schedules for multiple team members in a group. In this article, we will apply the Gantt chart to managing a team’s schedule using Excel.

MOTIVATING EXAMPLE

We will create a Gantt chart using a hypothetical team’s schedule to visualize their and vacations.

Download the Excel sheet here.

Suppose we have a team who will be taking vacation in the upcoming calendar year (2019). There are several important dates that the Team will need to block for meetings. In order to avoid conflicts, a Gantt chart is used to plan an efficient annual schedule.

Here is a figure of our Gantt chart.

The Gantt chart blocks weekend and holidays so that the manager can easily see the entire 7-day week. Each column represents a day nested in a 7-day week. The months are color coded to identify when it begins and ends. Each staff has a unique color to identify their days off, and the Team meeting is highlighted in red to indicate the critical meeting dates.

TUTORIAL

You can use any version of Excel to build the Gantt chart. After opening a new Excel sheet, follow these steps.

Step 1. Resize the column’s width:

Resizing the column’s width to 2.33 seems to give an efficient size cell for the days.

Step 2. Assigning days and months:

In our example, each column represents one day. Therefore, we can assign 7 days into a week. Since the month starts on different days, we make sure to start with the correct day in our calendar. In 2019, January begins on Tuesday, therefore, our Gantt chart will start on Tuesday.

Step 3. Highlight the weekends and holidays:

Hopefully, your team doesn’t have to work on the weekends. However, there are exceptions. Clinicians work on the weekends, so your Gantt chart may need an indicator for differential pay (if it is part of the benefits). In our example, we will assume that no one from the team works on the weekends.

The holidays are highlight with a different color from the weekend

Step 4. Include the team members and Meetings to the Gantt chart.

Once you block out the holidays and weekends, you can start entering information on meetings and team members’ vacations. Different colors were used for the meetings and individual team members to provide easy visualization.

CONCLUSIONS

The final Gantt chart should be able to help you organize your team’s schedule while making sure that there are no conflicts with important team meetings or deadlines. Although Gantt charts were designed for project management, it can also be used to efficiently manage a team’s complex schedule.

You can download the Excel exercise at this link.

REFERENCES

I used the following references to assist me with this tutorial.

Sadiq Javer’s article on Gantt charts published on the Boostsoultions.com website.

Wikipedia’s page on Gantt charts, Henry Gantt, and Scientific Management.

Mark Bounthavong

March 14, 2019

Data visualization

Communicating data effectively with data visualization - Part 13 (Box and Whisker Diagrams)

Mark Bounthavong

March 14, 2019

Data visualization

BACKGROUND

Box plot (box and whisker diagram) is a great way to display distribution of a continuous (e.g., interval) data variable. A typical box plot will contain the mean, median, interquartile values, and the minimum and maximum values. Figure 1 illustrates these elements on a box plot. Up until recently, Microsoft Excel did not have an option to graph box plots. However, in the 2016 version of Microsoft Excel, box plots were added as part of the statistical features.

Figure 1. Example of a box plot (box and whisker diagram (Figure 1).

MOTIVATING EXAMPLE

We will use data that was randomly generated to create box plots across four hypothetical quarters (Q1FY19, Q2FY19, Q3FY19, and Q4FY19). The data will contact the number of visits to the doctor from several outpatient specialty clinic. Here is what the data looks like from the first two sites. Data for the example can be found here.

Site 1 has 45 visits in Q1FY19 and Site 2 has 44 visits in Q1FY19. To create the box plots, we need to use the long format which uses multiple rows for each site.

EXERCISE

In this article, we will generate box plots that will visualize the average number of visits and its distribution across quarters.

After clicking on the Box and Whisker plot, you will need to select the data that will be used to generate the box plots across the quarters.

Click “OK” and the default box plot will look like Figure 2.

Figure 2. Default box plot generated by Excel 2016.

After a few changes to the color and labels, our box plot can be improved (Figure 3).

Figure 3. Updated box plots.

These box plots give us an idea of the changes in the number of visits across quarters including the distribution of the data. For each box plots, the mean indicated by the “X” is not too different from the median (indicated by the solid horizontal line). However, there is greater variation in the distribution of the number of visits in Q2FY19 and Q3FY19 compared to Q1FY19 and Q4FY19. We can see that there was an increase in the number of visits, on average, between Q1FY19 and Q3FY19, but this drop significantly in Q4FY19. This may be due to some kind of change (e.g., seasonal variation) and should be explored.

Conclusions

The box plot provides us with a nice data visualization of the mean number of visits across quarters including the variation and distribution of the data. Plotting these in Microsoft Excel 2016 will allow you to explore your data and motivate you to explore and generate some explanation or hypothesis for their behavior.

References

I used several online references to write this article.

The Dummies series provide a good illustration of the box plot elements, which can be located here.

I watched this YouTube video by stickpet on how to use Microsoft Excel 2016 to generate box plots.

Mark Bounthavong

February 19, 2019

Methods, Statistics & Probability, Econometrics, Microeconomics

Cobb-Douglas production function and costs minimization problem

Mark Bounthavong

February 19, 2019

Methods, Statistics & Probability, Econometrics, Microeconomics

Update 2: This article was updated on 12 August 2023 when Dimanjan Dahal (Twitter account) identified a better way to present the Lagrangian functions. I updated this to better reflect the minimization problem and set the partial derivative solution to 0. Thank you, Dimanhan.

Update 1: This article was updated on 11 October 2021 when an anonymous reader identified an error with the example used at the end. The error was the negative value generated for the output elasticity of capital. In the previous example, I used R to generate a set of random numbers that were used in a regression model. The beta coefficient generated a negative value which was used in the linear form of the Cobb-Douglass equation. Since the output of elasticity should be between the values of 0 and 1, this negative coefficient should not be possible. Hence, I’ve updated the data frame used in the example to avoid this issue. Appreciation goes out to the anonymous reader who identified this error.

INTRODUCTION

The Cobb-Douglas (CD) production function is an economic production function with two or more variables (inputs) that describes the output of a firm. Typical inputs include labor (L) and capital (K). It is similarly used to describe utility maximization through the following function [U(x)]. However, in this example, we will learn how to answer a minimization problem subject to (s.t.) the CD production function as a constraint.

The functional form of the CD production function:

where the output Y is a function of labor (L) and capital (K), A is the total factor productivity and is otherwise a constant, L denotes labor, K denotes capital, alpha represents the output elasticity of labor, beta represents the output elasticity of capital, and (alpha + beta = 1) represents the constant returns to scale (CRS). The partial derivative of the CD function with respect to (w.r.t) labor (L) is:

Recall that quantity produced is based on the labor and capital; therefore, we can solve for alpha:

This will yield the marginal product of labor (L). If alpha = 2, then a 10% increase in labor (L) will result in a 20% increase in output (Y).

The partial derivative of the CD function with respect to (w.r.t) labor (K) is:

This will yield the marginal product of capital (K).

The CD production function can be converted to a linear model by taking the logarithm of both sides of the equation:

This will allow for OLS regression methods, which is commonly used in economics to understand the association between inputs (L and K) on production (Y).

However, what happens when we are interested in the marginal cost with respect to (w.r.t.) production (Y)? This becomes a constraint (cost) minimization problem where the firm can control how much L and K they will use. In other words, we want to minimize the cost subject to (s.t.) the output

Cost becomes a function of wage (w), the amount of labor (L), price of capital (r), and the amount of capital (K). To determine the optimal amount of inputs (L and K), we solve this minimization constraint using the Lagrange multiplier method:

Solve for L

Substitute L in the constraint term (CD production function) in order to solve for K

Now, we can completely solve for L (as a function of Y, A, w, and r) by substituting for K

Substitute L and K into the cost minimization problem

Simplify

Final cost function

Let’s see how we can use the results from a regression model to give us information about the total costs w.r.t. to the quantity produced.

Recall the linear form of the Cobb-Douglas production function:

I simulated some data where we have the capital, labor, and quantity produced in R.

## Use the following libraries:
library(jtools)
library(broom)
library(ggstance)
library(broom.mixed)

## Generate random data for the data frame (cddata)
set.seed(1234)

production <- sample(100:600, 30, replace=TRUE)

labor <- sample(50:350, 30, replace=TRUE)

capital <- sample(6000:7000, 30, replace=TRUE)

## Cost function parameters: wage and price constants
wage <- 35.00
price <- 30.00

## Set up the data frame (cddata):
cddata <- data.frame(production = production, labor = labor, capital = capital, wage = wage, price = price)

## Name rows using some timeline from 1988 to 2017 (30 years for 30 observations for each variable):
row.names(cddata) <- 1988:2017

Then I perform a regression model using OLS

## Setting up the model, where log(a) is eliminated due to it being the intercept.
cd.lm <- lm(formula = log(production) ~ log(labor) + log(capital), data = cddata)

summary(cd.lm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96586 -0.25176  0.06148  0.37513  0.67433 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.44637   17.41733   0.255    0.800
log(labor)    0.14373    0.23595   0.609    0.548
log(capital)  0.05581    2.00672   0.028    0.978

Residual standard error: 0.5065 on 27 degrees of freedom
Multiple R-squared:  0.01414,   Adjusted R-squared:  -0.05888 
F-statistic: 0.1937 on 2 and 27 DF,  p-value: 0.8251

After running the model, I stored the coefficients for use later in the production function.

## Store the coefficients
coeff <- coef(cd.lm)

## Assign the values to the production function parameters where Y = AL^(alpha)K^(beta)
intercept <- coeff[1]
alpha <- coeff[2]
beta <- coeff[3]

From the parameters, we can get A (intercept), alpha (log(labor)), and beta (log(capital)).

This will give us the quantity produced (Y) for given data on labor (L) and capital (K).

We can get the total costs (C) based on the quantity produced (Y) using the cost function:

I set up my R code so that I have the intercept, alpha, beta, labor, wage, and price of the capital set up. I estimated each part of the cost function separately and then multiply the parts at the end.

## Cost
PartA <- (production / intercept)^(1 / alpha + beta)
PartB <- wage^(alpha / alpha + beta)
PartC <- price^(beta / alpha + beta)
PartD <- as.complex(alpha / beta )^(beta / alpha + beta) + as.complex(beta/ alpha)^(alpha / alpha + beta)

costs <- PartA * PartB * PartC * PartD

Note: R has a problem with performing complex operations with exponents that were defined using arrays or vectors. If you try to compute something like x^{alpha}, you will get an error where the value is “NaN.” I don’t have a complete understanding of the problem, but the solution is to make sure your root or base term is preceded by “as.complex(x)” to resolve the issue.

I plot the relationship between quantity produced and cost. In other words, this tells us the lowest costs needed to produce the quantities on the plot.

plot(production, costs)

CONCLUSIONS

Using the Cobb-Douglas production function and the cost minimization approach, we were able to find the optimal conditions for the cost function and plot the outcome relative to the quantity produced. As production increases, the minimum cost needed increases in a non-linear, exponential fashion, which makes sense given that Y (quantity produced) is in the numerator on the right-hand side of the cost function and positively related to the cost.

This was a fun exercise that made me think about the usefulness of the Cobb-Douglas production function, which I learned to optimize multiple times in my Economics courses. I was excited to find a pleasant utility for it using simulated data and will probably explore more exercises like this in the future.

REFERENCEs

I used a lot of resources to write this blog, which are provided below.

A site dedicated to the discussion of economics called EconomicsDiscussion.net was a great resource.

These papers were incredibly helpful in preparing the example in R:

Lin CP. The application of Cobb-Douglas production cost functions to construction firms in Japan and Taiwan. Review of Pacific Basin Financial Markets and Policies Vol. 5, No. 1 (2002): 111–128.
Larriviere JB, Sandler R. A student friendly illustration and project: empirical testing of the Cobb-Douglas production function using major league baseball. Journal of Economics and Economic Education Research, Volume 13, Number 3, 2012: 81-92
Hu, ZH. Reliable Optimal Production Control with Cobb-Douglas Model. Reliable Computing. 1998; 4(1): 63-69.

I encountered some issues regarding complex numbers in R. Fortunately, I found some great resources about it.

I found a great discussion about R’s calculation of exponents and “NaN” results and why complex numbers can mess up your math in R.
Another good site (R Tutorial: An Introduction to Statistics) explaining complex numbers in R.
John Myles White wrote a nice article about complex numbers in R.

Acknowledgements: I would like to thank the user who reached out to me about the coefficient errors for the output elasticity of capital. This helps me to learn my mistakes and correct them. Without the support and guidance from the community, I would not achieve my own goals of being a lifelong learner. Thank you.

Mark Bounthavong

January 6, 2019

Stata programming

Using Stata’s bysort command for panel data in time series analysis

Mark Bounthavong

January 6, 2019

Stata programming

BACKGROUND

Sorting information in panel data is crucial for time series analysis. For example, sorting by the time for time series analysis requires you to use the sort or bysort command to ensure that the panel is ordered correctly. However, when it comes to panel data where you may have to distinguish a patient located at two different sites or a patient with multiple events (e.g., deaths), it’s important to organize the data properly.

You can download the sample data and Stata code at the following links:

Data

Code

MOTIVATING EXAMPLE

In this example, we have a data set with time (months) in the column and patients in the rows (this is called a wide format data set). For each month, there are different numbers of observations. For instance, in Month 1, there were 5 observations. But in Month 7 there were only three.

The highlighted boxes indicate a patient was observed at two different sites. There are two ways to approach this: (1) remove the patient from Site B or (2) keep the patient by distinguishing it at each sight. Removing the patient will result in a loss of information for Site B, but keeping the patient complicates the panel data when we convert from wide to long format.

Converting this from wide to long format would result in the following panel data. Review each patient, in particular, the months of observations reported for the months. Notice that not all patients have observations for all the months (Months 1 to 7). Some patients have observations for scattered months (e.g., Patient 3). Of note is Patient 2 who has observations at Sites A and B. Since we opted to keep Patient 2 data for Sites A and B, we need to distinguish a method to ensure that the panel data is ordered correctly. Interestingly, Patient 8 has an observed event (Death) three times at Months 5, 6, and 7. Since a patient should experience death only once, this may be a coding error and should be removed. Using the Stata sort and bysort command will allow us to fix this problem.

The bysort command has the following syntax:

bysort varlist1 (varlist2): stata_cmd

Stata orders the data according to varlist1 and varlist2, but the stata_cmd only acts upon the values in varlist1. This is a handy way to make sure that your ordering involves multiple variables, but Stata will only perform the command on the first set of variables.

REMOVE REPEATED DEATHS FROM PATIENT 8

First, we want to make sure we eliminate the repeated deaths from Patient 8. We can do this using the bysort command and summing the values of Death. Since Death == 1, we can sum up the total Deaths a patient experiences and drop those values that are greater than 1—because a patient can only die once.

***** Identify patients with repated death events. 
bysort id site (month death): gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1

The alternative methods use the sort command:

* Alternative Method 1:
by id site (month death), sort: gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1

* Alternative Method 2:
sort id site (month death)
by id site (month death): gen byte repeat_deaths = sum(death ==1)
drop if repeat_deaths > 1

Now we have a data set without the unnecessary death values for Patient 8. Therefore, Patient 8 will not be counted in months 6 and 7 because they are no longer contributing to the denominator.

COUNT THE NUMBER OF DEATHS PER MONTH

Suppose we want to perform a single group time series analysis. We would want to sum up the number of deaths across the months. We can do this using the bysort command.

First, we have to think about how we want to count death. Since Death == 1, we want to add up the number of Death for each month. Initially, we were worried that Death would be counted two more times for Patient 8, but we solved this problem by removing these events from Patient 8.

The following command will yield the above results in a long format.

bysort month: egen byte total_deaths = total(death)

We use the egen command because we are using a more complex function. Detailers on when to use gen versus the egen commands are located at this site.

DETERMINING THE DENOMINATOR—COUNTING THE NUMBER OF PATIENTS CONTRIBUTION INFORMATION

Next, we want to determine that number of patient observations that are contributed to each month. To do this, we can use the bysort command again.

***** Determine the denominator -- using bysort and counter variable
gen counter = 1
bysort month: egen byte total_obs = total(counter)

This should yield the following results:

CHANGING FROM PATIENT-LEVEL DATA TO SINGLE-GROUP DATA

Currently, the data is set up using the patient-level. We want to change this to the single-group level or the aggregate monthly level. To do this, we have to eliminate the repeated month measurements for our total deaths (numerator) and total observations (denominator).

***** Drop duplicate months
bysort month: gen dup = cond(_N==1, 0, _n)
drop if dup > 1

We can visualize this by plotting two separated lines connected at the values for each month.

****** Plot the total number of deaths and total number of observations
graph twoway (connected total_deaths month, lcol(navy)) ///
             (connected total_obs month, lcol(cranberry) ytitle("Number") ///
	      xtitle("Months") ylab(, nogrid) graphregion(color(white)))

We can take this a step further and calculate the prevalence.

***** Estimate the prevalence (per 100 population) and plot
gen prev = (total_deaths / total_obs ) * 100	

graph twoway connected prev month, ytitle("Prevalence of Death" "per 100 population") ///
	     xtitle("Months") ylab(, nogrid) graphregion(color(white))

CONCLUSIONS

Using the bysort command can help us fix a variety of data issues with time series analysis. In this example, we have patient-level data that contained deaths for one patient and a patient who was observed at different sites. Using the bysort command to distinguish between sites allowed us to properly identify the patient as unique to the site. Additionally, we used the bysort to identify the patient with multiple deaths and eliminated these values from the aggregate monthly values. Then we finalized out single-group data set by summing the total deaths and observations per month and removing the duplicates.

You can download the Stata code from my Github site.

REFERENCES

I used the following references to write this blog.

Stata commands: bysort:

https://stats.idre.ucla.edu/stata/faq/can-i-do-by-and-sort-in-one-command/

Stata commands: gen versus egen:

https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesmodifying-data/

Mark Bounthavong

December 10, 2018

Methods, Game Theory, Statistics & Probability

Is my d20 killing me? – using the chi square test to determine if dice rolls are bias

Mark Bounthavong

December 10, 2018

Methods, Game Theory, Statistics & Probability

BACKGROUND

Every Tuesdays, my friends and I enjoy playing role playing games (RPGs), especially table top RPGs such as Dungeons & Dragons (D&D). Every week, we get together and pull out our laptops, character sheets, and review our previous notes to return to the fictional fantasy worlds we created (or were created for us) and do battle, solve mysteries, and tell stories over some ciders (and Le Croix). This ritual is important because it allows us to disconnect from the real world and allow our imaginations to run wild. After every session, we think about the various actions that took place and review how things would have been different if the roll of a dice went a different way.

I first started playing D&D Second Edition when I was a kid after I was exposed to it at a comic book store (Golden Apple Comics in Los Angeles). I still remember the strange colorful dice rolling on a table top mat and people scratching away at paper using stats that I wasn’t familiar with. In high school, my friends and I would play different campaigns from the D&D and Forgotten Realms worlds, creating characters based on rule books using statistics and probabilities. The key ingredient with any adventure is having your fate determined by a single dice roll. The iconic dice in RPG is the d20 or the 20-sided dice. A d20 dice is usually used to determine whether you “hit” your opponent, use your skills to identify if a trap has been set or whether or not you can charm your way out of an unnecessary fight. Often times than not, there is the chance that a critical fail (a d20 roll of 1) can occur. When this happens, you fail to hit your opponent and trip over yourself during combat, miss the trap and activate it killing someone in your party, or pissing off the non-playable character (NPC) and having them attack you. Not only will something go wrong, it will go wrong spectacularly. So, it’s only natural that we look at the d20 that was rolled and ask, “Is my d20 killing me?”

Luckily, there is a statistical test that we can use to answer this common question.

CHI SQUARE TEST

The chi square test is one of the most common statistical tests performed in sciences. In its simplest form, the chi square test is used to detect whether the observed frequencies are different from the expected frequencies across different categories. For example, in a 6-sided dice, the probability that the number 6 will land is 16.7% or 1/6. This is true for every value of the 6-sided dice if it was unbiased.

But what if the dice was biased? Suppose we roll the 6-sided dice 100 times and we get the following results:

Visually, we can see that there is some bias with this 6-sided dice. We don’t know what the bias is, but there is a something causing this dice to roll a “3” more times than it should (approximately 2 more times than normal). Alternatively, this 6-sided dice is rolling a value of “1” less times than it should (approximately 70% less likely compared to the expected frequency).

Using these data, we can perform a chi-squared test.

First, we use the following formula:

where O is the observed frequency for position i and E is the expected frequency for position i.

We need another piece of information, degrees of freedom. To estimate the degrees of freedom, we use the following equation: df = (R-1) * (C-1), where R = number of rows and C = number of columns. For the 6-sided dice, the df = (2-1) * (6-1) = 5

We can set up the formula using the following table.

The total value of 32.96 is the chi square statistic. We will need to use the chi square distribution table to determine the p-value. Next, we need to use a chi square table like the one shown below.

So, with a degree of freedom of 5 and a chi square statistic of 32.96, the probability of a more extreme test statistics than the one observed is less 1% assuming that there were no differences. In other words, the dice is definitely bias at the type I error of 5%. I should throw away this dice.

MOTIVATING EXAMPLE

Now, let’s do this for a 20-sided dice. I’m not going to actually roll the dice 100 times, but I will generate a simulation.

> #######################################################################
> ## Simulate a d20 dice roll with 100 trials
> #######################################################################
> sims <- sample(x = 1:20, size=100, replace=TRUE)
> 
> ## Generate frequency table
> table(sims)
sims
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 6  8  2  2  3  2  2  6  7  3  4  8  8  6  3  7  7  4  5  7 
> 
> ## Generate probability table
> prob <- table(sims) / length(sims)
> 
> ## Plot the frequency of the rolls
> plot(table(sims), xlab = 'd20 rolls', ylab = 'Frequency', main = 'Frequency of events for each possible d20 roll (Trials=100)')
> 
> ## Plot the probability of the rolls
> plot(prob, xlab = 'd20 rolls', ylab = 'Frequency', main = 'Probability of events for each possible d20 roll (Trials=100)')
> 
> ## Perform chi square test
> chi2 <- chisq.test(table(sims))
> chi2

    Chi-squared test for given probabilities

data:  table(sims)
X-squared = 19.2, df = 19, p-value = 0.4441

Based on this first simulation run of 100 rolls, the dice is fairly unbiased.

Let’s try 1000 rolls.

> #######################################################################
> ## Simulate a d20 dice roll with 1000 trials
> #######################################################################
> sims <- sample(x = 1:20, size=1000, replace=TRUE)
> 
> ## Generate frequency table
> table(sims)
sims
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
51 45 41 54 55 54 33 50 48 46 44 56 50 64 43 49 50 49 54 64 
> 
> ## Generate probability table
> prob <- table(sims) / length(sims)
> 
> ## Plot the frequency of the rolls
> plot(table(sims), xlab = 'd20 rolls', ylab = 'Frequency', main = 'Frequency of events for each possible d20 roll (Trails = 1000)')
> 
> ## Plot the probability of the rolls
> plot(prob, xlab = 'd20 rolls', ylab = 'Frequency', main = 'Probability of events for each possible d20 roll (Trials=1000)')
> 
> ## Perform chi square test
> chi2 <- chisq.test(table(sims))
> chi2

    Chi-squared test for given probabilities

data:  table(sims)
X-squared = 20.08, df = 19, p-value = 0.3898

Still unbiased. But notice how the frequencies for each value of the d20 dice is starting to have similar frequencies. Unlike the previous frequency figure where there were more fluctuations, you see less of it with more rolls.

How about 10,000 rolls?

> #######################################################################
> ## Simulate a d20 dice roll with 10,000 trials
> #######################################################################
> sims <- sample(x = 1:20, size=10000, replace=TRUE)
> 
> ## Generate frequency table
> table(sims)
sims
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
496 477 518 469 504 492 491 551 507 499 474 527 519 532 493 506 503 473 509 460 
> 
> ## Generate probability table
> prob <- table(sims) / length(sims)
> 
> ## Plot the frequency of the rolls
> plot(table(sims), xlab = 'd20 rolls', ylab = 'Frequency', main = 'Frequency of events for each possible d20 roll (Trails = 10,000)')
> 
> ## Plot the probability of the rolls
> plot(prob, xlab = 'd20 rolls', ylab = 'Frequency', main = 'Probability of events for each possible d20 roll (Trials=10,000)')
> 
> ## Perform chi square test
> chi2 <- chisq.test(table(sims))
> chi2

    Chi-squared test for given probabilities

data:  table(sims)
X-squared = 19.872, df = 19, p-value = 0.4023

Definitely smoother. As we perform more and more rolls of the d20, we get a nearly equal number of rolls for each value.

A BIASED EXAMPLE: IS MY D20 TRYING TO KILL ME?

What if the dice was actually bias? What then? Let’s use another d20 dice and simulate the probability that the roll will be a critical fail 80% of the time.

> #######################################################################
> ## Simulate a d20 dice roll with 10000 trials -- BIASED sample
> ## This is a biased d20 where the number 1 has an 80% probability of hitting.
> #######################################################################
> sims <- sample(x = 1:20, size=10000, prob=c(0.8, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632, 0.01052632), replace=TRUE)
> 
> ## Generate frequency table
> table(sims)
sims
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
7952   99  104  111  111  104  120  109   98   93  107   99  107  110  116  109  118  122  104  107 
> 
> ## Generate probability table
> prob <- table(sims) / length(sims)
> 
> ## Plot the frequency of the rolls
> plot(table(sims), xlab = 'd20 rolls', ylab = 'Frequency', main = 'Frequency of events for each possible d20 roll (Trials=10,000)')
> 
> ## Plot the probability of the rolls
> plot(prob, xlab = 'd20 rolls', ylab = 'Frequency', main = 'Probability of events for each possible d20 roll (Trials=10,000)')
> 
> ## Perform chi square test
> chi2 <- chisq.test(table(sims))
> chi2

    Chi-squared test for given probabilities

data:  table(sims)
X-squared = 116910, df = 19, p-value < 2.2e-16

Wow! This d20 is really biased! At a statistical significance threshold that is less than 5%, the very small P-value (P<2.2 x 10^-16) indicates that this d20 is statistically biased from from a fair d20. Maybe that’s why I have more critical fails than any member in my party. I definitely will not be using this dice in the future.

CONCLUSIONS

The chi square test has a lot of usefulness in explaining the bias with anything that provides frequencies of rolls or events. You can use the chi square test for a variety of things such as the fairness of a coin, the differences in the frequency of male and female across different character classes, and determine whether the actual observations matches what you expected. So, when you’re playing D&D with your friends and you suspect that your d20 is rolling a critical fail more often than naught, you may want to run a little experient using the chi square test.

The R code can be found on my GitHub site.

REFERENCES

I had help writing this blog. The codes for the chi square simulation came from Francis J. DiTraglia, Assistant Professor of Economics from the University of Pennsylvania. His website is here. The page where I found his codes is here.

For those interested in probability and games, you should check out this great resource from the Mathematics Assessment Resource Service at the University of Nottingham & UC Berkeley. It uses mathematics to design several games of chance. Fun to do in between campaigns.

And for those who want a more academic presentation on RPGs, Paul Mason wrote an incredible piece that can be found here. Citation: Mason, Paul. 2012. "A History of RPGs: Made by Fans; Played by Fans." Transformative Works and Cultures, no. 11. http://dx.doi.org/10.3983/twc.2012.0444

Mark Bounthavong

November 17, 2018

Data visualization

Communicating data effectively with data visualization - Part 12 (Waffle Charts)

Mark Bounthavong

November 17, 2018

Data visualization

BACKGROUND

In data visualization circles, the pie chart is considered an inefficient tool to convey parts of a whole. Edward Tufte often criticizes the use of the pie chart to display data visually stating that

“A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between charts.”[1]

The major reason why pie charts are disliked by data scientists and other pundits is due to the way our brain works. Mostly, we are good at judging things visually, but with pie charts, it is hard to distinguish the relative proportion of a slice to the whole. For example, can you tell the difference in proportions between the two pie charts below? Which one has more of component B?

Since it is challenging to identify the differences between the two pie charts, several alternatives exists to present the data accurately and effectively. In this article, we will discuss one such method using a waffle chart.

Waffle charts are grid-based visuals that have equal size blocks that convey parts of a whole accurately and efficiently. Some have called waffle charts the “square pie charts.” They are usually proportional and arranged in a 10 x 10 grid.

Colors can be used to distinguish the contribution of groups or categories to the whole where each square represents a percentage point totaling to 100.

Waffle charts are great at presenting data where you are describing the proportions or parts of a whole and should be used instead of pie charts.

MOTIVATING EXAMPLE

We will use the 2016 National Healthcare Expenditure dataset to illustrate the use of waffle charts. We will compare expenditures from the decades between 1965 and 2015.

You can download the data from the National Health Expenditures Accounts (NHEA) website:

https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/NationalHealthAccountsHistorical.html

The data provide the percentages of expenditures for different components spent on health for each decade from 1965 to 2015 (Table 1).

Table 1. National Health Expenditures from 1965 to 2015.

In the above table (we will start with 1965), 89% of Health Consumption Expenditures was due to Personal Health Care (83%), Government Administration And Net Cost Of Health Insurance (4%), and Government Public Health Activities (2%). The remainder was spent on Investments (11%). We have to estimate the cumulative percentage spent across the different categories to generate our waffle charts.

This is easily done by summing the individual components (Personal Health Care, Government Administration And Net Cost Of Health Insurance, Government Public Health Activities, and Investment) so that they will total 100 (Table 2).

Table 2. Cumulative values of the individual expenditures.

Step 1. Setting up the grid

In Excel, we want to create a 10 by 10 square grid. To do this, change your view from Normal to Page Layout. We are doing this because Excel has a unique way of measuring row and height size (they are not on the same scale under the Normal view). When you change to the Page Layout, the scales for the columns and rows are in inches. Set the size for the columns and rows to 0.5 inch. You should have a 10 by 10 grid with squared that have a height and length of 0.5 inch.

Step 2: Label the squares in the grid

Once the grid has been sized correctly, we will assign a value for each square in the grid. These values will correspond to the percentage point that you have in the cumulative values in Table 2. Start at the lower left with a cell value of 1 up to a cell value of 100 in the upper right of the grid.

Step 3: Apply conditional formatting for each cell in the grid

Once the cells have been assigned a value, we can use Excel’s conditional formatting tool to assign colors for each of expenditure components.

Select the cell that has the value “91.” (It doesn’t matter what cell you select, but we will use the upper-most left corner for simplicity; this is also cell D5).

Change the Select option to “Classic.” Then change the New Formatting Rule to “Use a formula to determine which cells to format.”

Insert the formula where you have the value is less than or equal to the total cumulative percentage (e.g., 100 percent).

The cells for the equations are specific to this example, but you can apply this towards an example of your own. (The Excel file that is based on this article can be downloaded here.)

Step 4: Change colors to match the different expenditure categories

Next, we will change the fill and font colors so that they will match. By having the font color the same as the fill color, the values in the cells will appear invisible, but still referenced using the conditional formatting rules we just created.

Step 5: Apply the conditional formatting to ALL the cells in the grid

After you change the colors for the font and fill, you will need to make sure that the conditional formatting rule is applied to the entire grid.

You should notice that all the cells in the grid are the same color. This is because the conditional formatting is first based on the total cumulative percentage, which is 100. Therefore, all the values in the cells should be the same color.

Step 6: Add a new conditional formatting rule for the next expenditure component

The next step is to apply another conditional formatting rule for the next highest cumulative percentage value, which would be Government Public Health Access, which is 89 percent.

We repeat the process for each expenditure component. When we are done, we should have four conditional formatting rules for each expenditure component.

All the formulas for the conditional formatting are listed below:

Step 7: Repeat this process for the other decades

Once we complete these conditional formatting for the other decades, we can present these waffle charts together.

Final waffle chart

The following waffle chart incorporated the health expenditures for each decade starting from 1965 to 2015. The border color was changed to white and the label used the Century Gothic font. Inside each waffle is the percentage of Investment associated with each decade. You can download this exercise’s Excel file here.

Based on the waffle charts, we can see that Investments (spending for noncommercial biomedical research and expenditures by health care establishments on structures and equipment) has decreased over time. Conversely, expenditures for Personal Health Care, Administration, and Health Activities have increased.

CONCLUSIONS

Using waffle charts is a better alternative to pie charts because we can discern the exact value of the parts that make up the whole. In this case, we can easily visualize the decrease in Investments when it comes to health expenditure spending in the US for each decade between 1965 to 2015.

You can re-create these findings using the Excel file located here.

REFERENCES

I used the following references to assist with the development of this article. They have been incredibly helpful in learning the methods and better understanding how to leverage the power of using waffle charts.

[1] Tufte ER. The Visual Display of Quantitative Information. 2001. Graphic Press. Cheshire. CT.

Everyday Office’s YouTube video

https://www.youtube.com/watch?v=HZe5SzgxlsQ

Michael Sandberg’s Data Visualization Blog

https://datavizblog.com/2014/09/09/dataviz-squaring-the-pie-chart-waffle-chart/

Robert Kosara’s Eagereyes Blog

https://eagereyes.org/techniques/pie-charts

Sumit Bansal’s Trump Excel: The Smart Way Blog

https://trumpexcel.com/waffle-chart-excel/

Jonathan Schwabish’s PolicyViz Blog provides another method to creating waffle charts using data validation

https://policyviz.com/2018/04/26/interactive-waffle-charts-in-excel/

Mark Bounthavong

October 20, 2018

Data visualization

Communicating data effectively with data visualization - Part 11 (Waterfall charts)

Mark Bounthavong

October 20, 2018

Data visualization

BACKGROUND

Changes in data values happens. But when you want to visualize these changes, you need to choose the visualization that best explains the narrative. In this article, we will explore the use of waterfall charts, which are a form of data visualization that illustrates changes from the reference value across some sequentially ordered axis (e.g., time).

MOTIVATING EXAMPLE

I created a dataset where the average duration of an academic detailing educational outreach visit was estimated from Quarter 1 Fiscal Year 2017 (Q1 FY17) to Quarter 4 Fiscal Year 2018 (Q4 FY18). The initial dataset includes the reference (or base) average duration (in minutes) of an academic detailing educational outreach visit followed by positive and negative changes to the reference value. (Excel example can be retrieved from the following link.)

There are a few things to consider when reviewing this table. The reference value is 25 minutes, which was the average duration in Q1 FY17. However, in Q2 FY17, the average duration decreased by 5 minutes for a new value of 20 minutes. Then in Q3 FY17, the average duration increased by 15 minutes for a net gain of 10 minutes from the reference value of 25 minutes (25 + 10).

Understanding how these values reflect the net gain from the reference value is critical in building the waterfall chart.

TUTORIAL

Step 1: Create three new columns (base, fall, and rise)

We need to do some calculations to generate the base values for the waterfall chart.

Start by creating three additional columns and label them as base, fall, and rise (I learned how to design my table from this YouTube video by United Computers).

Step 2: Estimate the value for the fall column

For the fall column, we want to look at the Duration column. If the value in the Duration column is positive, then we need to have the value in the “fall” column as zero. If the value in the Duration column is negative, when we want the absolute (positive) value in the “fall” column.

Here is an illustration of what the “fall” column calculation requires.

Step 2: Estimate the value for the rise column

To estimate the rise column we, again, look to the Duration column. If the value in the Duration column is positive, we enter it into the “rise” column. If the value in the Duration column is negative, then we enter 0 in the “rise” column.

Here is an illustration of what the “rise” column calculation requires.

Step 3: Estimate the value for the base column

The base column provides the necessary reference for us to create the waterfall chart. Unlike the other columns, we start in the base column Q2 FY17 row since the “base” column value in Q1 FY17 is the reference. Then we add the “rise” column in Q1 FY17 and subtract the “fall” column in Q2 FY17. We do this for all the remaining cells.

Here is an illustration of what the “base” column calculation requires.

Step 4: Review the final data

The final data should look like the following.

Step 5: Create a stacked bar chart

Select the data from the Time, base, fall, and rise columns and then insert a stacked bar chart.

You should have a chart that looks like the following.

Step 6: Hide the base stacked bars

The next step is to hide the base column values by selecting the blue bars and changing the formatting to No Fill and No Line.

Step 7: Format the chart for final presentation

You should format the chart using colors that indicate the direction of the rise and fall in the average duration of educational outreach visits. I selected blue for an increase and red for a decrease. I also added the Y-axis label and decreased the gap width to 10%.

Final waterfall chart

CONCLUSIONS

The waterfall chart visualizes the rise and fall in average duration of an academic detailing visit across two fiscal years. This visualization provides the viewer with an intuitive summary of the changes that occurred and whether action plans are needed. From this example, we notice that the changes did not deviate from the base of 25 minutes per visit across two fiscal years. Therefore, there may not be a need to spend resources to investigate this element of academic detailing.

In this example, I re-created the waterfall chart to illustrate seasonal changes.

Like the previous waterfall chart, this is a clear and intuitive visualization of the changes that occurred across two fiscal years. The red bars easily illustrate that the average duration decreased immediately after Q1 FY17, but increased after Q1 FY18 only to follow the same pattern as the previous fiscal year. This tells us that something is going on during Q1 of each fiscal year, which may require some kind of investigation of the program at the beginning of each fiscal year.

Microsoft Office 365 for Windows and Microsoft Office 2016 for the Mac have a new feature for creating waterfall charts. You may need to check the Office Updates to acquire these new features.

You can learn about these in the following Microsoft Office sites (link 1, link 2)

REFERENCES

I used the following YouTube video by United Computers to help generate the waterfall charts in this article.

I also referred to Cole N. Knaflic’s blog on waterfall charts.