Data visualization

Communicating data effectively with data visualizations—Part 18 (Histograms)

BACKGROUND

Inspecting your data is an important part of data analysis preparation. Data, like all things, should behave according to some reasonable expectation. For example, if we randomly sampled a group of people in the U.S., we would reasonably expect to get 50% males and 50% females. Similarly, if we examined the age distribution of this sample, we would expect to have a normal distribution.

At the macro level, we may only be interested if the mean and standard deviations are representative of the population distribution. Since we sample from the population (randomly), we would expect to get similar means (and medians). This can be accomplished using simple Excel functions (or commands in statistical packages) to generate a descriptive summary. Table 1 describes the summary statistics for the total fat consumed by a sample of 8,327 responders to the National Health and Nutrition Examination Survey (NHANES) survey.

TABLE 1.png

We can see that the mean and the median are different, which is an indicator that the distribution is not normal. However, we may be interested in learning more about the distribution or behavior of this variable. Are there any outliers? How skewed is the distribution?

HISTOGRAMS

To visualize this, we will need to generate a histogram. A histogram is a visual representation (bars) of the distribution of data (usually continuous). It uses spacings called “bins” to count the number of times a value falls into that bin. A histogram looks like a bar chart, but the key difference is that in the histogram the adjacent bars are touching each other rather than having a space between them. Another difference is that histograms plot the frequency (or density) of a value or a range of values for a continuous data type; whereas, bar charts plot the count of a discrete data type (Figure 1).

Figure 1. Comparisons between histogram and bar chart.

Keep in mind that the number of bins for histograms should be just enough to make out the distribution and not too small to be too much information. This is Grice’s maxim of quantity where data are presented in an informative manner without overwhelming the audience with too much information.[1] Creating smaller bins to increase the resolution of the histogram is unnecessary when all you want is a general visualization of the data’s distribution.

 

MOTIVATING EXAMPLE

We will use data from the NHANES survey (2015-2016) to generate a histogram in Excel. The data can be downloaded from my Dropbox folder here. I cleaned the file so that all missing data were dropped. In total, there are three variables:

·      seqn = subject identifier

·      drqsdiet = special diet (Yes/No/Don’t know)

·      dr1ttfat = amount of total fat (gm) consumed

We will create a histogram to visualize the distribution of total fat consumed by the subjects. To start, let’s select the data and insert a histogram chart from the Insert Tab.

A histogram will be inserted near where your data are located on the worksheet. Excel automatically selects the bin sizes for you. But you can customize this to your needs.

Figure 3 -histogram.png

Right click anywhere x-axis and select Format Axis. You should see a column on the right side appear with options to modify the bin sizes.

You can modify the bin width, number of bins, the overflow bin, and underflow bin.

The bin width can be larger or smaller depending on how much resolution you want. You should balance this out with the appropriate number of bins you want to show. According Grice’s maxim of quantity, you don’t want to overwhelm your audience. In Excel, you can only modify either the bin width or the number of bins; never both.

The overflow bin indicates what the last bin should be. If anything is over the overflow bin value (X), then Excel will collapse those frequencies into that last bin. For example, if I wanted the overflow bin to be 137 grams or greater, I enter “137” into the overflow bin field. You can do the same thing on the other end of the x-axis with the underflow bin value.

Once you’ve figure out how to change the number of bins, let’s change the number of bins from 66 to 100, 75, 50, and 25 to observe how the histogram changes.

Notice that the histogram with a bin size of 100 is really fine whereas the bin size of 25 is blocky. We can tell from all of these figures that there is a right skew to the distribution due to a few outliers. There are 3 subjects who consume more than 400 grams of total fat compared to 19 subjects who consume between 300 and 399 grams of total fat. The higher resolution doesn’t really help us determine that the total fat consumption is right skewed compared to the figures with bin sizes of 75 and 50. If I were presenting to an audience or publishing an appendix, I would select either the figure with a bin size of 75 or 50. These two histograms illustrate the peak at the mean and the right-skewed distribution without violating Grice’s maxim of quantity. However, different situations will require you to make different choices, so I encourage you to explore the design features on Excels’ histogram.

STEM-AND-LEAF HISTOGRAM

The stem-and-leaf display is an alternative histogram that uses the prefix of number to assign positions into the bins. The following figure is a randomly selected number of subjects from our NHANES data. The first subject consumed 14 grams of total fat which is indicated by the 1* | 4. The 1* represents the first digit of “14” and the “|” separates the next digit. Similarly, there is one subject who consumed 22 grams of total fat indicated by the 2* | 2 and another subject consumed 24 grams of total fat (2* | 4).

 
Figure 7 - stem-and-leaf.png
 

CONCLUSIONS

Histograms are a great visualization tool to quickly check whether your continuous data are normally distributed. You can identify whether the mean is close to the median or whether there are left or right skewness to your data. Moreover, you can change the bin sizes of a histogram to become more refined or less so. But according to Grice’s maxim of quantity, it is best to present enough data that will get the information across to your audience without overwhelming them with unnecessary details.

 

REFERENCES

Grice, H. P. Logic and Conversation.  In Cole P. and Morgan J. (Eds), Syntax and Semantics: Vol 3, Speech Acts. Academic Press, New York, pp.43-58, 1975.

Communicating data effectively with data visualization – Part 17 (Multivariate Dimensions)

MULTI-DIMENSIONAL DATA VISUALIZATIONS

Data visualizations can improve how we see complex data, in particular, where multiple dimensions are involved. For instance, in an X-Y plan, we can have the months on the X-axis and the number of patients on the Y-axis (Figure 1). Let’s imagine that number of patients represents some outcome you are interested in (e.g., number of patients who has 5+ prescription medications). Time and the number of patients are dimensions on this two dimensional plan.

 

Figure 1. Two-dimensional X-Y axes figure.

Figure 1.png

As a rule, whenever you want to display multiple dimensions, each dimension needs to be represented onto a single figure, which is challenging given that a figure is normally on a two-dimensional plane. What if we wanted to have a figure with more than two dimensions? What if we wanted to have a figure with months on the X-axis, number of patients on the Y-axis, and include a third dimension denoting different genders? How would we go about doing that? Figure 2 illustrates how we can do this by adding lines and labeling them using different colors.

 

Figure 2. Figure with three dimensions.

Figure 2.png

Figure 2 is able to capture three dimensions of data into a single two dimensional figure. The number of patients is captured in the Y-axis and the time in months is captured in the X-axis. Gender is represented by the colored lines that show the difference in the relationship between number of patients and time associated with males and females.

Alternatively, we use the color blue for the following conditions:

E[Y | male]

 

We use the color red for the following conditions:

E[Y | female]

 

The legend providers additional clarification that the different line colors denote the gender types. It is critical to include clear and intuitive legends so that your readers will immediately recognize their reference and label. Without a legend, your audience will have to guess what color belongs to what gender type.

How about adding another dimension such as age? This would increase the number of dimensions on this figure from three to four. For example, what if we wanted to see how being older (80+ years) impacted the relationship between the number of patients and time across genders? Well, this can be accomplished by using different types of lines (e.g., dotted lines and dashed lines).

Figure 3 illustrates how using different types of lines (dotted for the 80+ year old patient and dashed for the <80 year old patient) can provide a visual accounting of the differences across genders and across age in terms of the number of patients and months. The legend provides additional clarification as to the age groups associated with the different line types.

 

Figure 3. Figure with four dimensions.

Figure 3.png

Alternatively, we continue to use the color blue for the following conditions but add different line types for the age groups:

E[Y | 80+years & male] (dotted lines)

E[Y | <80 years & male] (dashed lines)

 

Similarly, we continue to use the color red for the following conditions but add different line types for the age groups:

E[Y | 80+years & female] (dotted lines)

E[Y | <80 years & female] (dashed lines)

Using colors and line types allow us to capture multiple dimensions onto a two dimensional figure. We essentially are showing a stratified descriptive analysis of the age groups nested within each gender and their relationships between the number of patients and time (months).

How about adding a fifth dimension? How could one do that?

A simple way to introduce a fifth dimension is to use the concept of small multiples by Edward Tufte.[1] Tufte uses small multiples to include additional dimensions. Figure 4 illustrates how we can leverage small multiples to look at the differences in the relationships between number of patient and time for different genders and age groups across different states.

 

Figure 4. Small multiples of states with differing patterns of number of patients and time for different genders and age groups.

Figure 4.png

Using small multiples allow us to compare the differences in the association between number of patients and months across different states stratified by gender and age groups. The number of patients increased across time for all gender and age groups in California. Similar patterns are observed in Virginia, but the rate of increase in the number of patients across time is lower in the female group and age cohorts. In Ohio, different patterns are observed compared to California and Virginia. Males and their associated age groups have a decreasing number of patients across time. Conversely, females and their age cohorts have a positive correlation between the number of patients and time.

 

CONCLUSIONS

Adding dimensions can improve the figure you design by incorporating complex relationships across different data characteristics. In our example, we demonstrate how we can integrate dive dimensions of data to a two-dimensional figure that tell us information about the association between the outcomes (number of patients) with time (months) across states stratified by gender and age groups. Be creative with how you integrate multiple dimensions into a figure. Ask yourself if this is something that will help improve the story the figure is conveying. There are times when a simple figure will do. But when you have a lot of data and want to tell a story, consider adding dimensions to the figure to get a narrative that will excite and capture your audience’s attention.

 

REFERENCES

  1. Tufte ER. The Visual Display of Quantitative Information. Second. Cheshire, CT: Graphics Press, LLC.; 2001.

Communicating data effectively with data visualization – Part 16 (UpSet diagrams)

INTRODUCTION

Venn diagrams are useful visualizations that illustrate intersections between several groups. A common Venn diagram includes three transparent circles that overlap each other. In Figure 1, a Venn diagram with three groups denoted as A, B, and C has a total of 4 intersections (AB, AC, BC, and ABC).

Figure 1. Venn diagram with three groups and four intersections.

 
Figure 1.png
 

However, this can get more complicated when there are more than three groups or multiple interactions. Figure 2 illustrates a more complicated example.

Figure 2. Complicated Venn diagram.

 
Figure 2.png
 

How many interactions are there? It’s quite difficult to determine based on this figure. An alternative data visualization method to capture the complexity of interactions and overlaps illustrated by Figure 1 is called an UpSet diagram. Please go to the Gehlenborg Lab for more information about UpSet diagrams.

UpSet diagrams visualizes complex intersections as sets of a matrix where the columns represent the number of interactions across different sets of groups (rows).[1,2] Each set can represent a combination of interactions. This is a very useful way to see how many campaigns are delivered in a single academic detailing visit.

Academic detailers deliver educational outreach to providers using a combination of tools that include unbiased evidence-based educational materials, online clinical dashboards, and audit-and-feedback approaches.[3,4] However, a single academic detailing interaction can include several campaigns and key messages. Since each visit may incorporate several campaigns, there needs to be a mechanism to capture the different sets of combinations. From an operations perspective, monitoring the number of interactions is challenging, especially, when you want to capture workload for your staff or facility. Fortunately, UpSet diagrams, developed by researchers at the Department of Biomedical Informatics, Harvard Medical School and the Institute School of Computing, University of Utah, can easily capture these complex combinations of campaigns with each academic detailing visit.

Below is an example of an UpSet diagram that lists the different campaigns in the rows and the number of visits using the various combinations of campaigns (Figure 3). For example, there are 31 visits that include the Pain and OEND campaigns. Further, there are 119 visits that include the OUD and OEND campaigns. Notice how easy it was to visually see the different campaigns that were combined into a single visit. A single visit can be illustrated to show how many campaigns were involved. By using an UpSet diagram, you can visually capture which complex interactions were more frequent or infrequent.

These UpSet diagrams are useful ways to capture the complex nature of academic detailing and the various combinations of campaigns delivered.

Figure 3. Example UpSet diagram with several campaigns and the number of visits.  

MOTIVATING EXAMPLE

Excel is unable to create UpSet diagrams with its current tools. Fortunately, the researchers at Harvard University and the University of Utah have created an UpSet R Shiny app that you can access online. All you need is your own set of data to upload. The R Shiny app will generate the UpSet figure and you can adjust the settings to get the figure you need.

The UpSet R Shiny app is located here.

When creating a data set, make sure that you save it as a *.CSV file with headers.

Use the following example dataset to upload onto the UpSet R Shiny app, which is located here.

After you open the UpSet R Shiny app in a browser, make sure that you are in the Option 1 tab. This is where you will see some instructions to upload the data you want to view the complex interactions for an academic detailing visit.

Browse for the file that contains the data on all the academic detailing visits with their campaigns. The data should be saved as a *.CSV file. We will use the pbmads_2.csv file. The data structure should have the each campaign as a binary variable (1 = YES, 0=NO) similar to how you would create a dummy variable.

Figure 4_1.png

Make sure to keep the columns separated using the “Comma” separator.

Then, click on “Plot” to see how the UpSet diagram appears in the R Shiny app.

After you click on Plot, you will see the app update and you should be able to view the Setting and Plot areas. The Setting area allows you to configure your plot. You can change the order of the plot and the type of ordering rules (e.g., Frequency and Degree). You can also change the size of the fonts using the Advanced section.

The Frequency order provides the largest number of visits in descending order.

The Degree order provides the campaigns with the most complex interactions or combinations in ascending order. 

Based on these figures, you can easily discern the complexity of interactions academic detailing visits include into a single educational outreach. The most complex interaction are the visits that include the Opioid Use Disorder (OUD), Opioid Overdose Education and Naloxone Distribution (OEND), posttraumatic stress disorder (PTSD), Pain, and Other campaign topics (N=3). The individual solo campaign with the most visits is the OUD (alone) campaign with 175 visits.

 

CONCLUSIONS

UpSet diagrams make it easy to categorize the academic detailing visits into different combination categories. We can apply this method to other program monitoring metrics such the differences in these visit combinations across time and the number of attendees. Other additional areas where UpSet diagrams are useful include complex genetic markers. Try and think of ways where you can use this method to simplify complex Venn diagrams or complex interactions across different groups.

Since the UpSet R Shiny app has limited functionality, you can explore other features using R or Python to generate more complex UpSet diagrams. The UpSetR package is available for the R environment. The UpSetPlot package is available for Python.

You can access the GitHub site for UpSet diagrams here.

I encourage you to read the papers on UpSet diagrams by Conway and colleagues and Lex and colleagues: Paper 1 and Paper 2.

YouTube video on UpSet diagrams.

 

REFERENCES

1. Conway JR, Lex A, Gehlenborg N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinforma Oxf Engl. 2017;33(18):2938-2940. doi:10.1093/bioinformatics/btx364

2. Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H. UpSet: Visualization of Intersecting Sets. IEEE Trans Vis Comput Graph. 2014;20(12):1983-1992. doi:10.1109/TVCG.2014.2346248

3. Avorn J, Soumerai SB. Improving drug-therapy decisions through educational outreach. A randomized controlled trial of academically based “detailing.” N Engl J Med. 1983;308(24):1457-1463. doi:10.1056/NEJM198306163082406

4.Avorn J. Academic Detailing: “Marketing” the Best Evidence to Clinicians. JAMA. 2017;317(4):361-362. doi:10.1001/jama.2016.16036

 

Communicating data effectively with data visualization – Part 15 (Diverging Stacked Bar Chart for Likert scales)

BACKGROUND

Surveys or questionnaires are used to capture respondent’s perceptions about any number of products, ideas, or subjects. You can ask someone how they are feeling to how much they agree with a particular statement. Or you can ask someone if they are satisfied with a product or service they recently received. Items (or questions) in surveys can solicit these types of responses. Free response questions allow the respondents to write their responses in an unstructured manner as long as it answers the question or purpose of the survey. But most surveys ask questions that require a specific response using multiple choice, ratings scale, or Likert scales.

In this article, we will discuss the Likert-type scale and how we can visualize this using Excel.

 

LIKERT-TYPE SCALE

The Likert scale was first developed by Rensis Likert, who was a psychologist in the early 20th Century. He developed the 5-point Likert scale as part of his PhD dissertation in order to capture peoples’ ratings of international affairs. The Likert scale is unique because it provides a rating that is ordered sequentially (Positively to Negatively or Agreement to Disagreement).

 

Figure 1. Example of a 5-point Likert scale.

Figure 1.png

The 5-point Likert scale is quite common in psychometric research. A statement is usually provided and the participant is asked to rate their level of agreement. Notice how the scale is ordered sequentially from Strongly Disagree in the left of the scale and Strongly Agree to the right of the scale. This is an important feature of Likert-type scales with an added convenience, higher values are associated with higher agreements. You can reverse this as well, but we will keep the order for the remainder of this article.

 

VISUALIZE THE LIKERT-TYPE SCALE

There are many ways to visualize a Likert scale. We can use pie or bar charts to capture the different responses to a Likert-type question or statement.

 

Figure 2. Bar and Pie charts used to visualize Likert scale responses.

Figure 2.png

However, the best way to visualize Likert scales is to build a Diverging Stacked Bar Chart.

Figure 3. Diverging stacked bar chart using a set of hypothetical data for three statements.

Figure 3-1.png

The red dotted line in Figure 3 represents the divergent point where the stacked horizontal bar chart aligns. This is effective when you want to suggest that certain set of ranked responses are more important than the other. In this example, Strongly Agree and Agree are given more precedence than the other ranks. Rightly so, since a majority of the responses were Strongly Agree or Agree.

 

MOTIVATING EXAMPLE

Let’s assume that we administered a survey with three questions. The results are as follows:

Figure 4.png

We have a 5-point Likert scale with responses in all the different ranks. A majority of the responses were either Strongly Agree or Agree, so we’ll create a diverging point with these two ranks.

 

TUTORIAL

The Excel file for the tutorial is located here.

There are two ways to create this diverging stacked bar chart.

Method 1: I learned how to create diverging stacked bar charts from Stephanie Evergreen’s blog Evergreendata.  I got her method step-by-step and then go over an alternative method.

Part 1.1 Estimate the buffers at the end of the stacks.

First, change the values to percentages (Step 1). Then, determine where the divergence will occur (Step 2). For our example, the divergence is between Neutral and Agree.

The next step involves estimating the values at the ends of the stacked bar chart. There are two ends, left and right (Steps 3 and 4). Once the buffers have been estimated, we will plot the stacked bar chart.

Part 1.2. Plot the divergence stacked bar chart.

First, select all the data (Step1). Then Select the 100% Stacked Bar Chart from the Insert tab (Step 2). This should generate a default stacked bar chart (Step 3).

Right-click on the any area in the chart and click on the “Select data…” to change the data arrangement (Step 4). Select the “Switch row/column” to change the Y-X arrangement of the data.

Once you switch the rows and column, the chart will change and look like the one below.

Now, we want to remove the color of the buffers at the ends of the stacked bar (Step 7). Right-click the left end of the stack and select “No Fill” (Step 8). Repeat this for the right end of the stack.

The diverging stacked bar chart should resemble the figure we presented at the beginning of this article.

Figure 11.png

Removing the buffer labels from the legend, deleting the grid lines, changing the font (Adobe Gothic Standard B), and changing the stacked bars colors can improve the figure.

Figure 12.png

The challenge with this chart is the labels on the axis. The statements are too far to the left of the diverging stacked bar chart. To fix this, delete the labels on the left and insert text boxes with the statements.

Figure 13.png

Remove the borders, right alignment with the statements, and add labels. We should be very close to the figure we presented above.

Figure 14.png

Here is the final diverging stacked bar chart.

Figure 3.png

Method 2: I learned this other method from John Peltier at his Peltier Tech Blog.

This method requires you to create a divergent point based on distance. For instance, if you want to make sure that you have the divergent point where the responses are at Strongly Agree and Agree. So, subtract the distance from that point.

Figure 14-5.png

First, change all the values left of the divergent point to negative (Step 1).

Figure 15.png

Then rearrange the order of responses so that the furthest rank is closest to the divergent point (Step 2).

The select the data (Step 3), Insert a 100% Stacked Bar Chart (Step 4), and then Visualize the chart (Step 5).

Right-click anywhere in the chart area and click “Select Date…” (Step 6). Then Switch row and column (Step 7).

You will get the chart on the left below. However, this is not complete. Right-click on the Y-axis (Step 8) and the Format Axis (Step 9).

Change the label position for “Low” (Step 10) and then review the chart. Notice where the divergent point is located at. This is the same as the previous stacked bar chart that was constructed using Method 1.

Now, you can change the colors, delete the gridlines, remove the X-axis to create a plot below. Notice that there are some values that are negative. That’s because of the data we rearranged earlier to generate the distance from the divergent point. To fix this, you will need to manually change the values.

Figure 21.png

After manually change the values, your plot will look similar to the one below.

Figure 22.png

Conclusions

Visualizing the Likert scale using horizontal diverging stacked bar charts is a good method to see how the participants respond to questions or statements on a survey or questionnaire. However, not all Likert-type scales will necessarily need a diverging stacked bar chart to illustrate its point. You can also use a conventional stacked bar chart, which we will discuss in a future article.

The Excel file for this tutorial is located here.

 

REFERENCES

I used the following references to help write this article.

Stephanie Evergreen’s blog Evergreendata is an excellent resource for learning about other data visualization methods.

John Peltier’s blot  Peltier Tech Blog is another wonderful resource where you can learn more about Excel charts and data visualization.

The following paper provides details on how to create diverging stacked bar charts using R.

Heiberger RM, Robbins NB. Design of diverging stacked bar charts for Likert scales and other applications. Journal of Statistical Software. 2014;57(5): 1-32. https://www.jstatsoft.org/article/view/v057i05/v57i05.pdf

Communicating data effectively with data visualization – Part 14 (Gantt Charts)

INTRODUCTION

A useful calendar can be helpful in scheduling your meetings, avoiding conflicts, and remembering important dates. Applying data visualization to a calendar can help to identify key events throughout the day, week, or month. Here is an example of a color-coded calendar for a single person from Sadiq Javer from BoostSolutions.

Figure 1 - exampel calendar.png

Each meeting or event is color coded to indicate a particular category. For a single user, this is sufficient to manage a complex day, week, or month. However, if you are a project manager or lead, managing the calendars of a group or team, this task can be challenging.

A solution is to use a Gantt chart to organize the calendars of several members in your team. Gantt chart is a type of bar chart that provides a longitudinal visualization of schedules and timelines. It was invented by Henry Gantt who is known for his work on scientific management. Gantt charts are useful for project management and can illustrate major deadlines or milestones in the project’s life cycle.

Conveniently, the same tools used for project management can be applied to managing schedules for multiple team members in a group. In this article, we will apply the Gantt chart to managing a team’s schedule using Excel.

MOTIVATING EXAMPLE

We will create a Gantt chart using a hypothetical team’s schedule to visualize their and vacations.

Download the Excel sheet here.

Suppose we have a team who will be taking vacation in the upcoming calendar year (2019). There are several important dates that the Team will need to block for meetings. In order to avoid conflicts, a Gantt chart is used to plan an efficient annual schedule.

Here is a figure of our Gantt chart.

The Gantt chart blocks weekend and holidays so that the manager can easily see the entire 7-day week. Each column represents a day nested in a 7-day week. The months are color coded to identify when it begins and ends. Each staff has a unique color to identify their days off, and the Team meeting is highlighted in red to indicate the critical meeting dates.


TUTORIAL

You can use any version of Excel to build the Gantt chart. After opening a new Excel sheet, follow these steps.

Step 1. Resize the column’s width:

Figure 3.png

Resizing the column’s width to 2.33 seems to give an efficient size cell for the days.

Step 2. Assigning days and months:

In our example, each column represents one day. Therefore, we can assign 7 days into a week. Since the month starts on different days, we make sure to start with the correct day in our calendar. In 2019, January begins on Tuesday, therefore, our Gantt chart will start on Tuesday.

Figure 4.png

Step 3. Highlight the weekends and holidays:

Hopefully, your team doesn’t have to work on the weekends. However, there are exceptions. Clinicians work on the weekends, so your Gantt chart may need an indicator for differential pay (if it is part of the benefits). In our example, we will assume that no one from the team works on the weekends.

The holidays are highlight with a different color from the weekend

Figure 5.png

Step 4. Include the team members and Meetings to the Gantt chart.

Once you block out the holidays and weekends, you can start entering information on meetings and team members’ vacations. Different colors were used for the meetings and individual team members to provide easy visualization.

Figure 6.png

CONCLUSIONS

The final Gantt chart should be able to help you organize your team’s schedule while making sure that there are no conflicts with important team meetings or deadlines. Although Gantt charts were designed for project management, it can also be used to efficiently manage a team’s complex schedule.

You can download the Excel exercise at this link.

REFERENCES

I used the following references to assist me with this tutorial.

Sadiq Javer’s article on Gantt charts published on the Boostsoultions.com website.

Wikipedia’s page on Gantt charts, Henry Gantt, and Scientific Management.

Communicating data effectively with data visualization - Part 13 (Box and Whisker Diagrams)

BACKGROUND

Box plot (box and whisker diagram) is a great way to display distribution of a continuous (e.g., interval) data variable. A typical box plot will contain the mean, median, interquartile values, and the minimum and maximum values. Figure 1 illustrates these elements on a box plot. Up until recently, Microsoft Excel did not have an option to graph box plots. However, in the 2016 version of Microsoft Excel, box plots were added as part of the statistical features.

Figure 1. Example of a box plot (box and whisker diagram (Figure 1).

 
Figure 1.png
 

MOTIVATING EXAMPLE

We will use data that was randomly generated to create box plots across four hypothetical quarters (Q1FY19, Q2FY19, Q3FY19, and Q4FY19). The data will contact the number of visits to the doctor from several outpatient specialty clinic. Here is what the data looks like from the first two sites. Data for the example can be found here.

 
Figure 2.png
 

Site 1 has 45 visits in Q1FY19 and Site 2 has 44 visits in Q1FY19. To create the box plots, we need to use the long format which uses multiple rows for each site.

 

EXERCISE

In this article, we will generate box plots that will visualize the average number of visits and its distribution across quarters.

Figure 3.png

After clicking on the Box and Whisker plot, you will need to select the data that will be used to generate the box plots across the quarters.  

Figure 4.png
Figure 5.png
Figure 6.png

Click “OK” and the default box plot will look like Figure 2.  

Figure 2. Default box plot generated by Excel 2016.

Figure 7.png

After a few changes to the color and labels, our box plot can be improved (Figure 3).

Figure 3. Updated box plots.

Figure 8.png

These box plots give us an idea of the changes in the number of visits across quarters including the distribution of the data. For each box plots, the mean indicated by the “X” is not too different from the median (indicated by the solid horizontal line).  However, there is greater variation in the distribution of the number of visits in Q2FY19 and Q3FY19 compared to Q1FY19 and Q4FY19. We can see that there was an increase in the number of visits, on average, between Q1FY19 and Q3FY19, but this drop significantly in Q4FY19. This may be due to some kind of change (e.g., seasonal variation) and should be explored.

 

Conclusions

The box plot provides us with a nice data visualization of the mean number of visits across quarters including the variation and distribution of the data. Plotting these in Microsoft Excel 2016 will allow you to explore your data and motivate you to explore and generate some explanation or hypothesis for their behavior.

References

I used several online references to write this article.

The Dummies series provide a good illustration of the box plot elements, which can be located here.  

I watched this YouTube video by stickpet on how to use Microsoft Excel 2016 to generate box plots.

Communicating data effectively with data visualization - Part 12 (Waffle Charts)

BACKGROUND

In data visualization circles, the pie chart is considered an inefficient tool to convey parts of a whole. Edward Tufte often criticizes the use of the pie chart to display data visually stating that

A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between charts.”[1]

The major reason why pie charts are disliked by data scientists and other pundits is due to the way our brain works. Mostly, we are good at judging things visually, but with pie charts, it is hard to distinguish the relative proportion of a slice to the whole. For example, can you tell the difference in proportions between the two pie charts below? Which one has more of component B?

Figure 1.png

Since it is challenging to identify the differences between the two pie charts, several alternatives exists to present the data accurately and effectively. In this article, we will discuss one such method using a waffle chart.

Waffle charts are grid-based visuals that have equal size blocks that convey parts of a whole accurately and efficiently. Some have called waffle charts the “square pie charts.” They are usually proportional and arranged in a 10 x 10 grid.

 
Figure 2.png
 

Colors can be used to distinguish the contribution of groups or categories to the whole where each square represents a percentage point totaling to 100.

Figure 3.png

Waffle charts are great at presenting data where you are describing the proportions or parts of a whole and should be used instead of pie charts.

 

MOTIVATING EXAMPLE

We will use the 2016 National Healthcare Expenditure dataset to illustrate the use of waffle charts. We will compare expenditures from the decades between 1965 and 2015.

You can download the data from the National Health Expenditures Accounts (NHEA) website:

https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/NationalHealthAccountsHistorical.html

The data provide the percentages of expenditures for different components spent on health for each decade from 1965 to 2015 (Table 1).

 

Table 1. National Health Expenditures from 1965 to 2015.

Figure 4a.png

In the above table (we will start with 1965), 89% of Health Consumption Expenditures was due to Personal Health Care (83%), Government Administration And Net Cost Of Health Insurance (4%), and Government Public Health Activities (2%). The remainder was spent on Investments (11%). We have to estimate the cumulative percentage spent across the different categories to generate our waffle charts.

This is easily done by summing the individual components (Personal Health Care, Government Administration And Net Cost Of Health Insurance, Government Public Health Activities, and Investment) so that they will total 100 (Table 2).

 

Table 2. Cumulative values of the individual expenditures.

Figure 4b.png

Step 1. Setting up the grid

In Excel, we want to create a 10 by 10 square grid. To do this, change your view from Normal to Page Layout. We are doing this because Excel has a unique way of measuring row and height size (they are not on the same scale under the Normal view). When you change to the Page Layout, the scales for the columns and rows are in inches. Set the size for the columns and rows to 0.5 inch. You should have a 10 by 10 grid with squared that have a height and length of 0.5 inch.

Figure 5.png

Step 2: Label the squares in the grid

Once the grid has been sized correctly, we will assign a value for each square in the grid. These values will correspond to the percentage point that you have in the cumulative values in Table 2. Start at the lower left with a cell value of 1 up to a cell value of 100 in the upper right of the grid.

 
Figure 6.png
 

Step 3: Apply conditional formatting for each cell in the grid

Once the cells have been assigned a value, we can use Excel’s conditional formatting tool to assign colors for each of expenditure components.

Select the cell that has the value “91.” (It doesn’t matter what cell you select, but we will use the upper-most left corner for simplicity; this is also cell D5).

Change the Select option to “Classic.” Then change the New Formatting Rule to “Use a formula to determine which cells to format.”

Figure 7.png

Insert the formula where you have the value is less than or equal to the total cumulative percentage (e.g., 100 percent).

Figure 8.png

The cells for the equations are specific to this example, but you can apply this towards an example of your own. (The Excel file that is based on this article can be downloaded here.)

 

Step 4: Change colors to match the different expenditure categories

Next, we will change the fill and font colors so that they will match. By having the font color the same as the fill color, the values in the cells will appear invisible, but still referenced using the conditional formatting rules we just created.

Figure 9.png

Step 5: Apply the conditional formatting to ALL the cells in the grid

After you change the colors for the font and fill, you will need to make sure that the conditional formatting rule is applied to the entire grid.

Figure 10.png

You should notice that all the cells in the grid are the same color. This is because the conditional formatting is first based on the total cumulative percentage, which is 100. Therefore, all the values in the cells should be the same color.

 
Figure 11.png
 

Step 6: Add a new conditional formatting rule for the next expenditure component

The next step is to apply another conditional formatting rule for the next highest cumulative percentage value, which would be Government Public Health Access, which is 89 percent.

Figure 12.png

We repeat the process for each expenditure component. When we are done, we should have four conditional formatting rules for each expenditure component.

Figure 13.png

All the formulas for the conditional formatting are listed below:

Figure 14.png

Step 7: Repeat this process for the other decades

Once we complete these conditional formatting for the other decades, we can present these waffle charts together.

 

Final waffle chart

The following waffle chart incorporated the health expenditures for each decade starting from 1965 to 2015. The border color was changed to white and the label used the Century Gothic font. Inside each waffle is the percentage of Investment associated with each decade. You can download this exercise’s Excel file here.

Figure 15.png

Based on the waffle charts, we can see that Investments (spending for noncommercial biomedical research and expenditures by health care establishments on structures and equipment) has decreased over time. Conversely, expenditures for Personal Health Care, Administration, and Health Activities have increased.

CONCLUSIONS

Using waffle charts is a better alternative to pie charts because we can discern the exact value of the parts that make up the whole. In this case, we can easily visualize the decrease in Investments when it comes to health expenditure spending in the US for each decade between 1965 to 2015.

 

You can re-create these findings using the Excel file located here.

REFERENCES

I used the following references to assist with the development of this article. They have been incredibly helpful in learning the methods and better understanding how to leverage the power of using waffle charts.

 

[1] Tufte ER. The Visual Display of Quantitative Information. 2001. Graphic Press. Cheshire. CT.

 

Everyday Office’s YouTube video

https://www.youtube.com/watch?v=HZe5SzgxlsQ

Michael Sandberg’s Data Visualization Blog

https://datavizblog.com/2014/09/09/dataviz-squaring-the-pie-chart-waffle-chart/

Robert Kosara’s Eagereyes Blog

https://eagereyes.org/techniques/pie-charts

Sumit Bansal’s Trump Excel: The Smart Way Blog

https://trumpexcel.com/waffle-chart-excel/

Jonathan Schwabish’s PolicyViz Blog provides another method to creating waffle charts using data validation

https://policyviz.com/2018/04/26/interactive-waffle-charts-in-excel/

 

 

 

Communicating data effectively with data visualization - Part 11 (Waterfall charts)

BACKGROUND

Changes in data values happens. But when you want to visualize these changes, you need to choose the visualization that best explains the narrative. In this article, we will explore the use of waterfall charts, which are a form of data visualization that illustrates changes from the reference value across some sequentially ordered axis (e.g., time).

 

MOTIVATING EXAMPLE

I created a dataset where the average duration of an academic detailing educational outreach visit was estimated from Quarter 1 Fiscal Year 2017 (Q1 FY17) to Quarter 4 Fiscal Year 2018 (Q4 FY18). The initial dataset includes the reference (or base) average duration (in minutes) of an academic detailing educational outreach visit followed by positive and negative changes to the reference value. (Excel example can be retrieved from the following link.)

 
Figure 1.png
 

There are a few things to consider when reviewing this table. The reference value is 25 minutes, which was the average duration in Q1 FY17. However, in Q2 FY17, the average duration decreased by 5 minutes for a new value of 20 minutes. Then in Q3 FY17, the average duration increased by 15 minutes for a net gain of 10 minutes from the reference value of 25 minutes (25 + 10).

Figure 2.png

Understanding how these values reflect the net gain from the reference value is critical in building the waterfall chart.

 

TUTORIAL

Step 1: Create three new columns (base, fall, and rise)

We need to do some calculations to generate the base values for the waterfall chart.

Start by creating three additional columns and label them as base, fall, and rise (I learned how to design my table from this YouTube video by United Computers).

Figure 3.png

 Step 2: Estimate the value for the fall column

For the fall column, we want to look at the Duration column. If the value in the Duration column is positive, then we need to have the value in the “fall” column as zero. If the value in the Duration column is negative, when we want the absolute (positive) value in the “fall” column.  

Here is an illustration of what the “fall” column calculation requires.

Figure 4.png

Step 2: Estimate the value for the rise column

To estimate the rise column we, again, look to the Duration column. If the value in the Duration column is positive, we enter it into the “rise” column. If the value in the Duration column is negative, then we enter 0 in the “rise” column.

Here is an illustration of what the “rise” column calculation requires.

Figure 5.png

Step 3: Estimate the value for the base column

The base column provides the necessary reference for us to create the waterfall chart. Unlike the other columns, we start in the base column Q2 FY17 row since the “base” column value in Q1 FY17 is the reference. Then we add the “rise” column in Q1 FY17 and subtract the “fall” column in Q2 FY17. We do this for all the remaining cells.

Here is an illustration of what the “base” column calculation requires.

Figure 6.png

Step 4: Review the final data

The final data should look like the following.

Final data.png

Step 5: Create a stacked bar chart

Select the data from the Time, base, fall, and rise columns and then insert a stacked bar chart.

Figure 7.png

You should have a chart that looks like the following.

Figure 8.png

Step 6: Hide the base stacked bars

The next step is to hide the base column values by selecting the blue bars and changing the formatting to No Fill and No Line.

Figure 9.png

Step 7: Format the chart for final presentation

You should format the chart using colors that indicate the direction of the rise and fall in the average duration of educational outreach visits. I selected blue for an increase and red for a decrease. I also added the Y-axis label and decreased the gap width to 10%.

 

Final waterfall chart

Figure 10.png

CONCLUSIONS

The waterfall chart visualizes the rise and fall in average duration of an academic detailing visit across two fiscal years. This visualization provides the viewer with an intuitive summary of the changes that occurred and whether action plans are needed. From this example, we notice that the changes did not deviate from the base of 25 minutes per visit across two fiscal years. Therefore, there may not be a need to spend resources to investigate this element of academic detailing.

In this example, I re-created the waterfall chart to illustrate seasonal changes.

Figure 11.png

Like the previous waterfall chart, this is a clear and intuitive visualization of the changes that occurred across two fiscal years. The red bars easily illustrate that the average duration decreased immediately after Q1 FY17, but increased after Q1 FY18 only to follow the same pattern as the previous fiscal year. This tells us that something is going on during Q1 of each fiscal year, which may require some kind of investigation of the program at the beginning of each fiscal year.

Microsoft Office 365 for Windows and Microsoft Office 2016 for the Mac have a new feature for creating waterfall charts. You may need to check the Office Updates to acquire these new features.

 
waterfall icon.png
 

You can learn about these in the following Microsoft Office sites (link 1, link 2)

 

REFERENCES

I used the following YouTube video by United Computers to help generate the waterfall charts in this article.

I also referred to Cole N. Knaflic’s blog on waterfall charts.

 

Developing choropleths using the United States Veterans Integrated System Network (VISN) shapefiles

BACKGROUND

When I want to present VISN-level data, I consider using choropleths because they are visually appealing and provide a good reference to other VISNs. Choropleths are maps that uses polygons or shapes that are shaded according to a metric such that the color indicates the intensity of that metric. For instance, if you wanted to compare population density across different states, you can use a choropleth to illustrate this difference.

An example of a choropleth comes from the Centers for Disease Control and Prevention that illustrates the prevalence of obesity by state. The legend tells us the prevalence of obesity at each state and the colors denote the level of the prevalence. The cranberry color denotes an obesity prevalence of 35% or greater whereas a lighter green color with dots denotes a prevalence of less than 20%. This choropleth providers a quick visual guide on the prevalence of obesity across the United States (U.S.).

Figure 1.png

Motivating example

In past reports, I have generated a choropleth using VISN-level data. Unlike the U.S. shapefile (map files with coordinates; normally with the *.shp extension), the VISN shapefile is specific to the VA and doesn’t not follow the borders of the states used in typical U.S. shapefiles.

In this example, I will provide a step by step guide on how you can generate your own VISN-level choropleth for use in reports and presentations. The files for this tutorial are available on following Dropbox link.

 

TUTORIAL

Step 1: Download QGIS

Download QGIS, which is a free Geographic Information Systems (GIS) software for both the Windows and Mac operating systems. Watch the following video for a step-by-step guide on downloading and using the program. The program is simple to use and does not involve any coding. After you install the software, proceed to the next step. (Contact your local IT support to have this installed on a government-owned system.)

 

Step 2: Download the VISN shapefiles

The VA provides shapefiles online at the following link. Download the file titled FY2017_Q4_VISN.zip. This file will contain all the necessary files that you will need to build your choropleth.

 

Step 3: Download VISN-level data on total population

We will need VISN-level data to join with the VISN-level shapefile in QGIS. You can download the data from the following VA public site. Go to the Population Tables and download the “All Veterans Integrated Service Networks” Excel file. It will contain data on the total population at each VISN.

With QGIS, you will need to have two types of files for the data. I recommend using a text editor (not the Windows native notepad) such as Notepad++ or Brackets. In the text editor, open the file with the data and save it as a *.csv. The reason we do this is to make sure that the data is in text format. There are certain values that you want to ensure include the “0” in front of the other numbers (e.g., “01,” “02,” “03,” etc). If you don’t include the “0” you will not be able to join the data to the shapefiles.

After you save this as a *.csv (please include the extension onto the title), then you can open a new document on Notepad++ and enter the data column format. For instance, if column 1 is in text format, then type “String” for the first column. If the second data column is in numeric format, type “Integer.” We have a total of seven columns; therefore, we need to have seven data formats. See the example below.

Step 4: Open QGIS and add the VISN shapefile layer

Click on the Layer tab > Add Layer > Add Vector Layer and browse for the VISN shapefile.

Click on Open and make sure to click Add to add the VISN shapefile onto your QGIS software workspace. You should see the following image appear.

Notice how the polygons are in the form of the VISNs instead of the states. This is an important difference between what you see with the U.S. shapefiles and the VA shapefiles.

 

Step 5: Add the VISN-level population data

Include the VISN-level population data by downloading it from the VA public site. However, you can also use the Dropbox link that I host with the files already formatted for QGIS here. For this exercise, it would be easier to use the files I provide in the Dropbox link since the formatting may be challenging to implement. For more discussion about the proper formatting, please watch the following video.

You add the VISN-level population data by dragging and dropping it into the Layers panel. Use the file titled “visn_population_2018.csv” and make sure to drop it into the Layers panel. QGIS will automatically recognize the data types because the “visn_population_2018.csvt” file is in the same folder as the “visn_population_2018.csv.”

Step 6: Join the data to the shapefile

Double-click the VISN shapefile in the Layers panel; this will open a new window called the Layers Properties. Click on the JOIN icon and select the data you want to join to the VISN shapefile. Make sure to select VISN for the Join and Target field. This will use the VISN number to join the data to the shapefile. After you select OK, make sure to click on Apply.

Step 7: Adding classes and color

In the Layer Properties window, select the Symbology icon, which will open the menu to add classes and change the color of the different classes. Above the column field, select the Graduated level. In the Column field, select the visn_population_2018_Total, which is the total population of the VISN. Then select Quantile in the Mode field. Change the color ramp field to a blue hue. Click apply and you should immediately see the VISN shapefile file change colors in the workspace.

Your project workspace should look like the following map.

Step 8: Adding labels

Click on the Label icon and select the Single Labels option. Select the Labels variable and then click apply. This may take a while since QGIS has to identify the polygon’s location and insert the labels. The labeling may take about 3-5 minutes because the VISN shapefiles have layers and polygons. I recommend saving this step for when you export the final image generated using the composure function of QGIS to save yourself time.

After the labeling is done, you should see the VISN labels for each polygon.

 

Step 9: Use the composer to finalize your choropleth

The composer is QGIS’s workspace that allows you to customize the choropleth. Select the composer and name it “VISN population” and then select the sections you want to insert into the composer using the Adds New Map to the Layout icon. Once everything is finalized, you can export this as a *.png or *.pdf file. (At this time, you may turn on the labels if you waited to add these at the very end.)

This is what the choropleth looks like after we finish composing it.

Step 10: Add a coordinate reference system (CRS)

Right now, the map is not an accurate portrayal of the United States in relation to the surface of the Earth. It should be more round at the top due to the distance from the North Pole and the fact that the Earth is a sphere. To ensure that we are accurately portraying the U.S., we need to install the appropriate coordinate reference system (CRS). To do this we need to first click on the Properties of the project and select CRS. We add the CRS from the server using the Datum Transformations window. We change both the Source and Destination CRSs and use the USA_Contiguous_Albers_Equal_Area_Conic CRS (ID = EPSG:102003).

Once the CRS is installed into our CRS database, we can select it to change the shape of the map to correctly conform to the shape of the Earth.

This is what the choropleth looks like in the project workspace.

After we add the labels and compose the final elements, the choropleth looks like the following.

CONCLUSIONS

Using choropleths can highlight important differences across VISNs that would be lost in a table or difficult to present in an alternative chart. Based on the population levels, VISN 22, 17, 10 and 8 have large populations of veterans receiving care at the VA. Areas where there is low prevalence of veterans are in VISN 1, 5, 9, and 15. One thing to consider is that we did not normalize the data based on a single denominator. You can play around with how you want to do that by using the U.S. census, which can be found here. As an added exercise, see if you can create something similar using the U.S. shapefiles, which are located here. Additionally, you can use multiple choropleths (small multiples) to show changes across time or another dimension. Choropleths are excellent visuals that can contribute to a narrative; using the VISN shapefiles will allow you to generate visuals that can enhance a report or presentation.

 

REFERENCES

I used the following references to develop this tutorial.

https://www.youtube.com/watch?v=aLmMovuydqI&t=387s

https://www.youtube.com/watch?v=rG6UphZGmg4&t=615s

https://www.youtube.com/watch?v=LNJj3g6SRqU

 

Download QGIS here:

https://www.qgis.org/en/site/forusers/download

 

VA population data:

https://www.va.gov/vetdata/veteran_population.asp

 

VISN shapefiles:

https://catalog.data.gov/dataset/veterans-integrated-services-networks-visn-markets-submarkets-sectors-and-counties-by-geog

 

 

Communicating data effectively with data visualizations - Part 10 (Heat Maps)

BACKGROUND

A heat map is a data visualization tool that uses positioning and coloring to identify clusters and correlations in multivariable analysis. The most common type of heat map is a 2 x 2 matrix, where two variables are examined using the rows and columns (R x C) positions on a matrix. A heat map matrix helps us identify any patterns or similarities across the different dimensions. In a heat map, color is critical in denoting degrees of change. (Please see refer to the past blog on colors.)

For example, we can see the changes in opioid overdose rates across time for each state (Figure 1). As the rates increase, the cells are darker (indicated by the legend). All the states experience an increase in opioid overdose deaths, but State 1 is experiencing it faster than States 2 and 3.

In this tutorial, we will perform two methods to create heat maps. The first method will use the built in Excel conditional formatting rules and the second method will use VBA macros.

Figure 1. Heat map matrix.

Figure 1 - heatmap matrix.png

MOTIVATING EXAMPLE

We will continue to use the state-level drug overdose mortality data from the CDC.

https://www.cdc.gov/drugoverdose/data/statedeaths.html

Mortality rate is presented as the number of deaths per 100,000 population.

In this tutorial, we will develop a heat map using opioid overdose mortality from 2013 to 2016.

The data setup in Excel has State indicators as the rows and time indicators as the columns. We are visualizing the change in opioid overdose mortality from 2013 to 2016 for each state. Figure 2 illustrates the data structure for the first seven states.

 

Figure 2. Data structure for the first seven states.  

Figure 2 - Data structure.png

 

METHOD 1: CONDITIONAL FORMATTING RULES

Excel has a convenient tool that allow us to use conditional formatting to shade our heat map matrix.

Step 1: First highlight all the data.

 

Step 2: In the Excel Ribbon, select Conditional formatting and then New.

Figure 3 - conditional formatting.png

Step 3: Select “Format all cells based on their values” and change the Format Style to “3-Color scale.” Change the color to the different shades you are interested in using. (In this example, we used a blue base with varying degrees of shading.) Select percentile and then click “Ok.” The percentile will use the Median for each column to distinguish the middle category for the rates of opioid overdose mortality for each state.

Figure 4 - 3-color scale.png

Step 4: Visually inspect the results. If there are no apparent pattern in the heat map, we will need to sort the rate of opioid overdose mortality for 2016 in descending order.

Figure 4 - Descending order.png

The heat map should look like the following:

 
Figure 6 - heat map example 1.png
 

There is a pattern emerging in regards to the rate of opioid overdose mortality across the different states. West Virginia has the highest opioid overdose mortality rate in 2016, but they also appear to have the highest from 2013 to 2015. The dark cell in 2016 indicates that West Virginia has exceeded the 50 deaths per 1000 population incidence rate. Other states also have high rates of opioid overdose mortality across 2013 to 2016, which continued to increase. The 2013 column is clearly lighter indicating that the opioid overdose mortality has increased over time up to the available data in 2016.

 

Step 5: We can improve this heat map by removing the numbers, which can be distracting. We need to select all the data and format the cells. Select only the numeric data and then format the number. Use the “Custom” category and enter the following: "";"";"";"". This will change the number values so that it doesn’t show in the cells.

Figure 7 - change the number format.png

Step 6: Change the column width and row height so that you have a nice square-like matrix for the final result. We used a column width of 5 and a row height of 30. (Only the first fifteen states are shown.)

The heat map can be used to quickly identify states with the highest opioid overdose mortality and the trends across time. The states with the highest rates of opioid overdose mortality are clustered at the top while the states with the lowest rates are clustered at the bottom.

We could also arrange this into regions of the US to further stratify the results (not shown).

 

METHOD 2: USING VBA MACROS FOR CONDITIONAL FORMATTING OF MORE THAN 3 CATEGORIES

Excel only allows us to choose up to 3-Color scales. If we wanted to use more than 3 color categories, we will need to use VBA macros.

But before we do, we need to think about the colors for the scales. Since we have more than 3 categories, we will need to figure out how to divide the colors.

 

COLOR CODES

We will need to determine the base color for our heat map. In this example, we will use a blue base-color and change the shading using the RGB color values. RGB colors are based on a system using a combination of three base colors (red, green, and blue) that can be used to change the intensity of the color from a range between 0 and 250. An example of an RGC color table can be found in the following site.

For this example, we used the following RGB color values where dark-navy denotes high rates of opioid overdose mortality (50 or more per 1000 population). However, you can change the values of these colors however you like.

Figure color codes.png

We will continue to use the blue color base and change the gradient using the RGB values.

 

VISUAL BASIC FOR APPLICATION (VBA) MACRO

The VBA macro comes from the site Excel For Beginners and written by Kristoff deCunha. We used the VBA code on the site, but modified it for this tutorial.

The VBA code is written in a way where any changes in the values will automatically update the color of the cell. Additionally, the code also sorts the 2016 opioid overdose mortality rate in descending order, adds thin-white continuous borders around the cells, and changes the font to Times New Roman.

Here are the VBA macros used for Method 2.

 

Macro 1 changes the font to Times New Roman.

‘Macro1
Sub ChangeFont()

Dim rng As Range
Set rng = Range("J1:N52")

With rng.Font
    .Name = "Times New Roman"
    .Size = 12
    .Strikethrough = False
    .Superscript = False
    .Subscript = False
    .OutlineFont = False
    .Shadow = False
    .Underline = xlUnderlineStyleNone
    .ThemeColor = xlThemeColorLight1
    .TintAndShade = 0
    .ThemeFont = xlThemeFontNone
End With

End Sub

 

Macro 2 changes the cell shading to the 6-Color scale (blue-base)

‘Macro2
Sub ChangeCellColor()


Dim rng As Range
Dim oCell As Range

Set rng = Range("K2:N52")

For Each oCell In rng
        Select Case oCell.Value
                 Case 50 To 100
                     oCell.Interior.Color = RGB(37, 54, 97)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                 Case 40 To 49.999
                     oCell.Interior.Color = RGB(56, 83, 145)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                 Case 30 To 39.999
                     oCell.Interior.Color = RGB(147, 168, 215)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                 Case 20 To 29.999
                     oCell.Interior.Color = RGB(183, 198, 228)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                 Case 10 To 19.999
                     oCell.Interior.Color = RGB(218, 225, 240)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                 Case 0 To 9.999
                     oCell.Interior.Color = RGB(230, 237, 253)
                     oCell.Font.Bold = True
                     oCell.Font.Name = "Times New Roman"
                     oCell.HorizontalAlignment = xlCenter
                  Case Else
                     oCell.Interior.ColorIndex = xlNone
         End Select
    Next oCell
    
End Sub

 

Macro 3 sorts the 2016 opioid overdose mortality rates in descending order.

‘Macro3
Sub SortColumn()

Dim DataRange As Range
Dim keyRange As Range
Set DataRange = Range("J1:N52")
Set keyRange = Range("N1")
DataRange.Sort Key1:=keyRange, Order1:=xlDescending

End Sub

 

Macro 4 hides the font from the heat map.

‘Macro4
Sub HideFont()
    
Dim rng As Range
Dim oCell As Range

Set rng = Range("K2:N52")

    For Each oCell In rng
        oCell.Font.Color = oCell.Interior.Color
    Next oCell
    
End Sub

 

Macro 5 creates thick white borders for each cell in the table.

‘Macro5
Sub WhiteOutlineCells()

    Dim rng As Range

    Set rng = Range("J1:N52")

    With rng.Borders
        .LineStyle = xlContinuous
        .Color = vbWhite
        .Weight = xlThick
    End With
    
End Sub

INSERTING DATA AND RUNNING MACRO

The five macros are assigned to a button in the Excel Macro-Enabled Workbook. Pressing the button will perform the task of creating a 6-Color scale heat map. Download the Excel Macro-Enabled Workbook here. This file will have the raw data and the macro-enabled worksheet for you to create a heat map.

Step 1: Copy the raw data from the “combined” worksheet.

Figure 9 - Copy Data.png

Step 2: Paste it in the worksheet “heatmap_2” starting on cell "J1".

Figure 10 - Copy onto J1.png

Step 3: Then press the “PressStart” button to run the macros. Your final heat map should look like the following:

Compare the heat map from Method 1 (3-Color scale) to the one generated by Method 2 (6-Color scale). The heat map with the 6-Color scale has a lighter pattern compared to the 3-Color scale heat map. The differences are dramatic. Depending on the granularity of the heat map you want, either one of these color scales would be fine. However, Method 2 requires some VBA coding.

* Not all states shown.

 

Conclusions

Heat maps allow us to observe patterns in the data. In our example, we notice that West Virginia has a high rate of opioid overdose mortality indicated by the clustering of dark cells from 2013 to 2016. Other states had similar patterns as West Virginia. Using heat maps provides a quick and easy interpretation of the changes in opioid overdose mortality across time and the states that are clustered together that have high rates of opioid overdose mortality.

 

REFERENCES

I used the following websites to help develop this tutorial.

Conditional formatting with more than 3 categories:
https://sites.google.com/site/excelforbeginners/Home/vba-codes/conditional-formatting-more-than-3-criterias
 
Changing the RGB color codes:
https://analysistabs.com/excel-vba/change-background-color-of-cell-range/
https://www.w3schools.com/colors/colors_rgb.asp

Excel color palette library:
http://dmcritchie.mvps.org/excel/colors.htm

Excellent site for VBA coding:
https://www.excel-easy.com/vba.html