box plots

Communicating data effectively with data visualizations: Part 32 (John W. Tukey short biography)

INTRODUCTION

John Wilder Tukey (1915 to 2000) was a mathematician, statistician, and data visualization pioneer. He has been attributed with coining computer science terms such as “bit” (shortened version of binary digit) and “software.” However, Tukey is best remembered for his contributions to data visualizations.

Tukey developed the foundations for exploratory data analysis (EDA) which has taken root as a critical first step to understanding complexities of observations. According to Tukey, EDA was about hypothesis generation. Unlike confirmatory analysis, EDA uses data to identify potential hypothesis to explain observed phenomena and assist with selection of appropriate statistical tests. According to Tukey, data visualization plays an important role in EDA. Visualizing the data using his methods allows analysts to better understand the data. Tukey was responsible for developing innovative data visualizations including a new count tally system, stem-and-leaf displays, and box-and-whisker plot. This article will highlight several of Tukey’s innovative data visualizations that continue to be used in today’s data analysis world.  

 

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John W. Tukey

Exploratory Data Analysis (1977)

Count tally

Conventional counting tally uses the vertical “stroke” method where you have a single pencil stroke to denote a single count. On the fifth count, a diagonal line is sketched across the four vertical strokes. This can cause miscounts due to the difficulty of interpreting the strokes. Here are some errors that Tukey highlights from his textbook, Exploratory Data Analysis:

Source: Tukey JW. Exploratory Data Analysis. Pearson; 1st edition (January 1, 1977)

Tukey developed an innovative counting tally method that was more efficient than convention methods by using dots and lines to indicate counts. This method is considered easier to interpret without any miscounts, especially when the counting is performed quickly.

 
 

Stem-and-leaf display

Although histograms allow us to see whether our data are normally distributed, they do not provide much information. Tukey developed an innovative method to capture additional elements while visualizing the data’s distribution. This visualization is known as the stem-and-leaf display which provides data analysts both with descriptive information and the data distribution. He believed that the histogram left out critical information that the would be informative to the analyst. By using the stem-and-lead display, a data analyst can observe the raw values of the data and quickly identify the mode and outliers. The following is a figure taken from Tukey’s 1977 text Exploratory Data Analysis. The figure represents two different stem-and-leaf plots that display the same data. The “#” represents the frequency in each bin, which are ordered by the first number character of the value. For example, the value 16 is represented as 1 | 6.

 

Source: Tukey JW. Exploratory Data Analysis. Pearson; 1st edition (January 1, 1977)

 

Box-and-whisker plots

Tukey discussed improving the box-and-whisker plot (also known as the box plot) by having the whisker length to be standardized at 1.5 times the interquartile range (IQR). This would allow analysts to identify the outliers that exceed this whisker length. In one example from his textbook, Tukey highlights the benefits of the box-and-whisker plot by measuring the elevations of states and volcanoes. The reader can easily identify the outliers as they are labeled (by Tukey’s hand) on the plots.

CONCLUSIONS

Tukey made significant contributions to mathematics, computer science, statistics, and data analysis. But his pursuit for efficient methods to display data has led to innovative methods of data visualizations that we continue to use. Data visualization, according to Tukey, was and important part of analysis from which we could generate hypothesis and select the appropriate inferential tests. He saw the world in a different way, which has helped us shed a little illumination on the mysteries of the world.  

 

REFERENCE

1.       Tukey JW. Exploratory Data Analysis. Pearson; 1st edition (January 1, 1977).

Communicating data effectively with data visualization - Part 13 (Box and Whisker Diagrams)

BACKGROUND

Box plot (box and whisker diagram) is a great way to display distribution of a continuous (e.g., interval) data variable. A typical box plot will contain the mean, median, interquartile values, and the minimum and maximum values. Figure 1 illustrates these elements on a box plot. Up until recently, Microsoft Excel did not have an option to graph box plots. However, in the 2016 version of Microsoft Excel, box plots were added as part of the statistical features.

Figure 1. Example of a box plot (box and whisker diagram (Figure 1).

 
Figure 1.png
 

MOTIVATING EXAMPLE

We will use data that was randomly generated to create box plots across four hypothetical quarters (Q1FY19, Q2FY19, Q3FY19, and Q4FY19). The data will contact the number of visits to the doctor from several outpatient specialty clinic. Here is what the data looks like from the first two sites. Data for the example can be found here.

 
Figure 2.png
 

Site 1 has 45 visits in Q1FY19 and Site 2 has 44 visits in Q1FY19. To create the box plots, we need to use the long format which uses multiple rows for each site.

 

EXERCISE

In this article, we will generate box plots that will visualize the average number of visits and its distribution across quarters.

Figure 3.png

After clicking on the Box and Whisker plot, you will need to select the data that will be used to generate the box plots across the quarters.  

Figure 4.png
Figure 5.png
Figure 6.png

Click “OK” and the default box plot will look like Figure 2.  

Figure 2. Default box plot generated by Excel 2016.

Figure 7.png

After a few changes to the color and labels, our box plot can be improved (Figure 3).

Figure 3. Updated box plots.

Figure 8.png

These box plots give us an idea of the changes in the number of visits across quarters including the distribution of the data. For each box plots, the mean indicated by the “X” is not too different from the median (indicated by the solid horizontal line).  However, there is greater variation in the distribution of the number of visits in Q2FY19 and Q3FY19 compared to Q1FY19 and Q4FY19. We can see that there was an increase in the number of visits, on average, between Q1FY19 and Q3FY19, but this drop significantly in Q4FY19. This may be due to some kind of change (e.g., seasonal variation) and should be explored.

 

Conclusions

The box plot provides us with a nice data visualization of the mean number of visits across quarters including the variation and distribution of the data. Plotting these in Microsoft Excel 2016 will allow you to explore your data and motivate you to explore and generate some explanation or hypothesis for their behavior.

References

I used several online references to write this article.

The Dummies series provide a good illustration of the box plot elements, which can be located here.  

I watched this YouTube video by stickpet on how to use Microsoft Excel 2016 to generate box plots.