JMPer Cable

Richard_Zink

For those who are unfamiliar with the Pareto plot, this website presents an interesting example where the plot is applied to medication errors in a hospital setting. I use this example in data visualization courses that I teach (Figure 1).

Figure 1. Pareto plot of medication errors.

The idea behind Pareto plots is to help analysts identify the most common sources of quality issues. To address the majority of issues (say, 80%), the goal is to implement solutions to solve the error types that are categorized as red in Figure 1. In other words, if I solve quality issues surrounding the medication errors labeled missed doses, wrong time, wrong drug, and overdose, approximately 80% of all medication errors will be solved. Note that solutions may address the less frequently occurring quality medication errors, and it is possible that they may not solve all issues related to the four main medication errors. However, at least I have a clear idea of how to proceed to improve the problem of medication errors to achieve the biggest bang for my effort.

For the remainder of this post, I work with simulated data from 100 patients with the frequency of an outcome recorded on each of 10 possible study days (Figure 2). This outcome could be anything, such as the frequency of adverse events experienced on that day, or the number of times the patient needed to apply eye drops to address allergic symptoms. I will return to the First Observation column later.

Figure 2. Unsorted simulated data of events.

We can produce a Pareto-like plot of event frequency by patient (Figure 3).

Figure 3. Pareto plot of events by patient.

In Figure 3, the X axis represents patients sorted by the total number of events they experienced over the 10-day period in descending order; the Y axis is the total number of events experienced. The black line is the cumulative proportion of the total number of events experienced as we proceed from left to right in the plot. You can use the grid lines to find the percentage of interest (say, 80%) and hover over the black line to identify the patient for whom approximately 80% of the total number of events occurs. In the above, approximately 80% of events occurs at Patient 6. Figure 3 has some limitations:

It does not describe the cumulative event frequency.
It does not communicate the cumulative frequency or percentage of patients involved among the 80% of events.

The Figure 2 data table has a column called First Observation, which includes the formula presented in Figure 4. This formula labels the first record within patient with a 1, and a 0 otherwise.

Figure 4. Formula to identify the first record for each patient.

This formula uses the Col Min function from the Statistical category. The statistical formulas that summarize column data are some of my favorite within JMP; I’ll produce a more detailed blog post on these functions at a later time full of useful examples, including a detailed description of the formulas used below to produce columns within the sorted data table.

I can use this variable to produce a cumulative patient percentage (in red) in the Pareto plot of Figure 5. At the point at which 80% of cumulative events occur, this happens within 69% of the patients. Admittedly, obtaining this estimate is a bit cumbersome, as it requires me to drop from the black line to the red line and attempt to match up the needed information.

Figure 5. Pareto plot with cumulative patient percentage.

However, Figure 5 still has some issues:

It does not communicate the cumulative frequency of patients or events.
To get data for the cumulative percentage of patients or events, I have to hover over each line individually, which is not convenient.

Note that I could solve the cumulative frequencies issue by adding two additional parallel Y axes: one to communicate cumulative event frequency and one to communicate cumulative patient frequency. However, with this approach:

There are two additional Y axes and two additional lines making the plot more crowded.
I still need to hover over each line individually to get the data of interest.

I can solve these problems, but it requires some additional work in the data table. Figure 6 presents the above data table with some additional variables defined, and sorted by Patient Total (descending), Patient, and Day.

Figure 6. Sorted simulated data of events.

Why the extra steps?

The Patient Total column allows me to sort the data table to produce the four cumulative frequency and percentage variables according to the order of the X axis needed for the plot
Computing these variables in the data table (or alternatively within Graph Builder) allows me to label the rows according to these columns so that no matter where I hover, I will get a complete picture of cumulative events and patients.

See Figure 7.

Figure 7. Patient Pareto with cumulative frequencies and percentages.

Figure 7 shows that regardless of whether I hover on the cumulative percentage lines or the Event Frequency bars, I get a complete picture of the data. Patient 6 happens to be the 69^th patient for both cumulative frequency and percentage from this study of 100 patients, and this patient represents 807 cumulative events (which is 79.74% of 1,012 total events across all patients).

I glossed over a brief point in the Figure 6 data table. These columns use several statistical column formulas, and as I mentioned above, I will describe this in detail in a future blog post. The important point to note, however, is that all of these calls use the Excluded option. This option is critical so that when a Data Filter is used to subset the patients, the cumulative variables update appropriately according to this smaller population. This step was not required for the Figure 2 data table, as Graph Builder handled those computations automatically.

Figure 8 presents a Patient Pareto plot for females, of which there are 20 in this study, using the Data Filter. At approximately 80% of cumulative events, 160 events have occurred (of 204 total) in 13 patients, representing 65% of female patients.

Figure 8. Patient Pareto for females.

You can download the data tables from Figures 2 and 6 from the links in the upper right of this post to examine formulas for the cumulative frequencies and percentages. The embedded Patient Pareto scripts enable you to produce the plots, which you can explore in further detail by opening the Control Panel from the red triangle menu. From here, you can see how variables were assigned to specific roles with the necessary options to produce the plot.