Use JMP to find patterns (and anomalies) in data

jacob_rhyne · Dec 10, 2021 08:45 AM

Earlier this year, my manager sent me a very interesting post from Data Colada, a blog post written by three behavioral scientists that examines published research. The authors of Data Colada worked with a team of anonymous researchers and found some curious anomalies in a car mileage data set that was used in a 2012 paper. I am responsible for testing the Explore Patterns platform in JMP, which can be used to identify unusual or unexpected patterns in data. In this post, I use Explore Patterns, along with other useful tools in JMP, to search for unexpected or unusual patterns in the car mileage data.

Background

The first few rows and columns of the car mileage data set are shown below. These data were collected from a study in which 13,488 customers of a particular auto insurance agency were asked to report their current odometer readings. The readings reported by the customers are included in the data as the update_car1 to update_car4 columns, as each customer reported information for up to four cars. The data also contain the odometer readings for each vehicle when the insurance policies were first issued. These columns are named baseline_car1 to baseline_car4. Since all policies were not issued at the same time, the time elapsed from the baseline to the update varies for the observations in this study.

First Few Rows.png

In the Data Colada post, the researchers presented four anomalies that were found in this data set. Let’s dig into each of these findings using JMP.

Anomaly #1: The distribution of the miles driven is uniform

The authors at Data Colada found that the distribution of miles driven was consistent with a uniform distribution ranging from 0 to 50,000 miles.

Since the data include a baseline and update column for each car's odometer readings, we create formula columns to store the miles driven by subtracting the baseline values from the updated values. In JMP, we then use the Distribution platform to view the distribution for the baseline and updates values, as well as the total miles driven for each car. Consider the Distribution report of the first car for each customer.

First Car Distribution.png

The shape of the distribution of baseline mileage is similar to the distribution of the updated mileage, with the median odometer reading about 26,000 miles higher at the update time. This finding is not by itself surprising. What I find most interesting about the baseline and update readings is that two participants owned vehicles that have been driven nearly 1 million miles! They must have very good mechanics.

The most alarming part of the report is the third Distribution above, showing the difference between the first two. The distribution of miles driven between the baseline and update times appears to be uniformly distributed between 0 and 50,000 miles. There are no values above 50,000 miles, and each bin of the histogram contains approximately the same number of participants. Visual inspection gives solid evidence in this case that the data are uniformly distributed. However, for data that are not as obvious, you could perform a goodness-of-fit test using the Fit Beta option in the Distribution platform.

The screenshot below shows the distribution of miles driven (update – baseline) for each of the participants’ cars. We find that each distribution appears uniformly distributed between 0 and 50,000 for the second, third, and fourth cars as well. Based on this distribution pattern, we can conclude that the updated mileage values could have been generated by adding uniformly distributed random values to the baseline mileage values.

Distribution of Mileage.png

Anomaly #2: Baseline mileage used rounding, and updated mileages did not

The authors at Data Colada found that the baseline mileage had many values rounded to the nearest thousand while the updated mileage did not contain this rounding.

The researchers at Data Colada investigated rounding by looking at the last three digits of the mileages. We could use JSL to take a similar approach, but instead let’s use Explore Patterns! If you’ve never used Explore Patterns before, you can find it by selecting Analyze > Screening.

Explore Patterns Menu.png

After we launch Explore Patterns to explore each baseline and update column, we get the output seen in the following screenshot. In the Univariate Patterns report, select the baseline and update values for the first car to compare the patterns.

Explore Patterns Car 1.png

By default, Explore Patterns includes the reports for the most duplicated values in each column, the longest runs in each column, and the longest duplicate sequences in each column. For the first car owned by participants, there is not much to be gained from the Longest Runs or Longest Duplicate Sequences reports. However, the Most Duplicated Values report shows an interesting contrast. In the baseline mileages for the first owned car, there are many values that appear to be rounded to the nearest 10,000. This is not necessarily an unexpected pattern to find; when I’m asked the mileage of my vehicle, I always round when answering.

The Most Duplicated Values report for the updated mileages is quite different. In this report, there are far fewer duplicates, and the duplicates that appear are not rounded like the baseline values. This is what would happen if the update columns were generated by adding uniformly distributed random values to the baseline values. Therefore, this finding is consistent with Anomaly #1.

Anomaly #3 – Data were entered in two fonts, and each observation has a ‘twin’ in the other font

The researchers noticed that in the original data file, the observations were recorded using two different fonts. Through some clever sorting, the researchers found that each baseline observation in one font corresponded to a very similar row in the other font.

I applaud the researchers for noticing the different fonts and taking this approach. However, let’s suppose the data were all in the same font and this red flag wasn’t raised. Can we use JMP to identify any curious relationships between rows and columns? Of course we can – this is the JMP Blog after all.

We can use Explore Patterns to investigate if there any linear relationships between columns for contiguous rows. Consider the following simple example. In the first four rows, the columns have an exact linear relationship where Y=2*X. In the next three rows, the two columns again have an exact linear relationship where Y=0.5*X.

Linear Relationships Simple Example.png

We can use Explore Patterns to search for linear relationships like this between the baseline columns. If we had not already noticed Anomaly #1, it would make sense to look for linear relationships in the updated columns, too.

Linear Relationships Explore Patterns.png

If there were many exact linear relationships, or perhaps large runs of linear relationships, it may be a sign that the values in one column were generated using a multiple of another column. Here, there are only a few rows for which the baseline columns have an exact linear relationship. We could use the Colorize option to view these cells and inspect for any irregularities, but for a data set this size with many rounded values, it’s not surprising to find a few linear relationships.

Using Explore Patterns, we’ve investigated whether there are any suspicious linear relationships between rows for two columns for the baseline values. However, the results were not very useful. Now, let’s also look for linear relationships between columns among two rows, again for the baseline values. To do this, we will need to use Explore Patterns on the transpose of our data.

I used the Transpose option in JMP to transpose the data by the baseline columns, for participants that reported mileage for at least three vehicles. I excluded rows where participants had two or fewer cars as these would not help identify linear relationships.

Transpose of Data.png

In the Explore Patterns report for this transposed data, there are many exact linear relationships where the slope is approximately one. Thus, there are many rows in the original data table that have a near twin row of baseline values across the cars. (Hint: You can right-click the Slope column and use the Select Where feature to highlight all rows where the slope is close to one. There are 69 pairs where the Slope is between 0.99 and 1.01).

Explore Patterns Exact Linear Relationships.png

You can select Colorize above the table of linear relationships to colorize all values in an exact linear relationship, or you can right-click the Y column to colorize specific values. The table below contains the colorized values for the first three rows of the Linear Relationships report.

Driving Twins.png

Explore Patterns has identified several pairs of ‘driving twins’ where the baseline mileage values were very similar across all cars, having an exact linear relationship. We could follow up on this by looking for more pairs that do not have exact linear relationship but are still very closely related. As illustrated in the Data Colada post, once you sort by the baseline value for the fourth car, it is easy to identify many driving twins.

Anomaly #4 – One font has many rounded baseline values, and the other does not

The researchers found that the baseline values in Calibri font were rounded, and the values in Cambria font were not rounded.

Now, suppose we knew that some rows were written in one font and the other rows were written in a second font, and that we have a column in our data with the font for each row. We may use Explore Patterns with the font name as a By column.

Explore Patterns with Font.JPG

As seen in the Explore Patterns reports, the baseline mileage values for the rows in Calibri font have many rounded values, while the baseline values for the Cambria rows are not rounded. Since we would expect customer reported mileages to be rounded, this suggests that the original reported values were the rows with Calibri font and the “driving twin” data was created and then entered in the Cambria font.

Conclusion

Whenever you acquire a new data set, you likely do some exploratory analysis and data cleaning. Perhaps you search for outliers, check the data for missing values, or view the distribution of each variable. I hope this post has encouraged you to consider using Explore Patterns to check for unusual patterns in your data. Not all patterns you find using Explore Patterns are evidence that anything is wrong or concerning, but the utility provides you with a tool to find patterns that warrant further investigation.

Ben_Ingham94 · ‎12-11-2021

Great read! Interesting to learn these functions in JMP

jacob_rhyne · ‎12-13-2021

Thanks for reading @Ben_Ingham94! I'm glad you enjoyed it.

madhu · ‎01-14-2022

good analysis in JMP

madhu · ‎01-14-2022

To create the Transpose of Car Millage data how have you identified (or picked up) the participants who have reported millages for at least three vehicles? I can do it manually by checking each row, however, is there any formula to use?

jacob_rhyne · ‎01-14-2022

Hi @madhu thanks for reading, and for the question. To identify the participants who reported mileages for at least three vehicles, you can use the Select Where feature of JMP to select these rows. Then when transposing the data, you can choose to use only the selected rows. Here is a JSL script that selects rows for all participants with three or more cars and then transposes the data using only the selected rows.

Data Table( "Car Mileage Data" ) << Select Where(
	Sum(Is Missing(:baseline_car1), 
		Is Missing(:baseline_car2),
		Is Missing(:baseline_car3),
		Is Missing(:baseline_car4)
	) <= 1);
	
Data Table( "Car Mileage Data" ) << Transpose(
	columns( :baseline_car1, :baseline_car2, :baseline_car3, :baseline_car4 ),
	Transpose selected rows only( 1 ),
	Output Table( "Transpose of Car Mileage Data" )
);

madhu · ‎01-14-2022

Hi @jacob_rhyne

Thank you for your prompt reply. I am new in JMP (using JMP 16 Pro) and do not know much about JMP. Unfortunately, I do not find the Select Where feature anywhere on JMP menu.

Further, is there any way to insert the JSL code on the data file to execute the codes automatically?

jacob_rhyne · ‎01-14-2022

Hi @madhu. You can find Select Where under Rows > Row Selection > Select Where.

Select Where.JPG

However if you want to use JMP interactively to select these rows, instead of using JSL, I would recommend adding a new column and creating a formula to identify these rows. To do this, please try the following steps.

Select Cols > New Columns... from the top menu
In the New Column menu, select Column Properties > Formula
In the Formula editor, enter the formula as below

Formula Column.JPG

When you've entered the formula, click Ok to create the column. In the new column, the rows with the value 1 correspond to participants with three or more cars. You can select all rows with the value 1 by right-clicking a row with the value 1 and clicking "Select Matching Cells" like below.

Select Matching Cells.JPG

For your question about using JSL to do this, a great place to start is to look at the Enhanced Log to find JSL that was generated for the actions you performed interactively. For example, the screenshot below shows the Log after opening the data table attached to this page, creating the new formula column described above, and transposing the data using the selected rows. You can save the script from the Log to a new script that you can run again any time.

Enhanced Log.JPG

I hope this is helpful. Thanks again for reading.

madhu · ‎01-16-2022

Hi @jacob_rhyne

Thank you for the guidance. I have now created the transpose matrix.

The only think I can not do in this exercise is to change the column title id=12, id=23, id=36 etc in one go. Do I need to change each column name manually. I wanted to use a formula that adds id= to replace all column title in one go.

jacob_rhyne · ‎01-18-2022

Hi @madhu. Thank you for this question, I'll share two solutions to make the column titles for the Transposed data table use this format.

First, you can create a new column in the original data table that uses the formula seen in the screenshot below.

Car Mileage Data - Enhanced ID Column.JPG

After you've created this column, you can then use it as the Label column when performing the Transpose.

Transpose Setup.JPG

A second option you can use, since you've already created the Transposed data table, is to use JSL to edit the column titles. To do this, please open your Transposed data table then select File > New > Script and paste the script below into the script editor window. Running this script will update the column titles to the desired format.

dt = Data Table("Transpose of Car Mileage Data");
colNames = dt << Get Column Names(string);
For(i=2, i <= Length(colNames), i++,
	Column(dt, i) << Set Name("id=" || char(colNames[i]))
);

I hope this helps. Thanks again for your comments.

madhu · ‎01-29-2022

Thank you @jacob_rhyne

this certainly helps