Level Up your Python game with JMP: Cleaning data with Python and JMP

Bill_Worley · Sep 19, 2024 10:00 AM

So, you’d like to learn more about Python and JMP integration. First, make sure you are up to speed by watching the related video, if you haven’t already.

A group of JMP SEs have developed a series of blogs that showcase the ease with which JMP and Python can be integrated to handle everyday tasks, such as cleaning data, as well as more advanced analysis and modeling. A special thanks to Yasmine Hajar (@yasmine_hajar) and Wendy Tseng (@wendytseng) for their help with the Python coding used in the video and blog.

In the video, we show how – using just a mouse and eyes – to find potential outliers and then remove them from the data table. You also learn how to analyze for missing values and, if necessary, impute values as surrogates to see the resulting analysis if the data weren’t not missing.

JMP and Python have been complementary for several years now, and with the release of JMP 18, it is much easier to integrate Python analytics with the excellent data visualization in JMP. However, and I am highly biased here, data cleaning in JMP is much easier than it is in Python for one very basic reason: coding.

As an example, with JMP you can visually scan for outliers within the Distribution platform in a univariate mode, and within the Multivariate platform in a multivariate sense. There is the functionality to scan for Hoteling’s T2 outliers for all data of interest using the Model Driven Multivariate Control Chart platform, but we will save that for another time. Again, please watch the accompanying video to see all this capability in action.

In the meantime, here is a list of some of the data cleaning options in Python:

Extract specific patterns from text data, such as email addresses, phone numbers, or dates.
Remove unwanted characters or substrings from text data.
Replace specific patterns or substrings with new values.
Standardize text data by converting all characters to lowercase or uppercase.
Identify and remove duplicates based on text data.
Rename columns to a more recognizable set of labels.
Drop unnecessary columns in a DataFrame.
Change the index of a DataFrame.

The first five options can be done quickly with JMP’s recode option. Plus, using the Column Names option under the Cols menu dropdown, it only takes a few clicks to rename columns.

Dropping unnecessary columns in data table is as easy as selecting the columns of interest, then right-clicking on any of the selected column headers and choosing Delete columns.

When it comes to finding missing values and imputing missing data, JMP leads the way. Locating and sequestering outliers are also straightforward in JMP, which you can see below.

Below is the Python code to open a JMP table and explore for outliers and missing values. Below that is an image of the data frame and some of the output for outliers and missing values. Note that outliers in Python are determined using a Z-score with an absolute value greater than 3. Based on this criterion, Python found outliers for virtually every continuous column in the data table.

To parse for outliers in JMP, follow this example, which uses Arrythmia.jmp from the JMP sample data index. Go to Analyze > Screening > Explore Outliers.

You can decide which data is truly an outlier, and you can easily recalculate the outlier ranges with just a couple of clicks. JMP recommends that you never completely remove data from your data set, due to potential audit trails. You can always Hide & Exclude data in JMP to keep it out of your analyses without deleting it from the data set. Hiding data in Python requires more coding and most likely means that the data cannot be seen by anyone.

To explore missing values in JMP, open the data table, go to Analyze > Screening > Explore Missing Values.