cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Predictive modelling with machine learning in JMP – K Nearest Neighbours (KNN)

Shuran_0-1706687042959.jpeg

The K Nearest Neighbours, or KNN, platform predicts a response value for a given observation using the responses of the observations in that observation’s local neighbourhood. It can be used to classify a categorical response as well as to predict a continuous response.

 

KNN is a non-parametric, supervised machine learning algorithm. It works on the underlying assumption that similarities in certain parameters (the observations) reflect similarities in other parameters (the prediction). To fulfil this assumption, KNN ranks the k smallest distances between the given observation and all other observations.

 

Because of this fact, K-Nearest Neighbours can classify observations with irregular prediction boundaries. However, as the algorithm is sensitive to irrelevant predictors, the choice of predictors can affect your results.

 

Calculate the distance

JMP pro 17 uses Euclidean Distances to calculate the distances between the predictor values and those of others. In a 2D space, the Euclidean Distance d between (x1, y1) and (x2, y2) is simply:

d = sqrt((x2-x1)^2 + (y2-y1)^2)

the Euclidean method requires that all the values are real.

 

Making predictions

The KNN gives its prediction depending on the types of responses it has.

- Continuous responses

JMP predicts values as the average of the responses of the k nearest neighbours. Each continuous predictor is scaled by its standard deviation. Missing values for a continuous predictor are replaced by the mean of that predictor.

- Categorical responses

In JMP, the predicted value is the most frequent response level for the k nearest neighbours. If there is more than one top response, the predicted response is assigned by randomly selecting one of these levels.

 

The k value or n_neighbors

Different values of k can give different predictions. In JMP, you only need to specify the maximum number of k. For each k, 1 ≤ k ≤ k_max, JMP builds a model using only the training set observations. Each of these models is used to classify the validation set observations. The validation set results are used to select the model with the lowest misclassification rate (aka the best k).

 

Validation for KNN in JMP modelling

You can add a numerical column as a validation for KNN modelling in JMP Pro. JMP uses validation to partition data into sets before modelling to reduce overfitting and select a good predictive model. For KNN, this feature is optional. However, overfitting can lead to a mismatch between training and prediction, i.e., the algorithm is overly specific to the training data, resulting in less accurate predictions.

For KNN, you can partition the data into two or three sets by the following methods:

Method

Set

Function

Train and tune

Training

Estimate model parameters

Validation

Tune the model parameters in the model fitting algorithm, and ultimately choose a model with good predictive ability

Train, tune, and evaluate

Training

Estimate the model parameters

Validation

Tune the model parameters in the model fitting algorithm, and ultimately choose a model with good predictive ability

Test

Independently evaluate the performance of the fitted model

 

Learn more about validation in the JMP Documentary: Example of the Make Validation Column Platform (jmp.com)

Example

In this example, we want to use KNN to categorise a customer between Good Risk and Bad Risk based on their financial history.

  1. Select Help > Sample Data Folder and open Equity.jmp.
Shuran_1-1706687043066.png

 

  1. Select Analyze > Predictive Modeling > K Nearest Neighbors.
Shuran_2-1706687043265.png

 

  1. Select BAD and click Y, Response. The is the value to analysis.
  2. Select LOAN and CLNO and click X, Factor. This is the predictor variables.
  3. Select Validation and click Validation. This is the value used to partition data into sets before modelling to avoid overfitting and select a good predictive model. We used Train, Tune, and Evaluate method in the example.
Shuran_3-1706687043323.png

 

  1. Click OK.
Shuran_4-1706687043524.png

 

To save the prediction equation

  1. Right click the BAD red triangle and select Publish Prediction Formula.
  2. Next to Number of Neighbors, K, leave the default value of 7.
  3. Click OK.
Shuran_5-1706687043571.png

 

The prediction equation is saved in the Formula Depot. You can compare the performance of alternative models published to the Formula Depot with that of the K = 7 nearest neighbour model using the Model Comparison option in the Formula Depot. Find more information regarding the Formula Depot in Help/JMP Documentation Library, page 3763.

If you are interested, you can go back to the step 4. This time put all the parameters in between LOAN and CLNO as X, Factor. What changed and why?

 

Shuran,

Reviewed by Jeremy T. @Jeremy_Tee 

Jan 2024, Based on JMP(R) Pro 17.2.0

 

This article does not reflect the positions and opinions of JMP Statistical Discovery nor SAS Institute.