Re: Metrics for Discriminant Analysis

Evan_Morris · Jun 8, 2023 5:36 PM

I have a discriminant analysis that I'm using to classify some materials. Works pretty well and I'm going to take it to production shortly. Some items of course end up being "edge cases" where slight variations in them can move them from one classification to the next. So for instance Sample A-1 (first sampling of stream A) ends up being classed as a Gamma object, where-as sample A-2 (second sampling of stream A) is classified as a Theta object due to small variations in their nature.

Alternatively, I have cases where the material just doesn't fit great in any of the classes but is assigned to one.

What I'm looking for is the best way to normalize a scoring metric for how well something fits with their assigned class. As far as I can tell the probability function is utilizing a Logit function so it's got too much exponential movement to it's nature to have much value there (basically the A1 and A2 flip above will show 99% to either category based on minute perturbations).

Alternatively I've tried looking at the SQDIST parameter. The problem with this one is that it's not normalized, so you can see very small SQDISTs for one tight cluster and very large SQDISTS for another large cluster. I tried to normalize each SQDIST to the median of the cluster, that helped but still didn't give the result I wanted. Maybe z-score them?

Anyways, just wondering if there was already a good off-the-shelf parameter already in place for this? Here's the details of what I would want

1) normalized metric for how well the data fits it's best bucket

2) normalized metric for how well the data fits it's second best bucket.

3) Have enough linearity in these to be easily interpretable (no exponential decay/growth)

The normalization is really the key. My goal, ultimately, would be to develop a control chart of new data as it is processed by the LDA to catch outliers as they come in, and to gage the overall health of the LDA classes.

Kevin_Anderson · Jul 14, 2021 05:32 PM

Hi, Evan_Morris!

An inconvenient answer to your question is that there is no perfect metric to measure the performance of your classification problem. A quick web search can identify 30-50 different metrics that try to summarize the results of a confusion matrix, and that's a blinking sign that there are troubles with them all under different conditions. Figuring out those conditions is an exercise usually left for the reader.

One question: statisticians tend to use the word "normalized" differently than other professions. What do you mean when you say "The normalization is really the key."? What does normalization mean to you? Does "normalization" = "correcting for class imbalance"?

One way to do that would be to take a Desirability approach to the Area Under Curve of the Receiver Operating Curve (ROC curve) that's available under JMP's little inverted red triangle (LIRT) next to the Discriminant Analysis -> Score Options -> ROC Curve. Right click on the AUC table, and choose Make Into Data Table. Set the individual desirabilities of each class to be 1 at 1 and zero at 0.5-0, and calculate a geometric mean of the individual desirabilities weighted by the sample size in each class. Use that summarized desirability as a figure of merit.

Another way: there are references that claim that the Matthew's Correlation Coefficient is superior to some other popular binary classification performance metrics, especially in the case of class imbalance, in part because it explicitly uses every cell in the confusion matrix. The MCC has been generalized to the multiclass case (which I assume you have), called Rk.

I am unfamiliar with your application and I can't confidently recommend either approach over the other. It does, however, seem clear to me that either approach would require a JSL script, and each would involve lots of effort and testing.

I hope someone has a more efficient and effective answer for your question than I.

Good luck!

Evan_Morris · Sep 2, 2021 09:53 AM

So one shortcut solution I found (thanks Di!) was to use the Col Standardize function with a bygroup variable. This standardizes the data around normal distribution I think, which doesn't work 100% as the data is highly skewed, but it's still pretty darn effective and was very easy to put in place.

Mark_Bailey · Sep 2, 2021 12:06 PM

BTW, the practice of 'centering and scaling' data is very common. This practice does not necessary imply or assume a normal distribution.