Need help: Grouping Categorical values, finding optimal number of categories and...

miguello · Oct 2, 2018 05:03 PM

Folks,

I have a problem that I'm hoping somebody would help me find a solution for, or at least nudge me in a right direction.

So, let's say we have N tools. They have multiple atributes, let's say 10. This might be exact model number, color, bells and whistles, modifications etc.

They also have hundreds of parameters. Some might be the same across all the tools, some might depend on a combination of attributes or a single attribute.

And I need to manage all those many hundreds of parameters. These parameters might change and be different from what it should be and I know each parameter at any point of time.

Now, the approach is following: break all the tools into few big groups, let's call them categories, that are defined by combinations of attributes. Let's say all tools of model A and blue color are one category, Model A red color are second category and Model B is third category, regardless of the color.

Now, not all the parameters are going to be the same within one category. Some specific parameter might be different between Model A red color green bell and Model A red color yellow bell. So we find the most common combination and call it "default" for that category. All those that are different we need to explain and put some kind of comment for them that explains why the parameter is different.

Eventually, we will have two unknown numbers: number of categories we need to break it into and number of exception comments we need to fill for parameters that do not match that of "default".

If we set one category - it's going to be easy to maintain one category, but we're going to have too many exceptions from the "default" to be manageable. If we create a category for almost each combination of attributes, we won't have to manage too many exceptions from "default", but we'll have trouble managing that many categories.

The problem is based on the set of data, find the optimal number of categories, where it both has reasonable amount of categories with reasonable amount of exceptions.

Any ideas on how to approach solving this problem?

Thanks,

M.

gzmorgan0 · Oct 3, 2018 04:05 AM

miguello,

This is reminiscent of a profile analysis or K Nearest Neighbor. Create a table where each row is an entity and other than the entity column, there is a column for each attribute. I suggest there be a code for the attribute, like 1=red and 2=green, etc., then according to your hypothetical example, the column for bell will have these values. I suggest you specify labels for the values.

Then you can define one or more metric and weights. For example, it might be most important that certain specifications must match: group A must be high capacity sprayer (> some cubic units) and group C low capacity(< some cubic units) and group B is other. To refine or subdefine the groups you might use:

Latent Class Analysis - note the categories do not need to be coded, but coding can simplify other analyses.
a Multiple Correspondence Analysis, which would show likely attribute pairings
k nearest neighbor methods, see Scripting Index Functions: KDE (statistical) and KDTable (matrix).

That said, often groups are defined over time by their results. Since it is getting close to midterm U.S. elections, I'll use it as an example. Predictions are often made on classes. These classes are defined/refined by age, sex, education, city vs. rural, which state, religion, church attendance, # of times voted in a midterm, in other words a mix of social and spatial demographics but also behavior (# times voted, church attendance).

Regarding the entity mismatch, that will need a little extra care. Suppose group A entities are closely matched for configuration, except older entities do not have 3 in situ sensors and the newer entities have the performance read-out from the sensors. Mismatch will need to be calculated by available data. Continuing the scenario, suppose there are 100 metrics for all group A tools but the newer ones have the 3 extra, or 103. Many JMP platforms easily handle missing values, but how you handle this depends upon which analysis/criteria you might use to identify a mismatch.

I like the JMP virtual join. If I were doing this analysis, I might create my table of enity attributes to define the categories and then virtually join it with the table of other variables. This also makes it easy if the other variables have multiple values (rows) per run.

These are just crumbs. Good luck with this interesting task.

miguello · Oct 3, 2018 01:45 PM

Thanks a lot for the reply, gave me something to think on. The main thing I want to mention is that I'm only interested in match/mismatch, no matter the distance between values. As I mentioned the parent problem is managing hundreds of tool constants\parameters. In this sense I need to know if it's POR or non-POR, or in other words, if there is any mismatch with Golden Parameter List. The problem basically is finding a optimal number of Golden Parameter Lists. Another thing is that many parameters are in HEX format. Without built in functions that would help easily parse them in decimal this would stay as a character value, meaning that I can only do categorical analysis. On one hand that simplifies things since the distance between parameters is going to be calculated based on ordering of values, as far as I understand. On the other hand that limits types of analysis that I can perform.

More on this in a separate comment below. I'll describe a methodology, at least how I understand it, and then later maybe the final solution.

Thanks again!

dale_lehman · Oct 3, 2018 09:00 AM

This sounds like a clustering problem to me. You are unsure about the number of clusters and their makeup and want to identify deviations within the cluster. Given the number of variables you have, you may have to reduce the dimensionality before clustering, such as using principal components. I would say that my own efforts at clustering have been less than satisfying - the clusters rarely match any identifiable groups (I've usually done this with individual consumers or college applicants for instance). It sounds to me like you want to define the clusters beforehand, such as putting "similar" tools together. The problem is that what you think may be "similar" may differ from what the data says is "similar." If you think you have an idea of the groups, then one approach might be to compare that ex ante clustering with what a clustering algorithm produces - then explore the differences between the two.

miguello · Oct 3, 2018 01:47 PM

Dale,

Yes, you're right, eventually I stopped on cluster analysis. The only detail is that I have many parameters as HEX values. Which, if I want to avoid lots of scripting trying to parse them reduces me to only hierarchical clustering that can deal with nominal values. The good thing - I don't have to go with principal components.

I will describe more in a separate comment.

miguello · Oct 3, 2018 4:28 PM

All,

Thanks for the comment.

Here is the path I chose to approach this problem.

My initial table was my tools as columns, and my rows are values for different parameters. Also a column for parameter name.

First thing I did was I stacked all the tool columns. So I ended up with three columns: Tool, Parameter Tag, Parameter Value. No attributes mentioned whatsoever - no models, nothing...!

Parameter Value is a character nominal value since lots of values are in HEX format.

Then I started Hierarchical Clustering platform and chose the following options (listed those different from default): Chose "Data is stacked" (because I just stacked it!), got few more roles available. Casted Parameter Values to Y, Parameter Tag to Attribute ID, Tool to Object ID.

Got a nice clustering tree.

Below was the distance graph.

The first few splits were responsible for the bulk of the parameters. And based on tool distribution (remember that we do not know in this analysis what is the model, color, bells and whistles of each tool) between the clusters they were repsonsible for the following:

1st split: Tool Models

2nd split: Tool submodel on one of the models

3rd split: Major subsystem version

and so one.

The plot below - the distance plot, in my understanding basically gives the average distance for all cluster members from cluster average. For nominal values this is defined by ordering. In some sense this can serve as an estimate of outliers. It pretty much saturates after the thirs split. Which means that if we were to go with 2 or 3 cluster, we would have to explain a lot of exceptions. With 4 clusters it is significantly less on this data set. We chose 5 clusters.

One can save a table of cluster means (which basically should be the most frequent value for categorical values), and I believe if I save such table for each number of clusters, and I join those tables to the main table (or virtually join - nod to gzmorgan), I can detect if a specific value is different from POR/default/Golden value and therefore calculate an exact number of exceptions for each number of clusters.

Then the only thing to decide - how many clusters (categories) you want to manage, and how many dispositions you're ready to do for those exceptions.

Thanks,

M.

Need help: Grouping Categorical values, finding optimal number of categories and exceptions

Re: Need help: Grouping Categorical values, finding optimal number of categories and exceptions

Re: Need help: Grouping Categorical values, finding optimal number of categories and exceptions

Re: Need help: Grouping Categorical values, finding optimal number of categories and exceptions

Re: Need help: Grouping Categorical values, finding optimal number of categories and exceptions

Re: Need help: Grouping Categorical values, finding optimal number of categories and exceptions