All,
Thanks for the comment.
Here is the path I chose to approach this problem.
My initial table was my tools as columns, and my rows are values for different parameters. Also a column for parameter name.
First thing I did was I stacked all the tool columns. So I ended up with three columns: Tool, Parameter Tag, Parameter Value. No attributes mentioned whatsoever - no models, nothing...!
Parameter Value is a character nominal value since lots of values are in HEX format.
Then I started Hierarchical Clustering platform and chose the following options (listed those different from default): Chose "Data is stacked" (because I just stacked it!), got few more roles available. Casted Parameter Values to Y, Parameter Tag to Attribute ID, Tool to Object ID.
Got a nice clustering tree.
Below was the distance graph.
The first few splits were responsible for the bulk of the parameters. And based on tool distribution (remember that we do not know in this analysis what is the model, color, bells and whistles of each tool) between the clusters they were repsonsible for the following:
1st split: Tool Models
2nd split: Tool submodel on one of the models
3rd split: Major subsystem version
and so one.
The plot below - the distance plot, in my understanding basically gives the average distance for all cluster members from cluster average. For nominal values this is defined by ordering. In some sense this can serve as an estimate of outliers. It pretty much saturates after the thirs split. Which means that if we were to go with 2 or 3 cluster, we would have to explain a lot of exceptions. With 4 clusters it is significantly less on this data set. We chose 5 clusters.
One can save a table of cluster means (which basically should be the most frequent value for categorical values), and I believe if I save such table for each number of clusters, and I join those tables to the main table (or virtually join - nod to gzmorgan), I can detect if a specific value is different from POR/default/Golden value and therefore calculate an exact number of exceptions for each number of clusters.
Then the only thing to decide - how many clusters (categories) you want to manage, and how many dispositions you're ready to do for those exceptions.
Thanks,
M.