Hi - I am trying to decide the best method of cluster analysis (e.g. principal component analysis, k-means, etc) to use for the following situation. I have a mapped dataset with 12,928 records, each corresponding to a well with sample results. Each row of data (each location on my map) has a well name, latitude, longitude, and results of compound A, compound B, compound C, compound D, compound E, etc (8 chemical compounds in all). These wells have been contaminated by one of three sources: 1) air deposition, 2) process waste, or 3) a combination of the two (mixed). And each source is associated with a unique source signature (e.g. the air deposition source tends to have high compound X and low compound Y, while process water tends to have high compound Y and Z and low compound X.). So, each row (i.e. well location) of my dataset is associated with one of the three sources of contamination. My goal is to identify which source is most likely for each record (i.e. well location) in my dataset.
Importantly, a subset of ~800 records in my dataset are known to be associated with the air deposition source signature. As such, this subset of data can be used as a training set for the air deposition signature. I can also come up with a subset of ~50 records that are representative of the process waste source signature.
Any suggested approaches in JMP or JMP Pro would be greatly appreciated. Thanks in advance!