Solved: Re: Finding subset of sample that is representative of the whole

tbob · Dec 21, 2023 11:33 AM

Hi All,

I have a set of 49 data points that have repeatable x,y coordinates. I would like to know which subset of 9 sites best represent the 49 sites. I tried hierarchical clustering, but that didn't seem to work or I am not using it correctly.

Thanks!

SDF1 · Dec 21, 2023 11:52 AM

Hi @tbob ,

You should be able to do this using the Subset platform, Tables > Subset. After you select the Random type you want, e.g. random fraction of the data table, or random selection of a fixed number of rows, you can select the Stratify box and then select the Columns of interest that you want to stratify on -- perhaps a response variable, or just your input variables. When you stratify, JMP tries to make a subset that has the same distribution, mean, standard deviation, etc., so that you're viewing a representative subset of all your data.

Hope this helps!,

DS

View solution in original post

SDF1 · Dec 21, 2023 11:52 AM

Hi @tbob ,

You should be able to do this using the Subset platform, Tables > Subset. After you select the Random type you want, e.g. random fraction of the data table, or random selection of a fixed number of rows, you can select the Stratify box and then select the Columns of interest that you want to stratify on -- perhaps a response variable, or just your input variables. When you stratify, JMP tries to make a subset that has the same distribution, mean, standard deviation, etc., so that you're viewing a representative subset of all your data.

Hope this helps!,

DS

tbob · Dec 21, 2023 01:30 PM

I seem to be missing something. I tried to do this but when I choose a sample size of 9, the output is the original table.

SDF1 · Dec 21, 2023 02:00 PM

Hi @tbob ,

You're right, and I can confirm that this takes place. My apologies, I thought stratifying on the subset would be like stratifying when creating a validation column. In the case of creating a validation column, it's like as I described before, JMP tries to match the different data sets in order for them to have the same distribution, mean, standard deviation, etc.

Instead, looking up stratifying with Subsets, you can find the help page here, Where it describes that when making a subset the number of sample sizes is the number of samples per stratum. So, if you stratify on a column that has 4 levels, and you choose a sample size of 4, you get 4x4 = 16 rows in the subset because there are 4 levels (strata) and you selected 4 samples per strata.

You might try not using the stratify and then look at some statistics of your subset and main data table to make sure that the distributions are similar, or whatever you need to compare the subsets to make sure it represents the whole.

Alternatively, you could create a fake strata column where every entry is the same, say, setting all values to 1. Then you stratify on that column and select 9 as your sample size, and that should do it.

I get why it's done differently here, but I still find it a bit strange and non-intuitive, which is not normal on most JMP platforms.

Hope this helps!,

DS

Mark_Bailey · Dec 21, 2023 12:57 PM

How will you use the 9 selected sites? Why won't you use all 49 sites?

tbob · Dec 21, 2023 01:02 PM

The 49 site measurement is a test qualifying production while the 9 site is actual product. The product is limited to 9 sites.