I have a large dataset and I need to subsample to form two groups. One is an SRS (that is easy). The other is a purposeful sample that will match a given distribution. I have the data for the distn I am trying to match, and know the mean and SD. How can I do this?
I can't think of a way to do this without scripting it. Even then, I don't know what strategy or algorithm you might use. Do you have any references or examples of how this might be accomplished? With that the Community might be able to provide some direction on some JSL.
I think I figured it out.
I'm glad to hear it Sarah.
Can you share what method you ended up with?
A is approx bimodal; this is the group I want to match. B is a much larger (10x larger) population, strong left skew The mean of A << the mean of B. Trying to get a subset of B to match A.
Get a freq table for group A.
Get a histogram of group B, match the bin size to group A.
Take a SRS from the first bin in group B of size n that matches group A, to get the same n in the sample of B as in A.
Repeat for each bin across the histogram in order to build the complete subset of B.
Used a two-sample t to make sure the means weren't too different between groups.
This took a while to build and I had to be careful not to accidentally select the wrong rows. But it worked.