Re: How to Select a quota sample from a data set

Samira

Hello everyone

I am working on a research that implies working on a representative sample. I have already collected data, but I need to select a subset that fulfills the representation criteria that are 4: The sample should be with the following quotas to be met:
-
on gender: 50% males and 50% females.
-
on age: 1/3 from 18 to 30 years old, 1/3 from 31 to 50 years old and 1/3 over 51 years old.
-
on household income level: 1/3 from low , 1/3 from medium 7 and 1/3 from high .

1/5 of study population by each of the five regions of the city (North, South, Centre, East and West)

How can I create this sample?

I am using JMP pro 17

hogi · | Posted in reply to message from Samira 12-05-2024

sounds like a textbook exercise - is there a chapter with the solution?

Samira · | Posted in reply to message from hogi 12-05-2024

No
I actually was discussing this issue with a colleague to start the data analysis of a project with this quota sampling technique.

txnelson · | Posted in reply to message from Samira 12-06-2024

This is a complex problem. How large is your data table you are pulling data from. What size of a sample are you pulling? You have 90 combinations of your 4 columns. Did you and your colleague come up with and idea on how to approach the problem?

Jim

hogi · | Posted in reply to message from txnelson 12-06-2024

And you don't want to pick the sample data by the actual share of the distribution, but by the artificial fraction?

How about female, > 51yrs, high income, north.
Should it be 1/2 * 1/3 * 1/3 * 1/5? (*)
This is very easy to compute - but maybe too strict - and not intended?
Just think of the case where there is no female, > 51yrs, high income, north in the original distribution.

On the other hand: If just the 1/2, 1/3, 1/3 and 1/5 have to be fulfilled, one could make up extreme cases with 0 sample data for female, > 51yrs, high income, north.

Samira · | Posted in reply to message from hogi 12-06-2024

Thanks for your reply

The idea is to conduct the research on a representative sample to the population living in the whole city

dlehman1 · | Posted in reply to message from Samira 12-05-2024

This is not elegant and I'm not even sure it will work, but it might. You can create 0,1 columns for each of your criteria and then combine them into a single 0,1 column where 1 means that all of the individual criterion columns were =1. That will give the desired subset. And, if you want a random selection from such rows, just use a validation column stratified by that 0,1 column. As I said, not elegant but might work.

hogi · Dec 6, 2024 09:38 AM

data table - ideal case:

New Table( "quota_samples",
	Add Rows( 100000 ),
	Compress File When Saved( 1 ),
	New Column( "gender",
		Character,
		Formula( Match( Floor( Random Uniform() * 2 ), 0, "M", "F" ) ),
		Compact(),
		Set Selected
	),
	New Column( "income",
		Character,
		Formula(
			Match( Floor( Random Uniform() * 3 ), 0, "low", 1, "medium", "high" )
		),
		Compact()
	),
	New Column( "region",
		Character,
		Formula(
			Match( Floor( Random Uniform() * 5 ),
				0, "N",
				1, "S",
				2, "W",
				3, "E",
				"C"
			)
		)
	),
	New Column( "age",
		Character,
		Formula(
			Match( Floor( Random Uniform() * 3 ), 0, "young", 1, "medium", "old" )
		)
	)
)

hogi · Dec 8, 2024 4:11 AM

For such an ideal table, (if the probabilities of your data set fit to the fractions you want), you can pick random samples - random samples per variant or a specific number of samples per variant - or a combination of all 3 ...
you will always get the subgroups with the requested fraction (1/2, 1/3, 1/3 and 1/5)

// random sampling : full data set
if(not(current data table() << has column ("cum_prob")),New Column( "cum_prob",
	Formula(
		Col Rank( random uniform()) / (
		Col Number( 1 ))
	)
));

// random sampling : per variant
if(not(current data table() << has column ("cum_prob_indiv")),New Column( "cum_prob_indiv",
	Formula(
	tmp = random uniform(); // tmp =1; // **)
		Col Rank( tmp, :gender, :income, :region, :age ) / (
		Col Number( tmp, :gender, :income, :region, :age ))
	)
));

// force ratios: 1/2, 1/3, 1/3, 1/5
if(not(current data table() << has column ("rank_indiv")),
New Column( "rank_indiv",
	Formula( Col Rank( random uniform(), :gender, :income, :region, :age ) )
));

Graph Builder(
	Size( 518, 448 ),
	Show Control Panel( 0 ),
	Graph Spacing( 4 ),
	Variables( X( :gender ), X( :income ), X( :region ), X( :age ) ),
	Elements( Position( 1, 1 ), Bar( X,  Summary Statistic( "N" ) ) ),
	Elements( Position( 2, 1 ), Bar( X,  Summary Statistic( "N" ) ) ),
	Elements( Position( 3, 1 ), Bar( X,  Summary Statistic( "N" ) ) ),
	Elements( Position( 4, 1 ), Bar( X,  Summary Statistic( "N" ) ) ),
	Local Data Filter(
        Title( "how many samples do you want ? " ),
		Add Filter(
			columns( :cum_prob, :cum_prob_indiv, :rank_indiv )
		)
	)
);

**) instead of using CDFs with random uniform(), one can randomizing the row order and use CDFs of "1".

hogi · Dec 6, 2024 1:12 PM

The last option also works for less systematic tables like the one below.
The only limitation: if there are few samples for one of the variants, there is this clear limit to the number of samples that can be selected.

It follows the simple rule:
If for one of the variants (A), there are just N samples, take those and pick the same number of random samples from the other variants. Actually, for variant A, this is NOT "sampling".
So, maybe pick just M << N random samples from each of the 90 variants.
You can take the same (JSL) logic - just adjust N to a lower value.

(view in My Videos)

variants = {};
For Each( {gender}, {"F", "M"},
	For Each( {age}, {"young", "medium", "old"},
		For Each( {income}, {"low", "medium", "high"},
			For Each( {region}, {"N", "S", "E", "W", "C"},
				Insert Into( variants, Concat Items( {gender, age, income, region} ) )
			)
		)
	)
);

	Eval(
		Eval Expr(
		new table(
	"unfair",
	add rows( 100000 ), 
	

			New Column( "variant",Character,
				formula(
					variants = As constant(Expr( variants ));
					Try(
						variants[Floor( Random Normal( 45, 30 ) )],
						"F young medium C"
					);
				)
			),
			
			New column ("gender",Character, formula(Word(1,:variant))),
			New column ("age",Character, formula(Word(2,:variant))),
			New column ("income",Character, formula(Word(3,:variant))),
			New column ("region",Character, formula(Word(4,:variant))),
		)
	)
);