Hi @hogi and @Samira,
Just read quickly the post and answers and I'm still not 100% clear about your objective, as "representative" dataset may have different implications :
Do you want to better understand and model a response/phenomenon based on demographic data (age, gender, location, etc...), or do you want a "representative" sample in order to do some inference/generalization about the population ?
- In the first case, because you would like to link the response/phenomenon to specific attributes, it's best to have a balanced dataset, so having balanced levels for age, gender, location, etc... to have an easier modeling and interpretation of the results.
Using a DoE approach, either building your Custom design "from scratch" and finding the right person corresponding to the different attributes, or building your D-Optimal design thanks to a Candidate Set approach with collected demographic data would help you investigate and analyze the different demographic factors as independantly and individually as possible.
- In the second case, because you want to generalize the results of the sample to the population, you need to have a representative sample (as "biased" as the population) of your population. So perhaps you won't have 50/50 male/female, as the population might be biased to 40/60 for example, so you need to respect this demographic specificity.
Using a stratified approach on demographic data you have collected on your 5 regions might be helpful. You may have to consider two aspects : the creation of samples of your regions based on their population (make sure the proportion of you sample relative to the region population is the same, so a highly/dense populated region will have more people in the sample dataset) and based on the demographic data (gender, age, etc...).
In both cases, I would recommend to gather as much data as possible on the populations of your 5 regions, to better understand and inform your sample creation method.
On a side note, you can also find litterature about the segmentation of demographic data, I'm not sure the age segmentation proposed here is really helpful because of different age ranges : 18-30 (12years), 31-50 (19years), 51+ (20+ years ?). Also how do you define household income level low/medium and high ? Based on analysis or predefined thresholds/criteria ?
Some examples about demographic segmentation : https://xperiencify.com/what-is-demographic-segmentation/
Hope this answer might provide some ideas,
Victor GUILLER
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)