Discussions

dale_lehman · Jun 8, 2023 5:22 PM

I am working with some Census data that comes from a survey. There is a weight variable, given that they use a fairly complex sampling scheme to ensure adequate representation from different groups. I am aware that JMP will compute weighted averages when analyzing distributions, if I put the weight variable in the Weight box of the dialog. I am also aware that JMP ignores the weight variable when analyzing the distribution of a nominal variable. But it is not clear to me why. In particular, I am wondering what the correct way is to describe the distribution of a nominal variable. For example, suppose there are two groups and group 1 comprises 40% of the total respondents. Suppose that group 1 variables have 60% of the total weights. Analyze distributions will show 40% in group 1. If I put the weight variable in the Frequency box, then group 1 will comprise 50% of the total. Which is the correct fraction in group 1? Is it the raw % or the percent weighted by the frequency of the weights? And, if it is the latter, why isn't it handled by putting weight in the Weight box, rather than the Frequency box?

peng_liu · Oct 28, 2020 09:29 AM

Please allow me to call your weight variable W to facilitate the discussion.

First, I want to clarify about your conclusion "the weights should be used as frequencies for a nominal variable and weights for a continuous variable". This is not a correct conclusion about how Freq and Weight should be used in the Distribution platform. I have explained what Freq and Weight do. To summarize, the following suggestion is what I would like to offer when one has difficulty to decide whether to use Freq or Weight:
1) In majority of the cases, use your W variable in Freq. In this situation, W means the counts of replicates of your individual observations.
2) Use your W variable in Weight, only if you know what you are doing.

Now let me come back to your W variable. As you have described, it does not squarely fit into either Freq or Weight. The purpose of W is to counter the different selection probabilities in the survey. Suppose you are interested in estimating both the population proportion and the standard error of the estimate. The correct tool to use is a SAS procedure Proc SURVEYMEANS, I believe. If the Distribution platform is equivalent to SAS procedure Proc UNIVARIATE, I don't think JMP have an equivalent platform to Proc SURVEYMEANS. I have checked the Categorical platform, but I don't think it is equivalent.

So to me, you real challenge is not to decide putting W in Freq or Weight in the Distribution platform, but to understand what calculations to perform in Distribution to replicate outcomes if you could have had the access to Proc SURVEYMEANS. At this point, I can confirm that your point estimate calculation using Distribution by putting W in Freq will replicate the estimate from Proc SURVEYMEANS. The formulas of the two calculations are mathematically equivalent. And you are right, you won't be able to replicate standard error of the estimate. The details of calculating standard error can be found in the documentation of Proc SURVEYMEANS.

View solution in original post

peng_liu · Oct 27, 2020 12:34 PM

The current behavior that may cause confusion is mainly due to historical reasons. And I believe the behavior needs to remain as is for good. Otherwise, existing JSL scripts in user's production may produce unexpected results without warnings and difficult to detect.
Some references may help to understand the current behavior:
1) JMP Documentation. Basic Analysis > Distributions > Launch the Distribution Platform. https://www.jmp.com/support/help/en/15.0/jmp/launch-the-distribution-platform.shtml
2) FREQ statement and WEIGHT statement in SAS Proc Univariate Documentation: https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/procstat_univariate_sect012... https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/procstat_univariate_sect021...

dale_lehman · Oct 27, 2020 01:07 PM

Thank you for attempting to address my question - but your response is a bit too cryptic for me. The JMP documentation did not help - it just states what I stated in my question. Weights can be used for continuous variables, but are ignored for nominal variables. I don't understand your statement regarding "historical reasons." What are these? From the other cites you provided, I suspect this has something to do with sas rather than jmp. I'm not a sas user, so I wouldn't know. Just say what you mean.

More to the point, are you confirming that when you have survey weights, they should be cast in the frequency role for nominal variables, but in the weight role for continuous variables? That is what I have come to believe, but I'd appreciate some confirmation of that, even if the "historical reasons" must remain obscure.

peng_liu · Oct 27, 2020 03:18 PM

To answer where to put the weight variable, I need to ask what the weight variable represents. Does weight=2 means there are two identical survey results, e.g. two people give exactly the same answers to all questions if each row means a survey result in your data? If so, the weight variable should go to "Freq". "Freq" literally means how many copies of the same observation. Otherwise, if you assign it to "Weight", we need to see how "Weight" will impact the statistics that you are interested in. And that will be a case-by-case situation. For example, if you put the weight variable in "Weight", then N will be the number of rows of your data table; the Mean will be the same as the Mean by using weight variable in "Freq"; the Std Dev, however, will be different from that by using weight variable in "Freq". "Weight" may affect computations whose formulas involve "w_i". But one needs to check individual documentations to be sure.

Two more references that might be helpful in explaining the differences between Freq and Weight. Both are from SAS. And yes, SAS had strong influence on the designs of JMP platforms in the early days.
https://blogs.sas.com/content/iml/2013/09/13/frequencies-vs-weights-in-regression.html
https://support.sas.com/kb/22/600.html

By "historical reasons", I can only say based on my experience. I can see two reasons:
1) "Freq" used to be integer only. Though it is no longer true in JMP, but it still seems to be the case in some SAS procedures. And by adding "Weight", which can hold larger values and fractional values, it allowed some calculations that were not possible by using "Freq".
2) "Weight" can be used as cleaver tricks for some statistical computations, but does not change data. That SAS blog gives a good example, about how outliers can be handled by re-weighting. Based on my experience, these tricks were usually developed along with the classic methodologies. They are valuable from situation to situation, but one needs to be careful about what they do.

dale_lehman · Oct 27, 2020 04:13 PM

These are weights from Census Bureau surveys. Since they use a complex stratified sampling procedure, household weights represent the number of households that a responding household represents. While the household itself is a single respondent, it may represent 1000 households of that "type." So, it does not fit the description of frequency in that there are no other respondents with exactly those responses. But if it is used as a weight, it will be ignored in the distribution for a categorical variable. So, for example, if you want to know what % of people have characteristic x, then the % of respondents will differ from the % of the population, but not just because of random sampling, but because of the sampling procedure used. It seems to me that we want to use the weights - but the only way they will influence the % with characteristic x is to use it as a frequency, not a weight. I believe that is correct for the point estimate - but probably not for estimating the standard error.

dale_lehman · Oct 27, 2020 04:41 PM

To make this clearer, here is an example from a recent survey. In response to a question regarding whether someone in the household has changed their educational plans, 5.3% of the 109,051 respondents gave an affirmative response. Adding the household weights to the analysis changes nothing. Adding the households weights as a frequency, provides an estimate of 4.8% of 121,520,180 households. I believe 4.8% is the correct point estimate. Of course, the standard error will be wrong - in fact, I don't think there is any easy way to get an estimate of the standard error. One must use the 80 replicate weights that Census provides for that purpose. But, for now, I am only asking about the point estimate - and why the weights should be used as frequencies for a nominal variable and weights for a continuous variable (such as household income). I think you have answered this last question (no good reason from my perspective, but a historical reason having to deal with sas users and old scripts). Can you confirm the part about the point estimate?

peng_liu · Oct 28, 2020 09:29 AM

Please allow me to call your weight variable W to facilitate the discussion.

First, I want to clarify about your conclusion "the weights should be used as frequencies for a nominal variable and weights for a continuous variable". This is not a correct conclusion about how Freq and Weight should be used in the Distribution platform. I have explained what Freq and Weight do. To summarize, the following suggestion is what I would like to offer when one has difficulty to decide whether to use Freq or Weight:
1) In majority of the cases, use your W variable in Freq. In this situation, W means the counts of replicates of your individual observations.
2) Use your W variable in Weight, only if you know what you are doing.

Now let me come back to your W variable. As you have described, it does not squarely fit into either Freq or Weight. The purpose of W is to counter the different selection probabilities in the survey. Suppose you are interested in estimating both the population proportion and the standard error of the estimate. The correct tool to use is a SAS procedure Proc SURVEYMEANS, I believe. If the Distribution platform is equivalent to SAS procedure Proc UNIVARIATE, I don't think JMP have an equivalent platform to Proc SURVEYMEANS. I have checked the Categorical platform, but I don't think it is equivalent.

So to me, you real challenge is not to decide putting W in Freq or Weight in the Distribution platform, but to understand what calculations to perform in Distribution to replicate outcomes if you could have had the access to Proc SURVEYMEANS. At this point, I can confirm that your point estimate calculation using Distribution by putting W in Freq will replicate the estimate from Proc SURVEYMEANS. The formulas of the two calculations are mathematically equivalent. And you are right, you won't be able to replicate standard error of the estimate. The details of calculating standard error can be found in the documentation of Proc SURVEYMEANS.

Discussions

Survey weight question

Re: Survey weight question

Re: Survey weight question

Re: Survey weight question

Re: Survey weight question

Re: Survey weight question

Re: Survey weight question

Re: Survey weight question

Recommended Articles