Subscribe Bookmark RSS Feed

Is it okay to transform categorical data into weighted percentages?

keffinger

New Contributor

Joined:

Mar 30, 2017

I am comparing the efficiency of ovens by measuring consistency in cookie baking (well, not really, I'm just using this for an example). After summarizing data, I have two responses, "Cooked Evenly" and "Cooked Unevenly".

 

Can I create a new column "% Cooked Evenly", run Fit X by Y where Y = %cooked evenly, X = oven, and Freq = Batches of Cookies? In doing this, I think I'm giving weight (from number of batches) to the percentages. Running data this way allows me to run means/anova and student t test for connecting letter's report, making it easy to compare several ovens at the same time. 

 

Normally, I stack the "Cooked evenly" and "Cooked Unevenly" and run a contingency analysis (observing proportions) on two ovens at a time, finding out which are significantly different from each other. I often find it difficult to get a nice statistics overview when I use contingency analysis on multiple ovens at the same time.

 

After running data in both scenarios, I get the same overview of which ovens are significantly outperforming other ovens, but I get the information in a fraction of the time by giving "weight" to a percentage.

 

Is this an acceptable way to approach analyzing categorical data?? Any and all criticisms are welcome! Thanks!

6 REPLIES
markbailey

Staff

Joined:

Jun 23, 2011

You have one response with two outcomes. Why not use logistic regression? This method is based on the proportions/probabilities as you describe but you don't have to manually transform the data. Make sure that the response column uses the nominal modeling type, then select Analyze > Fit Model and enter this column in the Y role. It will automatically switch to nominal logistic regression. Then build the linear predictor with any effects that you want to test.

The analysis platform will provide all the features that you need including the Prediction Profiler.

So to be clear, you do not need to summarize the outcomes and transform the response to proportions. Logistic regression will take care of everything for you.

Learn it once, use it forever!
keffinger

New Contributor

Joined:

Mar 30, 2017

Thanks for the feedforward. 

 

When I run the fit model, I'm assuming I still stack the responses? So that Y = Label (cooked evenly/cooked unevenly), X = oven, Freq = Data (counts of batches cooked evenly or cooked unevenly).

 

The thing I like about manually transforming data is that the summary shows a connecting letters report - where can I find this information from logistic regression? I'm not very familiar with this report.

 

 

ron_horne

Super User

Joined:

Jun 23, 2011

the letters report is not available as far as i know. in general the contrasting options are limited in the logistic platform.

if you ask for the odds ratios you get a formal test for all combinations of categories.

 

 

markbailey

Staff

Joined:

Jun 23, 2011

The connecting letters report is not appropriate for a binary response. This form of multiple comparisons is intended for a one-way ANOVA with a continuous response.

You manually created the proportions but this response should not use the continuous modeling type. Regression and ANOVA assume that the true continuous response is unbounded but your proportion is bounded [0,1]. So while the proportion can vary continuously over this range, it is not a true continuous response.

Your response is binary so you should use logistic regression or a GLM with a binomial distribution and a logit link function.

Learn it once, use it forever!
ron_horne

Super User

Joined:

Jun 23, 2011

i agree with Mark, logistic regression is the way to go.

in addition, there is a twofold repetitive structure to the data. ovens (A,B,C...) are repetitive and each repetition has more repetitions in terms of batches. if all the same letter oven indications are identical than the summary of the data is misleading. if they are not identical it needs to be controlled for. same thing with the batches.

the way the data is presented makes me think there is some dependency between the rows in the table.

ron

keffinger

New Contributor

Joined:

Mar 30, 2017

Ron - think of the different rows as different "cookie types". Sorry for the confusion.