Solved: Constraining Fit Model Output (or prediction profiler) to a Range

Shujinko · Dec 1, 2021 06:15 PM

I have three variables that influence a value (a proportion from 0 to 1). When I take the approach of creating a generalized linear model using the fit model platform, I get a prediction equation that largely works but still allows predictions of below 0 or 1, if I input certain variable values. This is not desired since the estimate is supposed to be a proportion.

Is there a way in which I can constrain this, either by doing a certain type of data formatting or manipulation (e.g. a log transformation), or perhaps a set of conditions set in the Fit Model platform?

Thank you!

Mark_Bailey · Dec 2, 2021 09:56 AM

You might try one of these two approaches. Both of them are based on the logit transform of the response Y. The logit expects an argument [0, 1] for the constraint you want but produces a response [-infinity, infinity] that regression requires.

I mocked up some data to illustrate this approach.

Apply the Logit transform to the response Y in the Fit Model dialog. Select Analyze > Fit Model and enter the data columns in the respective analysis roles. Select the column in the Y role, then click the red triangle next to Transform and select Logit.

Your result will be familiar, you can use the Profiler for example, but the constraint should apply.

The other approach is to use a Generalized Linear Model. Start as before but do not apply the transformation. Instead, click Personality and select Generalized Linear Model. Click Distribution and select Normal. Click Link and select Logit.

View solution in original post

Mark_Bailey · Dec 2, 2021 09:56 AM

You might try one of these two approaches. Both of them are based on the logit transform of the response Y. The logit expects an argument [0, 1] for the constraint you want but produces a response [-infinity, infinity] that regression requires.

I mocked up some data to illustrate this approach.

Apply the Logit transform to the response Y in the Fit Model dialog. Select Analyze > Fit Model and enter the data columns in the respective analysis roles. Select the column in the Y role, then click the red triangle next to Transform and select Logit.

Your result will be familiar, you can use the Profiler for example, but the constraint should apply.

The other approach is to use a Generalized Linear Model. Start as before but do not apply the transformation. Instead, click Personality and select Generalized Linear Model. Click Distribution and select Normal. Click Link and select Logit.

Shujinko · Dec 3, 2021 02:31 PM

Thank you Mark! That's what I was looking for. As a follow-up:

In my data set, we have X1 X2 and a proportion of respondents who answered Y, since multiple respondents all saw different X1 X2 combinations (e.g. 100 respondents each looking at a set of 10 X1 X2 combinations) and responded either Y or something else (X).

Since my raw data has the response of either Y/X for each X1 X2 combination, I would also be able to model the probability of Y/X as a function of X1 and X2 by using the nominal logistic capability. This yields a prediction expression and a confusion matrix.

Do you have any insight regarding whether using the raw data of Y/X responses is a better or worse approach compared to using the proportions of Y? My goal is to simply be as accurate and precise as possible when making predictions with novel combinations of X1 and X2.

Mark_Bailey · Dec 6, 2021 11:11 AM

In a sense, your ratio Y/X is already a probability. The maximum likelihood estimator for the probability of an event is the proportion of the event out of the total events and non-events.

Many people use logistic regression for your purpose. The raw data are counts like you have. The response is event or non-vent, and it is modeled with a binomial distribution. Logistic regression or equivalently GLM with the binomial distribution and the logit link function address these issues directly. Have you considered this approach?

Shujinko · Dec 7, 2021 03:22 PM

I have evaluated the following:

1. GLM (normal) Logit to predict ratios on aggregated data: AICc -57.2

2. GLM (binomial) Logit to predict Event/Non-Event on raw data: AICc 1907, gives me probability of Non-Event

3. Nominal Logistic to predict Event/Non-Event on raw data: AICc 1907, gives me probability of Event, and probability of Non-Event, and a "Most-likely" evaluation based on whether p(Event) or p(Non-Event) is greater

I have read that a lower AICc value is an indicator of a better model. One concern that I have with using the binary raw data is misclassifications (based on the confusion matrix, I see a sizable number of mis-predictions on the training set). Would we consider the differences between AICc values (~1964) to be substantially in favor of Option 1 being the superior approach? I'm perfectly happy with a ratio as my output.

Mark_Bailey · Dec 8, 2021 12:00 PM

You cannot compare AICc when the response changes. Think of the general fitting process as finding the best Y = F(p|X) where F is a function with parameters p given the fixed data X. You can change anything about F and compare the differences with AICc, but you cannot change anything about Y.

JGMM · Jan 26, 2023 11:40 AM

Hi Mark,

I think I have a related question here - I've been trying to perform a LogitPct transform on data from a DoE experiment. We're interested in finding those factors that maximize measured level of product and also minimize measured levels of impurities. The LogitPct helps ensure that the Prediction Profiler doesn't give non-realistic yields of over 100% for product or under 0% for impurities.

The problem is that some of my DoE experimental data points include zero (e.g., level was below the detection limit of the instrument). Both Logit and LogitPct return errors when running the Model, since taking Log of 0 is undefined - I think bounds for Logit and LogitPct in JMP are (0,1) rather than [0,1]. I don't want to omit these rows from analysis, since that "desymmetrizes" the DoE and reduces the number of experiments. Is there anything to do beyond recoding the zeroes to "dummy" values functionally close to zero such as 0.0001?

Mark_Bailey · Jan 26, 2023 12:26 PM

Yes, you are correct that these transforms expect numbers in the domain (0,1), not [0,1] or beyond. I do what you suggest - I replace 0 with a very small positive number, like 0.001, and 1 with a number very slightly less, like 0.999. This change should work for you.

Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range

Re: Constraining Fit Model Output (or prediction profiler) to a Range