Choose Language Hide Translation Bar
Highlighted
Level VI

## Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

Hi JMP Community,

When conducting a logistic analysis (i.e. Y = Categorical and X = Continuous), are there any assumptions in data structure, distribution, and/or sampling method especially in regard to ROC analysis?

Specifically, in clinical research aimed at identifying predictors of response to treatment, I have noticed that some of my colleagues select patients with the best response and the worst response, excluding the "middle" of the response distribution. Upon identification of a candidate predictor, they go on to produce ROC analyses (same biased data set) where the AUCs appear to be quite inflated.  Hence, does the intentional selection of patients violate the statistical assumptions built-in ROC analysis?

Best,

TS

Thierry R. Sornasse
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
Level VI

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

This is a complicated question, but the simple answer is "Yes, the data can skew the ROC curve".  The reason for this is that the ROC curve is based on a prediction model, and if "bad" data was used to generate that model, then the ROC curve will not be good either.

The ROC curve is a way to examine sensitivity and specificity of a decision rule (a rule to predict a discrete binary outcome Y=0|1) using varying thresholds for the model.   The model (M) is assume to have a numeric output value, and then the decision is made based on a threshold value (T) using the rule

If(M > T) --> Predict Y=1, otherwise --> predict Y=0.

The ROC allows you to see what the effect the threshold has on classification, and in general, the higher the area under the curve for the ROC, the better the model is at being able to do correct classifications over a range of potential thresholds.

No other assumptions are really being made about the underlying data or distribution.

6 REPLIES 6
Highlighted
Level VI

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

This is a complicated question, but the simple answer is "Yes, the data can skew the ROC curve".  The reason for this is that the ROC curve is based on a prediction model, and if "bad" data was used to generate that model, then the ROC curve will not be good either.

The ROC curve is a way to examine sensitivity and specificity of a decision rule (a rule to predict a discrete binary outcome Y=0|1) using varying thresholds for the model.   The model (M) is assume to have a numeric output value, and then the decision is made based on a threshold value (T) using the rule

If(M > T) --> Predict Y=1, otherwise --> predict Y=0.

The ROC allows you to see what the effect the threshold has on classification, and in general, the higher the area under the curve for the ROC, the better the model is at being able to do correct classifications over a range of potential thresholds.

No other assumptions are really being made about the underlying data or distribution.

Highlighted
Level VI

Highlighted
Level VI

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

Hi,
Thank you for your feedback. Just to clarify, the example cited in my original post is about exploratory biomarker research and not about clinical development: decision making about patients' health is never conducted that way.
Best,
TS
Thierry R. Sornasse
Highlighted
Staff

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

Even though the use case does not involve patient outcomes, the omission of the middle of the biomarker distribution will inflate the estimates of sensitivity and specificity. There is no need or reason to do this exclusion to the data. Why wait to find out later, when all the data is used, that the performance is much less or not even reproducible? This practice will set false expectations.

Learn it once, use it forever!
Highlighted
Level VI

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

Hi Mark,
Thank you for your feedback: this is exactly why I asked the question. I'm indeed concerned about setting unreasonable expectations with this biased approach: when I present ROC analysis data using a representative patient population (random selection), I'm often told that the results from my cherry-picking colleagues are much "stronger". Hence the need for solid justification on my part.
Best,
TS
Thierry R. Sornasse
Highlighted
Staff

## Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

Perhaps someone should question your colleagues about the validity and rigor of their approach. On what statistical principles is it based? What are the risks of this approach? What biases does it introduce to estimates of parameters, AUC, and any other data-dependent results? Are your colleagues experts in statistical analysis? Have they invited qualified experts to validate their approach?

(Also be aware that such a practice can cause the separation or quasi-separation problem leading to unstable estimates for the logistic model parameter. You might know about this issue already.)

Reminds me of the early days of ad hoc data analysis of microarrays...

Learn it once, use it forever!
Article Labels