Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions a...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Jul 6, 2020 12:58 PM
(662 views)

Hi JMP Community,

When conducting a logistic analysis (i.e. Y = Categorical and X = Continuous), are there any assumptions in data structure, distribution, and/or sampling method especially in regard to ROC analysis?

Specifically, in clinical research aimed at identifying predictors of response to treatment, I have noticed that some of my colleagues select patients with the best response and the worst response, excluding the "middle" of the response distribution. Upon identification of a candidate predictor, they go on to produce ROC analyses (same biased data set) where the AUCs appear to be quite inflated. Hence, does the intentional selection of patients violate the statistical assumptions built-in ROC analysis?

Thank you for your help.

Best,

TS

Thierry R. Sornasse

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

This is a complicated question, but the simple answer is "Yes, the data can skew the ROC curve". The reason for this is that the ROC curve is based on a prediction model, and if "bad" data was used to generate that model, then the ROC curve will not be good either.

The ROC curve is a way to examine sensitivity and specificity of a decision rule (a rule to predict a discrete binary outcome Y=0|1) using varying thresholds for the model. The model (M) is assume to have a numeric output value, and then the decision is made based on a threshold value (T) using the rule

If(M > T) --> Predict Y=1, otherwise --> predict Y=0.

The ROC allows you to see what the effect the threshold has on classification, and in general, the higher the area under the curve for the ROC, the better the model is at being able to do correct classifications over a range of potential thresholds.

No other assumptions are really being made about the underlying data or distribution.

6 REPLIES 6

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

This is a complicated question, but the simple answer is "Yes, the data can skew the ROC curve". The reason for this is that the ROC curve is based on a prediction model, and if "bad" data was used to generate that model, then the ROC curve will not be good either.

The ROC curve is a way to examine sensitivity and specificity of a decision rule (a rule to predict a discrete binary outcome Y=0|1) using varying thresholds for the model. The model (M) is assume to have a numeric output value, and then the decision is made based on a threshold value (T) using the rule

If(M > T) --> Predict Y=1, otherwise --> predict Y=0.

The ROC allows you to see what the effect the threshold has on classification, and in general, the higher the area under the curve for the ROC, the better the model is at being able to do correct classifications over a range of potential thresholds.

No other assumptions are really being made about the underlying data or distribution.

Highlighted
##
Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Jul 7, 2020 4:27 AM
| Last Modified: Jul 7, 2020 4:43 AM
(623 views)
| Posted in reply to message from Thierry_S 07-06-2020

Highlighted
##
Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Hi,

Thank you for your feedback. Just to clarify, the example cited in my original post is about exploratory biomarker research and not about clinical development: decision making about patients' health is never conducted that way.

Best,

TS

Thank you for your feedback. Just to clarify, the example cited in my original post is about exploratory biomarker research and not about clinical development: decision making about patients' health is never conducted that way.

Best,

TS

Thierry R. Sornasse

Highlighted
##
Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Even though the use case does not involve patient outcomes, the omission of the middle of the biomarker distribution will inflate the estimates of sensitivity and specificity. There is no need or reason to do this exclusion to the data. Why wait to find out later, when all the data is used, that the performance is much less or not even reproducible? This practice will set false expectations.

Learn it once, use it forever!

Highlighted
##
Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Hi Mark,

Thank you for your feedback: this is exactly why I asked the question. I'm indeed concerned about setting unreasonable expectations with this biased approach: when I present ROC analysis data using a representative patient population (random selection), I'm often told that the results from my cherry-picking colleagues are much "stronger". Hence the need for solid justification on my part.

Best,

TS

Thank you for your feedback: this is exactly why I asked the question. I'm indeed concerned about setting unreasonable expectations with this biased approach: when I present ROC analysis data using a representative patient population (random selection), I'm often told that the results from my cherry-picking colleagues are much "stronger". Hence the need for solid justification on my part.

Best,

TS

Thierry R. Sornasse

Highlighted
##
Re: Logistic Analysis > Receiver Operator Curve Analysis: what are the assumptions about data structure, distribution and sampling?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Perhaps someone should question your colleagues about the validity and rigor of their approach. On what statistical principles is it based? What are the risks of this approach? What biases does it introduce to estimates of parameters, AUC, and any other data-dependent results? Are your colleagues experts in statistical analysis? Have they invited qualified experts to validate their approach?

(Also be aware that such a practice can cause the separation or quasi-separation problem leading to unstable estimates for the logistic model parameter. You might know about this issue already.)

Reminds me of the early days of *ad hoc* data analysis of microarrays...

Learn it once, use it forever!