Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- Bayesian approach to machine learning for Bernoulli response

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 12, 2019 12:13 PM
(6635 views)

I have a data set where I have about 2000 observations of about 150 features or covariates. I'm trying to predict a 0/1 outcome. The outcome is a risk score for disease, R1. The assay generates all the feature values (F1, ..., F150) in a database, as well as a risk value for a related disease (R0). The thought is that the R0 is highly predictive for R1, and that if we use the features F1, ..., F150 we might have some sort of update rule we can apply to R0. Stated a bit differently, I'm thinking of using R0 as the Bernoulli parameter for a prior distribution of R1 and somehow training an update formula to use on R0 based on the values of F1, ..., F150.

The training algorithm would have to allow a 0/1 probability score model to be built that allows us to specify a prior value for the score. This sounds like a pretty straigthforward Bayesian problem. But I haven't done this before and I'm not sure how the feature information would be incorporated into the parameter update expression.

Would we generate a second probability score?

Would we use the linear predictor prior to going into the logit transform?

Would we modify R0 in some additive way, or by some multipicative factor?

Clearly I have more questions than answers. If anyone can point me in the right direction that would be great. If JMP allows this sort of thing, I'd love to hear about it.

2 ACCEPTED SOLUTIONS

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I'm a bit confused - you say your model uses 100% of the features (150 of them!). Virtually all machine learning models I have run never use all the features. You may enter them all as possible, but rarely (never) do they all get used. In any case, if you think your model is too complex, you can drop out each feature (tedious, but someone can probably help script this) individually to see how much the performance goes down. An alternate version of this - which I have seen in other software but not in JMP yet - is to replace each feature (one at a time) with random data and compare the performance.

You also appear to want to see if you can get better predictions by adding something to the model you already have (what is left to add if all the features are in it already?). I think there are many ways to do this - but I think more about trying different models, and using some kind of ensemble method to combine them, rather than using additional features. Bayesian model averaging is one way to do this. But your original idea about using your first model as a prior and then adding features to produce a "Bayesian update" still doesn't make sense to me.

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Yes, 100% of the features are used to calculate p0. The system actually generates thousands of features, but we needed to be able to validate the algorithm for computing p0 offline. So we found (through reading the code) every feature that was used to calculate p0 and saved them. Mind you, the "calculation" for p0 as a function of the features is too complex and implicit to ever let you write it down (hundreds of binary classifiers in parallel and serial configurations, I didn't build the system). But we do know that we have captured all the necessary and sufficient feratures to do an offline simulation and validate the computation for p0.

My goal is not to modify anything about the model for calculating p0. It is to see if, by chance, the very same features might hold contain some "magic" information that allows us to predict some other disease. My thought is that is unlikely, especially since the features have been generated specifically to, in some optimal and minimal manner, predict D0. There are probably some information theoretic ways to test this asssumption.

Again, I need to come up with a fair and principled wat to either show that the concept just will not work or that it will. It has been easy to generate plenty of red flags (covariate drift as an example). My attempt to use some sort of modifed "Bayesian" approach is my way of using the best statistic the features have been designed to generate, and then see if there is some independent, residual, information that might **help** predict D1. I'm a skeptic, but I'm among a group where others (management) have a dogmatic view that machine learning can magically solvce any problem imaginable. They don't appreciate that most of these models (LASSO, ridge, PLS) are just, at the end of the day, just simple main effects (generalized) linear models.

Once again, I appreciate your engagement. It has been very helpful.

4 REPLIES 4

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Bayesian approach to machine learning for Bernoulli response

I am by no means a Bayesian expert, but it seems to me that what you want to do is not correct. Bayesian updating usually refers to incorporating new information - such as additional data points. Why add additional features? Just put them in your original model. Or, if using something like boosted trees, the algorithm already will use the features to "update" its predictions based on the degree to which it is making incorrect classifications. In other words, I think you are suggesting an apparatus that would appear "Bayesian" but in fact is just examining how much each feature contributes to the model, given the other features in the model. That is essentially what all predictive models do as is - leaving some features out manually and then adding them in to see how predictions change seems like just manually doing what predictive models do automatically.

Please, if I'm wrong about this, somebody straighten me out.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Bayesian approach to machine learning for Bernoulli response

Thanks Dale,

I am probably quite in agreement with you. The whole intent of the project is to see if the features, already used to train on D0, might have some "orthogonal" information that can be used to predict D1. Restated, does the noise in the original model for p0, which is an incredibly complex model based 100% on the features, contain any information that can be used to train for p1. I want to keep p0 as the initial estimat since it has already proven to be quite useful for D0.

If I just go ahead and niavely use something like LASSO (or ridge or PLS or ...) to train a model for p1 based on the features , I get better performance than predicting D1 using only p0. But I'm skeptical that I might unknowingly be training on noise with all the CV going on. Moreover, I have evidence from independent experiments that the model for p1 suffers from covariate shift while p0 is quite resiliant to different data sets.

Of course it comes down to me to give either a thumbs down or up on the whoe concept and so I want to simultaneously challenge the cocept as well as give it the best chance. That's why I'm leaning toward this "Bayesian" approach. It's a bit of a stretch to call it Bayesian, I agree, because we don't get to observe a Bernoulli event, we only get to look at the predicates in a model for the prior parameter.

I appreciate your thoughts.

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I'm a bit confused - you say your model uses 100% of the features (150 of them!). Virtually all machine learning models I have run never use all the features. You may enter them all as possible, but rarely (never) do they all get used. In any case, if you think your model is too complex, you can drop out each feature (tedious, but someone can probably help script this) individually to see how much the performance goes down. An alternate version of this - which I have seen in other software but not in JMP yet - is to replace each feature (one at a time) with random data and compare the performance.

You also appear to want to see if you can get better predictions by adding something to the model you already have (what is left to add if all the features are in it already?). I think there are many ways to do this - but I think more about trying different models, and using some kind of ensemble method to combine them, rather than using additional features. Bayesian model averaging is one way to do this. But your original idea about using your first model as a prior and then adding features to produce a "Bayesian update" still doesn't make sense to me.

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Yes, 100% of the features are used to calculate p0. The system actually generates thousands of features, but we needed to be able to validate the algorithm for computing p0 offline. So we found (through reading the code) every feature that was used to calculate p0 and saved them. Mind you, the "calculation" for p0 as a function of the features is too complex and implicit to ever let you write it down (hundreds of binary classifiers in parallel and serial configurations, I didn't build the system). But we do know that we have captured all the necessary and sufficient feratures to do an offline simulation and validate the computation for p0.

My goal is not to modify anything about the model for calculating p0. It is to see if, by chance, the very same features might hold contain some "magic" information that allows us to predict some other disease. My thought is that is unlikely, especially since the features have been generated specifically to, in some optimal and minimal manner, predict D0. There are probably some information theoretic ways to test this asssumption.

Again, I need to come up with a fair and principled wat to either show that the concept just will not work or that it will. It has been easy to generate plenty of red flags (covariate drift as an example). My attempt to use some sort of modifed "Bayesian" approach is my way of using the best statistic the features have been designed to generate, and then see if there is some independent, residual, information that might **help** predict D1. I'm a skeptic, but I'm among a group where others (management) have a dogmatic view that machine learning can magically solvce any problem imaginable. They don't appreciate that most of these models (LASSO, ridge, PLS) are just, at the end of the day, just simple main effects (generalized) linear models.

Once again, I appreciate your engagement. It has been very helpful.

Article Labels

There are no labels assigned to this post.