I'm working with a file of approximately 8500 records where approxmately 100 have a resonse variable value Y and the rest a value of N. I'm running a Partition Model on the response variable Y and then adding in various X, Factors. Given the relatively low number of "responses" I have two questions when using this model.
1) shoud I weight the resonse variables using conditional formatting (adding a new column)?
2) what percentage should I use for the validation portion (20%, 30%...)?
For that total size, I would do a 70/30 ro 60/40 split. You could try both. How many factors do you have?
I probably would, though I guess it depends a bit on what a good classifier is for your situation. If you want make sure there's a huge penalty for missing the "Y" responses to make sure you do an adequate job at predicting those, that would be a good idea. I'm not an expert in this area, but I would probably play around with different weights and observe the impact on the confusion matrix.
You might also want to include the profit matrix as a weighting scheme. Take a look at this link to learn a little more:
Hope that helps.
@chris_kirchberg, That is really cool and is exactly the kind of thing I had in mind. Learned something new today.
Great info, will incorporate that. Now that I've run various validation proportions and both weighted and unweighted I do get slighly varying results as would be expected. What statistical factors would you suggest paying the most attention to as far as choosing the best results from the modeling?
Don't try to optimize too much or you'll overfit your validation set. Just pick a setting that results in relative agreement between the training and validation set and go with it.
Thanks. And when you say relative agreement, how best would you gauge that? Simlar R2 values for both training and validation?