Subscribe Bookmark RSS Feed

Following Question to Scoring New Data Table Using a Model

jenkins_macedo

Community Trekker

Joined:

Jul 13, 2015

This is a follow-up question to a similar question asked weeks ago. Let say you used a set of historical data of customers confirmed enrollments (dependent variable) to series of products that your organization provide. The model that was developed using several inputs variables (independent variables) for example, number of orders, order type, product ordered, products added to cart, call log frequency, call type, call recency, refer a friend (RAF), RAF Recency, # Pageviews, Time on Page (seconds), page view product type, # clicks, # Times promoted by Email, # Times promoted by Direct Mail, # Contribution (Donation), Donation Type, Donation $, Donation Recency, etc.


Thus, # confirmed enrollments is a function {Var 1, Var 2, Var 3, Var 4, Var 5, Var 6, Var 7, Var 8, Var 8, Var 10, etc}.

At the end of the model development process, I saved the prediction formula to the data table used to developed the model. Thus, the prediction formula is predicting # confirmed enrollments as described above.

Now, I want to use this model to score new data table with similar variables. My question is, should the new data table have # confirmed enrollments in the table, or should the table only include those variables that are listed in the prediction formula? What I have done is that since the prediction formula is predicting # confirmed enrollments, which actually contains coefficient estimates for each of those variable and the response in the prediction formula, having # confirmed enrollments in the new data table that is to be scored using the prediction formula would auto correlate.

I just want to make sure that I am doing the right thing. Any expert advice would be appreciated.

Jenkins

Jenkins Macedo
1 ACCEPTED SOLUTION

Accepted Solutions
Solution

Jenkins,

If I understand your situation correctly you have a new table with VARS (some number of predictors), PREDICTED ENROLLMENTS (this is your formula column that uses the the VARS), and then CONFIRMED ENROLLMENTS (true measured results).  Your data table does not need the confirmed enrollments to score your data using your prediction formula (it is not part of the formula).  However, you do not need to remove the column from your table. If your goal is to "test" (validate, verify, etc) your model then you would want to compare your actuals (CONFIRMED) to the predicted and see how "good" your model performed and for this you would need both the predicted and confirmed columns. Note, "good" is in quotes as what is good in one setting could be bad in another (risk/benefit of your prediction being correct or not).

Karen

2 REPLIES
Solution

Jenkins,

If I understand your situation correctly you have a new table with VARS (some number of predictors), PREDICTED ENROLLMENTS (this is your formula column that uses the the VARS), and then CONFIRMED ENROLLMENTS (true measured results).  Your data table does not need the confirmed enrollments to score your data using your prediction formula (it is not part of the formula).  However, you do not need to remove the column from your table. If your goal is to "test" (validate, verify, etc) your model then you would want to compare your actuals (CONFIRMED) to the predicted and see how "good" your model performed and for this you would need both the predicted and confirmed columns. Note, "good" is in quotes as what is good in one setting could be bad in another (risk/benefit of your prediction being correct or not).

Karen

jenkins_macedo

Community Trekker

Joined:

Jul 13, 2015

Hi Karen, you got the concept correctly as I did described. All the variables in the new data table are those listed in the prediction formula. Thus, I have a separate column, which contains the actual enrollments; meaning, actual enrollments per each customer, which is not part of the scoring. A column which has the prediction formula than scores all those variables and the result are compared by sorting the scores from largest to smallest and ranking them into decile. That way, we can compare / validate the model performance against the new data table that was scored using the model. Does that makes sense.

For our case, we used the model to score 1.2 million customers using the variables of the model, got those 1.2 million folks actual enrollments as a different column, sort the model scores (which did not include # confirmed enrollments), because that is what the model is predicting, create decile based on the sorted scores and graph the decide with the actual enrollments and see how each decile performed (i.e. predicted enrollments) against their actual enrollments.

If this is right, than I am doing the right thing. I have seeing others do it this way and just want to share some insights with folks like you and many others out here to learn from your expertise and you just authenticate that I am doing it right.

Jenkins Macedo