Hi, that is good suggestion.
I’ll explain what I did using the following screenshot, which shows data for one bot. The real dataset contains about 15,000 bots. Please don’t use my example values to fit the model — the predicted values are based on the cumulative formula generated from all 15,000 bots.
The first three columns in table on the left includes data 90 days ago, BOT ID, Age, and Cost (1 means a repair, 0 means end of study), and these three columns were used to fit the JMP Recurrence Analysis proportional Poisson Process model (There are additional 15k bots that I did not show. Just want to use one bot to show what I did). The cumulative formula from the model was then obtained. The fourth column, Age Plus 90 Days (current age), was fed into this formula to calculate the fifth column, Predicted Repairs in 90 Days (prediction for current time).
The value 6.28 represents the predicted total repairs for this bot at age 941. Therefore, the repairs for this bot in the most recent 90 days are predicted to be 6.28 − 3 = 3.28. To calculate total numbers, I sum all predicted repairs at the oldest age for each bot to get total prediction. I sum all true repairs (90 days ago) for each bot to get total true repairs. Then I get the difference (total predicted repairs for most recent 90 days) between those two sums.
The right table (current data) shows that only 3 repairs actually occurred up to 941 days, which is the same as 90 days ago. so the true repairs for this bot in the most recent 90 days are actually 3-3=0. I also calculate the true repairs for all bots in the most recent 90 days.
When I compare the two numbers, I consistently see an overprediction. For example, for all 15,000 bots, the number of true repairs in the most recent 90 days is 6,996, while the predicted repairs for the same bots are 9,306 — a 33.0% overprediction. I’ve tried several time windows and always get around a 30% overprediction.
I’m concerned that I might have done something wrong.
What do you think? Thanks
I also attached a simulated data in case who want to play around with it.
There are three columns in the data, Machine ID, cost (1 means repair and 0 means end of study), age. The question is how to predict repairs during the following 90 days using the data. Thanks.