Subscribe Bookmark RSS Feed

Prediction future end values depending on previous data set

saitcopuroglu

Community Trekker

Joined:

Sep 29, 2014

Dear All,

I have a data table which presents the sales of year 2015 by sales date.

After adding a Cumulative Summary column I have the final amount of year 2015 "total revenue".

Depending on this data (I also have the same for 2012, 2013, 2014 if that helps a healthier prediction) how will I predict the total revenue of 2016 by filling the "Sales for 2016" column day by day?

Message was edited by: Sait Copuroglu
Second data table added " Predicting Future Values Years Weeks"

5 REPLIES
kevin_c_anderso

Community Trekker

Joined:

Jun 5, 2014

Hi, Sait!

It's a great question:  How will you predict Total Revenue in 2016? 

Here's what I recommend:

  • Get more data.  The file you enclosed is basically of two partial years.  This is insufficient for any kind of accurate prediction of a seasonal variable.
  • Understand the missing data.  They don't appear to be Missing At Random.  Are they really missing?  Zero?  Can they be imputed somehow?  How?
  • Assuming these are hotel data, think about what can drive hotel occupancy...day of week, day of year, week of year, holidays, conferences, etc.  Build several good, defensible models using those variables with some crossvalidation and average the predictions of the models.  Use the average of the prediction formulae to forecast every day of 2016, and sum the days to get Total Revenue.

I ignored the missing data, made three models, averaged them, and predicted 5.4M euros in 2015 and 7.1M euros in 2016.  Better data will support better models.  Better models will hopefully support better predictions.

Niels Bohr is reputed to have said "Prediction is difficult, especially about the future."

Good luck!

saitcopuroglu

Community Trekker

Joined:

Sep 29, 2014

Dear Kevin,

Thanks for your attention and for your reply!

Actually I feel I am very lucky with this prediction, last 4 years data is moving within a 85% confidence interval  (Please see inserted pic).

Here's what I recommend:

  • Get more data.  The file you enclosed is basically of two partial years.  This is insufficient for any kind of accurate prediction of a seasonal variable.
  • Understand the missing data.  They don't appear to be Missing At Random.  Are they really missing?  Zero?  Can they be imputed somehow?  How? The missing data is "no sales that date", means zero.
  • Assuming these are hotel data, think about what can drive hotel occupancy...day of week, day of year, week of year, holidays, conferences, etc.  Build several good, defensible models using those variables with some crossvalidation and average the predictions of the models.  Use the average of the prediction formulae to forecast every day of 2016, and sum the days to get Total Revenue. Yes you're right, this data set is of a Resort Hotel. Could you please help undertanding how to build several defensible model? Cross validation?

I ignored the missing data, made three models, averaged them, and predicted 5.4M euros in 2015 and 7.1M euros in 2016.  Better data will support better models.  Better models will hopefully support better predictions. Could you please share your work so I try to understand how to ignore missing data, build models, average them, predict? Did you by Time Series or Fit Model or...?

Niels Bohr is reputed to have said "Prediction is difficult, especially about the future."

Many thanks again.

10447_LBLCumNetAccRevYears.png

kevin_c_anderso

Community Trekker

Joined:

Jun 5, 2014

Hi, Sait!

I'd much rather be lucky than good.  I'm not going to share my work with you...I am going to try to help you by consulting with you, not by doing the job for you.

Zero is not missing!  Egyptian accountants had a symbol for zero in 1740 BC.  We should honor prehistory by using it in place of the missing data in your file.

But your graph of the 2014 data shows a monotonic increase that doesn't jive with the large numbers of zeroes (used to be missing) in the Revenue data.  So I guess I am even more confused with this new information.

The functional shapes on the graph all look similar year-over-year, and you may do better by just using the average of the functions at each day.  That would certainly be more parsimonious than my approach, and it may even be more accurate.  Remember, all you gave us was two partial years...

I know very little about the resort hotel business.  But it makes some intuitive sense to me that, for instance, weekends would be more lucrative than weekdays, that summer vacationtime would be more lucrative than winter, that holidays like New Year's Eve would make more money than other days, etc.  So I made some variables like DayOfYear, DayOfWeek, WeekOfYear, Year, etc.  I then fit "decent" Neural Net, Bootstrap Forest, and Boosted Tree models with the response of Revenue with those variables as Factors, and saved the prediction formulae.  I made a variable that summarized the mean of those three model predictions. Then I incremented the dates through 2016, made the daily predictions, and summed the means by Year.  It's a very naive approach that could probably be improved dramatically with some subject knowledge.



saitcopuroglu

Community Trekker

Joined:

Sep 29, 2014

Dear Kevin!

I'd much rather be lucky than good.  I'm not going to share my work with you...I am going to try to help you by consulting with you, not by doing the job for you. I would be much more happier. Did not request the data to be lazy but to spare you time as I can by trying to understand on my own.

Zero is not missing!  Egyptian accountants had a symbol for zero in 1740 BC.  We should honor prehistory by using it in place of the missing data in your file. Yes true, statistically missing, financially zero.

But your graph of the 2014 data shows a monotonic increase that doesn't jive with the large numbers of zeroes (used to be missing) in the Revenue data.  So I guess I am even more confused with this new information. Actually I could make a data table with the missing revenue data not displayed but each year the date of sales differ so I've thought I could make a calendar date column to sweep all. Under these circumstances I think making a "Week Of The Year" time variable will be much better.

The functional shapes on the graph all look similar year-over-year, and you may do better by just using the average of the functions at each day.  That would certainly be more parsimonious than my approach, and it may even be more accurate.  Remember, all you gave us was two partial years... I guess the exponential smoothing or weighted average would be the best as leisure & tourism in Turkey in 2016 is predicted to be more likely as 2015 rather than 2012 and I think I can handle it.

I know very little about the resort hotel business.  But it makes some intuitive sense to me that, for instance, weekends would be more lucrative than weekdays, that summer vacationtime would be more lucrative than winter, that holidays like New Year's Eve would make more money than other days, etc. Actually what you were describing is the characteristics of "City Hotel" business. Weekend or weekdays do not play into Resort Hotel business since the average stay night is 12. Starts operating in May and ends with November (no christmas). So I made some variables like DayOfYear, DayOfWeek, WeekOfYear, Year, etc.  I then fit "decent" Neural Net, Bootstrap Forest, and Boosted Tree models with the response of Revenue with those variables as Factors, and saved the prediction formulae.  I made a variable that summarized the mean of those three model predictions. Then I incremented the dates through 2016, made the daily predictions, and summed the means by Year.  It's a very naive approach that could probably be improved dramatically with some subject knowledge. Resort Hotels typically take very little direct reservations from their guests but make contracts with Tour Operators for the whole year in advance. 65 percent of the whole revenue is almost completed with the contracts months before the hotel starts operating (that's why the sales date for given operating year is consisting also the previous year). Predicting the remaining 35 percent is coming from "hot sales" which is the skills of Sales Management Team. They give a prediction about which market will produce what amount of sales and when, which is put in an excel file and called as "Revenue Budget" and "Cost Management Team" calculates the expected expenses depending on the data given by Revenue Budget and we call it "Expenses Budget". Total profit is the difference of the two. What I would like to monitor is the deviation of actual sales from the Revenue Budget as every day sales data flows in, which market is under performing or over performing and what would be the final actual sales amount compared to revenue budget.


I guess first I need to prepare the data table in such a format that I can use for Time Series and Fit Model and Fit Y by X but I don't know how to.. ( as michael.anderson​ mentioned "You can take out isolate seasonal effects from linear trends in revenue.  This also gives the ability to do forecasting.")

I have added a data table " Predicting Future Values Years Weeks" I guess which is better for demonstration purposes.


Many thanks in advance,


M_Anderson

Staff

Joined:

Nov 21, 2014

Assuming that you got the missing data issue Kevin showed, this would likely be a good choice for using the Time series platform - under modeling.  You can take out isolate seasonal effects from linear trends in revenue.  This also gives the ability to do forecasting.

Best,

M