topic Re: deployment model in Discussions

deployment model

FR60 — Fri, 09 Jun 2023 00:31:21 GMT

to everybody.

In attached 2 jmp files. The one named Training-validation-test was used to build a RF model with the following results.

The file contains one numeric response and 20 different predictors with one of them categorical, The predictors are sensors values.

As you can see the Bias is very good while the variance is slightly high (92-69=23%).

Then a new database with examples never seen by the model was used to simulate the deployment of algorithm on new data.

With high surprise the result was really poor with a R2 between the predicted and the actual response close to zero. We were not able to understand this behavior. The R2=68% for test data after all doesn't seems to be really bad. So we were expecting to obtain for sure a R2 larger than zero.

Please any help/comment is welcome.

Felice-Gian

Re: deployment model

dale_lehman — Tue, 30 Mar 2021 18:11:27 GMT

Virtually all of the X variables (with the exception of X12, X13, and X19) are significantly different, on average, between the two sets. So, my guess would be that the new data is not simply another random sample from the same population, but comes from a different population. It is not surprising that your model, which fit the original data fairly well, does so poorly on new data, when that new data comes from a different population.

I suppose the question to ask is why it comes from a different population? Apparently you thought the data was being drawn from the same population (or something close to that). But must of these variables look so different between the two sets that such an assumption is clearly not correct.

To see the differences, I concatenated the 2 tables and just used Fit Y by X, with Y being all of your X variables and X being the source table created when I concatenated them.

Re: deployment model

Georg — Tue, 30 Mar 2021 21:05:15 GMT

I agree to @dale_lehman , the new data is different.

I was able to fit a similar model, working also poorly on new data.

The Scatterplot matrix showing some details, e.g. X5 using very small range of the original data, same with X11 ...

In both pictures the new data is selected. Maybe this helps you to find the reason why there is a difference.

Re: deployment model

FR60 — Wed, 31 Mar 2021 08:44:47 GMT

Hi Dale and Georg

thanks for your reply.

Really the differences between the two distributions for each predictors to me don't seems to be so different (see histograms in the attached picture. With bls we indicate the Training-Val-Test set and exp the set never seen by the model). Maybe the X20 shows a new peak in the never seen data that was absent before. But the remaining distributions are only less populated then previous ones.

If the predictor distribution of training set is wider than test one, the model should have "visited" the phase space occupied by the predictor of test set. Correct? So why it should do a bad prediction?

Anyway I confirmed (even though for us this is strange) what you say by taking randomly 2000 instances (instead of the last 2000 instances in thenever seen by the random forest model and once trained it was able to predict correctly the response. So if a slight difference in terms of average/std make a big difference in terms of prediction performance how we can use these models in production where the signals change continuously?

Rgds. Felice/Gian

Re: deployment model

dale_lehman — Wed, 31 Mar 2021 13:21:38 GMT

Now that I look more closely, I see that this is time series data. From my initial views, it does not appear that the process has changed significantly. However, I think the problem is that the train-validate-test paradigm is not really appropriate for time series data. I'd suggest creating a distinct time period (e.g., the last month of the first data set) as the validation set and re-estimating the model. My guess will be that the model performance on the validation and never seen before data should be more similar if you estimate the model that way. I'm still a bit concerned about the noticeable differences between the X variables in the two data sets - it might require some more sophisticated time series modeling (such as modeling potential seasonality).

Re: deployment model

Georg — Thu, 01 Apr 2021 11:58:01 GMT

Meanwhile I believe that I will not be able to fit a valid/reasonable model for the data.

My first approach was to generate validation column statified by Xs.

The Timeseries hint from @dale_lehman was useful, so I looked also at the data in a timeseries manner,

realizing that with time the combination of the Xes change. And here it again looks to me like "New" is different. Yes, the Xes cover the same range, but that is not enough. I think, the combination is different in "New", that means, it is a kind of extrapolation. This would be not an issue for regression, but for methods like RF this counts (model is multidimensional and highly nonlinear, so difficult to look inside).

On the other side I was not able to fit a reasonable model with any other method (neural, Boosted Tree). It looks to me that we do not have the proper information in the X-columns to build a solid model.

So I would look into the physics of the process or whatever it is,

how the Xes may/can change the Y,
what the noise may be
and what makes the things change over time, and include it into the dataset.
and I'm also curious about the data structure, there are groups of rows with same date and same values for some parameters

What me surprised is that, when I made a validation column that is stratified by time, I got the same result (fair model for the first part, no model with the new data).

Good luck,

Re: deployment model

dale_lehman — Thu, 01 Apr 2021 14:35:44 GMT

I have a couple of suggestions. First, I would try first differencing your response variable and then trying random forests again. For me, that is the easiest approach to time series data - and it is possible that the data will be better behaved when looked at that way. I would still manually create the validation set as the last time period rather than a random selection of rows. The second idea is something you noticed - there are time groupings, such as multiple observations closely linked in time. You might consider making the time sequence uniform by summarizing observations closely related in time. Time series are better behaved if the time variable is uniformly measured, so that means your data will need to be summarized into equal spaced time intervals. As you are suggesting, it is time to go back to thinking about exactly what is being measured and how it is being measured.

Re: deployment model

FR60 — Fri, 02 Apr 2021 15:34:24 GMT

Hi George and Dale

thank you very much for your answers. I give a look at the lag plot of all predictors and response and with my big surprise I discovered a very long autocorrelation (see attached pictures).

Since the data come from a semiconductor fab, the value 25 visible in the ACF is related to the number of the wafers in a lot.

The points out line instead in the lag plots are just the first wafers of a different lot loaded in the macgine. I think that the autocorrelation is due to internal feedback present in our machine to keep the working point. Of course those adjustments are hidden in to the chamber and not under our control and understanding. The response also show a clear lag that I cannot explain. But is there.

I tried to use the lag, moving average, smoothing and other predictors features engineering but the results continue to be really bad.

Next step I want to try the differencing that Dale proposed. Do you believe that the simple difference X(t)-X(t-1) can be enough?

Ciao Felice

Re: deployment model

dale_lehman — Fri, 02 Apr 2021 18:04:58 GMT

The PACF plot suggests that the first two lags are the relevant ones - by far, the first lag is the most important. So, I think simple first order differencing would be a good thing to try. It may be possible that one of the machine learning models (e.g., RF, boosted trees, neural nets) will perform well, so I would try them. But you also should try some more standard time series models (where you would not use the differencing, as the time series models are built to account for the autocorrelation in various ways). In theory, if you use first order differencing (and if that largely eliminates the time structure), then you can probably use a random train-validate-test set for your modeling. At least, that's what I think - others may know better.

Re: deployment model

FR60 — Sat, 03 Apr 2021 09:36:45 GMT

Hi Dale

since also the predictors show the same PACF, do you think that could be useful differencing the predictors too beside the response?

Thanks Felice

Re: deployment model

dale_lehman — Sat, 03 Apr 2021 11:03:09 GMT

Time series models are not my strength, so hopefully someone will give you a more definitive answer. I would not difference the features at first - I'd build the models and look at the residuals to see if they exhibit the same autocorrelation. If they do, then perhaps I'd try differencing some of the features.

Re: deployment model

FR60 — Sat, 03 Apr 2021 18:05:23 GMT

Hi Dale

unfortunatly also by using the response variable difference none of the regeression models was able to return a good result.

So at this point maybe I need to assume that our X variables are not good predictors of response.

Anyway still remain the doubt if for data like ours (sensor data) we have to split the training/test by taking in account the time axis or instead we can use the random split tecnique. In the literature I found several papers where for sensor data they use the random split training/test. By if we do in the same way the results seems really good until we do the deployment of model in to production to understand that the model is completely wrong. So where is the truth?

In attached you can find a paper that apply the machine learning tecniques to a semicondcutor tool (etcher). They try to do a predictive maintenance by looking at the sensors data. So the data should be really similar to ones. If you give a look at the part hof the paper ighlighted in green looks like that they split the data taking in account the time axis and then considering the data as time series.

I hope that more people besides Dale can have their say and help us to solve this dilemma.

Are or not the data like the ours multi-time series?

It is true that the predictors X and the respone Y depend on the time. But if we want predict the Y according to the X values why the time variable should be so important? To me the Y should depend on the X variables that are set automatically by the machine during the process phase. Of course this happen in the time cronologically, but the X value depend on the recipe and machine state. So I have some difficult to accept the dependancy on the time.

Happy Easter to the whole community.

Felice

Re: deployment model

dale_lehman — Sat, 03 Apr 2021 20:47:36 GMT

I strongly suspect you need to spend more time preparing the data. To fit time series models, the observations must be equally spaced and with no duplicate time values. You have 4 sensors (?) with unevenly spaced responses. If you want to make sense of this data, I believe you need to aggregate the values so there is a single observation at each time period, and you might want to create a variable for the length of time between timed observations. I'd also suggest doing a separate analysis for each type of sensor rather than using sensor as one of your features, as the behavior of each appears to be different.

I wouldn't be prepared to conclude that your features are not predictive - I think too much noise is introduced by mixing the sensor types as well as having nonstandard time intervals. If you address these issues, there may be a good predictive model yet. The modeling part is not the problem - the data preparation is (as is often the case). At least that is what I think.

Re: deployment model

FR60 — Sun, 11 Apr 2021 13:11:59 GMT

Mybe this can help in the discussion. I used the excellent model screening option in jmp pro by using the nested option to tune hyperparameters combined with the k-fold cross validation. Here the results obtained after some hour.of jmp work

What was surprising me is the large difference among the models in terms of performances. The tree models and KNN showed a R2 above 50% while all the fit models showed values well below the 50% (close to zero) even though the lasso and ridge regularization usage. Mybe is this difference that explain the bad performance of RF/Boosted models once deployed in production?

What do you think? Any comment/suggestion/help?

Felice.