cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar

Hello,

A general question regarding an application of DOE. I have a dataset with 30 independent features. The response is yield. Of course the data is rather noisy and when I run Random Forest (RF), I get at best R^2 of 70% on the validation data. To conduct an actual DOE is not possible at all due to logistics of it. However, let's say I take the RF model as my data generating process. I then desing an experiment using the min/max of the features. This will give a huge matrix, but no problem as computing power is not an issue. I then feed the DOE matrix into the RF and generate yield. I then use OLS to model the yield using the design matrix as my features. I do get an R^2 of 0.92!

This may sound cheating and not valid, so I like to get exeprts' opinion on this approach.

The goal here is to find feature settings that maximizes the yield.

 

Thanks,

1 REPLY 1
P_Bartell
Level VIII

Re: DOE

My two cents: Why bother with the DOE? You have a model from the random forest which you can use with the Prediction Profiler to find the optimal settings for your 'features' which I am interpreting as factors in DOE speak. Plus I think it's cheating a bit because if nothing else there is zero noise in the system wrt to the responses since they are just deterministic from the RF model for each of the treatment combinations.