Hi @lala,
If you want to reproduce the analysis on other file, you can launch manually the platform with the corresponding settings :
- Your response (here XX) in the "Y, Response" panel
- Your factor(s) (here YY) in the "X, Factor" panel
- Finally, you can set a validation portion (between 10 and 20%) in the corresponding panel (bottom left) :
The JSL code corresponding to this analysis would be :
// Launch platform: Partition
Data Table( "pic" ) << Partition(
Y( :XX ),
X( :YY ),
Validation Portion( 0.2 ),
Informative Missing( 1 )
);
However, I don't know how you could automate the splitting, you may have to create a metric to indicate where to stop splitting, based on the difference between R² training and R² validation, or based on the metric slope between the previous split and the new one (for R² training and validation for example).
The red vertical lines comes from the prediction formula of the Decision Tree and is directly linked to the inner working of tree-based methods. A Decision Tree will create splits in the factor(s) values based on a criteria to create more homogeneous subsets of data. For Regression Tree, this criteria is often MSE (Mean Squared Error), MAE (Mean Absolute Error) or SSE (Sum of Squared Error).
The value used for splitting is determined by testing every value for every factor, so that the one which minimizes the sum of squares error (SSE) best is chosen. So by splitting your data into "chunks" where you calculate for each of this part the mean value, you minimize the difference between predicted values and actual values, and actually reduce SSE. This explains the "stairway" step profile look of the prediction formula.
On the table, you can get the prediction values of this formula in the column "XX Predictor" saved, and in the platform , you can click on the red triangle next to "Partition", go into "Save Columns" and "Save Prediction Formula". As you can see, values are the same for each range/split done, and corresponds to the average of actual values in this split/group. This is quite helpful in your use case, as you can directly identify groups of similar values (corresponding to the same predicted value with the Decision Tree).
Some ressources about Regression Trees :
StatQuest (Youtube) : https://youtu.be/g9c66TUylZ4?si=-oyWbyieIlfwxLoH
Interpretable Machine Learning (Christoph Molnar) : https://christophm.github.io/interpretable-ml-book/tree.html
https://saedsayad.com/decision_tree_reg.htm#:~:text=Decision%20tree%20builds%20regression%20or,decis....
Hope this answer will help you,
Victor GUILLER
L'Oréal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)