topic Re: best model in Discussions

best model

maryam_nourmand — Wed, 10 Jul 2024 07:51:53 GMT

Hello
my question is how can i find best parametric model that fits very well on my dataset?
i want find a parametric model that can predict good my response

Re: best model

Victor_G — Wed, 10 Jul 2024 12:47:06 GMT

Hello @maryam_nourmand,

Your question concerns a broad topic, and there may be (a lot) of questions to adress and answer before answering this question:

How is the data collected ? Observational study, experimental design, ... ? Representativeness and completeness of the dataset ? An Exploratory Data Analysis may be helpful to detect some patterns and possible pitfalls regarding the assumptions in regression models, like multicollinearity which may require adapted model like PLS or pre-processing steps like PCA.
Objective of the model(s) ? Causal explanations, prediction & optimization, or both (also linked to the available dataset and collection method) ?
Validation strategy/feature selection ? How to ensure the model(s) created has the right level of complexity and still has good predictive performance for example ? Do you assess the model performances and robustness through a "standard" Machine Learning validation strategy (with cross-validation or train/validation/test splits), or through a "statistically-oriented" approach, based on likelihood, information criteria (AICc, BIC), p-values ... ? Note that the model complexity should also be directly limited by the data collection : if you have factors with 3 different levels for example, you won't be able to fit higher terms than 2nd order terms.
Evaluation/selection metrics and threshold ? How do you evaluate the models ? What is the selection process/criterion : do you select the ones with the best predictive results with the selected metric, or do you select all models which have a better performance than a benchmark model or a naive model, ... ? How do you finally test the model ?

Some of these questions and answers were already described in previous posts :

https://community.jmp.com/t5/Discussions/Statistical-Significance/m-p/765928/highlight/true#M94573

https://community.jmp.com/t5/Discussions/Analysis-of-split-plot-design-with-full-factorial-vs-RSM/m-p/770579/highlight/true#M95183

Creating, comparing and selecting model(s) require evaluation metrics linked to your objective and thresholds/citeria to select one or several models. If you simply want the best predictive model, you could :

Create a model with a standard ML validation strategy (cross-validation for example) or a strategy able to control the model's complexity,
Use one or several metrics linked to predictive accuracy, like RMSE, MAPE, ...
Compare models based on the metric(s) and domain expertise : which one(s) is/are the most appropriate/relevant for your topic and which ones have the best performances,
Choose to estimate individual predictions with the selected model(s) to see how/where they differ, and/or to use a combined model to average out the prediction errors.
Test the model in "real" situation/production environment.

You might be interested by these ressources as well :

(it might help screening parametric and Machine Learning models options and compare them simultaneously)

I hope this first discussion starter will help you,

Re: best model

maryam_nourmand — Thu, 11 Jul 2024 12:05:32 GMT

If I want to explain my goal more precisely:

I have an initial dataset related to cancer patient data. I want to find a suitable parametric statistical model that best fits my dataset. Using this model, I aim to simulate data in order to apply a shift in the model's intercept. Ultimately, I want to see how quickly my pre-existing machine learning model can detect this shift in a control chart (i.e., obtaining the ARL). For this purpose, I need the best-fitting model on my data so that I can simulate data from it.

Re: best model

Victor_G — Thu, 11 Jul 2024 13:50:19 GMT

Hi @maryam_nourmand,

Ok, then you'll be more likely in a Data Mining approach, with a fixed dataset of observational data where you try to fit an acceptable predictive model. Using validation columns (with stratification on your features) will help validate (to avoid overfitting) and test your model on production data. In terms of modeling strategy, you'll very likely use stepwise approaches to select features, and Generalized Regression approaches with validation column method.

As an example, I used the Cancer_Data dataset from Kaggle to predict if there is a benign or malign cancer based on individual characteristics : https://www.kaggle.com/datasets/erdemtaha/cancer-data?resource=download

After a first exploratory data analysis focussed mainly on distributions and correlations between features, I created a validation column formula (stratification on the features) with 70/20/10 ratios for training/validation/test sets, and then used a Generalized regression model with validation column method, and all main effects features and 2-features interactions terms entered as possible terms in the model. I choose an adaptative Elastic Net as features are strongly correlated, but there might be other options as well, like PLS or PCA pre-processing.

You can then save the formula of this model, and use the Prediction Profiler to create simulations, to assess impact of features effects on the response, and possibly estimate the effect of increasing noise on the predicted response.

I hope these few options may help you for your topic,

Re: best model

dlehman1 — Thu, 11 Jul 2024 15:06:05 GMT

Your response triggers me to ask: what do you mean by a "parametric statistical model?" I usually think of machine learning models as non-parametric, so are you excluding such models here. It is unclear since you say you have a pre-existing machine learning model. Are you wanting to compare a parametric and non-parametric model? If you use the model screening platform, you can build a number of both types of predictive models. Using whatever you find to be the "best fitting," you can then save the prediction formula in order to do simulations. I'm not entirely sure what you mean by a "shift in the model's intercept" but I think you could just put an additive disturbance into the formula to generate the simulated data.

Re: best model

maryam_nourmand — Thu, 11 Jul 2024 19:22:47 GMT

Thank you for your response.

But I think for simulation, it is better to go to the section `save columns->save simulation formula` because it contains a formula for simulation that I can use for writing code as well, right?

And thank you for introducing the dataset.

But do you have a dataset related to a treatment process that includes multiple stages? I mean, after the first treatment, the disease relapses, and the second treatment is performed, and the treatment information such as drug dose, etc., is recorded. If you have such a dataset, I would appreciate it if you could share it.

Re: best model

maryam_nourmand — Thu, 11 Jul 2024 19:27:36 GMT

Yes, by "parametric model" I do not mean machine learning models.

My goal in finding the best parametric model is to simulate and generate data with a larger quantity than my initial dataset. I want to use more data for my machine learning model, but I want the simulated data to closely resemble and be similar to my initial dataset.

Re: best model

dlehman1 — Thu, 11 Jul 2024 21:33:29 GMT

I don't see why you need a parametric model to do that. You can run any predictive model and use the Profiler to simulate any number of additional data points, specifying different values for the independent variables and adding random noise to the predictions. Perhaps I am not understanding what you intend to do, but I don't see the parametric model part of this as necessary.

Re: best model

maryam_nourmand — Fri, 12 Jul 2024 01:45:39 GMT

What is the difference between saving simulation formulas through "Save Columns -> Save Simulation Formula" and simulating using the Profiler tool?

The reason I am using a parametric model for simulation is that I want to have the relationship and the simulation formula so that I can write the corresponding code and repeat this process 100 times in a loop (to obtain the ARL of my control chart based on the model I constructed in phase I ).

If I were to do this manually with the software, it would be time-consuming and difficult.

Re: best model

Victor_G — Fri, 12 Jul 2024 07:34:23 GMT

Hi @maryam_nourmand,

It depends what is your objective.

Saving a simulation formula can help you assess the coefficients distributions of your model, by switching in the diagnosis response with the diagnosis simulation formula : Simulate

The simulation formula is just a condensed version of the prediction formulas you can save from the model fit report: instead of having one probability column for each class and a final classification column, you only save one column with probabilities calculation and classification inside the same formula.

If you're more interested in the robustness of your model regarding variations in your inputs, then using the Simulator from Prediction Profiler with variations in the inputs (and possibly adding noise in the output) and run a Simulation Experiment may help you. You can then generate variations in your inputs, and possibly shift the distributions of your features to see how it affects your model, as well as increasing noise to see how robust your prediction model might be.

On a side note and related to the points brought by @dlehman1 :

My goal in finding the best parametric model is to simulate and generate data with a larger quantity than my initial dataset. I want to use more data for my machine learning model, but I want the simulated data to closely resemble and be similar to my initial dataset.

In order to simulate and generate data that is similar to the real data collected, you have to "mimic" the data generation process. If there are strong non-linearities, correlations, or other patterns found by your ML model and not considered by the parametric model, you should simulate and generate data with the ML model, or else you'll introduce a strong bias in the simulation/generated data. I don't understand the need to have a separate model for data generation and prediction for this use case.

Note that no matter how predictive a model might be, it's only a simplification of a phenomenon you're trying to understand and predict, so the data generation from this model may still be more or less biased (and probably generate less noisy outcomes than real data). But it can still be useful to assess its robustness to variations in the data like you intend to do.

And no sorry, I don't have a dataset that includes multiple stages treatment. I just found this one and use it to provide an illustration example that could fit your use case, that's all.

Hope this answer will help you,

Re: best model

dlehman1 — Fri, 12 Jul 2024 10:43:18 GMT

@VictorG

I guess I'm going to learn some new JMP features. Please tell me where you got the Simulation window with the Columns to switch out that you are showing. I can't find it anywhere. I do see where to run simulation experiments, but I can't find anywhere that prompts saving a simulation formula. Thanks.

Re: best model

Victor_G — Fri, 12 Jul 2024 11:17:46 GMT

Hi @dlehman1,

The simulation feature can be accessed by right-clicking on any panel/report of model platform : Launch the Simulate Feature

Here as an example on model summary of SVM :

Then you'll be prompted to choose the columns to switch in and switch out. In the context of Simulate, you could use it on the parameters estimates of a linear regression model, switching out the response with the simulated response to assess the parameters estimates distribution, or on the model summary to assess the consistency of the model's performances : The Simulate Window

Concerning the simulation formula, it can be found and saved as one of the options under "Save Columns" in the Generalized Regression platform :

I use this "Simulate" option to assess robustness of Machine Learning models using a Validation column formula. Using this column as a validation in the model dialog window, I can then simulate and switch in and out these column values, to represent slightly different training/validation/test sets with the same stratification or grouping method. It can help assess robustness of the algorithm (like cross-validation would also enable : https://community.jmp.com/t5/Discussions/How-can-I-automate-and-summarize-many-repeat-validations-into/m-p/632192/highlight/true#M83061), and possibly test the performances improvement (if any) of the model by fine-tuning the hyperparameters (https://community.jmp.com/t5/Discussions/Boosted-Tree-Tuning-TABLE-DESIGN/m-p/609591/highlight/true#M81062).

Hope this answer may help you,

Re: best model

maryam_nourmand — Sat, 13 Jul 2024 05:10:34 GMT

hello again
for example look at this simulation formula :
(-0.741104921699375) + 0.0237113448295257 * :weight + Match( :sex,
0, (:age - 33.8085106382979) * 0.0091364700024452,
1, (:age - 33.8085106382979) * 0,
.
) + (:weight - 63.3936170212766) * Match( :family, 0, -0.015643799304455, 1, 0, . ) + (:tumor size - 2.49893617021277) *
Match( :family, 0, 0.441690642908435, 1, 0, . ) + (:tumor size - 2.49893617021277) * Match( :hyper, 0, -0.346279666197286, 1, 0, . )
+Match( :hyper, 0, Match( :dissection, 0, 0.354356313550484, 1, 0, . ), 1, Match( :dissection, 0, 0, 1, 0, . ), . ) + (0.15468861367462
+ 0) * Tangent( Pi() * (Random Uniform() - 0.5) )

i need such a formula like this for my project because i should write a python code and generate for 100 times to calculate ARL
did I manage to convey my point?