Solved: Re: building multiple neural network models - Page 5

abmayfield · Jun 9, 2023 2:59 PM

I am using the neural platform to make predictions, and, being new to neural networking, I am only slightly familiar with all the input parameters: number of hidden layers, number of nodes/hidden layer, boosting models, learning rate, and tours. What I want to do is try to minimize RMSE and the validation model misclassification rate. What I've been doing is iteratively changing each parameter one by one, saving the model performance parameters, and pasting them into a new JMP table, but this is going to take days since there are so many combinations of layers, nodes, tours, etc. Would it be possible to write a script to where JMP Pro builds, say, 1,000 models and dumps the data into a table so that I don't have to manually change each model input parameter?

#Hidden layers: 1 or 2

#Sigmoidal nodes: 0, 1, 2, 3, or 4

#Linear nodes: 0, 1, 2, 3, or 4

#Radial nodes: 0, 1, 2, 3, or 4

#boosting models: no idea, 1-5 maybe?

#learning rate: 0-0.5 in 0.1 increments

#tours: 1, 5, 10, 20, 100

So that would be 2 x 5 x 5 x 5 x 5 x 5 x 5=~30,000.

I guess I could widdle a few of these down once I am more familiar with the dataset. Of course, having many nodes and a high number of boosting models doesn't make sense (nor do certain other combinations), but we're still talking about potentially hundreds of models worth testing. Surely this sort of model screening/comparing could be scripted, right?

Anderson B. Mayfield

SDF1 · Jul 19, 2021 08:53 AM

Hi @abmayfield ,

Sorry for the late reply, I was out on vacation last week.

With regard to your question about the holdback portion, it is my understanding that JMP will randomly select the percentage of rows based on your holdback value. So, if you have 100 rows and select 0.3 as the holdback portion, each time JMP runs the NN platform, it will select a new random set of 30 rows as the validation portion. So, in terms of using the tuning table, the platform is called each iteration where it sends to the NN platform a new set of model parameters. If you've chosen the holdback option for validation and put a number in ranging from 0 to 1, then each new run in the tuning table, JMP will select a different set of rows according to the holdback portion. However, if you are running multiple tours during a single NN model fit, then it will use the same holdback rows during each tour. Each time the NN platform is called, it will generate a new subset for validation. This is my understanding, at least.

For the above situation, it is not possible to know which rows are held back (at least not that I'm aware of). If you wanted to do something like that, then I would choose the excluded rows holdback validation option. You can then select which rows in your data table should be excluded and only those will be used for validation. Presumably, you could feed the NN platform a specific random seed, say 10191, for each tuning run with a holdback portion and it will use that random number each time to generate the subset for validation. Since it's starting from the same random seed each time the NN platform is called, in theory JMP should end up selecting the same rows for each tuning run. However, in this case you still wouldn't know which rows those were. Again, this is my understanding of how it it works.

Hope this helps!,

DS

abmayfield · Jul 19, 2021 01:35 PM

Thanks so much for explaining that, and I'm glad I asked because my assumption was wrong. It makes more sense, though, that each "run" would be a random shuffling in terms of which samples are in the training model and which are held back. As you stated before, with my small, wide dataset, it makes more sense to manually exclude rows, or else create a validation column, so really I'm just trying to make sure I "master" all the various options since hopefully in the future, I WILL have a dataset in which holdback and kfold could be employed. I especially think the latter is attractive/clever.

I have un about 15,000 models with your GUI, no exaggeration. Now I think I'm going to play around with comparing NN vs. XGBoost, a technique I never heard of at all until your GUI. I don't even know if it's in JMP Pro 16 (though maybe it's hidden somewhere)!

Anderson B. Mayfield

SDF1 · Jul 19, 2021 01:57 PM

Hi @abmayfield ,

As far as the XGBoost add-in, you'll need to download and install the add-in separately. It works from JMP Pro 15 and up. You can find the add-in here. Be sure to read the documentation and try some of the test modeling that is explained by Russ in the post before moving on to doing your own modeling. Right now, that platform is still one which requires a fair bit more understanding from the user on how to modify/manipulate the hyperparameters of the fit. I found this blog to be very helpful when it comes to tuning the XGBoost platform (there are MANY different parameters to tune, which makes the whole space quite cumbersome). Although it's written for python users, the same basic idea works well for the JMP platform. This video is also helpful, and you'll also want to read up on what each parameter in the model does, what its working ranges are, and why you may or may not want to modify it from the default value here. It is definitely not a platform for just diving into without doing some background homework. I found this out the hard way by first diving in and realized I didn't know what I was doing. So, I did some homework and it helped a lot. Now, I can tune models pretty quickly using the framework from the blog I listed above.

Good luck!,

DS

abmayfield · Jul 19, 2021 03:08 PM

Thanks so much for all the info. I indeed would totally have tried to build tons of models without having any idea what I was doing. Two years ago I did this with another approach and was on the verge of trying to write a Nature article, thinking I'd changed marine biology as we know it, before I realized that I was interpreting the model completely incorrectly, haha. Thankfully, AI and machine learning models are all the rage at the moment where I work, so, even though my motivation for using them was driven by the traditional approaches NOT cutting it (rather than my wanting to do AI because it's "sexy"), I hope I can use this wave of interest in these sorts of analyses to try to learn all the ins and outs.

Anderson B. Mayfield

abmayfield · Aug 21, 2021 05:56 PM

I started to use your NN GUI again after upgrading to JMP Po 16.1, and I stated to get the "subscript range" error again. Have you encountered this issue?

On another note, do you have a DOE script for the XGBoost tuning table? I am going to explore that approach with your GUI, as well! Thanks so much

Anderson B. Mayfield

abmayfield · Aug 22, 2021 02:57 PM

Nevermind on point #1. I think was something weird on my end, or that a suitable stratification of the samples into training, validation, and test. Basically, I have a totally unbalanced design in which one group I am trying to predict is much more common, so I think this is causing the issue!

Anderson B. Mayfield

SDF1 · Aug 23, 2021 08:20 AM

Hi @abmayfield ,

I have not upgraded to 16.1 yet, so I have not experienced the same error, however I will keep my eyes open in case it happens and post any update if needed.

As for XGBoost, i do have a tuning table to optimize that one. It's attached here. The XGBoost platform has many different tuning parameters that can be modified. Some are more influential than others. I typically follow a step-wise tuning of the model because it's just such a big parameter space. I do not have all possible parameters for the tuning table, just the most influential ones (you can always add more with the appropriately names columns). You can access the whole list of parameters in the platform by opening up the "Advanced Options" submenu in the XGBoost window. Since you mentioned that you have a highly imbalanced data set, you'll want to make sure that the scale_pos_weight parameter is set to 1.

In fact, you can even run a tuning design from the platform itself, where it will use the space filling DOE to generate the tuning table. It's pretty handy and flexible. However, I find that for CPU use and RAM minimization, running it like how my GUI works is better from within JMP.

I've found that for my situations, following these steps works really well for XGBoost:

1. Fix learning rate, max_depth, min_child_weight, gamma, subsample, and colsample_bytree to first find the right number of trees. These other parameters will be tuned later. I set them to their default values at first.

2. Tune max_depth and min_child_weight

3. Tune gamma (it's a good idea to re-tune the number of trees before going to step 4)

4. Tune subsample and colsample_bytree

5 Tune alpha and lambda, the regularization parameters

6. Tune Learning Rate: lower the learning rate and increase number of trees.

At each step, I generate a tuning table for the parameter that I wish to tune while keeping the others fixed at what they were for that step. I typically run through this sequence twice. It's critical that you use cross validation (either K-fold or validation column) when running XGBoost. You can run the fit with no cross validation, but then you risk overfitting the data and not really having that great of a predictive model in the end.

Hope this helps!,

DS

abmayfield · Aug 23, 2021 12:04 PM

Wow! Thanks so much for the table and the explanation. I was thinking that, since I am finally familiar with all the NN jargon, XGBoost would be a breeze, but it seems like I still have a lot to learn (just to get a hang of the jargon for one thing)! It seems like XGBoost is sort of the "darling" of the machine learning world at the moment, but, more importantly, that it might actually be better suited for the types of data I generate, so I definitely need to do a deep dive into all this (for instance, while your GUI is running in the background!). Speaking of which, would you mind if I mentioned the NN GUI, and very likely even demo it, in my JMP Discovery talk (which I am recording this week)? I would be sure to give you credit, but I totally understand if you want to be the one to unveil it to a larger audience and wouldn't want to "steal your thunder." I would be using the neural platform only since I doubt I will be the master of XGBoost by Friday (when I record the talk), haha.

Anderson B. Mayfield

SDF1 · Aug 23, 2021 01:08 PM

Hi @abmayfield ,

No problem, I'm glad it could help. Yeah, tuning the XGBoost can be a bit more involved. I found a blog here that is very useful -- last blog entry on that page, but for some reason link to the blog article is not really responding. It gives an example of how it can be done in Python, but it's essentially the same in JMP, once you've figured out the jargon for the model parameters. I also found this site and accompanying documentation VERY helpful in reading up on what XGBoost does and which parameters make it more or less conservative of a model. One thing I like about it, is that it's much more stable than boosted tree or random forest. Both of those have some stochastic behavior in R^2 resulting in models that differ slightly each time you run the fit with all the parameters being the same. XGBoost gives you the same fit quality each time you run the same set of parameters.

Please feel free to mention the NN GUI in your Discovery talk -- I plan on attending virtually, and was going to try and go to the in-person visit in Cary (I'm just an hour away), but I think with the continuing problems from the pandemic, they stopped the in-person meetup. I have posted the GUI to the add-ins page, so technically, I guess it's "published" and out there. Thank you for asking and thinking of maybe even demo-ing it. I hope it runs smoothly! I'm not the best programmer, but I can get things to work. If you run into any issues, please let me know and I'll try to correct it ASAP.

Thanks!,

DS

abmayfield · Aug 23, 2021 09:05 PM

Great. Thanks so much. I did notice something weird today: on my 13-inch Macbook Air running Catalina and JMP Pro 16.1, everything works. I then migrated over to my work Macbook Pro 16-inch, also running Catalina and JMP Pro 16.1, used the exact same data table and your GUI, and I get that weird "expected character arg 1 in access or evaluation of 'Outline Box', Outline Box/*###*/(o_box)" regardless of what I enter. Do you think it could actually be linked to the display (i.e., the larger screen)? Surely not. It seems strange that a script would run on one computer and not the other. Any ideas?

Anderson B. Mayfield