Solved: Re: building multiple neural network models - Page 4

abmayfield · Jun 9, 2023 2:59 PM

I am using the neural platform to make predictions, and, being new to neural networking, I am only slightly familiar with all the input parameters: number of hidden layers, number of nodes/hidden layer, boosting models, learning rate, and tours. What I want to do is try to minimize RMSE and the validation model misclassification rate. What I've been doing is iteratively changing each parameter one by one, saving the model performance parameters, and pasting them into a new JMP table, but this is going to take days since there are so many combinations of layers, nodes, tours, etc. Would it be possible to write a script to where JMP Pro builds, say, 1,000 models and dumps the data into a table so that I don't have to manually change each model input parameter?

#Hidden layers: 1 or 2

#Sigmoidal nodes: 0, 1, 2, 3, or 4

#Linear nodes: 0, 1, 2, 3, or 4

#Radial nodes: 0, 1, 2, 3, or 4

#boosting models: no idea, 1-5 maybe?

#learning rate: 0-0.5 in 0.1 increments

#tours: 1, 5, 10, 20, 100

So that would be 2 x 5 x 5 x 5 x 5 x 5 x 5=~30,000.

I guess I could widdle a few of these down once I am more familiar with the dataset. Of course, having many nodes and a high number of boosting models doesn't make sense (nor do certain other combinations), but we're still talking about potentially hundreds of models worth testing. Surely this sort of model screening/comparing could be scripted, right?

Anderson B. Mayfield

abmayfield · Jul 7, 2021 06:09 PM

Wow, this GUI has really gotten an upgrade! The only thing is that now I get an error of: "

Subscript Range in access or evaluation of 'V_stats[ /*###*/5]' , V_stats[/*###*/5" regardless of the validation type I use. I wonder if it's because I have such a small sample size for trying to have training, validation, and test samples for each of four groups (as Y's)?

Anderson B. Mayfield

SDF1 · Jul 8, 2021 10:16 AM

Hi @abmayfield ,

Sorry it took a bit to get back to you. I was able to reproduce the error you were getting and found where the mistake was in my code. lines 348 and 349 had the wrong values for the variables ncol_box_v and ncol_box test. They should be 10 and 19, respectively. I've corrected that and re-ran the NN with the validation column, and it works as it should. I'm attaching the corrected code. Sorry for the inconvenience. Thanks for catching that, and if you find any other bugs, please let me know.

Hope this helps!,

DS

abmayfield · Jul 8, 2021 11:49 AM

Thanks for looking into that, and it does work now with a validation column, but I get the same error code when I use holdback or kfold. I wonder if it's because those methods don't necessarily stratify the data to where each bin I am trying to predict is represented? I'll try it with a bigger dataset and see if I get the same "subscript range" error.

Anderson B. Mayfield

SDF1 · Jul 8, 2021 02:41 PM

Hi @abmayfield ,

Thanks again for catching that and letting me know -- my models tend to have numerical responses, so I don't usually need to edit things for nominal data types.

Anyway, I think there might have been a change from JMP 15 to 16 that lead to different number column boxes in the report windows because I needed to change all the same values from what they were to what they are now, and it worked before in JMP 15, I recall testing it out. I've re-checked, and this updated version correctly runs with the different validation methods for NN: excluded rows holdback, holdback, kfold, and validation column. Sorry once again for the inconvenience. I am glad you can test it out and let me know of the bugs so that it can get better and help others. Please let me know if you find anything else!

Thanks!,

DS

abmayfield · Jul 8, 2021 04:15 PM

AWESOME! Now we're cooking with fire. The only thing I see now is that, even though the GUI asks you to tell it which is the data table and which is the tuning table, it still sometimes gets them mixed up. I think it defaults to whichever table is closer to the "top of the pile" (I tend to have 10-20 tables open, or else this would likely not be a problem).

Would it be possible to add a simple "rule" that says "If the validation column has only training and validation terms, do not calculate the test data" or something along those lines? I guess the alternative is to go through the code manually and put the "//" in front of all the "test" options. I ask because sometimes I use training and validation ONLY, rather than training/validation/test (since my sample sizes are small). if you enter a validation column with ONLY training and validation terms now, you get the "subscript range" error, and I think it's because it's looking for "test" data that are not there!

Anderson B. Mayfield

SDF1 · Jul 8, 2021 05:00 PM

Hi @abmayfield ,

I have that same issue with my data tables. I think you're right, if you don't run it directly from the data table your modeling from, it somehow defaults to the last opened data table in your JMP home window. I do not know how to get around this.

Hmm, the code should work with a validation column that is only training and validation or all three. I wrote it to check the number of levels in the validation column and adjust accordingly, meaning to record the appropriate fit statistics whether there are two levels or 3. I will double check that one tomorrow. I think the subscript range error is likely due to the changes between 15 and 16. When 16 came out, I did not go back and double check the code to make sure it worked, but I ran it through multiple tests in 15 and everything worked. Nonetheless, I will fix this and get things corrected so that it works as expected, I just won't have time until Friday morning to get to it.

Again, thanks for helping me debug the code, and sorry for the inconvenience. I'll get back to you when I've checked it for all the data types and code types like I did for submitting it as a JMP Add-in.

Thanks!,

DS

SDF1 · Jul 9, 2021 11:42 AM

Hi @abmayfield ,

So, I found out what the problem was in my coding. It wasn't an issue from one version of JMP to another, it has to do with how many levels there are for a nominal or ordinal response. This changes the location of the column number box in the report window. I needed to adjust the code to reflect that when going from a response that has 2 or 3, or 4, or 6 levels, the statistics for the fit changes location within the report. The nice thing is that is changes at a fixed amount related to how many levels there are, so I was able to relatively easily fix the code. It took me a while to figure out that bug though because I had used the PowderMetallurgy file from the sample data, but it is only two levels, not 4 like yours is.

Attached is the updated version that is more generalized to different levels of the response variable. The code should already work for 2-level or 3-level validation columns as well as nominal, ordinal, and continuous responses.

I tested out all fit platforms: NN, BT, Bootstrap Forest, and XGBoost with your data set and it works properly across all data types and the different validation options. Please note that at present, JMP can only fit a two-level response (nominal or ordinal) using the boosted tree platform. The other platforms can handle more levels than that.

I'm pretty sure it's now working as intended (except for defaulting to the last opened data table -- I do not know how to get around that). However, if you do find a bug in it, please let me know.

Thanks, and good luck!,

DS

abmayfield · Jul 9, 2021 01:42 PM

This is SO awesome. I am doing batches of 100 iterations for each validation method and have yet to have encountered any error. I have not tried "excluded rows" yet but am confident it will work.

kfold always give me the same, perfect (overfit?) solution, though I think this is a testament to the size/format of my dataset, rather than any sort of error with the script.

Scientifically, I have been doing some research, and it appears if my "small" datasets (dozens of samples and hundreds of potential predictors) are probably more amenable to XGBoost, or else neural nets may be overkill, so I am going to try that platform, too. I didn't even know XGBoost was supported in JMP at all (whereas I have seen random forests and likely the other approach in your GUI). In other words, I look forward to taking advantage of ALL components of this excellent research tool! Thanks so much for your help (again), and I'll let you know if I encounter any other errors.

Anderson B. Mayfield

SDF1 · Jul 9, 2021 02:19 PM

Hi @abmayfield ,

I'm really glad to hear that it's working for you and I'm hopeful the bugs have been ironed out. I wish I had a better solution for the top of the pile data table issue. I honestly don't know how to get around that.

Anyway, I do believe you are right about Kfold xvalidation. That method typically works better with tall tables and not so great with wide tables. If you have very few data entries, then using a validation column stratified on the response is a good idea. Or, if you use the GenReg platform, you can use the leave-one-out approach. But typically, Kfold is not the best if you have few data points to consider.

I would definitely recommend looking into XGBoost, you can download the add-in from the add-in page on JMP. It's a very stable and robust algorithm. XGBoost works really well with a validation column.

You might consider also looking into support vector machines, which JMP Pro has. It was originally developed for classification problems, but can also handle regression problems. In any case, it's always a good idea to compare different models using a hold out data set that has not been used in training or xvalidation of the model. I typically build several different models and then look at how they compare on a completely new data set to see which is the best at predicting the outcome.

Glad I could help!,

DS

abmayfield · Jul 13, 2021 05:38 PM

I have now run literally 10,000 odd models thanks to your epic GUI. I did have one question regarding holdback. Say my tuning table has 100 rows (i.e., different models to test). Does each row get a random holdback based on the percentage or do the same X number of samples get held back for each model? I am guessing it's the latter. In other words, if samples A, B, C, and D are held back from model #1, those same four samples get held back from model #2. I guess regardless my followup question would be: is it possible to KNOW which samples get held back? In any event, as we discussed, in my case having a stratified validation column really makes more sense because I can ensure that at least one sample from each treatment is included as training, validation, and test, whereas a random 30% holdback could theoretically ONLY hold back samples from the SAME treatment (which would bias the prediction in favor of said treatment!

Anderson B. Mayfield