Solved: using categorical data with random forests in JMP

TheTerminalMan · Jun 8, 2023 5:58 PM

We are having a little trouble wrapping our heads around how JMP treats categorical data in random forests. We have created a small pilot data set and mapped the categorical data using a variety of techniques including many suggested in this forum. However, I don't really understand why we should see so much of a difference in performance when using these mappings. If I am mapping a discrete set of values to another discrete set of values (e.g., character strings to integers), why should it make so much of a difference in JMP?

We don't see this kind of variation when using Python or MATLAB's random forest algorithms. With JMP, the difference in error rates for held out data and on the training set are significant.

We have read most of the posts on this topic, and can supply more specifics, including a trial data set, if necessary. But before we jump into that rabbit hole of choosing a method that optimizes performance in JMP, I was hoping someone could briefly explain why their implementation of random forests is so sensitive to how you map categorical data.

Thanks.

Mark_Bailey · Dec 8, 2022 08:24 AM

Are the mapped integer values using the nominal modeling type, or the default continuous modeling type?

View solution in original post

Mark_Bailey · Dec 8, 2022 08:24 AM

Are the mapped integer values using the nominal modeling type, or the default continuous modeling type?

TheTerminalMan · Dec 9, 2022 01:15 PM

Hi Mark,

Good point. My grad student doing this work said "oh!"

Thanks very much,

-Joe

using categorical data with random forests in JMP

Re: using categorical data with random forests in JMP

Re: using categorical data with random forests in JMP

Re: using categorical data with random forests in JMP