using categorical data with random forests in JMP

TheTerminalMan — Fri, 09 Jun 2023 00:58:19 GMT

We are having a little trouble wrapping our heads around how JMP treats categorical data in random forests. We have created a small pilot data set and mapped the categorical data using a variety of techniques including many suggested in this forum. However, I don't really understand why we should see so much of a difference in performance when using these mappings. If I am mapping a discrete set of values to another discrete set of values (e.g., character strings to integers), why should it make so much of a difference in JMP?

We don't see this kind of variation when using Python or MATLAB's random forest algorithms. With JMP, the difference in error rates for held out data and on the training set are significant.

We have read most of the posts on this topic, and can supply more specifics, including a trial data set, if necessary. But before we jump into that rabbit hole of choosing a method that optimizes performance in JMP, I was hoping someone could briefly explain why their implementation of random forests is so sensitive to how you map categorical data.

Thanks.

Re: using categorical data with random forests in JMP

Mark_Bailey — Thu, 08 Dec 2022 13:24:53 GMT

Are the mapped integer values using the nominal modeling type, or the default continuous modeling type?

Re: using categorical data with random forests in JMP

TheTerminalMan — Fri, 09 Dec 2022 18:15:54 GMT

Hi Mark,

Good point. My grad student doing this work said "oh!" :)

Thanks very much,

-Joe

topic Re: using categorical data with random forests in JMP in Discussions

using categorical data with random forests in JMP

Re: using categorical data with random forests in JMP

Re: using categorical data with random forests in JMP