Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Assign category to row keeping total probability fixed

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 2, 2019 8:05 AM
(172 views)

Dear JMP Community,

(WE10, 64-bit, JMP Pro 4.1.0)

I am trying to generate a column formula that performs an action similar to "create validation column" by assigning either 0, 1, or 2 to the row. I am trying the Random Category function as a way to do this, and it works, but not the way I want. It assigns either 0, 1, or 2 to the rows, but when you add up the frequency of each category, you get a value larger than 1, which I don't want.

In "create validation column", you can assign a certain percentage of your data table as "train", "validate", or "test" (0, 1, 2), and the sum of those fractions always adds to 1. I want to do this with a function, such as Random Category, so that the fraction of "0's", "1's", and "2's" add up to 1. For simplicity sake, I want to keep the number of "1's" equal to the number of "2's", i.e. the number of validate and test data are equal and add up to 1 - number of training data rows.

The purpose behind all of this is that I'd like to test out tuning a model by changing the number of train, validate, and test data sets in order to minimize the RMSE. This is done by right clicking the statistic of interest and then selecting "Simulate", which does bootstrapping and re-runs the model to evaluate how the statistic changes. BUT, you have to have a function to swap out for the number of rows in each of the validation data sets.

Each time it runs a simulation, it would "generate" a new validation column, but with a different ratio of "train", "validate", "test" rows. The output could then be used to find the optimal ratio of the three for having a better performing model.

Any thoughts/feedback on this is greatly appreciated.

Thanks!,

DS

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I would try this formula.

```
p = Random Uniform();
If( p <= 0.5, 0, p <= 0.8, 1, 2 );
```

I used 50% training, 30% validation, and 20% testing.

Learn it once, use it forever!

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I would try this formula.

```
p = Random Uniform();
If( p <= 0.5, 0, p <= 0.8, 1, 2 );
```

I used 50% training, 30% validation, and 20% testing.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Assign category to row keeping total probability fixed

Hi @markbailey,

Thanks for the suggestion. This works, however I did modify it slightly as I don't want to have a fixed ratio of train, validate and test, but rather modify this each bootstrap simulate to see if there is an optimum ratio of train/validate/test, since the selections in a validation column can have an impact on the model fit, maybe a simple 0.6, 0.2, 0.2 split isn't optimum. Maybe it should be 0.72, 0.15, 0.13 for example.

```
p = Random Uniform();
If( p <= Random Uniform(x1,x2), 0,
```

p <= Random Uniform(X3,x4), 1,

p <= Random Uniform(x5,x6), 2 );

Adjusting the x1, x2, x3, x4, x5, and x6 can lead to a range of different ratios of the train/validate/test in the formula column.

Thanks!,

DS

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Assign category to row keeping total probability fixed

You should only use one random variable at a time. There is no need to call Random Uniform() more than once, before calling the If() function. Use your variables for the Boolean expressions.

```
p = Random Uniform();
If( p <= x1, 0, p <= x2, 1, 2 );
```

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Assign category to row keeping total probability fixed

Hi @markbailey,

If I use the modified version, I get a wider range of ratios than if I use the condensed form.

The other way, I can see a wider range for the ~~test~~ training set (about 65%-78%) with the others distributed accordingly -- roughly 50/50 validation/test. This is really what I'm after: change the percentage of the ~~test~~ training set between a low value and upper value and then split the remainder 50/50: validation/test.

Thanks!,

DS

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Assign category to row keeping total probability fixed

The scheme is solid. It performs better with larger samples, though. Here is a better version perhaps. (I don't usually use formulas.)

```
Names Default to Here( 1 );
n = 500;
dt = New Table( "Again and Again",
Add Rows( n ),
New Column( "Y", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "X1", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "X2", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "Validation", "Numeric", "Nominal" )
);
training proportion = 0.6;
valid threshold = 1 - ((1 - training proportion) / 2);
:Validation << Set Values(
J( n, 1, p = Random Uniform(); If( p <= training proportion, 0, p <= valid threshold, 1, 2 ) );
);
```

Is there a reason that you would hold out such a large proportion for testing?

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Assign category to row keeping total probability fixed

Thanks again for your suggestions! I had a typo in my previous response and edited it. The larger portion should be for the training set, not test set.