topic Assign category to row keeping total probability fixed in Discussions
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/236980#M46775
<P>Dear JMP Community,</P><P> </P><P>(WE10, 64-bit, JMP Pro 4.1.0)</P><P> </P><P> I am trying to generate a column formula that performs an action similar to "create validation column" by assigning either 0, 1, or 2 to the row. I am trying the Random Category function as a way to do this, and it works, but not the way I want. It assigns either 0, 1, or 2 to the rows, but when you add up the frequency of each category, you get a value larger than 1, which I don't want.</P><P> </P><P> In "create validation column", you can assign a certain percentage of your data table as "train", "validate", or "test" (0, 1, 2), and the sum of those fractions always adds to 1. I want to do this with a function, such as Random Category, so that the fraction of "0's", "1's", and "2's" add up to 1. For simplicity sake, I want to keep the number of "1's" equal to the number of "2's", i.e. the number of validate and test data are equal and add up to 1 - number of training data rows.</P><P> </P><P> The purpose behind all of this is that I'd like to test out tuning a model by changing the number of train, validate, and test data sets in order to minimize the RMSE. This is done by right clicking the statistic of interest and then selecting "Simulate", which does bootstrapping and re-runs the model to evaluate how the statistic changes. BUT, you have to have a function to swap out for the number of rows in each of the validation data sets.</P><P> </P><P> Each time it runs a simulation, it would "generate" a new validation column, but with a different ratio of "train", "validate", "test" rows. The output could then be used to find the optimal ratio of the three for having a better performing model.</P><P> </P><P> Any thoughts/feedback on this is greatly appreciated.</P><P> </P><P>Thanks!,</P><P>DS</P>Mon, 02 Dec 2019 16:05:55 GMTDS2019-12-02T16:05:55ZAssign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/236980#M46775
<P>Dear JMP Community,</P><P> </P><P>(WE10, 64-bit, JMP Pro 4.1.0)</P><P> </P><P> I am trying to generate a column formula that performs an action similar to "create validation column" by assigning either 0, 1, or 2 to the row. I am trying the Random Category function as a way to do this, and it works, but not the way I want. It assigns either 0, 1, or 2 to the rows, but when you add up the frequency of each category, you get a value larger than 1, which I don't want.</P><P> </P><P> In "create validation column", you can assign a certain percentage of your data table as "train", "validate", or "test" (0, 1, 2), and the sum of those fractions always adds to 1. I want to do this with a function, such as Random Category, so that the fraction of "0's", "1's", and "2's" add up to 1. For simplicity sake, I want to keep the number of "1's" equal to the number of "2's", i.e. the number of validate and test data are equal and add up to 1 - number of training data rows.</P><P> </P><P> The purpose behind all of this is that I'd like to test out tuning a model by changing the number of train, validate, and test data sets in order to minimize the RMSE. This is done by right clicking the statistic of interest and then selecting "Simulate", which does bootstrapping and re-runs the model to evaluate how the statistic changes. BUT, you have to have a function to swap out for the number of rows in each of the validation data sets.</P><P> </P><P> Each time it runs a simulation, it would "generate" a new validation column, but with a different ratio of "train", "validate", "test" rows. The output could then be used to find the optimal ratio of the three for having a better performing model.</P><P> </P><P> Any thoughts/feedback on this is greatly appreciated.</P><P> </P><P>Thanks!,</P><P>DS</P>Mon, 02 Dec 2019 16:05:55 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/236980#M46775DS2019-12-02T16:05:55ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/236993#M46776
<P>I would try this formula.</P>
<P> </P>
<PRE><CODE class=" language-jsl">p = Random Uniform();
If( p <= 0.5, 0, p <= 0.8, 1, 2 );</CODE></PRE>
<P> </P>
<P>I used 50% training, 30% validation, and 20% testing.</P>Mon, 02 Dec 2019 18:58:34 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/236993#M46776markbailey2019-12-02T18:58:34ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237014#M46784
<P>Hi <LI-USER uid="5358"></LI-USER>,</P><P> </P><P> Thanks for the suggestion. This works, however I did modify it slightly as I don't want to have a fixed ratio of train, validate and test, but rather modify this each bootstrap simulate to see if there is an optimum ratio of train/validate/test, since the selections in a validation column can have an impact on the model fit, maybe a simple 0.6, 0.2, 0.2 split isn't optimum. Maybe it should be 0.72, 0.15, 0.13 for example.</P><P> </P><PRE><CODE class=" language-jsl">p = Random Uniform();
If( p <= Random Uniform(x1,x2), 0,<BR /> p <= Random Uniform(X3,x4), 1,<BR /> p <= Random Uniform(x5,x6), 2 );</CODE></PRE><P>Adjusting the x1, x2, x3, x4, x5, and x6 can lead to a range of different ratios of the train/validate/test in the formula column.</P><P> </P><P>Thanks!,</P><P>DS</P>Mon, 02 Dec 2019 21:07:51 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237014#M46784DS2019-12-02T21:07:51ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237030#M46788
<P>You should only use one random variable at a time. There is no need to call Random Uniform() more than once, before calling the If() function. Use your variables for the Boolean expressions.</P>
<P> </P>
<PRE><CODE class=" language-jsl">p = Random Uniform();
If( p <= x1, 0, p <= x2, 1, 2 );</CODE></PRE>Mon, 02 Dec 2019 21:15:50 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237030#M46788markbailey2019-12-02T21:15:50ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237038#M46792
<P>Hi <LI-USER uid="5358"></LI-USER>,</P><P> </P><P> If I use the modified version, I get a wider range of ratios than if I use the condensed form.</P><P> </P><P> The other way, I can see a wider range for the <STRIKE>test</STRIKE> training set (about 65%-78%) with the others distributed accordingly -- roughly 50/50 validation/test. This is really what I'm after: change the percentage of the <STRIKE>test</STRIKE> training set between a low value and upper value and then split the remainder 50/50: validation/test.</P><P> </P><P>Thanks!,</P><P>DS</P>Tue, 03 Dec 2019 14:35:20 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237038#M46792DS2019-12-03T14:35:20ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237040#M46794
<P>The scheme is solid. It performs better with larger samples, though. Here is a better version perhaps. (I don't usually use formulas.)</P>
<P> </P>
<PRE><CODE class=" language-jsl">Names Default to Here( 1 );
n = 500;
dt = New Table( "Again and Again",
Add Rows( n ),
New Column( "Y", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "X1", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "X2", "Numeric", "Continuous", Values( J( n, 1, Random Normal() ) ) ),
New Column( "Validation", "Numeric", "Nominal" )
);
training proportion = 0.6;
valid threshold = 1 - ((1 - training proportion) / 2);
:Validation << Set Values(
J( n, 1, p = Random Uniform(); If( p <= training proportion, 0, p <= valid threshold, 1, 2 ) );
);</CODE></PRE>
<P> </P>
<P>Is there a reason that you would hold out such a large proportion for testing?</P>Mon, 02 Dec 2019 22:42:21 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237040#M46794markbailey2019-12-02T22:42:21ZRe: Assign category to row keeping total probability fixed
https://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237081#M46806
Thanks again for your suggestions! I had a typo in my previous response and edited it. The larger portion should be for the training set, not test set.Tue, 03 Dec 2019 14:36:40 GMThttps://community.jmp.com/t5/Discussions/Assign-category-to-row-keeping-total-probability-fixed/m-p/237081#M46806DS2019-12-03T14:36:40Z