<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Subset creation keeping original distribution in Discussions</title>
    <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724443#M90681</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/47783"&gt;@bbenny7&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; I found this topic interesting for a similar reason of wanting to correctly subset a data table, but to stratify it on a column. I've done it in the past as&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/53879"&gt;@dlehman1&lt;/a&gt;&amp;nbsp;has suggested using a validation column to stratify on column(s) of interest, but also was curious how to do it a different way in case multiple data tables were needed. I tried the way that&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/5358"&gt;@Mark_Bailey&lt;/a&gt;&amp;nbsp;suggested, but found that I needed to split the JSL code for the New Table() into two lines, one defining the new table, and the next assigning the values based on the other data table column of interest. I couldn't get it to work the way his original code was laid out. Here's how a modified code worked for me:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Names Default To Here( 1 );

dt = Data Table( "originaldatatable" ); //assigns the original data table to the variable dt

subscript = J( samplesize, 1, Random Integer( 1, N Rows( dt ) ) ); //creates the random integer vector of length 'samplesize'

dt2 = New Table( "Sample", New Column( "Data" ) ); //creates new data table with column Data

dt2:Data &amp;lt;&amp;lt; Set Values( dt:originalColumn[subscript] );&amp;nbsp;//&amp;nbsp;assigns&amp;nbsp;values&amp;nbsp;to&amp;nbsp;Data&amp;nbsp;based&amp;nbsp;on&amp;nbsp;the&amp;nbsp;row&amp;nbsp;entries&amp;nbsp;for&amp;nbsp;the&amp;nbsp;originalColumn&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp; As a fun little test, I generated 4 subsets by making a For() loop and putting the subscript line in it (to generate a new set of row numbers) and compared the distributions for the 4 sets, and their summary statistics are all very similar.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SDF1_0-1708025111456.png" style="width: 400px;"&gt;&lt;img src="https://community.jmp.com/t5/image/serverpage/image-id/61153iFF10B1D5B38FA3B6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="SDF1_0-1708025111456.png" alt="SDF1_0-1708025111456.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; I did a similar test but created 4 stratified validation columns and then looked at their statistics. The N is different because the Make Validation Column platform wouldn't generate the same ratios that I did above, where I chose 300 just randomly. Anyway, the results are all very similar.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SDF1_1-1708026727602.png" style="width: 400px;"&gt;&lt;img src="https://community.jmp.com/t5/image/serverpage/image-id/61154iB9E5D4C58B17119A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="SDF1_1-1708026727602.png" alt="SDF1_1-1708026727602.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; Either way should work and get you where you want to go.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Good luck!,&lt;/P&gt;&lt;P&gt;DS&lt;/P&gt;</description>
    <pubDate>Thu, 15 Feb 2024 19:52:35 GMT</pubDate>
    <dc:creator>SDF1</dc:creator>
    <dc:date>2024-02-15T19:52:35Z</dc:date>
    <item>
      <title>Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724413#M90670</link>
      <description>&lt;P&gt;If I have a data table with a continuous variable and I want to create several subsets that keep the same distribution as the original table, how can I do it?&lt;/P&gt;&lt;P&gt;I have JMP Pro and I have tried the Stratify option in the Subset menu, but I could not figure out how to do it.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 15:27:33 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724413#M90670</guid>
      <dc:creator>bbenny7</dc:creator>
      <dc:date>2024-02-15T15:27:33Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724422#M90673</link>
      <description>&lt;P&gt;You could create a validation column (under Predictive Modeling) stratified as you want and then use those subsets.&amp;nbsp; I think you are limited, at least initially, to 3 subsets (training, validation, and test subsets) but you could do that multiple times if you need more.&amp;nbsp; I've never used the stratify option in the Subset menu, but it appears that it works but will only create one subset at a time.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 16:29:41 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724422#M90673</guid>
      <dc:creator>dlehman1</dc:creator>
      <dc:date>2024-02-15T16:29:41Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724423#M90674</link>
      <description>&lt;P&gt;I think all you need to do is to specify to subset using the random sampling capability in the Tables=&amp;gt; Subset Platform.&amp;nbsp; Random sampling by either rate or size will give you data tables with the same distribution as the original table.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 17:11:40 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724423#M90674</guid>
      <dc:creator>txnelson</dc:creator>
      <dc:date>2024-02-15T17:11:40Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724427#M90675</link>
      <description>&lt;P&gt;Strictly speaking, sampling with replacement is required if you want the same distribution as the original. That is the basis for resampling and bootstrap methods. The Subset command performs random sampling without replacement. You need a script, I think, to get what you need. The sampling would be based on computing random subscripts like this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;subscript = J( sample size, 1, Random Integer( 1, N Row( dt ) )&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This vector could be used in the creation of a data column in a new data table.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;New Table( "Sample", New Column( "Data", Value( dt:originalColumn[subscript] ) ) );&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 15 Feb 2024 18:18:28 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724427#M90675</guid>
      <dc:creator>Mark_Bailey</dc:creator>
      <dc:date>2024-02-15T18:18:28Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724443#M90681</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/47783"&gt;@bbenny7&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; I found this topic interesting for a similar reason of wanting to correctly subset a data table, but to stratify it on a column. I've done it in the past as&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/53879"&gt;@dlehman1&lt;/a&gt;&amp;nbsp;has suggested using a validation column to stratify on column(s) of interest, but also was curious how to do it a different way in case multiple data tables were needed. I tried the way that&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/5358"&gt;@Mark_Bailey&lt;/a&gt;&amp;nbsp;suggested, but found that I needed to split the JSL code for the New Table() into two lines, one defining the new table, and the next assigning the values based on the other data table column of interest. I couldn't get it to work the way his original code was laid out. Here's how a modified code worked for me:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-jsl"&gt;Names Default To Here( 1 );

dt = Data Table( "originaldatatable" ); //assigns the original data table to the variable dt

subscript = J( samplesize, 1, Random Integer( 1, N Rows( dt ) ) ); //creates the random integer vector of length 'samplesize'

dt2 = New Table( "Sample", New Column( "Data" ) ); //creates new data table with column Data

dt2:Data &amp;lt;&amp;lt; Set Values( dt:originalColumn[subscript] );&amp;nbsp;//&amp;nbsp;assigns&amp;nbsp;values&amp;nbsp;to&amp;nbsp;Data&amp;nbsp;based&amp;nbsp;on&amp;nbsp;the&amp;nbsp;row&amp;nbsp;entries&amp;nbsp;for&amp;nbsp;the&amp;nbsp;originalColumn&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp; As a fun little test, I generated 4 subsets by making a For() loop and putting the subscript line in it (to generate a new set of row numbers) and compared the distributions for the 4 sets, and their summary statistics are all very similar.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SDF1_0-1708025111456.png" style="width: 400px;"&gt;&lt;img src="https://community.jmp.com/t5/image/serverpage/image-id/61153iFF10B1D5B38FA3B6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="SDF1_0-1708025111456.png" alt="SDF1_0-1708025111456.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; I did a similar test but created 4 stratified validation columns and then looked at their statistics. The N is different because the Make Validation Column platform wouldn't generate the same ratios that I did above, where I chose 300 just randomly. Anyway, the results are all very similar.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SDF1_1-1708026727602.png" style="width: 400px;"&gt;&lt;img src="https://community.jmp.com/t5/image/serverpage/image-id/61154iB9E5D4C58B17119A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="SDF1_1-1708026727602.png" alt="SDF1_1-1708026727602.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; Either way should work and get you where you want to go.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Good luck!,&lt;/P&gt;&lt;P&gt;DS&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2024 19:52:35 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724443#M90681</guid>
      <dc:creator>SDF1</dc:creator>
      <dc:date>2024-02-15T19:52:35Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724626#M90756</link>
      <description>&lt;P&gt;Because you have JMP Pro, and you mentioned Stratify, here is another option: "Make K-Fold Columns" from XGBoost Add-in for JMP Pro (&lt;A href="https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383" target="_blank"&gt;https://community.jmp.com/t5/JMP-Add-Ins/XGBoost-Add-In-for-JMP-Pro/ta-p/319383&lt;/A&gt;) And read page 3, 4, 5&lt;/P&gt;
&lt;P&gt;in the "XGBoost Add-in.pdf" file.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="peng_liu_0-1708265326457.png" style="width: 400px;"&gt;&lt;img src="https://community.jmp.com/t5/image/serverpage/image-id/61207i281727C9F490C0DA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="peng_liu_0-1708265326457.png" alt="peng_liu_0-1708265326457.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The function creates subsets (the Add-in calls it folds) that keeping original distribution (the Add-in calls it balanced), while respect stratification by keeping stratification variables balanced as well. See the histograms on page 5 in that pdf document.&lt;/P&gt;</description>
      <pubDate>Sun, 18 Feb 2024 14:15:24 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724626#M90756</guid>
      <dc:creator>peng_liu</dc:creator>
      <dc:date>2024-02-18T14:15:24Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724634#M90759</link>
      <description>&lt;P&gt;Thanks for your answer.&lt;/P&gt;&lt;P&gt;I have realized that I don't need the add-in, but I can use "Make K-fold Validation Column" in Predictive Modeling --&amp;gt; Make Validation Column.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 08:37:41 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724634#M90759</guid>
      <dc:creator>bbenny7</dc:creator>
      <dc:date>2024-02-19T08:37:41Z</dc:date>
    </item>
    <item>
      <title>Re: Subset creation keeping original distribution</title>
      <link>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724635#M90760</link>
      <description>&lt;P&gt;Thanks for your answer.&lt;/P&gt;&lt;P&gt;I have realized that more than 3 subsets are needed, you can use&amp;nbsp;&lt;SPAN&gt;"Make K-fold Validation Column" in Predictive Modeling --&amp;gt; Make Validation Column.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 08:39:23 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/Subset-creation-keeping-original-distribution/m-p/724635#M90760</guid>
      <dc:creator>bbenny7</dc:creator>
      <dc:date>2024-02-19T08:39:23Z</dc:date>
    </item>
  </channel>
</rss>

