<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How good is K-fold cross validation for small datasets? in Discussions</title>
    <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/692044#M87767</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/1253"&gt;@tnad&lt;/a&gt;&amp;nbsp; &amp;nbsp; I just came across this&amp;nbsp;post and hope you don't mind an additional reply 3.5 years later!&amp;nbsp; &amp;nbsp; &lt;BR /&gt;&lt;BR /&gt;I think setting K = 154 for Stepwise in this case is not recommended since the algorithm searches for the best fold, which in this case would consist of only a single observation.&amp;nbsp; &amp;nbsp;In general it is important to perform stepwise nested within each fold to avoid overfitting, and this is available in the Nested K-Fold option in Model Screening.&amp;nbsp; &amp;nbsp;Also I think 154 observations is plenty for k-fold and that repeated nested k-fold is a great way to compare models.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Model Screening reveals that Neural Boosted works better than Stepwise for these data with Test R2 = 0.87 vs 0.77 using Log10(KV40) as Y, K = 5, and L= 4.&amp;nbsp; I also tried the new Torch Deep Learning add-in (available by request at &lt;A href="https://jmp.com/earlyadopter" target="_self"&gt;JMP Early Adopter&lt;/A&gt;&amp;nbsp;) and with a little tuning am getting similar if not better results with its 5-fold cross-validation.&lt;BR /&gt;&lt;BR /&gt;In general, Neural, Torch, XGBoost, and other platforms in JMP can be valuable for QSAR predictive modeling.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 30 Oct 2023 13:04:30 GMT</pubDate>
    <dc:creator>russ_wolfinger</dc:creator>
    <dc:date>2023-10-30T13:04:30Z</dc:date>
    <item>
      <title>How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/250294#M49129</link>
      <description>&lt;P&gt;I have a small data set with 154 molecules (attached). I'm trying to predict KV40 using 6 factors. From looking at similar studies in the literature, many use the leave-one-out validation method to build models for such smaller data sets, so to replicate this in JMP, I did the following:&lt;/P&gt;&lt;P&gt;Fit-model: Stepwise regression&lt;BR /&gt;- used response surface for factors&lt;BR /&gt;- used k (k-fold cross-validation) = number of samples = 154&lt;BR /&gt;- left everything else as default&lt;/P&gt;&lt;P&gt;Result:&lt;BR /&gt;r2 = 0.84; r2 (k-fold) = 0.72&lt;/P&gt;&lt;P&gt;After box-cox transformation: r2 = 0.87 (r2_adj = 0.86)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried to do this with a validation column (70:30) and did not get a result as good as this. I have JMP pro. My concern is am I creating a fake or overfit model when I do this type of cross-validation? Is there a better way to do this? Is there anything I should watch or test for? Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Mar 2020 04:01:07 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/250294#M49129</guid>
      <dc:creator>tnad</dc:creator>
      <dc:date>2020-03-03T04:01:07Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/250371#M49138</link>
      <description>&lt;P&gt;Right, honest assessment for model selection with cross-validation is always valid, but using hold-out sets for validation and testing are only practical and rewarding if you have large data sets.(There is also an issue of rare targets even with large data sets.) That is why K-fold cross-validation was invented. Leave-one-out is just taking K folds to the extreme. I would not expect the approach with 70:30 hold out to work well with this small sample size.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Mar 2020 13:16:53 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/250371#M49138</guid>
      <dc:creator>Mark_Bailey</dc:creator>
      <dc:date>2020-03-03T13:16:53Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/251396#M49354</link>
      <description>&lt;P&gt;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/5358"&gt;@Mark_Bailey&lt;/a&gt;&amp;nbsp; &amp;nbsp;When you mentioned, &amp;nbsp; "You have used cross-validation for model selection. You will need new data to test the selected model to see if the training generalizes to the larger population."&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is this because cross validation tends to over fit?&lt;/P&gt;</description>
      <pubDate>Mon, 09 Mar 2020 18:09:33 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/251396#M49354</guid>
      <dc:creator>Byron_JMP</dc:creator>
      <dc:date>2020-03-09T18:09:33Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/251438#M49361</link>
      <description>&lt;P&gt;Over-fitting is the concern but the reason for my statement is based on the whole scheme of 'honest assessment.' One uses some data to train the model and separate, entirely new data to validate or select the model (two hold-out sets). This approach is valid but is still limited to the data already seen by learning and selecting. It cannot speak to generalization of new data. New data is needed to test the choice of models. This data can be a third hold-out set or future data. The risk of waiting for future data depends on the situation so if a large amount of data is available, the third hold-out set is preferred.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Mar 2020 20:52:18 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/251438#M49361</guid>
      <dc:creator>Mark_Bailey</dc:creator>
      <dc:date>2020-03-09T20:52:18Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294092#M55673</link>
      <description>&lt;P&gt;Hello! May I ask you why k = 154 was for your cross-validation?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Normally I put 5 (20% for validation) for k depending on the dataset, although all they were small enough for normal training-validation-testing.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In the future, it is desirable for JMP Pro to implement an option to visualize the folds disttribution while processing data this way.&lt;/P&gt;</description>
      <pubDate>Sat, 29 Aug 2020 02:58:43 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294092#M55673</guid>
      <dc:creator>Nazarkovsky</dc:creator>
      <dc:date>2020-08-29T02:58:43Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294288#M55678</link>
      <description>&lt;P&gt;Honest assessment by cross-validation is commonly implemented in one of three ways: hold out sets (train, validate, test), k-fold, or leave one out. Using k = N (sample size) is a way of using k-fold CV to achieve the last approach.&lt;/P&gt;</description>
      <pubDate>Sat, 29 Aug 2020 11:54:20 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294288#M55678</guid>
      <dc:creator>Mark_Bailey</dc:creator>
      <dc:date>2020-08-29T11:54:20Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294310#M55680</link>
      <description>Wow! It is quite suprising for me, as the idea for K-fold crossvlidation rests on dividing a dataset into K-folds where K-1 is intended for training and 1 - for validation. I may be confused, though. From your post it looks like this strategy is attributed more to "leave on out", isn't it?</description>
      <pubDate>Sat, 29 Aug 2020 12:30:06 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/294310#M55680</guid>
      <dc:creator>Nazarkovsky</dc:creator>
      <dc:date>2020-08-29T12:30:06Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/295388#M55693</link>
      <description>&lt;P&gt;I just figured this out...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;cooking: fold the egg into the mixture&lt;/P&gt;&lt;P&gt;geology: intense folding in the earth's mantle&lt;/P&gt;&lt;P&gt;business: the club folded after a year&lt;/P&gt;&lt;P&gt;sports: the runner folded after a mile&lt;/P&gt;&lt;P&gt;cards: know when to fold 'em&lt;/P&gt;&lt;P&gt;geography: the town lies in a fold in the hills&lt;/P&gt;&lt;P&gt;web design: above the fold text doesn't require scrolling&lt;/P&gt;&lt;P&gt;change: a 10-fold increase in accidents&lt;/P&gt;&lt;P&gt;paper: origami folds to make a duck&lt;/P&gt;&lt;P&gt;idiom: fold your hands&lt;/P&gt;&lt;P&gt;farming: put the sheep in the fold&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It has to be the farming version! It means pen! The data is divided up into K pens/subsets/folds.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 30 Aug 2020 18:01:17 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/295388#M55693</guid>
      <dc:creator>Craige_Hales</dc:creator>
      <dc:date>2020-08-30T18:01:17Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/296120#M55736</link>
      <description>&lt;P&gt;Yes, K = N produces 'leave one out' cross-validation.&lt;/P&gt;</description>
      <pubDate>Mon, 31 Aug 2020 20:56:36 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/296120#M55736</guid>
      <dc:creator>Mark_Bailey</dc:creator>
      <dc:date>2020-08-31T20:56:36Z</dc:date>
    </item>
    <item>
      <title>Re: How good is K-fold cross validation for small datasets?</title>
      <link>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/692044#M87767</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.jmp.com/t5/user/viewprofilepage/user-id/1253"&gt;@tnad&lt;/a&gt;&amp;nbsp; &amp;nbsp; I just came across this&amp;nbsp;post and hope you don't mind an additional reply 3.5 years later!&amp;nbsp; &amp;nbsp; &lt;BR /&gt;&lt;BR /&gt;I think setting K = 154 for Stepwise in this case is not recommended since the algorithm searches for the best fold, which in this case would consist of only a single observation.&amp;nbsp; &amp;nbsp;In general it is important to perform stepwise nested within each fold to avoid overfitting, and this is available in the Nested K-Fold option in Model Screening.&amp;nbsp; &amp;nbsp;Also I think 154 observations is plenty for k-fold and that repeated nested k-fold is a great way to compare models.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Model Screening reveals that Neural Boosted works better than Stepwise for these data with Test R2 = 0.87 vs 0.77 using Log10(KV40) as Y, K = 5, and L= 4.&amp;nbsp; I also tried the new Torch Deep Learning add-in (available by request at &lt;A href="https://jmp.com/earlyadopter" target="_self"&gt;JMP Early Adopter&lt;/A&gt;&amp;nbsp;) and with a little tuning am getting similar if not better results with its 5-fold cross-validation.&lt;BR /&gt;&lt;BR /&gt;In general, Neural, Torch, XGBoost, and other platforms in JMP can be valuable for QSAR predictive modeling.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Oct 2023 13:04:30 GMT</pubDate>
      <guid>https://community.jmp.com/t5/Discussions/How-good-is-K-fold-cross-validation-for-small-datasets/m-p/692044#M87767</guid>
      <dc:creator>russ_wolfinger</dc:creator>
      <dc:date>2023-10-30T13:04:30Z</dc:date>
    </item>
  </channel>
</rss>

