This question may be quite philosophical and I would like to know the specialists' opinions in order to avoid speculations.
Talking about small datasets, for example, of circa 80-300 instances (rows), it would be right enough to apply neural networks in general? If yes, then:
- which metrics are the most critical for cathegorical and continious responses, respectively?
- which validation strategy is preferable except K-fold crossvalidation? Or only K-fold crossvalidation?
I would be very thankful for your advice.
I'll leave the cross validation conversation to others. Here's my take on the applicability of neural networks,,,and I'll be philosophical as well. I think the issue of applicability without context wrt to the practical problem at hand, the data (is the data set wide/narrow, multicollinear etc.), risks both statistical and representation, and end use case for the model, will all have a greater influence on the applicability of the method compared to the outright sample size alone all by itself.
For example: What's the purpose of the study? Prediction or exploration? Is there the opportunity for additional study...perhaps in the context of designed experiments? And on and on.
For some insightful commentary on the prediction vs. exploration construct, I suggest taking a look at this:
That is an excellent interview, thanks @P_Bartell. I will also be philosophic...I think that predictive modeling requires understanding causal relationships of the "phenomena" of interest. Some rational order of thought may look like this: we have to figure out how to quantify the phenomena of interest (and are the measurement systems adequate). Then we construct hypotheses as to why the phenomena varies (I like to think these hypotheses have their roots in the natural sciences). This leads to testing our hypotheses through directed sampling or experimentation. Yes, the hypotheses are constructed before the data is collected. If you have no hypotheses, then you may use "data mining" techniques to stimulate your mind to create/develop hypotheses. From my perspective, having a model that describes the data set in hand (e.g., neural networks) without providing insight and understanding as to what the causal relationships are, is not going to be an effective predictive model.
Words to live by: All questions that can be answered, tools to use for analysis, confidence in conclusions to be drawn and ability to extrapolate the results DEPEND ON HOW THE DATA WAS ACQUIRED.
Regarding your question about neural networks in particular, I have not had much success with NN for small data sets - I think they generally work best when you have a lot of data. However, regardless of that, I would suggest you run a variety of predictive models - decision trees, boosted trees, random forests, logistic/multiple regression, etc. I would compare these in terms of their accuracy (using ROC, R2, particular misclassifications of interest, etc.) but also in terms of what variables they identify as the most important. While the prediction vs understanding issue is important, I think that prediction absent understanding will ultimately be less valuable than having a model that makes sense. There is no need to pick one modeling technique - it is easy to run them all on the same data and compare/contrast what you get.
You are right. And now a new issue is emerging: is it worth creating a set of models for small datasets in general? Not just NN. I am afraid to propose predictive modeling in such cases.
Let me please re-phrase this question: which number of instances (rows) and/or properties (columns) is enough to perform predictive modeling without worrying about the dataset's size?
This is a vast subject and not an easy one to respond to. I would just say that it is always worth trying to model your data - but the smaller the data set, the more humble you need to be about anything you think you find. There is certainly no critical number of observations that can be derived from a given number of variables - and clearly the more variables you have, the more observations you need to make sense of the data. In fact, the reverse is worth pondering - if you only have a few variables, you don't need as many observations to build a "sensible" model, but that should not be a source of confidence. Quite the reverse - if you only have a few variables, then you are missing much (or most) of the picture. One critical piece of the puzzle concerns how much theoretical structure you can impose on your data. In the physical sciences, there is often a well developed theory that leads more or less directly to a structure for modeling the data. In the social sciences, this is far less true. But, the bottom line for me is that there is no alternative but to try to build models (if you don't build any models, on what are decisions going to be based?). The key (for me) is to not overstate the power of your models, and this is related to how many variables and observations you have.
You ALWAYS have to worry about the size of the data set so long as the data set is a sample from some population(s) of interest. But it's not just the size...size influences statistical risk, but contributes little to nothing wrt to representation risk.
Why do you have to always worry about the size? Because there will always be risk of Type I or Type II errors regardless of sample size. I remember years ago, in an advertising brochure for the Master of Science in Applied Mathematical Statistics program at Rochester Institute of Technology, in the brochure they defined 'statistics' as the 'Science of decision making in the face of uncertainty.' It's this uncertainty, which is made up of two components, statistical risk, which is influenced/controlled in part by sample size, and representation risk. Sample size speaks only to statistical risk...not representation risk.
A small but illustrative example of the interplay of statistical and representation risk: Suppose we have oodles (that's an unscientific term for a LARGE amount of data) of observations on pilot equipment and scale and we've got p values to die for...all but absolute certainty that we have ALL the key x variables and none of the non influential variables in our model. BUT, and here's a big BUT, we're now going to make PREDICTIONS about manufacturing processes and manufacturing scale...say going from beaker (pilot) scale, to rail car (manufacturing) scale. And our results are found to be not even close to the predictions. Then sample SIZE meant nothing. Representation risk bit your behind big time.
Why people focus on sample size and all but ignore representation risk is just mind boggling to me. I get it...it's easy to talk about statistical risk, alpha, beta, difference to detect, Type I and II errors, all easy peasy to get your mind around and we've now got computers to tell us what we need to know...no longer need those pesky ASQ/ANSI look up tables.
But NEVER forget about representation risk...rant over.
The above 2 replies are absolutely EXCELLENT! I almost always say size doesn't matter (really to emphasize the representative issues). I give a conceptual example. I ask which is the better sample size?
a. 10,000 or
Now the problem that needs to be uncovered is day-to-day variation. The 10,000 samples are taken in 1 day, the 5 are 1 sample a day for 5 days. So which is better?...obvious.