You ALWAYS have to worry about the size of the data set so long as the data set is a sample from some population(s) of interest. But it's not just the size...size influences statistical risk, but contributes little to nothing wrt to representation risk. Why do you have to always worry about the size? Because there will always be risk of Type I or Type II errors regardless of sample size. I remember years ago, in an advertising brochure for the Master of Science in Applied Mathematical Statistics program at Rochester Institute of Technology, in the brochure they defined 'statistics' as the 'Science of decision making in the face of uncertainty.' It's this uncertainty, which is made up of two components, statistical risk, which is influenced/controlled in part by sample size, and representation risk. Sample size speaks only to statistical risk...not representation risk. A small but illustrative example of the interplay of statistical and representation risk: Suppose we have oodles (that's an unscientific term for a LARGE amount of data) of observations on pilot equipment and scale and we've got p values to die for...all but absolute certainty that we have ALL the key x variables and none of the non influential variables in our model. BUT, and here's a big BUT, we're now going to make PREDICTIONS about manufacturing processes and manufacturing scale...say going from beaker (pilot) scale, to rail car (manufacturing) scale. And our results are found to be not even close to the predictions. Then sample SIZE meant nothing. Representation risk bit your behind big time. Why people focus on sample size and all but ignore representation risk is just mind boggling to me. I get it...it's easy to talk about statistical risk, alpha, beta, difference to detect, Type I and II errors, all easy peasy to get your mind around and we've now got computers to tell us what we need to know...no longer need those pesky ASQ/ANSI look up tables. But NEVER forget about representation risk...rant over.
... View more