We have an alloy database, let’s say Steel database. In this database, it contained information for 100 steels.
In this database, we have the composition, i.e. the concentration of each elements. and the property of the alloy. An example is shown below.
I want to use this data to create model, which can be used to predict the hardness of new steels. However, I am afraid that the size of the data we have right now are not sufficient to represent the kind of steel we investigated.
I am wondering is there a statistical way to quantitatively evaluate the sampling sufficiency? Thank you.
I think that the best way to determine if you have a sufficient sample is to perform the analysis and evaluate the model diagnostics. By 'sufficient,' I mean both quality and quantity. The sufficiency depends on many aspects of the data, so I think that a simpe sample size calculation, for example, would not be helpful or appropriate in this case.
To elaborate on Mark's point: run your model and save the prediction formula and any other relevant output. Also, examine the model output in all the relevant dimensions. How accurate is the model on average? How wide are the prediction intervals? Are you concerned with particular kinds of estimation errors?
Your accuracy will depend on sample size, but it also depends on the accuracy of the model. So, you can increase the sample size and/or increase the accuracy of the model. I don't think there is a general formulaic approach to which is better - it depends on the particular data you are working with and the context of your problem. If you are willing to post your data, I'm sure some people will be willing to explore this further (I am).
Thanks to Dale for pointing out some specific suggestions. To take that direction even farther, I am going to suggest that you see Help > Books > Fitting Linear Models > Fit Least Squares. (I assume that you intend to use a linear model.) There are numerous features under the platform menu called Row Diagnostics. These tools are meant to help you understand the strengths and weaknesses of the model and the data.
The Actual by Predicted plot is a way to understand bias, if any, in the model.
The Leverage Plots will help you understand data problems within each variable such as collinearity with other variables and unusual influence of some observations.
The plots based on residuals are useful to assess how well your data meet the model assumptions. The presence of bias in the Actual by Predicted or one of the residual plots suggests a lack of fit problem that would be addressed by either adding terms as transforms of the existing variables (e.g, powers or cross terms) or additional variables.
What kind of model do you plan to use?
The linear models have many advantages but they are only so flexible (limited Taylor series, after all). The Neural platform might produce better fits for complex relationships. You can use Actual by Predicted and residual plots for you assessment.