Ronald Snee, PhD, President, Snee Associates Richard De Veaux, C. Carlisle and Margaret Tippit Professor of Statistics, Williams College Roger Hoerl, PhD, Assistant Professor, Union College
JMP offers a number of advanced tools to analyze massive data sets, including partition, neural and k-means clustering. These tools make high-powered statistical methods available to not only professional statisticians, but also to casual users. As with any tool, the results to be expected are proportional to the knowledge and skill of the user. Unfortunately, much of the data mining, machine learning, and "big data" literature may give casual users the impression that if one has a powerful enough algorithm and a lot of data, good models and good results are guaranteed. This session will focus on three important principles that in our opinion have been underemphasized in the literature: The importance of using sequential approaches to scientific investigation, as opposed to "one-shot studies"; the need for empirical modeling to be guided by subject-matter theory, including interpretation of data within the context of the process and measurement system that generated it; and the typical unstated assumption that all data are created equal, and therefore that data quantity is more important than data quality. We will discuss the problems that can arise when these fundamentals are ignored, and share our thoughts and experiences on how to improve data mining projects.