JMP Blog

anne_milley · Aug 7, 2023 09:46 AM

Anderson-Cook.png-original copy.png If you missed last week’s Statistically Speaking with the widely published and award-winning Christine Anderson-Cook, retired Research Scientist at Los Alamos National Laboratory (LANL), you missed a fascinating tour of designed data collection, big data, and their synergies. The potential for designed data collection to add more value to projects involving big data is significant. Indeed, attendees were so keen to learn more about this topic that we didn’t have time to answer all of their many interesting questions. So, we sent them to Christine, who kindly took the time to answer them for this post. You can see her talk and hear the questions she did address in the on-demand recording.

Do you think too much emphasis is put on observational (and transactional) data, which is typically information-sparse versus intentionally gathering data that may better inform decisions? Is this primarily due to the “cost” of getting more information-rich data?

I think that early on for many companies, there was a sentiment that “lots of data can solve all problems.” That may lead to thinking that just collecting data is a solution to itself. However, now that big data has been around for a while, many companies may have had some unsuccessful experiences, where the data have not been able to give them the answers that they need. So, unfortunately, as reality creeps in, experience might be one of the most compelling teachers that low-cost convenient data is not necessarily more desirable than more expensive intentionally chosen data.

In the discussion portion of the QREI paper ^[1], Laura Freeman had a wonderful quote about what often happens when data are collected without intention: “The end result is typically an exhausted analyst and a disappointed decision-maker, because the data was never designed to answer the problem they are trying to address.”^[2] For answering many specific and important questions, there is no substitute for matching the data to the question of interest before it is collected.

The bottom line is that dependence on observational or transactional data can have mixed results. If it is collected without forethought or simply because it is convenient, then there is a higher chance of disappointment than if there is awareness and planning from the experts on the data collection team about what the data will be used for.

Do you know of instances where the cost to measure something exceeded its benefit?

I have worked on a few projects at LANL where the cost to get the data was very high, usually because the way to collect the desired data was unknown at the start. In these cases, it sometimes takes some research and development to be able to get the data that you want. Of course, there may be some occasional disappointments about what you learn from that data once you are able to collect it. However, that investment might have a payoff down the road for another application once the new capability has been developed and the platform for data collection has been established. If it looks like you might be in a situation like this where resources will need to be spent to get a new kind of data, it is important to really think through its possible benefits for the current project and over the long term, the risks of disappointment, and if there might be an alternate surrogate where similar information can be obtained more easily or cheaply.

Many newer companies are promoting prescriptive models to tell experimentalists what to do next. Can you comment on this approach to experimentation versus setting up designed experiments?

Prescriptive models are definitely emerging as a go-to approach in a number of different areas. For those not familiar with what they involve, here are two loose definitions that might help to clarify the distinction between more traditional models and these particular ones:

Predictive models forecast results for new input combinations based on a model estimated using previous data.
Prescriptive models make specific actionable recommendations about how to respond to new data.

Predictive tools are designed to help with understanding, while prescriptive models take that understanding and automate it into planned actions.

I think it is important to remember that the algorithms behind prescriptive models are only as good as the people who create them. Hannah Fry^[3] does a masterful job of talking about the strengths and weaknesses of automation with algorithms and efforts to take humans out of the loop.

It might be advantageous to consider prescriptive models for situations where the volume or speed required for making the decision (think adaptive design, where new subjects waiting to be assigned to treatments are rolling in continuously and frequently) make it impractical for a human decision maker to keep up. That being said, I think there are huge advantages to keeping design of experiments as a task guided by a thoughtful human. People notice when conditions change from the paradigm that was used to set up the experiment for which the algorithm was designed. Humans are better at extracting knowledge and key aspects of the problem. They are better at finding solutions that appropriately balance multiple objectives in the way that best reflects the priorities of the study. So, my recommendation is that design of experiments should be the starting point for most experiments, but if the experiment is to be repeated many times or if the speed or volume dictate, then the experience gained from initial experiments could be translated into a prescriptive model to reduce the burden on the humans.

For sequential experimentation, how are models built and estimated when data were obtained from several stages, such as a pilot study, exploration, and model building? What are the consequences if new data cannot be obtained for model refinement?

For all of the advantages of sequential experimentation, how to manage the analysis is one of the bigger complications. If the data are collected in groups in a non-randomized order, then the analysis should almost always reflect this situation. An exception to needing to adapt the analysis would be for a computer experiment where the same code is running for all of the input combinations and is agnostic to run order. Sometimes there is no need to combine data into a single analysis, with separate models and interpretations for each stage. For example, if a measurement system analysis is performed at the pilot stage, there might not be a direct link between it and subsequent data designed to focus on exploration and model building.

Generally, treating groups of runs from different stages of the experiment (pilot, exploration, model building, etc.) as blocks provides a way of acknowledging that things may have changed between the phases. Exploratory methods can help to examine the impact of the blocks and see if there appear to be substantial changes in the responses between blocks. In the nuclear forensics example, the three phases of the larger experiment were analyzed with a block effect added. From examination of this effect, we were able to identify some differences in the measurement of some of the characteristics over time.

If new data cannot be obtained for model refinement, then the precision of the initial model cannot be further improved. In many cases, the goal of the model refinement is to target areas where the current estimation or prediction from the model is weak. The augmentation with additional runs can help to moderate this weakness and improve overall performance. If there is no budget for this model refinement, then estimation and prediction are dictated by what data are available. This might be a lost opportunity – so whenever possible, it is good to think about efficient resource allocation across the different stages.

Now we have a lot of data from electronic medical records in hospitals, those collecting and analyzing it may have limited skills in DOE. What are some ways to apply some of the great points you shared with us?

Hospital records contain a wealth of information, but it may well be hidden in the volume of records to be processed. Given the nature of medical treatment and ethical considerations, it is most likely going to be very difficult to change its observational nature. As a result, we may be unable to control the “inputs” for the data gathered and more likely to be restricted during the big data phase to strategic sampling of records to help us answer questions.

Here are a few thoughts on where there might be opportunities:

If new types of data are going to be added to the database, there may be an opportunity to perform a measurement system analysis to help understand the sources of variability. For example, are the people administering the tests a substantial part of the variation? Does it matter which machine is used for tests or are they all similarly calibrated? Given that the volume of data to be collected is likely large, a small study to understand the variability sources might be very informative about how results should be interpreted.
If measurements are costly, in the stages prior to big data, there may be opportunities to use a smaller data set to examine correlations between measurements to understand which data are mostly connected to results and if there are possible costly redundancies of information.
A key aspect of answering different queries is making sure that there are easy-to-implement ways of partitioning the data into relevant and less relevant categories. Being able to sample records for subpopulations based on key fields will allow more precise comparisons to be made with similar subjects.
Sampling the data by date or location might allow for comparison of results across time or equipment. Doing so may highlight trends or reduce the number of false conclusions that are attributable to external factors and not changes in the patient.
Finally, in the stages following big data, there may be opportunities to use some small, designed experiments in the form of confirmation experiments to verify hypotheses or check whether effects translate across broader populations.

Doesn’t the adaptive design (A/B testing) approach bias the results toward whichever model gets a head start at the beginning? If A gets shown 75% of the time, the baseline expectation of the number of hits for A becomes higher than for B.

The field of adaptive design has many options for implementing the algorithm that decides how to respond to new observations. Typically, the experimenter has choices about how quickly the algorithm will change the proportion of new subjects receiving each treatment. If you evolve too slowly, then there is an opportunity cost of potentially having a larger proportion of the subjects receiving the less desirable option. If you evolve too quickly, then with an “unlucky” (non-representative) start to the sequence, you may begin by directing too many subjects to the less desirable option.

As with so many decisions, there are two types of errors, and the experimenter needs to weigh their relative risks and consequences. In many implementations, the balance between exploration and exploitation is chosen to minimize the chance of evolving too quickly to the wrong solution, while tolerating some lost opportunities by making a cautious choice. The algorithms are typically designed to work with the proportion of successes for each option, not the absolute number of observed successes.

Regarding the data competition data set example, is dividing into three stages from the overall design something that JMP can do to select which runs are for which stage, or was that just arbitrarily determined?

Instead of creating a bigger data set and dividing it into training, public test, and private test data, we use a bottom-up approach by generating smaller data sets with different characteristics for different scenarios and then combine them to generate the final competition data.

Here are a few more details about the construction of the training, public test, and private test data sets. A dense set of candidate points were constructed for each scenario (think, one of the radioactive sources to be discovered, shielded, or unshielded). Since the highest consequences of the competition lay with the results of the private test data set, it was constructed first, using a non-uniform space filling (NUSF) design ^[4], where the non-uniformity was chosen to emphasize the most difficult regions (here, the fast-moving detector looking for a small amount of the radioactive material), but still with some coverage throughout the entire space. Once this design was finalized, then the reduced candidate set (original minus those points already selected) was used to construct the public test data set. In this case, the emphasis was on moderate levels of difficulty, but again with some coverage throughout the entire space. The process was repeated a final time to construct the training data, with a further reduced candidate set (selected points from both the public and private test data sets were removed from the original candidate set). This allowed for the creation of three designed data sets for each radioactive source x shielding combination.

Next, these designs were combined across all scenarios and randomized into three final data sets (overall training, overall public test, and overall private test), which were then provided to the competitors. There are some papers that I wrote with my colleagues about how to think about constructing data competition data sets^[5] or how we implemented these strategies for the radiation detection problem^[6]. Currently, JMP does not have an automatic feature for constructing NUSF designs directly, but it is possible to build one that uses a sequential approach to vary the degree of uniformity of the design^[7] with the fast flexible space filling designs ^[8] in JMP. Alternately, there is R code available to construct these designs in the supplementary materials of my paper ^[4].

I have a question on handling the curation of outliers prior to the training the algorithms. There is a danger in excluding them, particularly if there is sparsity in that state space. Would you potentially use a design of experiment for adaptive measurements to differentiate whether to include or exclude those outliers? Are there other approaches for determining outlier curation?

Identifying outliers and high leverage points are noteworthy for their disproportionate impact on results and our interpretation of patterns. To define these notions more precisely, an outlier is an observation where the response(s) does not follow the trend of the other data. A high leverage point is an observation where the input combination is extreme relative to the rest of the data set (this can be difficult to assess as the dimension of the input space gets higher).

Design of experiments might be helpful in different ways for each of these. For an outlier, a small local experiment in the region where the extreme response value was observed might help to clarify if this is a region of different behavior or perhaps has higher than typical variation. Each of these, when confirmed, would require a different reaction. For a high leverage point, intentional experimentation in the region connecting the high leverage point to the rest of the input space can help to characterize the relationship between inputs and responses without big gaps. It can help to provide information about the transition from the two input location regions. Both of these strategies, of course, are dependent on the degree of control of input settings that the experimenter has.

I am rather old-school about removing outliers. Unless there is subject matter expertise to guide inclusion or exclusion, I am extremely reluctant to remove these data points. Ensemble models (that combine models based on the entire data set and with the outliers removed) can provide an alternate way of calibrating the impact of these observations on understanding of variability and model uncertainty.

We appreciate Christine taking the time to share more of her wisdom on this strategic topic. Thanks to everyone who attended and asked such great questions! We applaud all efforts to gather information-rich data! We also highly recommend Christine's paper, The Why and How of Asking Good Questions.

References:

[1] Anderson-Cook, C.M., Lu, L. (2023) “Is Designed Data Collection Still Relevant in the Big Data Era? Quality and Reliability Engineering International 39(4) 1085-1101.

[2] Freeman, L. (2023) “Review: Is design data collection still relevant in the big data era? With extensions to machine learning” Quality and Reliability Engineering International 39(4) 1102-1106.

[3] Fry, H. (2018) Hello World: Being Human in the Age of Algorithms, W.W. Norton & Company.

[4] Lu, L., Anderson-Cook, C.M., Ahmed, T. (2021) “Non-Uniform Space-Filling Designs” Journal of Quality Technology 53(3) 309-330.

[5] Anderson-Cook, C.M., Lu, L., Myers, K., Quinlan, K., Pawley, N. (2019) “Improved Learning from Data Competitions through Strategic Generation of Informative Data Sets” Quality Engineering 31(4) 564-580.

[6] Anderson-Cook, C.M., Myers, K., Lu, L., Fugate, M.L., Quinlan, K., Pawley, N. (2019) “Data Competitions: Getting More from a Strategic Design and Post-Competition Analysis” Statistical Analysis and Data Mining 12 271-289.

[7] Lu, L., Anderson-Cook, C.M. (2022) “Leveraging What You Know: Versatile Space-Filling Designs” Quality Engineering https://doi.org/10.1080/08982112.2022.2161394

[8] Lekivetz, R., and B. Jones. 2015. “Fast flexible space-filling designs for nonrectangular regions” Quality and Reliability Engineering International 31 (5):829–37. doi: 10.1002/qre.1640.