Re: Simulate vs Bootstrap methods in JMP: which to use when and why

SDF1 · Sep 12, 2023 11:21 AM

Hi All,

I'm hoping someone from JMP might be able to provide an explanation as the the differences/similarities of the Simulate vs Bootstrap approaches when evaluating model estimates. (JMP Pro)

Some links to previous Discovery Summit presentations from Gotwalt and others can be found here, here, and here. One clear difference between the methods is that Simulate requires you to swap out a column formula with another column formula, whereas bootstrapping does not. I understand the concept of bootstrapping when recalculating the parameter estimates, but it's unclear to me why the fractional random weighted column needs to be swapped out when simulating the same data. See for example the first link above and corresponding presentation that explains this approach using an autovalidation setup (in this case they were working with DOE results).

What I'd like to better understand is the benefits/drawbacks of either approach and when is one approach better than another. I have a wide and tall data table where I have several process variables and product quality measurements. I want to narrow down the set of possible process variables that might be influencing the outcome to provide production some idea of where to investigate. To do this, I take the Null Factor idea from their presentations and include that as a truly random variable in the model. I then run a very basic GenReg model on the factors and Y output, and then right click the column of estimates and run either the bootstrap or simulate option on the parameter estimates.

I do not get the same results running both ways, and I'd like to understand why. Also, is one approach more appropriate over another since this isn't a DOE?

Examples of the results are below on a Pareto Plot, where any factors that are introduced into the model fewer times than the Null Factor are simply removed. The first image shows the results of using the Simulate method, where the factor on the far right of the Pareto plot is the Null Factor. Similarly, the second image shows the results using the Bootstrap method, where again the Null Factor is at the far right.

Why do I obtain two very different results for the two methods? Is one more "trustworthy" than the other? Why would I use one method over the other? Is one method more robust than the other? If you do get different results, why does JMP Pro have the two options?

Any insights to this are much appreciated.

Thanks!,

DS

Mark_Bailey · Sep 12, 2023 12:13 PM

Bootstrapping or resampling attempts to empirically supplant the underlying distribution. Bradley Efron is generally considered to have invented and developed this methodology. He has also explored its shortcomings. The bootstrap samples can then empirically determine sample statistics and intervals. No population distribution is assumed. The accuracy and stability of these estimates depend on the size of the original sample and the number of resamples.

The simulation method assumes a distribution model from which resamples are generated. The estimates depend heavily on the model assumption.

This answer is high-level and not detailed. You can imagine instances of both methods and how they might succeed or fail spectacularly.

SDF1 · Sep 12, 2023 02:41 PM

Hi @Mark_Bailey ,

Thanks for your feedback and thoughts.

So, it sounds like (from a high level) the primary difference comes down to how either simulate or bootstrap assumes or doesn't assume a population distribution. Because of this difference in population distribution requirements, the two methods approach the process of estimation differently, which leads the simulate method to require another column to swap in and out.

I still would like to understand better if one method is more robust than another and when to use one method over the other. For example, what are the population distribution requirements for the simulation method? Are there limitations/restrictions when you shouldn't use one method over another.

In my example above, I get very different answers for the factors that come in to the model more frequently than the Null Factor with each method. Which method is able to actually provide the most accurate estimation for those factors that should remain in the model vs those that truly are not actually useful in the model? Which result can I "trust"?

The JMP online help has limited explanation and doesn't really discuss using one method or another.

Thanks!,

DS

Mark_Bailey · Sep 13, 2023 10:46 AM

Neither method is superior in any statistical sense. That is to say, neither one always performs better than the other in every case.

The accuracy of the result must be verified independently with a new prediction (i.e., the estimate) with a new sample.

The literature on bootstrapping and simulation is vast. I can't easily answer a broad question here about limitations and shortcomings in a small space. For example, will the sample sufficiently capture population features if the population has broad tails or exhibits outliers? Will it generate bootstrap samples like real samples from the population? On the other hand, to what degree do you believe the distribution model used for simulation?

Trust is subjective because two investigators might require different kinds and amounts of evidence to trust one method over the other.

Software documentation is not intended to serve as an education. You must use other sources to learn about the two approaches in this case.

statman · Sep 12, 2023 04:23 PM

Pardon my confusion....

Before I offer my thoughts, which are likely controversial, what is the ultimate objective of your post? Some of your questions (e.g., Is one more "trustworthy" than the other? Why would I use one method over the other? Is one method more robust than the other?) suggest one objective. Others (e.g., Why do I obtain two very different results for the two methods?, why does JMP Pro have the two options?) suggest something different. Are you trying to determine which method will arrive at a model that explains the variation that already exists (in a process or data set) or a model that is useful for prediction (analytical in Deming's terms)? OR are you just interested in a technical discussion about the two methods?

"All models are wrong, some are useful" G.E.P. Box

SDF1 · Sep 13, 2023 08:53 AM

Hi @statman ,

Sorry for the confusion, and to be honest, it's a little bit of all of it, but not so much at arriving at the "final" model yet. More of arriving at the most likely set of factors to further investigate -- kind of like the Predictor Screening platform.

I'm using a modified method similar to what was presented at the 2018 Discovery Summit in Cary -- the first link in my original post -- to try and determine from a very large (>130) set of possible production tags (things like pump speeds, flow rates, oven temp and air flow), which ones contribute to the outcome, Y, that is measured in QC. Using the GenReg approach (e.g. the Forward Selection or Pruned Forward Selection) you can get estimates for the Terms in the model -- and this is a VERY simplistic model looking only at the main factors, no interactions, nothing else. Many of those estimates will be zero, or near zero for their coefficients.

You can then perform either a simulate or bootstrap analysis of the estimates -- this is normally done to get a better confidence of what the range in possible values is for said estimate. However, another thing you can get out of the distribution statistics from either method is the proportion of times the value for that estimate is non-zero. Comparing each estimate's proportion non-zero value to the Null Factor gives you an idea of whether or not the factor appears in the model more frequently than the Null Factor. Since the Null Factor is completely random and has no correlation with the actual production tags, if a tag shows up less frequently than the Null Factor, it can be thrown out of future modeling efforts. This allows you to shrink the >130 possible factors to a more manageable set of perhaps a dozen or so. Further investigation, experiments, or modeling can be done at this point on a much smaller set of factors than previously.

What I would like to better understand is why the simulate and bootstrap methods produce different results. Is it the distribution issue Mark Bailey was talking about? Is it that using the default of 2500 resamples is not enough, do I need more? Or is there some more fundamental reason that is causing the two methods to provide different results?

I would like to better understand the technical side of the two methods in order to make a more well-informed decision about which method to use and when. If the two methods do not converge on a similar solution (not that they need to) then which solution is the "correct" one.

There is not a great deal of information in JMP's online help, and I'd like to better understand these tools in order to effectively utilize them.

Thanks!,

DS

P_Bartell · Sep 14, 2023 08:06 AM

@SDF1 I'm going to offer my two cents/opinions as well. I concur with everything @Mark_Bailey and @statman have contributed so far. The pathway of your investigation sounds eerily similar to what I repeatedly was tasked with in my working days in the chemical process industry. We'd have lots of production/manufacturing real time data that we'd try to leverage for process/product variation understanding or corrective action when the magic of physics, chemistry, or biology didn't come together during a production run last night. Now it's all hands on deck to figure out 'why'? It almost sounds like you are trying to come up with a 'final' model that explains everything. I found using production data for this was just not an overall, with respect to solving the problem, an efficient problem solving approach. The data was just too messy, incomplete, or missing oft impactful noise/nuisance factors to be useful and representative of what we could expect to encounter in the future. So what did we do? We'd use the data you have at hand to identify LIKELY important factors. Then go 'offline' and conduct thoughtful process space encompassing designed experiments to induce factor variation in a sequential way so as to lead to an efficient (resource wise, not model/statistic wise) path to ultimate problem resolution. At the end of the day, as one of the engineers I worked with often times said, "You don't know root cause of a failure mode until you can turn the failure mode off and on". The only efficient way I know of how to gain that knowledge is through DOE...not torturing happenstance data into a model. Just my opinion. Take it for what it's worth.

SDF1 · Sep 14, 2023 09:49 AM

Hi @P_Bartell ,

Thanks for your reply. Your description of what you've experienced in the past is the closest to what I have tried to explain in my previous posts. The goal of this approach is not to find a perfectly functioning model, but to help production in weeding out those process data that DON'T matter. Our process doesn't really allow for performing true DOEs in production -- WAY too expensive! Instead, if we can narrow down the possible production data streams to ones that are the most LIKELY influencers, then we can do DOEs in the lab, just as you pointed out. This is what we do.

We know from our R&D lab that certain factors affect certain outcomes, how and why. However, the small batch-wise (a few kg/day) and highly controlled environment of making our product in the lab is FAR different from a continuous full scale production, making over 500 tons of material a day. The continuous production equipment influences the outcomes differently than how things are done in the small lab batch setting. And, with our process, there's no "scale-up" or "scale-down" factor that can simply translate the batch to the continuous process, or vice versa. So, it's much harder with the continuous production to find which process might be drifting or fluctuating and leading to out of spec product.

So in coming back to what I was trying to learn about originally, I would like to learn more about similarities and differences in the two methods by a technical discussion. Are there cases where one method is preferred over another, and why? In doing so, I'm hoping to learn why the two methods provide such different results so that I can come up with a plan on how to find a more reliable and robust approach to determining those most LIKELY factors that we need to investigate further. This last step is something I'd work on with my group at my company, not via the JMP community.

Again, the JMP documentation has limited information on these methods (perhaps more on simulate) as well as limited information on where to find more details of each method. Not being an expert in these methods, I thought I'd go to the discussion forum for ideas/input on the simulate vs bootstrap methods. So far from this discussion, I've learned that bootstrap doesn't assume any distribution on the resamples, whereas simulate does assume a distribution of the resamples, and that neither is superior. None of this really addresses what I was hoping to learn. The JMP help had some helpful information but not all, and as is advertised on the bottom of the JMP help pages:

Hence, coming to the community discussion, hoping to learn more. Again, I am not an expert in these methods, but was hoping someone among the community pages might be, and could either provide more information or resources that I could read to learn more.

Thanks!,

DS

P_Bartell · Sep 15, 2023 07:15 AM

WOW! We must have worked at the same company...you've described exactly how I spent half of my working career at Eastman Kodak Company. DOE on production film/paper coating processes/machines were verboten due to cost and capacity. There was scalability from lab bench to pilot to production that was a pipe dream written on paper. Our saving grace was there were lots of smart people trying their best. I'm going to send you a private message using the Community messaging capability with some additional thoughts.

statman · Sep 15, 2023 11:21 AM

I will offer some of my thoughts which might not be of any value and may not be useful regarding the questions posed by DS. My thoughts align with Pete's as I have similar experiences. In grand scheme of trying to understand causality, there are many different approaches. In ALL cases, iteration is required. I will over simplify two of those approaches:

1. A statistical approach: This is an attempt to look for clues in existing historical/observational data (AKA Data Mining). This often entails use of some kind of regression approach (of which there are many), ultimately looking for patterns in the response variables (i.e., Y's) and trying to correlate those with patterns in input variables (i.e., x's). Instead of just a quantitative look (e.g., bootstrapping, simulating, et. al.), use graphical methods to look for patterns. The correlation of patterns should inspire explanations as to why those patterns exist. These explanations are hypotheses. Once hypotheses have been formulated, gather data via directed sampling (e.g., DOE) to build confidence in causal relationships.

2. A scientific/engineering approach (AKA scientific method): This approach starts with hypotheses about the potential effects of factors on response variables. These hypotheses are a function of SME experience, intuition, education, understanding of accepted scientific theory, etc. When there is a large number of variables (e.g., >15), and hypotheses are general, directed sampling can be used to separate and partition the components of variation into smaller subsets. I would strongly recommend this approach for your production processes as it does not "disturb" the process. As the number of potential influential factors is reduced, experimentation can be used to build confidence in causal relationships (as you indicate).

Which approach will be more effective and efficient is always situation dependent. When we are extremely low on the knowledge continuum (e.g., there is a lack of subject matter knowledge), perhaps approach 1 will be advantageous. If hypotheses already exist, approach 2 may be more advantageous.

Regarding the inability to experiment on the production process (without getting into an argument about short-term vs. long term thinking), this can add complexity to the investigation. The issue is inference space. What is wanted is for the pilot line to mirror the production process. This is a challenge because the ilot lie is in a lad and is often a completely different inference space. Often this means noise will need to be added to the pilot line (e.g., vary ambient conditions, use multiple lots of raw materials). When doing studies on a small scale it is important to exaggerate effects. Both the factor effects (e.g., bold level setting) and noise effects. The exaggerated noise effects can be accomodated in experiments using blocking and split-plot strategies.

IMHO, simulating and bootstrapping (or any quantitative method) are completely dependent on how you acquired the data and how representative the data is of future conditions. If the data used for simulating or bootstrapping is not representative, neither method will be useful.

"All models are wrong, some are useful" G.E.P. Box

Simulate vs Bootstrap methods in JMP: which to use when and why