Tracking Down Faults in Complex Systems: A DOE and BayesFLo Case Study (2025-US-30MP-2561)

Testing plays a crucial role in ensuring the reliability and robustness of engineered systems, especially in safety-critical domains like aviation. In this talk, we present a case study of the Traffic Alert and Collision Avoidance System (TCAS) and illustrate how JMP can be used to facilitate the testing of complex engineered systems. By applying design of experiments (DOE) and Bayesian fault localization (BayesFLo) techniques, we demonstrate how to construct rigorous test cases and identify software faults using JMP. Additionally, we show how Bayesian methods can use the prior knowledge of domain experts to analyze the outcomes from testing complex engineered systems.

Hello, everyone. My name is Irene Ji. I'm a Research Statistician Developer in JMP's DOE and reliability. Today I'm going to talk about a new feature in JMP 19: BayesFLo: Bayesian fault localization.

First, let me try to formulate our problem we are trying to solve. The data we have consists of inputs and outputs. For inputs, we have categorical factors or discretized continuous factors. For the outputs, we have categorical outcomes with no noise. In this case, the linear regression does not fit on this data.

Our goal here is to find input combinations that are responsible for one outcome out of the multiple categorical outcomes.

Let me first use a simple illustration of a coffee shop problem to illustrate this problem. As you see here, we have a coffee shop on the left, and we have recorded the number of people and their characteristics, and whether this person has bought any coffee in this coffee shop for one day.

Some come in the morning and some come in the evening. Some are members, some are nonmembers, some have discounts, and some don't have any discounts. Three out of the five customers bought coffee, and two didn't buy.

The question we try to solve here is what factors are preventing people from buying coffee? As I mentioned, this cannot be solved using a linear regression, but it's also similar to linear regression in the terms of the confounding effects problem.

For example, for this problem, we may find on combining the non-membership and no discounts, you get two people who don't buy the coffee. If you also consider morning and nonmember, it can also explain this one customer who didn't buy the coffee. Which one is more likely to be the real reason why people don't buy coffee? That's a problem we are trying to solve.

Now, besides this simple illustration, we also have a sample use case which is more realistic. It's the fault localization for software testing problems. Fault localization is trying to find input combinations that are responsible for the failures. The failure is the one outcome we are interested in.

There are four steps in fault localization problems. The first step is to identify the input factors. Second step is to construct test cases. Third step, identify failures, and the fourth step, locate root causes.

Now I'm going to use a case study for testing JMP's Easy DOE platform to illustrate the whole process and how we use JMP to help do this analysis.

As many of you may have seen or may have even used Easy DOE, on the left, we see this user interface of this DOE. The first step of our analysis is to identify the input factors we are interested in. There are 12 of them, so they're all shown on the right. Let me just go through them one by one.

The first one, mode, whether you want to use Easy DOE in a guided mode or a flexible mode. Second, there are four response-related factors. You may either add response with a maximum goal, match target goal, minimize goal, or no goal.

The third section is the factor section. We have all three kinds of factors: continuous, discrete numeric, and categorical factors. For the last two categories, we can also specify the number of levels we want.

There are five factors related to factors. The remaining two factors, one is the model type for different models, starting from main effects going to RSM, and also a number of runs, which is translated into the n extra runs, meaning the number of extra runs you added to the default number of runs. The first step is done; we have already defined our input factors. Now we need to construct our test cases.

The way we construct our test cases is to use covering array. Using a strength-2 covering array, we make sure every two-factor combination is covered at least once. This is a screenshot of the JMP table. We see there are 36 runs. The response is the first column, 35 runs, and there's only one 0. This 0 is the failure.

Before we go into the failure detection, let's first give you a quick overview of the covering arrays we are using here. The covering array is a design structure. Use this table as an example. This is a 10 factor, strength-3 covering array. For any choice of three columns, each possible combination of these columns occurs at least once.

If you look at this table, we have circled some rows and some columns in three different colors, in red, yellow and green. For any columns we circled, the combinations of 1, 0 all appear at least once.

Using this covering array, we are able to guarantee the coverage of all possible combinations involving at most three factors. We think this is very beneficial for the fault localization problem. That's why we used a covering array to construct a test case.

Now coming back to this table, in the test cases, we identified there's one failure, it's the Run 14. This is when Easy DOE generated a design, but with unexpected sample size.

Now we can move on to the analysis parts. Before JMP 19, we can use this deterministic covering array analysis. The result is shown here.

It first identifies how many failures are there in these 36 test cases. Then it will find which combinations are most likely found in this one failure. We have eight combinations. It also accounts for the number of failures they appear in. But because there's only one failure, so the failure count in this case are all just 1.

Now the question becomes, we have eight tied combinations. Other than this failure count, we cannot distinguish any of them. Any of them is more suspicious. We are not sure which one is more likely to be the actual root cause.

That's why we are developing, and we are using this Bayesian fault localization method in JMP 19. Starts from some initial test cases. I think we identify some failures, and then we do this Bayesian analysis to get a more informative ranking of these combinations.

Here we are not using failure counts to rank the combinations, we are using the probabilities. The main idea is to use domain knowledge to rank the input combinations. We have to make sure we have the most suspicious one as a top-rank combination. I will skip a lot of technical details. If you're interested, you can go to our arXiv paper in this link.

Now, before I use JMP to do a demo, let me just first explain what are the domain knowledge for this case study. Look at our 12 factors, and also consider the nature of the failure is the incorrect sample size error.

We are able to categorize our factors into three groups. The first group: response-related factors. We think they have low relevance to this failure, because usually we first get the sample size, get the design, and then get a different response.

The second type is the factor-related factors, these five factors. They have medium relevance to this failure, because we have to define the number of factors we need and the number of levels for each factor before we can create a design.

The last type is the model and run size related factors. We think they are highly relevant, because for different models and for different number of extra runs, you may have different designs.

Now we are moving on to JMP to do a demo here. To use JMP to do this Bayesian fault localization analysis, you can go to JMP, DOE, special purpose, and covering array. If you just have a data table, as shown here, you may load it into the covering array platform using the loads design methods.

Now let's move on to JMP. Here's the JMP table we have. As shown here, we have this one failure. There are many table scripts here, and we just click this analysis table scripts. What's shown here is we have all the results from the deterministic analysis. As mentioned before, we only have this failure count, so eight combinations are all tied.

If I want to enable the Bayesian analysis, I just check the rank by probability. There are three options. The first option is just assuming we don't have any prior. Intuitively, it will give the same ranking as the deterministic approach, because every probability is the same.

The second and last options are for cases where we do have some prior knowledge. I'm going to use the third one as a demonstration, because we can easily just convert our ranking of the different factors into this table.

I would just first select all the factors and then do a quick entry. As mentioned, I think the model-related and run size-related factors are most relevant. I set them to be most relevant. Now I've updated all 3 rows out of these 12 rows in this table. Then I'm going to select all the factor-related factors and set relevance to be immediate. Again, update priors and close it.

This has reflected my [inaudible 00:10:24] of all the factors. Now I would just hit Continue. We are going to translate the rankings into the values. The values are also predefined here, so I didn't change them. They're just 4, 2, 1.

Moving on, if I have any information about certain combinations, like if I think the combination of some categorical levels and some discrete numerical levels, I can just select them and then update my prior for this combination. But now I don't have any, so I just select no for this question.

Because I just want to use the same scope of the analysis, so I don't change the strength to a larger number, I just look at these eight combinations. Then I'm going to update ranking.

As you see here, the results have changed. Now we do have a more informative ranking of the combinations. We have the first one, the main model type: main effects interaction uncorr, and n extra runs. This one is actually the true root cause of the failure we observe. Now let's move back to the slides.

Next slide. Finally, let's compare some results. Deterministic analysis, only comparing the failure counts, may not be sufficient for this analysis because there are too many confounding effects. Using our Bayesian fault localization, we have a more informed ranking of these eight combinations.

To summarize, this new method gives us more informative rankings, and it can also identify the true root cause in this case, which is the first one.

If I want to do some more testing, this may also give me some idea of where I should continue my test, so we can also guide subsequent testing.

To summarize, we have categorical or discretized continuous factors and categorical outcomes with no noise. This data is not suitable to fit a linear regression. Our goal is to find input combinations that are responsible for one outcome. What our solution is, is to use this Bayesian fault localization method, which is a Bayesian framework that integrates domain knowledge into the analysis.

Here are the references I used throughout the slides. The second one is our arXiv paper on BayesFlo. Thank you all for your attention. If you have any questions, please feel free to contact me at my e-mail.

Presenter

Irene Ji

Skill level

Intermediate

Beginner
Intermediate
Advanced

Austin, TX October 21-24

Tracking Down Faults in Complex Systems: A DOE and BayesFLo Case Study (2025-US-30MP-2561)

Presenter

Skill level

Design of Experiments