Choose Language Hide Translation Bar
richnewman
Level II

A JMP Script That Determines a Simultaneous 95% Bound Using a K-Nearest Neighbor Approach (2021-US-30MP-911)

Level: Intermediate

 

Rich Newman, Statistician, Intel
Don Kent, Data Analytics and Machine Learning Manager, Intel

 

We have a set of responses that follow some continuous, unknown distribution with responses that are most likely not independent. We want to determine the simultaneous 95% upper or lower bound for each response. As an example, we may want the lower y1 and y2 and upper y3 and y4 bounds such that 95% of the data is simultaneously above y1 and y2 and below y3 and y4. Finding the 95% bound for each response leads to inaccurate coverage. The solution: a method to calculate the simultaneous 95% upper or lower bound for each response using nearest neighbor principles by writing a JMP script to perform the calculations.

 

 

Auto-generated transcript...

 


Speaker

Transcript

rich n Hello, my name is Rich Newman, and I'm a statistician at Intel. And today I'll be presenting on a JMP script that determines a simultaneous 95% bound using a K-nearest neighbor approach.
  This presentation is co authored by Don Kent, who's also at Intel and both of us are located in Rio Rancho, New Mexico.
  I'd like to start today's presentation by motivating the problem, and from there, I'll share some possible solutions and ultimately, land on the solution that we went with.
  Along the way, I'll provide some graphs to help further illustrate the points and then finally, I'll show the JMP add-in that we use to solve the problem and some screenshots illustrating the script.
  we're designing a device, and we need to know what is the worst-case set of fourr resistance and four capacitance values that we see.
  And worst case for us is defined as the low resistance/high capacitance and the high resistance/low capacitance combinations. So for clarity,
  we have eight variables and we may need the simultaneous bounds of the four low resistance values, the four high capacitance values, so we can use it to help us design the device. And worst case may be defined as 95% confidence or 99% confidence and ultimately, that's up to the user.
  Alright, to illustrate this problem, let's use just one resistance and one capacitance value. So I have resistance on the X axis, capacitance on the Y axis, and we want to know two things. We want to know that yellow star, the worst
  low resistance/high capacitance combination, and we want to know that purple pentagon, the high resistance/low capacitance combination.
  Alright, with respect to our problem, I want to point out that we recognize that these eight variables, these eight responses are not independent. There's some correlations among them.
  Furthermore, each of these responses may or may not follow a normal distribution or the multivariate normal distribution.
  We ask these types of questions frequently. So in other words, we do not want to solve this once and be done with it, and never deal with it again. We get asked these questions
  often, so we really need a robust solution that's easy to use. In our case, we're very fortunate that we tend to have relatively large data sets, at least 400 points, and we typically have
  1,000 points. And for us, a practical solution works as long as it has some statistical methodology behind it. So if I go back to that previous graph, it's not like I'm going to throw a dart on that graph and say, wherever it lands,
  it's going to be our worst case bound. You really want a little more meat behind that, but I do want to point out, we don't necessarily have a definition of worst case, whether it's 99% or 95.
  And we just know it's better to be a little bit conservative, to make sure we're designing a device that's really going to work and not have any issues in the future.
  Okay, I want to share a completely made-up example, just to illustrate that this type of problem can happen in any industry.
  And so, imagine we made adjustable desks for the classroom, and we want our desk to work for 99% of the population.
  a person's height and a person's width.
  Now, when you have JMP, it comes with some sample data, and one of those data sets is called BigClass. And in BigClass, it has
  some students in there and their heights and weights. And so we can use that data set to help us determine the height and weight bounds that capture 99% of population.
  So if I look at this graph here on the bottom right, we see the points represent...each point represents one student's height and weight combination.
  Okay, if I go back to our problem,
  our current approach, which we believe can be improved, is we independently find 95% or 99% prediction bounds for resistance and capacitance.
  And in this example where I'm just looking at one resistance and one capacitance, we would find two separate bounds.
  So as an example, we would find the 95% prediction bound for the resistance, which is 4.52 and 4.97 and that's designated by the darker blue lines.
  And then we would find 95% prediction balance for capacitance, which is 15.6 and 16.5, which is designated by that greenish blue lines, and then we find the combinations to get us that yellow star and purple pentagon, which gets our worst case.
  Now we have a little bit of concerns with this approach. And the first concern is around a Type I error rate. So when I find a 95% bound for resistance 1 and 95% bound for resistance...
  for capacitance 1, keep in mind, I have eight variables, overall I know my confidence levels, not 95%. And what it is, will depend on the correlation the variables.
  Now we can get over this hurdle by making an alpha adjustment, but there's another hurdle that I wanted to discuss that that has a bit of a bigger concern for us.
  All right, what if we were interested in the high/high combination, which is designated by this yellow circle?
  In this particular example, you can see we don't have any data near this worst case bound, so if we were to use this, this is extremely conservative.
  And when we go to design our device it's going to have a cost and a time element associated with it, so we want to be a little bit conservative, but we wouldn't want to be so conservative that we would use this yellow circle because we're really not getting any data points around it.
  Okay, there are some alternative approaches that are easily done in JMP that we wanted to consider as solutions to this problem, and the first one is density ellipses.
  And this is found in the fit Y by X platform, so if I hit that red triangle on bivariate fit, I can choose density ellipse, and in this case, I chose a 95% ellipse. And I get that that red ellipse on the graph.
  Well, when JMP provides this density ellipse, if you look at the bottom right hand corner of the presentation, it presents the mean standard deviation and correlation of the two variables.
  And what JMP does not provide is the equation of the ellipse. Now, this is a hurdle we can we can get over. We just have to do some math to be able to solve it.
  But the bigger hurdle is what happens if you have more than two variables?. In this case, JMP doesn't have an easy option
  for us to solve this problem. So we could do pairwise ellipses or just do two variables at a time, but we're going to have the same alpha problem and it's going to be pretty difficult to pick out what points we want to use as our ???.
  Now there's also one other very minor concern in this approach is what if we were interested in that high/high corner or in that yellow circle? What point on the ellipse do we choose?
  And again, I think that's a hurdle we can go to...get over, but the two...when we have more than two variables that's a hurdle that's pretty tricky. We're not sure it is easily solved.
  All right, there's another approach that we can...we can...that's easily done in JMP and that's principal components.
  And what principal components does is it creates new variables, and JMP will label them Prin 1 and Prin 2,
  such that the new variables are orthogonal to each other. And the fact that they're orthogonal we can use to help us solve what's our worst case bounds.
  And this is found in the in the multivariant methods platform, so if you go to analyze, multivariate methods, principal components, we can ask for these principal components.
  Now the concern with the principal components approach is that the math is extremely difficult when there are more than two variables.
  Furthermore, in theory, principal component tries to reduce the dimensionality so, in other words, if I had eight variables that I wanted to try to find this worst case simultaneous...simultaneous bounds on,
  but JMP may come back and say, okay, we found three main principals that that really help explain what's going on.
  And that case we have three equations and we have these eight unknowns and it really puts us in a difficult place to solve the problem. And so, for that reason we wouldn't use this principal components approach.
  So where does that leave us? So our goal is to find the simultaneous worst case bound, that high/high, low/high or low/low/high/high combination.
  We want to use...we'd like to use JMP to help us solve the problem. It has to be able to handle three or more variables.
  Each variable may or may not be normal. We expect some correlations. The good news is we tend to have relatively large data sets.
  We want to make sure that if we asked for this corner, if you will, that there's data around there and we're not stuck in a situation where there's no data. And again, an easy practical solution may be sufficient.
  Okay, so what I want to do now is explain a concept, and then I'm going to show you how that concept is used to solve our problem.
  So there's a concept out there called the K-nearest neighbors approach and the idea is, you have a point and you find a distance from that point to every other point.
  And then you determine the point's k-nearest neighbors, or the k points for the shortest distances. So to make sense out of this, let's look at an example. So let's focus on that point that's highlighted in red...that dark red and it's coordinates are
  4.51 and 15.66. If I take that point in blue, coordinates are 4.67 and 15.54, I can find the distance between those points.
  Or the idea of the nearest neighbors is for that point in red, I can find the distance from that point at every other point in the data set.
  And then I can sort the distances from smallest to largest and then I can pluck off the ones that I want.
  So, for example, if I wanted to know when k=2, two points that the nearest neighbors to the one in the red, I see they're the two points in pink and they're the ones that are the closest to the red one.
  Okay, so what I want to do now is talk through the solution and then...at a high level, and then I'm going to slow down and and walk through the steps. So our solution was based on, if you have a large data set,
  we can first find the median +/-3 standard deviations for each variable (and I want to point out, you could also use the mmean),
  and in doing so, we define what we call our targeted corners, or our desired corners, and that's the lower the highest...based on the lower the high point of each variable.
  And then, what we're going to do is we're going to find the distance from each point to that targeted corner, we'll sort the distances from smallest to largest,
  and, in our case for our needs (and I'll explain this a little bit more in an upcoming slide) we take the k-nearest neighbors to the targeted corner. And in general, you can collect k-neighbors that represent desired confidence. And again I'll explain this a little bit more in just a second.
  Then what we do is, we take the average of the k-neighbors and that becomes our solution.
  Okay, so here's the idea is we start with our two variables, we find the mean +/-...excuse me, the median +/-3 standard deviations and we can also use the mean.
  And in doing so, we call those our targeted corners, so if I was interested in the low/high corner, the high.low corner that yellow star in the Pentagon they start as what we call her defined targeted corners.
  The next step is we take all the data points and we find the distance from every point in the data set to those targeted corners.
  And once we get those distances we sort them from smallest to largest.
  And then, in this particular example, we find the k-nearest neighbors closest to the targeted corners. So just as an ilustration, you can see the five points in yellow
  are the five points closest to the yellow star, and the five points in purple are the ones that are closest to the purple pentagon.
  Then what we do is, we take the average of those five points, respectively, and in doing so, these these black ellipses represent what we would use as our worst case value.
  And then we will use that to help us design our device.
  Now I want to show you those points relative to the density ellipse and relative to these targeted corners, and the yellow star and the purple pentagon is what our original method was.
  And we can see the density ellipses aren't bad, they get a little bit better. And what's really nice about this particular solution is we have data points near them and that's exactly as it is designed.
  Furthermore, it's not too conservative for us, so we don't have to pay this extra cost, if you will, when we design the device and we didn't have to worry about the distributions of the data. The correlations are really not concerned in how we solve our problem.
  All right, let's say you were interested in the high/high and low/low bounds, that light blue pentagon or in the darker blue star. This method works as well.
  And what you see in the black ellipses, our solution, is that again we have data points here.
  And so, for us, this is a wonderful approach because, especially relative to our current approach, it's not too conservative. It may be a little bit conservative but it's not as conservative as this pentagon and the star.
  Okay earlier, I made some comments about how we approach it and there's some choices, so so I want to discuss now the choice of K and should we average?
  And to me that K may be based on your confidence level, your sample size and kind of your philosophy, and let me explain.
  As an example, let's say I had 1,000 data points and I wanted to be 95% confident.
  In that case, I can take the 25th closest distance for the two corners and that would be 2.5% out on the low side or the low/high side and 22.5% out on the other side and together that I've captured my 95% confidence.
  So I could just take the 25th closest distance and be done. That's one approach. I can also take the 23rd, 24th, 25th, 26th, and 27th distances, and average them.
  And take the average of five values and use that as an approach. So there's a couple different ways you can handle it.
  In our particular needs, again we have very large sample sizes and we want to be a little conservative and we're not driven by 95% or 99% confidences. So just as illustration purposes,
  those orange circles on the graph on the right, they may represent, as an example, the 95% confidence interval and it may just be the average of five points or maybe that 25th closest distance, as just an example.
  And what we would do is instead of using that approach, we would actually take the average of the first 25 points, and in doing so we'd end up with the black ellipses and you can see they're moving out and it's making it a little more conservative.
  And so we do that by design, a little more conservative.
  And again it's a it's a choice, and for us what's nice is it's not as conservative as those desired corners, our current approach, so so we get a little conservative nature in there, but we're not grossly conservative.
  Alright. So what are the pros and cons of this approach? The positives are we do not need to know the distribution of the variables.
  We can easily handle some correlation or dependent variables, we can easily handle multiple variables, especially more than two.
  We know there's data close to that solution and part of that's dependent on that large data size and we can build a script in an add-in in JMP to easily perform the calculations.
  The negative is that it does require a decent-sized data set, because if you want that 99% confidence level as an example, a real high confidence levels, you really need lots of data.
  All right, so this is what our add-in looks like, this user interface, and so you have the possible variables in the upper left hand corner.
  And then on the high side you enter in, for example, we want the the high values of the resistance and that's in that green highlighting. If I go to the purple highlighting, we can add in some values on the low side, and so in this example, we'd say we want the low combination of capacitance.
  Then the next thing you have to do is enter in your confidence that you want.
  And we have a recall button, which is nice for convenience for people.
  And then we have our team logo, which makes the item look nice and professional.
  Alright, once you run this add-in, it will trigger the scripts and the output will be a JMP table.
  And in this JMP table, the first thing I want to point out in this highlighting, in this green highlighting, is for all eight variables, we're getting the median, the standard deviation, and whether it was the low side or high side we were interested in.
  And so from there we'll build the desired corner, so it would be the median minus three standard deviations for the low side and the medium plus three standard deviations on the high side. And again in purple now, this is our targeted corner.
  Then the next thing we do is find the distances for all the points
  to that desired corner. And then we're going to find in this example the five points that are closest to that desired corner, and that's that neighborhood values. They're going to be our vector of five values. The neighborhood indices, I'll explain a little bit more on the next slide.
  Then in blue, we take the average of those five nearest neighbors and that becomes our solution to the problem. So that column in blue is the worst case values or they are the worst case values that we would use to help us design the device.
  Now we also have a column in here called Neighbor Z-score, and what that is, it takes that neighbor average, our solution, and kind of works backwards and sees how many standard deviations away it is from the median.
  And the reason why we do that is because our original approach was to take roughly this median plus or minus three standard deviations.
  And what we're finding is, to get what we want, we can actually use a much smaller multiplier. So this was just helping us know how conservative or overly conservative our current method is. It's not being used in any calculations, other than just helping us understand.
  All right, I mentioned those neighborhood...excuse me, the neighbor indices. So in the upper right I highlighted in purple the 109 and 126.
  That corresponds to the rows of the data. So when you run this add-in, you get your JMP table.
  It'll tell you what five rows represents the five nearest neighbors,
  and also selects them in your original data set. And what's nice about that is, it makes it easy to color code. So earlier, I showed you the example with the yellow, so five yellow and the five purple.
  And it's easy once you run this add-in just to change the colors right...right after running the add-in.
  Alright, so this is what the output looks like for our eight variables.
  And you can see, the green points represent the low resistance/high capacitance values, the red points represent the high resistance/low capacitance values. And I just want to point out in the bottom right part of this graph that I've highlighted in purple,
  that you can see, for example, the green and red points, they're not that extreme points for any given variable and that's the simultaneous aspect of this problem.
  It's really solving the problem across all eight variables. And so for some variables, that may be extreme, for others it may not be, and that's fine with us.
  But again...and in doing so that's helping to understand the relationships between our variables and this again would be a graphical display of what our final solution be, our worst case...our worst case values.
  Okay, if I go back to that BigClass data set just to illustrate the add-in, and I want to recognize that this is a small data set. This is just for illustration purposes. I can run the script and ask for the high/high side, I can run it and ask for the low/low side.
  In doing so again, like before, I'm going to get the median and the standard deviation. I'm going to use it to find the bound.
  And that's my desired corner, then I'm going to find the five neighbors that are closest to that bound. And you can see in green, those are the actual values, and those are plotted on the graph in yellow and purple.
  The neighbor indices refer to the rows and that's what allows me to color code my data quickly.
  The neighbor average is our yellow star and our purple pentagon, and that would be our solution to the problem. And again we do the Z score just so we have an idea internally for for how conservative this method is.
  Okay, at this point, I just want to highlight some some points of our script. So when we built our script, and you can see in the upper right, you know we built this panel box.
  And light blue, that's where we ask the user to input variables that are...we're finding on the high side and the variables to be found on the low side.
  In that copperish color, that's where we're getting input of the percentile and that's actually a number.
  And you can also see on lines 222 and 223, that's what we're building a recall function.
  Alright, one of the things that that we like to do is data quality checks, so you know we want this to be as mistakeproof as we possibly can, errorproof.
  And so, for example, you have to input that confidence level as a number, and so, then we do a check to make sure it's between zero and 100 or the user gets a error message that they need to change their input.
  Likewise on the bottom, we need something in the low list or the high list in order to run this, so we have these data checks to make sure that something's been input.
  Alright, the next thing we do is have something called the sumcols funciton.
  And what the sumcols function does is it loops through the past columns. So, in other words, it's taking the lows and highs that you've inputted
  and returns a dictionary or associative array with some information that's important to us. So if I look on the on the right, the bottom right,
  excuse me, if I look on the bottom right, for example in orange it's going to collect information on the data type,
  how many data points there were, what's the median, what's the standard deviation. In purple it's going to do that calculation and get us our desired
  corner or targeted corner, that we are going to call our bound. And then in blue, that AA, that's our associative array and that's bringing all the information or storing all that information that we need for the next step.
  All right, what we do it from from here is we start to calculate those distances. And so you can see in the blue highlight,
  and that's on the low side, and if you go right under it, that's on the high side. We start finding distance from each point to that targeted corner.
  And we have to do the math, you know, that was shown earlier, where we're squaring and then taking the square root but but in essence we're just finding that distance from the points to three sigma.
  And then on the bottom, we're going to go through the process of sorting them, ordering those distances, and then plucking off the ones we need.
  And I want to point out in orange highlight that, six, seven and eight.
  Whenever we write our scripts, we do our best to have comments. And sometimes when one person is working on a script, unfortunately, they get pulled to something else, and someone else can
  finish it, so it's really nice to have these comments so someone else can take over and really understand what's going on.
  Furthermore it's nice that these comments, even if you're the only one working on it, that if you have to go back to it, you know, years later, you remember what each step is doing. So we do our best to put in comments.
  All right, finally, what we do is, you know, we run through the low and the high side and we return some values in a dictionaryj. And you can see in blue, what we do is we pluck off the minimum neighbors and that's that vector that came out in the output.
  Then we take the average of them and that's our solution; that's happening in orange. And just as an example, in purple that we find our Z score. So so once we get the information, all those distances, have sorted them,
  we then pluck off the information we need to then build a table that that's our.
  Alright, so putting all this together. Our motivating problem is we wanted to find these simultaneous worst case bounds to help us design a device and our current solution is too conservative; it costs us money and time.
  And when we go to solve this problem, we know the data may or may not follow the multivariate normal distribution, our data is not independent.
  And the frequency of this really requires a simple solution, and preferably with JMP, and so our solution
  was to build a JMP add-in that's easy to use, it uses the k-nearest neighbors concept, and the output is easy to understand, and it helps us quickly build those graphs that we can color code, so we can show others.
  All right, thank you very much.
Comments

Hello @richnewman, Interesting approach!  Thank you for sharing this with the community and @Discovery Summit Americas 2021.  I am wondering if you have an example add-in that you can share for those of us interested in trying out this approach in the context of our own application?  Thank you in advance for your consideration. 

Normal Distribution(3)

Also, I have one question about the choice of the k sigma multiplier that you are using here (in this case k = 3 to cover 99.865% of the population on either side of the mean under the assumption of normality); which we can get to in JMP with the following: 

 

PatrickGiuliano_0-1635812177822.png

Is the choice "coverage" in this context really Confidence Level? To me this speaks more to the "coverage" (proportion covered under the standard normal distribution assuming a mean and sigma calculated under the normality assumption; or in your case a median to handle the "non-normal" case) and not to a Confidence Level.  Is this how we refer to it in the context of "K Nearest Neighbors"? 

Thanks, @PatrickGiuliano 

 

richnewman

Hi Patrick (@PatrickGiuliano),  

 

1) With respect to the add-in, I unfortunately cannot send you what we use due to IP reasons.  We did try to include the "meat" of the script in the presentation so someone could reproduce it themselves.  While I recognize that this is not the ideal situation, hopefully you can use what was provided to build something similar.

 

2) There are two "confidence" or "coverage" types ideas going on. 

a) The first is the mean (or median) +/- 3 std devs.  This generates our desired corners.  We use the desired corners in the equation to find the distance.  In other words, for each observed point, we find the distance from that point to the desired corner.  Thus, the mean +/- 3 std devs is really for direction and not for a particular confidence or coverage.  I tried using mean +/- 4 std devs, +/- 5 std devs, ... -- all gave similar results.

b) The second is the "nearest neighbor percentile" that is inputted in the dialog box (that you show above -- in copper / brown coloring).  This value does have a coverage or confidence type definition to it.  This value drives the choice of k.  For example, if I have 1000 observations and I input a percentile of 99%, then I want the 10th smallest distance to the desired corner.  In theory, we expect 1% of distribution to be "outside" this point.  Instead of calculating one point, I may take an average of 5 neighbors.  For example, I may take the average of the 8th, 9th, 10th, 11th, and 12th nearest neighbor to represent the point that has 99% coverage (or where 1% of the distribution is expected to be "outside" this point).

 

Hopefully this makes sense.  If not, I would be happy to walk through a data set with you.

 

Regards,

Rich