Choose Language Hide Translation Bar

Analyzing Functional Data with Direct Functional PCA in Functional Data Explorer (2021-US-30MP-913)

Level: Advanced

 

Ryan Parker, Sr Research Statistician Developer, JMP

 

In JMP Pro 16, the Direct Functional PCA (DFPCA) modeling option has been added to the Functional Data Explorer (FDE) to provide a way to perform functional PCA without first fitting basis function models. This approach not only makes larger functional data more tractable, but it also provides a more hands-off approach to analyzing functional data. This presentation details how DFPCA works and presents examples that highlight how and when to use DFPCA to analyze functional data.

 

 

Auto-generated transcript...

 


Speaker

Transcript

Brendan Leary you're right.
Ryan Parker, JMP you're.
  here.
  I'm good.
  You can hear me OK.
Brendan Leary I can hear you Okay, you can hear me okay.
  yep all right good you got the audio good nice to meet you.
  Obviously, doing this, you work for Chris or.
  awesome I'm.
Ryan Parker, JMP Chief data scientist right, so I have that right.
Brendan Leary that's a big title.
  right there.
  yeah.
Brendan Leary And you're a senior research statistician I look around, and I think I've seen your face around but I haven't had the chance to meet you so nice to meet you.
Ryan Parker, JMP Too So where are you.
Brendan Leary Based out of New Jersey I'm going to take you for New York and New Jersey project.
  awesome.
Brendan Leary I have fun in the field and I get to show off all the good stuff that you guys build so.
  I think you're perfect.
  So, as we get started here just a couple of things when I'm.
  let's see first things first your microphone looks good if all your cell phone computer notifications all that jazz turned off.
Ryan Parker, JMP Like to.
  Be do not disturb yeah let me double check the computer.
  yeah do not disturb on everything.
Brendan Leary Okay, I don't hear anything like if you have a fan going or anything.
Ryan Parker, JMP I do, but you can't hear it okay.
  No.
Brendan Leary You have any pets even put away so no no.
Ryan Parker, JMP that's just.
  Just little kids, but they will not be around.
Brendan Leary Had a dog incident, the last one I did that's all I'm gonna say.
  All right, alright let's see your display um are you on windows or MAC.
  MAC well do you know what your display settings it's probably fine if it's on a MAC usually they asked for 1920 by 10 at.
Ryan Parker, JMP The.
Brendan Leary Just for a resolution, but I imagine it's fine.
Ryan Parker, JMP yeah should be OK OK, I.
Brendan Leary want to show what you're going to use it just to JMP journal, or is it a PowerPoint as well.
Ryan Parker, JMP So I'll just share the desktop is that the best way to do it.
Brendan Leary yeah.
Ryan Parker, JMP Or do you like to share.
  One at a time it's probably the easiest way to hide zoom.
  So let's start with this see how this looks.
Brendan Leary Great.
Ryan Parker, JMP Okay, this slide film all real quick just to make sure nothing weird pops up.
Brendan Leary You know funky animation you're good.
Ryan Parker, JMP yeah no weird animations and then I'll just go to JMP and openness state of setup and.
  And a launch from there.
Brendan Leary Perfect Okay, I look I I think we're good, am I said, the only other thing I bring up to me, is before you JMP in and you start we're going to record it to detail is you know.
  So if you make a mistake, or you want to would have to start over.
  But just give a couple of seconds before you start maybe count down from three or five in your head.
  Okay, a little bit of white space at the beginning, in the end, so you can flip it.
Ryan Parker, JMP Right sure okay.
Brendan Leary And that that's all I got so I'm going to go ahead and hop on mute and, lastly, I work for JMPs I don't think I have to but I'm going to read it just to be safe.
  You understand that this recording is for use by JMP discovery summer conference and will be publicly available in the JMP user community do you give permission for this recording in us.
Ryan Parker, JMP I do please share it.
Brendan Leary widely reported, just in case just perkinson can't get mad at me I asked.
Ryan Parker, JMP yeah you know somebody spouses like hey what you did what.
Brendan Leary You know I know I know.
  cool Ryan, thank you I'm going to go on our video hop on mute countdown in your head and when you're ready go ahead and start.
Ryan Parker, JMP Okay, and just so you know, I think that time myself to be right about 20 minutes.
  Because it's because I think it's a total 30 right, and then they wanted to give some.
  extra time.
Brendan Leary So yeah brevity is perfect um you given time for Q amp a that that sounds ideal.
Ryan Parker, JMP Okay, great.
  sounds great.
  Well, thank you for coming today. My name is Ryan Parker. I'm a
  direct functional PCA.
  And it's a new way of analyzing functional data. And I just want to acknowledge our chief data scientist Chris Gotwalt has played a major role in not only development of FDE but
  also this new tool, as well as our test team of Rajneesh, whose, you know, work usually goes unnoticed, but is a big part of why we are where we are today.
  And so I'm not going to assume that you even used FDE or that maybe you don't even know what functional data is.
  So just kind of start off at the beginning to make sure we're all on the same page. Functional data...really we think of it as anything that has an input X, in this case we have some temperature data. Our input is the week, so these are measured
  every week of the year. And our output Y is this temperature.
  And so, you know, you could sample to find a resolution, maybe every day, or every hour, but the the general idea is you haven't completely sampled, you know,
  the whole function. You've got some way to work with some sampling of it. And really in all the cases
  that we care about, although you can use one function, we also want to think about having multiple functions. So we've got multiple weather stations where every week, we've captured the temperature.
  So, although we have every week filled in here, we don't necessarily have to have that feature of FDEs, as you can have, you know, some months sample points or they don't...they can be sampled at different time locations.
  But it also doesn't have to be, you know...your input doesn't have to be time in the traditional sense. Maybe it's temperature and you've got this clarity of measurement as your output.
  You know, so we support that.
  it's really this mapping of you have an input, you have an output. Or maybe you're in a spectral setting and you have all your input is the wavelength, and now your output is the intensity.
  In an example I'll show to illustrate direct functional PCA, we have multiple streams of data, multiple functions, so not only do we have, you know,
  a different...different function but we've got different outputs. We've got, you know, a charge, piston force, the voltage and a vacuum and we want to bring all of these together
  and analyze them.
  And so, these kind of types of data help motivate the development of functional data explorer.
  And there are two primary questions that we...
  we usually use it to answer. The first is in a functional design of experiments setting. In this setting, we have factors that we want to try and relate
  to our response function. So here in this case, we're really interested in how do we get this response function to be shaped a certain way.
  So we can link up these factors to the function and, in this case, we wanted to...our function to remain in the green specification area for as long as we can, and with FDE, we can...we can help do that.
  The other common case is in what we call functional machine learning and, in this case, we want to think more of our functions as being
  inputs to something. So in our process, maybe we have the final results or a final yield, in the case of this fermentation process data.
  So we want to summarize the shapes of these functions. Use them as inputs to a predictive model to help figure out what is going to give me the best result.
  And so, really, the big game with this, that you try to play, is it's functional PCA and what functional PCA is doing is it's going to summarize our data.
  So I have here a really simple example where it's probably pretty easy to see that they all have different pairings slopes.
  And it may be a little harder to see that the means are kind of a little different for each one. So our goal is to motivate this is to try and summarize these, you know.
  Use a simple case to how can we expand this to more complex situations.
  And so, when we do ??? composition, we'll get orthogonal eigen functions that are going to explain as much of the function to function variation as possible.
  And, once we have those, we can use them to extract summaries from all these functions that we
  can use in predictive modeling. So we take, you know, some really complex shape, in this case it's not super complex but we're going to summarize it to a couple of data points from, you know, the 10 or so that we have here.
  This is...this is what an eigen function looks like
  for these data. So we pulled out two. The first is explaining around 77% of the variation and it's giving the most weight to the very beginning and to the very end. And
  this is, as you might expect, summarizing that, you know, it's quantifying the slope of these these data. Whereas the second eigen function is giving equal weight over the...over the whole input and that's that's quantifying that difference in the mean.
  So now we want to turn these Eigen functions, use them with our data and get a quick summary that we want to use to explain the differences between these functions. So
  taking Function 2 as an example, we multiply this function times this first Eigen function. We're going to get a function that has a lot of negative numbers.
  So if you kind of think about, you know, taking the integral of this, just kind of adding up all those negative numbers, you're going to get a fairly large negative number.
  So we can see how this first component is really capturing the differences. So number 10, right, was a large positive number, and so we can...we can kind of see back in our data that it also had that kind of large positive slope.
  And a similar idea for the second component, where we have, you know, higher versus lower overall averages.
  And, once we have these two things we can then go back to our original function.
  Adding in an overall mean, we just take the first functional principal components score, looking here at Function 1. We multiply it by the first eigen function
  and then we add in the second functional principal components score, multiply by the second eigen function, and
  that's going to recreate this first one. In a similar process, using the scores with the second function allows us to
  reconstruct or approximate those functions. So you can kind of see where, okay, if we are able to build models for these FPC scores,
  we can, you know, understand how DOE factors change them, in which case that's changing the shape of our alpha function and sort of that first scenario we looked at.
  So let's let's kind of go into why direct functional PCA? Again what what motivated to do this? So if you have used FDE before,
  the modeling options we have are considered basis function models. And in these cases, they're really...they're smoothing the data first, so we're going to fit the model, we're going to get a smooth function that we can operate with. But part of the problem is we may have a lot of
  a lot of things that we have to kind of tune. So in a case of B spline, we have to pick you know what's the best,
  you know, degree of the polynomial to use in the spline, or how many number of knots should we have? And we kind of give you some defaults for that, but, you know, we also allow you to change the locations of those knots. And this is great and works for a lot of cases,
  but as you get larger sample sizes or more complex functions, this could take a lot of time and it may be on track to really tune all the locations of these knots.
  But the way FDE works now is we fit the model, we'll perform FPCA on the coefficients of those models and there's this nice relationship between,
  you know, now our Eigen functions are in the same form as the model of our data. So there's a lot of nice things that come with it, but whether it be costly computing or just the models don't do it well, we needed another approach.
  And so
  the previous approach is smoothing the data first, and now we're thinking about, okay, let's just take the data as they are. Let's operate directly on that.
  And then from there let's smooth the eigen functions that we get. So this isn't a fair, you know apples to apples comparison, but with with B splines compared to direct FPCA, you'll...you'll tend to notice that the
  eigen functions you get are a little smoother with direct FPCA and that's by nature of the way we wanted it to be a little smoother, you know. Is this little artifact here really that important.
  The FPCA says, maybe it isn't. I think it's really captured in the last eigen function, where it doesn't explain a lot of the variation. We're kind of getting some weird bumps here.
  You know, an expert maybe ought to analyze this and say, no that's actually real, but most likely, maybe we shouldn't really be giving a lot of weight to this eigen function.
  Probably not using it.
  And so the algorithm we use to fit direct functional PCA model is similar in spirit to this Rice and Silverman method, mainly that it's it's an iterative process. So what we do is, you know...
  first I should mention that the data needs to be on a regular grid. So if you do not have your data on a regular grid, we will interpolate it directly for you. We also have some options...data processing step options called reduce that you can
  apply to kind of finely control the grid that we operate on.
  But in our procedure, we'll take one eigen function.
  We can just ask for the first first component.
  Fit a very, you know, fit a smoothing model to that and then ask for the next one.
  And once we've smoothed the next one, we need to make some adjustments so that we get you know the orthogonality properties that we want.
  But the idea here is that we're...we've taken a problem where we're trying to smooth a lot of different functions to first get models to work with, where now we're going to
  really focus our efforts on smoothing these Eigen functions one at a time. And we have a much smaller number of them, so this makes this technique much faster for large data sets than existing solutions in JMP.
  So the example I'm going to go over is a in a manufacturing process and just to kind of give you an idea of speed ups,
  if you, sort of, best case scenario, you knew the exact P spline model that you wanted to use for these data, it would take you three times as long to fit those models than the out of the box direct FPCA solution.
  Where the difference is, you don't necessarily know what those those models are going to be so you have to take time to fit multiple models and so now you've really taken a lot of time relative to what direct FPCA is going to do.
  But, so this example
  looks specifically at a step where we're bonding glass to a wafer.
  And this process, you know, there's like a vacuum surrounding it, there's some tools, it's all sitting on this chuck, and
  this process runs and, unfortunately, about 10% of them get destroyed, but this is just sort of in the middle of process and you don't get to know until weeks later
  that they were destroyed. So our goal is...we have sensors that are collecting data through this process, we want to try to use that to identify wafers that we can sort of get rid of
  early. Maybe there's a subset that we can get rid of, so we can not spend any more money on them, and so the goal is to try and identify that using our sensor stream data.
  So now I'll go to JMP.
  And I have a journal, so all these sources will be available on the Community page but I'll open up this data set
  And I'll launch
  functional data explorer from the analyze specialized modeling menu.
  So let's go through each of these columns. So we have the wafer ID, so this is just sort of our...
  groups our different functions.
  The condition, so this is, you know, was it good or bad? So we want to
  keep that, we want to use that later so we'll just put it as a supplementary variable and that tells FDE, you know, when we save things in the future, go ahead and bring along that with it.
  charge, the flow, piston force, vacuum and this voltage
  as a part of this process.
  To launch FDE,
  we just kind of scroll through and see you know I showed these earlier, but these are all the different types of data we have and
  necessarily...it may be possible that not every model would fit, you know. The same model that fits this maybe it doesn't fit, you know...same model that fits charge maybe doesn't fit flow as well.
  But direct FPCA sort of is looking at them individually and it's kind of handling that for us. Before I before I go to that, I want to show this reduce option that I mentioned. So by default...
  well there are three tabs here. You can directly put it on an evenly spaced grid, you can bin into observations, or you can remove every
  nth variable to kind of fill it out. These data aren on a grid. We don't really have to do anything, but we could, you know, say all right, let's do this and, by default, it gives you half of the original data set.
  So now you've you've taken it down and the shapes are still fairly the same. In this case we don't have to do it, it's still fast, but if your data are either
  not already on a grid, or you just have a lot of it, by using reduce you're still able to capture the key features. It's really something worth exploring.
  So, since we have multiple ???, I'll go to this FDE group option
  and launch direct functional PCA.
  And so, this has taken a few seconds, but it's for each one fitting a model.
  charge. So we're able to fit, works reasonably well, we have diagnostics available.
  We have a model selection option to let you know, change the number of FPC scores. We've identified as four as being best for this particular case and most of others that actually picks just one.
  But if you have used FDE before, you know, you'll see this familiar functional summaries but there's nothing...we don't have anything else. There's no prior model right. This is kind of this...
  functional PCA is the model and we're focused entirely on that, instead of other things that we necessarily had to tune before
  on my your score plots and profilers. So to scroll through these, we can see that we are you know seemingly fitting these fairly well.
  Piston force...and I said that, you know, a lot of these end up just picking one. I mean it's saying this is explaining almost all all the variation in that case.
  Okay.
  Good. Again, voltage, so last one.
  So if we go back to this group option we can save the summaries for all of these functions. So now we've effectively, fairly quickly used FDE to
  load our data and take all of those functions and summarize them down into just a few summaries for each one.
  But the main ones we
  are most interested in, primarily because they're just summarizing the variation in the shapes, or these FPC scores.
  But there is still information in things that people, you know, pre FDE, what would they do? You take the mean, or you look at standard deviations or other summaries, and these things still have value, so I think it's good, you know, we, by default, bring those along.
  We have some scripts. So every script you had in your original data table is going to be brought along as well, but we also add these profiler scripts so you can launch those and see, okay as I'm changing my FPC scores, what does that mean?
  And help build build some insight into that.
  And so what we want to do now is we're trying to predict that condition, good or bad. So I'm going to use generalized regression just because I think it's, you know, it's a really good
  method for not only fitting these models, but also interpreting them.
  But you could really use anything. You could get a neural network, you could do any other thing. Once you've...once you've saved this table, you're free to use the rest of JMP
  for how you feel like you could model it best. So I'll take all of these summaries and I'm going to just do a factorial tp degree two.
  And so we're trying to predict the final condition and we'll use the validation data set.
  And we're targeting as it as, was it bad or not?
  And so, really, this is probably the longest computing section of this. We'll do a Lasso by default using this validation column. I mean it takes, I
  think, around you know 5-10 seconds.
  You can always stop it early if you want to, but you know we're we're giving it quite a good set of features and looking at interactions between them to try and figure out,
  you know, the best way to try and predict this condition.
  The...
  almost done.
  It's just building the report now, and and once you get this model, you''ll also see things that it felt like didn't map.
  So I'm personally a fan of looking at the parameters on a centered and scaled basis to, you know, to help better understand magnitude differences.
  But some of these, you know, like this FPC score, I thought, you know, it's really not informative whereas FPC 2 and 3 for the charge
  are more helpful, so you can kind of sort them and see, you know. And kind of like I described, things like
  simple summaries of means or minimums or standard deviations, they're giving information but, but we also see there's definitely a lot of information in the summaries of the...
  of the shapes of these functions using these FPC scores. So I'll go on the other side, whereas this you know this flow standard deviation seems to be very important and I think,
  you know, another reason why I like this is, if you are in charge of this process, and you have control over the flow standard deviation,
  this can help you, you know, maybe you don't need to know what the first step is, well, hey, maybe we have a part of this we can actually
  improve before we build a model, to say okay, now let's just discard these. So let's say we've done that, where this is our model, we want to try and just build a heuristic
  good or bad. At what point do I just want to go ahead and discard it? So let's save the columns for the prediction formula, and this will be saved back to that summaries table. So GenReg is going to give us probability bad, probability good, most likely condition.
  And so we can...we can just stop here and say okay, if it's above .5, let's look closer at it, or if it's above, I don't know, .75, that feels good. Let's...maybe it's not worth it just due to the
  cost of the rest of the process. You know, you kind of maybe pick that probability based on the real-world implications.
  Or we could let a partition model help us figure that out, so what we're going to use now is this probability as a factor. We're just kind of saying okay,
  I could kind of look at this by hand and let's have a model just help me try and figure out, you know, how is it going to group up these conditions. And so we'll take this, maintain our validation data sets so we're not kind of double dipping.
  So now all of our blues are the bads and the reds are the goods, so thankfully, in general, most of them are good.
  But let's do a split. So if this probability of being bad is less than .1, very likely, it's good. Most of our bads are in this greater than .1. It makes sense. Split again. Now we're really looking at...a lot of them are really in this
  is the probability over .25. We'll split one more time, and
  at least for the training data, it fits pretty well that, hey, all of them with above .6, they're all bad. You know, you really don't expect that to happen
  in reality, and this training or this validation R squared does highlight that, you know, like most models, you can do better on the training set than you can do on the validation data set.
  But it's giving us now, you know, I kind of felt like .75 was good, but really this is saying, maybe we really need to focus on these that are .6 or higher.
  And so, this is, you know kinda...
  I guess now now, we've gotten to the point where you know, these were simulated data. In the real-world case, this was very helpful, things like interpreting, you know, are we...can we improve our process, also helpful, not only in this, but in other sample data sets that we have.
  Let's go back to the slides.
  Okay, so I kind of went through and summarized what funtional PCA is, kind of, what motivated this new direct FPCA approach and showed you how to use it in an example where we're trying to, you know, discard things early in this manufacturing process.
  So really just kind of some final tips is, you know, it...the fast computing makes this great for large data sets. In some ways, you can just start there
  and say, okay, what's direct FPCA think? It's so fast, I don't have to you know fiddle with model controls. In some ways, if you have large data, it's a great place to start.
  But it's not perfect, you know. I showed some diagnostics and you can see if it's not fitting well and like like any model, just because it was fast doesn't mean it was good. So just kind of make sure that you're
  identifying possible issues and maybe you need a different approach. I mean we have, you know, other even basis function models that we're
  working on for very particular types of data that, you know, even this approach doesn't necessarily do as good on as, you know, these very specific models.
  And that's, you know...part of this is the data must be on a grid and try to use reduce to help you control that grid, if it seems like things are either too slow or don't really, kind of, makes sense, maybe what we're doing by default isn't what's good for your data as a
  as what you can do yourself.
  Thank you so much for coming and I'll answer any questions if anyone has anything.
Comments

Thanks @RyanParker.

If I had known this approach I probably would have been much faster with my modelling online fermentation Data!

I spent hours adjusting the splines.

I will try the DFPCA approach on the Data Set.

Tedwang

thks,that is the thing I want.

Thanks! I have a question about the diagnostic plots. Is comparison of the 'predicted' VS 'actual data' result of each model's actual data VS each model's model (red line of each graph)? Or each models' actual data VS overall model (red line of graph of 'order' VS 'charge' right under the model selection tab)?

elb89

Hi, this is a wonderful tool! Which of the many plots would be most useful and informative in interpreting the data? Do you describe the score plot as you would in a regular PCA? But I couldn't find how to generate the loadings plot etc? Any help would be appreciated!