Classify and Parameterize Time Varying Curves with Functional Data Explorer
There is often a need to classify time-varying curves into different categories. For example, cell growth curves may have different shape types (e.g., diauxic growth) that characterize the growth behavior of cells. These different curve types make it difficult to automatically and accurately pull characterizing parameters (e.g., lag to start-of-growth, growth rates, final growth level, etc.) out of the curves.
JMP Pro's Functional Data Explorer (FDE) can be used to classify these curves into certain types. Once classified, FDE filters noise out of the curves, greatly simplifying the assessment of critical growth parameters.
Real-world data is used to demonstrate how FDE in JMP Pro can perform these operations.
Hi. My name is Jerry Fish. Welcome to this talk where my colleague and I, Kyle Probst, are going to talk about using Functional Data Explorer to classify Curve Shapes. I've been with JMP. I am a JMP employee. I've been with JMP for 7 years now, supporting our customers in the Midwest. If you happen to be in the Midwest, give me a shout, I'd be happy to help you out with any of your JMP needs.
This is a typical curve that we're talking about characterizing. Today, Kyle is going to explain a little more of this. But we're talking about cell growth curves. The growth curves have to do with as you give cells certain foods, for instance, you're interested in how quickly you can grow those particular cells. You might have a curve that looks something like this, where you start off with no growth. Suddenly, the cells begin to grow and they plateau out here at the end. This might be a typical curve. Kyle will talk a little bit about the critical parameters that are of interest.
The trouble is we've got different kinds of curves that we need to classify. Here on the upper left is that curve that I just showed you. We're going to call that a monophasic curve. It's just got a single growth pattern, and then it's sustained afterwards, so the cells continue to survive out here. This curve, the upper right, is also monophasic. It just has the one transition, but then it tends to die off out here at the end, the cells begin to die. Perhaps that's important, or perhaps in your business, that might be a critical characteristic you're interested in.
This one we call diauxic where it has two growth regions. There's a first region, things plateau out, and then we have a second growth region. Then finally, we've got some situations where there is no growth, and we would, of course, be interested in those as well.
Now, these things can come in hundreds of tests, and we don't want to have to sort through these one at a time. If we knew what each curve was, this curve, for instance, could be modeled with Fit Curve within JMP. This curve and this one, the diauxic curve, might be better to model in the nonlinear platform where we've got more flexibility in what these curves look like. What we don't want to have to do is look at each and every curve to decide which one of those to apply to it. We'd like to automate that system if we could.
At this point, I'll turn it over to Kyle to introduce himself and to talk a little bit about this curve and some of the background of why this is important and what we're doing. Kyle?
Thanks, Jerry. Nice introduction there. As Jerry alluded to, we do a lot of work here at Kerry within the pharma cell nutrition business to evaluate microbial growth. Just a little background on myself. My name is Kyle Probst. I'm a Senior Scientist at Kerry, based out of the Beloit, Wisconsin, North America headquarters. I support our cell nutrition area of the business, where we supply food for microbes and mammalian cells, and other cells in the form of protein hydrolysates, and yeast extracts, and a lot of these are used in the biopharma industry to produce life-saving drugs.
The data that Jerry is using here comes from a particular assay that we tend to do in the laboratory. This is from more high throughput screening, where we're using microtiter plate readers to measure microbial growth curves, like what he's showing here. The growth that we were measuring is often done using optical density at 600 nanometers, which is known as OD600, so that is essentially correlated to growth. It's an easy to measure thing, just measuring turbidity, and it's the standard metric used in the industry for quantifying microbial growth.
We use it to compare and select the best ingredients based on certain key performance indicators or KPIs that we can determine from those growth curves there. Some of the key KPIs that we are interested in measuring from these growth curves and quantifying from these growth curves include growth rate, lag time, final cell density, and many others. What Jerry is working on here is trying to automate pulling these values out, these quantitative values we got from these curves, in a way that prevents us having to go one through one and also allowing us to get a good fit.
That's been one of the biggest challenges we've had when we do this using some of the standard processes. You don't always necessarily get a very good fit, and it depends on how the curve looks. Unfortunately, they're not all perfectly monophasic like the one that we have here in the demonstration. That's why Jerry has been interested in helping us out with doing some of this.
Anyways, the curve you're looking at here, and I'll explain what these things are, the KPIs we are often interested here are the lag time. The lag time is just how long the cells are dormant before they start to grow. That's important for us to understand. Time to inflection point. That's the time till the cells are at maximum growth rate. Then the slope at the inflection point, that is the maximum growth rate. That another important parameter that we tend to measure. Then the final asymptote is the final cell density or the cell density that is achieved, which is related to how all the microbes grow.
Before we do a lot of this work, we actually have to process the raw data into a format that allows us to actually create these curves just to make it easier for us to make comparisons, either through visual comparison or like the stuff that Jerry is showing us here to actually quantify these curves. We actually have another paper that we did, that I presented on showing how we process that raw data and using some of JMP's nifty tools to be able to help us do that.
Even for somebody like myself that is a novice in scripting, I was able to set up some approaches that really saved us a lot of time and effort. That is under the title JMP Add-in Builder for Automating Microbial Growth Characterization. Be sure to check that one out.
Jerry, I'll kick it back over to you. I feel like I've talked enough. I've probably a lot more than you have already.
That's perfect.
Thanks, Jerry.
That's perfect, Kyle. Thank you for that background. I would encourage you to look up the paper that Kyle's primary author on, to find out what the pre-processing was that we did to the data in this format, things like shifting everything up, so it always started at 0 and truncating things here at 30 hours, which arbitrarily chose to do that. Thank you, Kyle.
We have Kyle and Kerry conveniently provided some 330 curves for us to analyze in this data set. I went through and tried to classify those by eye to see how these curves group themselves. I called all of these curves up here the monophasic sustained, where they climb and they reach some asymptotic value. These I called monophasic dying, where the curves come up, and they reach some peak, and then they tend to die off, some severely, some not so severely, some got some wiggles in the data. Odd behavior, but I classified all of these to be monophasic dying.
These I classified as diauxic. All of the curves have some a growth period, an asymptote, followed by another growth period. Maybe it's subtle, maybe it's big, but that's what these are. Then, of course, these are the no growth curves over here.
Let's go take a look at the data and see if we can learn some things about the data and how FDE, Functional Data Explorer, might be used to help us out. FDE is a JMP Pro feature. The tools that we're going to be talking about from here on are largely JMP Pro tools. The data itself, we've got a column here with test number. Test number 1_1. Here's our time. If we go down here to 30 hours, we have collected data in the OD600 column. This first one, personally by I, identified as a no growth data set.
That's what happens. There's test 1_1, followed by 1_2, and so forth. We've got roughly 40,000 rows in this data set comprising about 330 different tests.
Now, I've also got a column over here called validation. With JMP Pro, you have the option to validate your models. Validation means a little bit different in FDE than it does in other platforms in JMP. We'll get into that a little bit later. But I wanted to show you very quickly how we set this validation column up. It's under Analyze, Predictive Modeling, and Make Validation Column.
Now, we need to tell JMP that, first of all, there are grouping columns. For all of the 1_1 data, we want that grouped as either training or validation. We don't want a mix of training and validation within a particular test. We can do that by saying the test number is our grouping column. Then I'd also like to make sure that within the training set and the validation set, each of those has an equal number or an equal percentage of no growth, monophasic dying, monophasic sustained, and diauxic curves. That's called stratification. If I just take my type and put it here in stratification, that's all I need to do.
Then I specify here. I chose to do 60% of the runs in the training set and the remaining 40% in the validation set. Then we simply say Go, and we get this column out here, which should have about 60% of the trials as training and 40% as validation.
Now, I'm going to delete that column that I just made. It randomly assigns training and validation to the different runs, and I want to make sure that I have the same runs that I did for the analysis that's shown in the accompanying paper. I will just delete that column that I just built, but that's how I got this column.
We're ready to go into Functional Data to explore now. Analyze, Specialized, FDE. Now, here we have to describe to JMP what the different columns mean. Test number is the ID. Time is our X variable. OD, the measurement of the cell growth, is our Y output. I'm going to carry type along as a supplementary variable. It's not going to be used in this current analysis. I'm just going to include this in a later step. I'm just carrying it through at this point. Then I'll put validation down here, and we'll say, OK.
Here is the Functional Data Explorer window. This top plot is simply all of the curves overlaid. If I expand this, down here are all of the individual curves. Now, the first thing we need to in Functional Data Explorer is fit functions to each of these data curves. That's done under the red hotspot models. Then we have these different options that we can choose from: B-Splines, P-Splines, Fourier, and Wavelets.
I'm going to do B-Splines. They're fairly simple to understand. But there's no reason I couldn't have done this with P-Splines. Fourier Basis would be if each of my signals down here had repeating signatures to them. Wavelets are very popular, becoming more popular, as ways to fit this data or even spectral data that's got sharp peaks in it. Wavelets are very good for that. But I'm just going to stick with B-Splines today.
JMP is going to do two things. It's going to sort through all of the data and choose two parameters for me. One of those parameters is the degree. The degree is either a step function, a series of segments: linear segments, quadratic segments, or cubic segments that are used to break this signal up into splines.
Now the splines fit in between these things called knots. That's the other thing that JMP determines, is the number of knots and the location of these knots to best fit all of these functions down here. In this case, JMP has chosen a cubic spline fit with 17 knots. You have controls over that. You can change that if you'd like to I'm going to try to get an even better fit. I'm going to stick with this for what JMP automatically brought up for me.
What does JMP do with that? Well, first thing it does is determine an overall average curve. This is the overall average of all the curves that we just looked at. These are shape functions. They're called eigenfunctions, too. But imagine if we had a coefficient that we multiplied by the higher shape function 1, maybe that coefficient is plus 2, and we add another coefficient that's minus 1 that we multiply by shape function 2, and shape function 3, we multiply that by half, and shape function 4, we multiply that by 5; and we add all of those up, the coefficient times each of these functions, and we add it onto the mean function, we would get some sort of a curve up here.
The question is, what are the coefficients that go in front of each of these functions to reconstitute each of these curves? Since each curve is different, you'll get different coefficients, or they're called functional principal components, FPC, 1, 2, 3, and 4. In this case, JMP says, "You need four functional principal components in order to characterize each one of the curves."
Now, that's a mouthful, but the bottom line is these four coefficients right here characterize this curve. I'm going to take these four coefficients along with the knowledge that we've got a third-degree spline fit and 17 knots in the breakup of the splines. If I know all that, I can reconstitute Curve 1 is a no growth curve, but here's Curve 2, here's Curve 2_1.
If you notice—we'll make this a little bit bigger—I'm actually showing here the raw data and the fit data. The raw data is in black. You can see a little bit of it here. I hope you can see it. The red data is the fit, the spline fit. You can see that we're doing a really good job fitting all of these curves as we go along.
We need these functional principal components in order to continue the modeling, so I'm going to go to the red hotspot and say save those summaries. That creates a new data table that for each of our tests has one row. Here's 1_1, test number 1_1. I identified it as being a no growth, so the type, remember, we carry that through as a supplementary variable. That's where this ends up. That's a no growth curve, according to my eye. It was in the training set, not validation. Here are those first four principal components for that test.
Then we've got some other a lot of stuff out here, the mean, the standard deviation, the median for each of those curves. Those may be useful to us, but I'm not going to worry about them for today's talk. I'm just going to take all of those things and delete them. Leaving us with just these four guys.
I've got one of those for each of the tests, and I've got my assessment by eye of what each of those are. Then I've got a column here that says Training. If I come down far enough, somewhere down here, there we go, there's a bunch of validation rows as well.
Now the question is, "Can I come up with a model that uses these FPCs that we just determined in order to predict the type?" If I can, then that'd be great. All we have to know is for a new curve coming in, "What are the FPCs?" We can plug them into this magical formula, and we're going to end up with a prediction of what the type is.
Since we're in JMP Pro, since we're using FDE, we have access to JMP Pro, we can do something called Model Screening. Here we've got our Y response is going to be the type. I want to know if I can predict the type. I want to use validation down here. Let's see. I want to use the FPCs as my X inputs. With the Model Screen tool, JMP Pro is going to run all of these different models and compare them together to see which is the best at predicting.
Now, I'm going to add… Some of the models allow two-way interactions and quadratics, so I'll put those in. I'm also going to set a random seed here, 2, 3, 4, 5, just so that the results here agree with what's in the paper. Normally, I would not set that. I just let JMP solve for an arbitrary seed. We say OK.
Now, JMP goes through at this point, and it runs each one of these models and compares them. Here, just that quickly, is a comparison of all the models that it ran, rank ordered according to the root average square error. The minimum root average square error turns out to be a neural-boosted model, which coincidentally happens to be a JMP Pro model as well. Things like decision tree and nominal logistics, some of those are standard JMP models, but the neural-boosted tree is a good model. It also says the misclassification rate here is 5%. That's the best that we had of all these models. We'll talk about that in just a minute.
I'm going to run that neural-boosted model now to see what the results look like. Here they are. We talked about the confusion matrix. This left side is for the training data, the right side is for the validation data. Let's just look at the validation data.
When I actually said that a trial was diauxic, the neural-boosted model predicted diauxic eight times. It also predicted monophasic sustained one time. I got one of those wrong out of a total of nine. That's what we're doing here in the confusion matrix. You can see what you'd like to have is all of these numbers along the main diagonal. We got a few that are not quite where they need to be. The model is not perfect, but it's pretty good. We've got a 95% success rate.
If you'd rather look at fractions instead of counts, that's what this table is. Then the receiver operating characteristic curves are down here that you may be familiar with. Interesting, but we do make some mistakes. Now, those mistakes… If you'll pardon me just a second. I need to get those numbers, so I know which ones to talk about with you.
If I go back, here we've got… Let's see, I predicted monophasic dying, and three times my algorithm, my neural-boosted algorithm, said that it was diauxic. Thirty-nine times it got it right, three times it said it's diauxic. Let's take a look at those three trials.
If I go back to my main data set, and I've got a script here that brings up Graph Builder for me. Here's all my different curves It's all superimposed. I want these particular curves are 124_2, 144_2, and 176_2. These three curves by eye, when I was going through these originally, I said were monophasic. They just had a peak, and then they died off. They died a little bit. They didn't die a lot, but I characterized those as monophasic dying.
The routine said, "No, these three curves were diauxic." Let's take a closer look. The green curve here comes up, and sure enough, there's a little wiggle in it right there. I missed that by eye, but JMP has said, "Hey, I think that's a diauxic curve." Is that important? I don't know. People in Kyle's business would have to tell us about that. But it picked this one out and said, "No, I think that's diauxic."
These two curves, if you look all the way back here at the back, there seems to be some growth right away, plateaus, and then there's a second larger growth curve. Maybe those are diauxic. Maybe JMP is actually doing better than what I could do by eye.
Here's another curve. This is number 60_2. Where is it? There it is. This curve, I said, was monophasic dying. Sure enough, it comes up and then it falls off back here. It starts to grow and then it quits. I'm sure you've noticed this is very steppy that gets down to the resolution of the instrumentation, I'm sure. The bottom line of that is that we're way low on the scale here. This curve never really got started. So JMP has said, "No, this is a no growth curve." That's pretty good. It missed this, but it missed it because it's so tiny that it's not practical. I thought those were interesting results.
We've got a model now. See, I can close Graph Builder. We have a model, this neural-boosted model that allows us to look at the misclassification rates. But what do we do when we have new data? Something I didn't tell you about earlier is in the original data set down here at row 190—come on, there we go—191, I meant to have this as a hide and exclude, but that's all right, it is excluded data.
Let's say that these data that are in the 191_2 run are new data. These were not included in the previous run, in the previous analysis. I'm going to select Matching Cells, and I'm going I'm going to clear that. Let's see. There we go. Now I want to essentially repeat the process. This is now new data as far as the algorithm is concerned. It's a monophasic dying by eye anyway. Let's see what JMP does with that. How am I going to get to that? I'm going to go through and do the same thing.
The one thing I need to be careful of is I need to make sure that all the training rows are still in here as they were originally because I'm going to go to specialized modeling FDE. I'll just hit recall here. Everything is the same in the setup. I'll say, OK, and I'll do that same fit of B-Splines. When I do that, I get exactly the same results over here, so the fit is only based on the training runs.
Now, what I need to get out of that are these functional summaries, the FPCs. I need to get that for just that… Where did it go? Oh, it's way on down here. 191_2. I need these functional principal components. If I know those, I can plug them into that neural net formula and get a prediction out. Rather than try to copy and paste these or whatever, I'm just going to do another Save Summaries.
Then I'll go down here to the bottom to the 191_2, which is right here, this row, I want these four numbers. These are the functional principal components for that new trial. I'll copy those, CTRL+C. Oh, let's see, I've forgotten to do something. I left out a step. After we have this model, we also need to save that model. I'll save the formulas back to the data table. Now here in the data table, wherever it went, I now have a lot of extra columns.
This is what we started with, the four FPCs, and now we got all these extra columns out here to the right. They contain the neural net formulas. They're little calculators in each column, and then they're summarized over here, grouped together to get a probability of each type of curve. Then finally, whichever has the most likely type is what you end up with out here.
All I need to do is take these numbers for 191_2, copy those, and come back over to this data set. Come to the bottom. This will be 191_2. This is a new data set, so I don't I wouldn't normally know what it is. All I have to do is paste those numbers in here and JMP calculates through all these different columns and assesses and finally says that this is a monophasic dying data set, which in fact it was. That's what I assessed it by my eye. You could presumably take as many new data sets as you want, get the functional principal components Let's put them right in here, and we're done.
Let's go back to the journal and wrap things up. Our summary is Functional Data Explorer in combination with other modeling tools in JMP Pro can be used to characterize curves. I got it to a 95% accuracy, but there are things we can do to potentially improve that even further. The resulting modeling process can be used to predict these new curve types, saving time and preventing errors, and the steps are all outlined in the accompanying paper.
Possible models for each… No, I'm sorry, I've got a couple of slides out of order. Suggestions for future work would be to prescreen your data for no growth. If OD600 doesn't exceed some number 0.05, then that's obviously a no growth. We'll just throw those out. We don't even need to consider them.
Experiment with different types of splines, different complexities, different numbers of knots, and then create a script to automate all these steps. It's very conceivable that we could do that as well. We could really speed this up.
Now, once you have those curves, if you have a monophasic sustained, like the very first curve that we showed, you can do that in JMP with the fit curve platform. This is the equation that you would get, and you would pull out the lag time, the maximum growth rate, and the asymptote out of that for the A, B, and C parameters.
If you had a diauxic curve, you might do this. It's actually two of the monophasic curves just added together. You can use JMP's nonlinear platform to solve that. No growth, like I said, you just put a limit on there. If OD600 is less than 0.05 at all points in the run, then call it a no growth.
Then the monophasic dying, there are several different ways I suppose, that you could do this. One way that I did was to take that original monophasic sustained that approaches an asymptote and just multiply that by a linear function that's downsloping function of time. That would tend to put a slope afterwards. Again, let the nonlinear solver solve for M and A, B, and C.
That concludes our show. Like I said, there will be an accompanying PDF file with all the steps involved. We hope you find it valuable. Thanks very much to my co-presenter, Kyle. We'll see you next time. Stay JMPy.