The biopharmaceutical industry has generated many valuable products. Yet if the analytical tools used are difficult to operate or don't work in concert, practioners can struggle to understand the complex systems they employ to achieve those Eureka moments.
We demonstrate a real-world example of the powerful combination of routine PAT (process analytical testing) using chemical analytics performed by LC-MS (liquid-chromatography-mass-spectrometry),data analytics, and graphics bundled in intelligent workflows, progressing to understanding and breakthroughs in bioproduction development. By integrating high-end hardware and software, bioprocess engineers can quickly decipher parameters that they control or influence to optimize productivity, yields, and quality.
This presentation represents a three-way collaboration between Waters, CSL Seqirus, and JMP. Waters Corporation's LC-MS BioAccord system measured very detailed profiles in a high throughput bioreactor system during a DOE study. The BioAccord was used to measure product titer and quality attributes of the protein biotherapeutic, as well as feed and metabolic components to enable a rich daily snapshot of the process profiles. These analytics enable users to see the impact of process changes on stress markers, productivity, and quality at an unparallelled level.
The Waters Bioprocess Monitor Add-In, available in the JMP Marketplace, gathers data from the BioAccord platform and makes it available graphically in JMP; it is being used successfully at CSL Seqirus. By applying a series of analyses in JMP, we can understand relationships between parameters and, thus, guide discovery toward maximizing yield, purity, and process reliability of the target substance with a robust, well-characterized process.

At Waters, we've been talking to scientists and engineers in bioprocess labs and have expressed interest in using advanced analytics to get insights for bioprocess development. The main obstacle they've expressed to us is the time it takes to get the data and to get the insights from those advanced analytics. When we talk about bioprocessing, we're really focused on upstream bioprocessing and appreciate that many in the audience won't know too much about that field. I thought some photos to illustrate the kind of labs we're talking about and the activities that generate the samples.
The types of labs, upstream bioprocess development labs and clone selection labs, and they're typically using either automated bioreactors in that top left-hand corner photo or sort of bench top bioreactors, maybe a couple of liters, 5 liters in size sometimes. The day in the life of a bioprocess engineer running these experiments would be when the experiments are being executed. Typically, it's about two weeks or so for experimental duration and every morning they'll come in, and they will take samples from those bioreactors as you can see in those pictures and then analytics will be performed on those bioreactors.
Now, when it comes to routine types of analysis, so basic chemistries like pH, metal ions, that kind of thing, they're done immediately by the bioprocess engineer themselves. But advanced analytics are typically sent away to a core analytical lab. Then there's this delay we talked about of several weeks before they get the results and the data back. It doesn't really fit the cadence for designing new experiments.
What we're trying to do at Waters is change that so that advanced analytics can be performed by the process engineer themselves ideally during the cell culture run, so they can use the insights that that data has given them to design the next experiment and get benefits from that. At Waters, we've been developing new LCMS systems that can be used by non analysts, so engineers, chemists, etc. The particular one that we believe is suitable for bioprocess labs is this BioAccord LC-MS system, which can be run by a non analyst.
Now, if they do that in a bioprocess lab, then the advantages they'll get will be, they'll get a lot more comprehensive picture about what's going on during their experiments. In addition to the sort of routine chemistries and cell measurements that they're making, they'll also get a lot more information about what's happening in those cells, in the bioreactors in terms of what they're consuming a lot of, what they're consuming less of, what metabolites and byproducts they're producing and a lot more information about the products and product quality during the cell culture process.
There's a lot of data extra that they'll be able to get that we believe has got a lot of value. The downside though, is that the more different instruments they're using, the more and different methods they're using, the more data that they've got to clean up and aggregate before they can get insights from that data. I hand over to my colleague at CSL.
Hi, I'm Lalit. I am part of CSL Seqirus, and I work in the upstream team as a scientist. We have as a user to the BioAccord, I would like to share my experience. In CSL, we use, data driven approach or DoE driven approach to do our experiments. Mainly the Amber 15 and Amber 250 are the two high throughput instruments that we use. Amber 15 has 48 bioreactors with a working volume of 15 ml. Amber 250 has 24 reactor with a working volume of 200 ml. With the DoE designs, we generate a lot of data and that gets fed through the BioAccord LC-MS. It's very handy in terms of identifying and quantifying components, which is the capacity to detect more than 200 components from its library. That makes it much easier not to use any other instrument.
At the same time, it could be pretty challenging with the amount of data that you get using the BioAccord LC-MS. That is very useful, and you can look at all the data, but it's very hard to make an inference with the volume. That's where the Waters JMP add-in comes pretty handy. You can fit the data through the add-in and is very capable of separating the metabolites based on the pathways. That makes it easier to look at each of those components and tie that up to the DoE process parameters, which can be used to further improvement your process or optimize the process. In general, if I look at the value of BioAccord LC-MS and JMP Waters add-in, it is very useful so far as upstream scientist is concerned.
The data visualization and analysis that my colleagues from JMP will demonstrate to you was generated in an experiment of which we've given a high level overview here. The cell culture experiments were performed on the Amber 250 system that I showed you a picture of earlier. That was an eight way system, so we had eight different bioreactors, it was a 14-day study, samples were taken from each bioreactor every morning for analysis on the Nova BioProfile FLEX2 system and the BioAccord LC-MS system.
The design of the experiment was quite a simple one. It was varying the gas sparge rate, the air overlay rate and the buffer concentration. The feeding of the bioreactor structure was identical for each bioreactor, it was just those factors that were changed sparge overlay and buffer. Then the Amber 250 system has onboard sensors for online temperature gas and pH. With the FLEX2 system, that's being used to measure things like cell count, cell viability, ammonia, pH, etcetera. Then the BioAccord system, this was being used to measure titer and the gold standard way of doing that is with a protein A chromatography method. Aggregates, so these are high molecular weight species impurities and low molecular weight species impurities as well.
Glycoprofiles, so we were seeing which glycoforms are being made during the cell culture experiment. I should have said it was a CHO cell line that was engineered to express a monoclonal antibody. There'd be different glycoforms of that antibody that'd be quite important quality attributes. Then the spent media, so this is the media surrounding the cells that contains the nutrients and the metabolites that have been kind of expelled by the cells. We measure in there, things like amino acids, vitamins, TCA metabolites, so that's the Krebs cycle metabolites, and then other metabolic pathway metabolites as well. It's that data that our colleagues from JMP will show you being analyzed and visualized.
Now to facilitate the data analysis and the data visualization, with the help of our colleagues from JMP, we wrote a bioprocess monitor add-in for JMP, which is available on the marketplace. Using that add-in, you can import the data files from the different analyzers, and it will automatically add the metadata, clean up the data and aggregate it all into one table that's then used to create visualizations.
The add-in creates visualizations to get you started in sort of bar chart format and line series chart formats as well. Then additionally, in order to get you ready to do multivariate analysis, the final thing it does is create MVA table for you, which is reshaped to be more ideal for MVA analysis with one column per analyte. Yeah, now I'll hand over to my colleague from JMP.
Thanks, Pat. My name is Greg Flexman. I'm a Systems Engineer with JMP, and I'm going to start the process where we look at driving JMP. The Waters biopress process monitor is available from the JMP marketplace, and it does have a full set of PDF documents with instructions how you will import data and go through these steps to make files. Pat has made a very handy video, but I will pick it up from midstream.
After importing all the data, you're then given a series of reports. If you like the bar graph format or the line format, they're both available. This is where the broad range of analytes that this BioAccord can measure really starts to show because we've got a DoE here with eight bioreactors. We can see not only the titer at all these time points, but all these things that are going on that contribute to the productivity of the cells.
We can see the fragments and aggregates as well as the main cells. We can see all these different cycles, DNA synthesis, glycosylates. Really valuable to browse your data. As you continue on, that was in a tall data format. We'll look at this just really quick. 12,000 rows, a lot fewer columns, and this has the appearance like a LIMS database. This is getting all these wide variety of analytes measured and their values in a tall table.
The add-in also does generate a wide table they call the multivariate analysis. Out of the box, the script will end with this. This does give you the option to make scatter plots, such as we could look at titer by metabolic stress and start to browse some of the relationships that are present in our data. This also is a really great launching point for further analysis.
I'm going to start with this wide table, and I've previously taken this a little further and added some formula columns. We can start graphing this. We could look at titer over time. Again, this is where we're trying to get to, understanding those relationships. I did make a calculated value for the incremental titer. This is the portion of growth day over day, so we see some growing up to 80%, and in later days, we are only growing, say, about 10%, 1.1 a day cycle.
I did notice this big blip here on day seven. Just for grins, I'm going to define an early and late phase. What I'm going to do is also calculate this relative titer increment per time where I averaged up all these time points. We can look at, like, our poorest performing bioreactor was much less than the others in this early phase and also low in the late.
Our best performing, we can see it did pretty well early on, that's where it's really shown. Then it was, a little closer to the center of the pack here in a later, but this V7 is pretty interesting. It's showing very strong early on, and then it dropped down later. I wonder what its potential would have been if the right conditions were present to allow those cells to keep producing.
JMP gives us a bunch of tools to understand these various process conditions. We can do fit Y by axis, but there's just so many parameters. We're going to apply a screening tool, that is called response screening. I will first show you what this is based on, and then we will invoke that analysis platform. If we were to do a whole bunch of fit Y by axis, I have just two here with a bivariable early and late. We can see this isocitric acid, very strong r squared early, very high significance early, but less so later.
We could also look at this other nicotinamide, very strong in the late phase, very strong statistical significance late, less so early. With the values r squared and p, that tells us a lot about r the strength of relationship. If we're really just trying to understand the leverage of these many parameters, This is going to be very helpful. I'm now going to execute this response screening platform, and we will see the r squared and the p value for many of these. Now when we're doing so many comparisons, we are going to expect some false hits.
There's an adjustment called the false discovery rate. We have these false discovery rate p value. Because these p values is very hard to look at such small numbers, it's convenient to look at the log worth. So, taking the log, it's very easy to see. 0.05 translates to two. We can see the boundary between what is most significant. There's that isocitric for the early, and we can also look at this for the late. You'll notice that the rankings for early and late are different. What I'm going to do now is, I'm going to prepare a parallel plot. I'll do that with a script, but to give you a hint, I did do make a combined data table, and then I did a split operation. Here we go, here's the punchline that we got to. We're going to redo that here on the script. This is going to give us a parallel plot. Then I did a table update in from our summary to get the analyte class rolled in there.
This is going to go through. Now we can see for the early and late phase, relative to this log worth of two where there's significance, early phase TCA cycle is the most statistically significant. The parameter within that, it is the strongest, again, is that isocitric. We can see in the late phase, these relationships change a lot. This is a completely different ballgame here in the late phase. There's that nicotinamide as well as some other vitamins that aren't as important. That's just an example of some of the graphical analysis that can be done here.
Now I do want to introduce another concept, and I'm going to JMP over to a different data table. This is not from the BioAccord, but I'm going to quick run a teaching example for a JMP Pro feature called functional data explorer. Imagine we had a very simple bioreactor where we had 20 different runs, and we were looking just at turbidity over time. We want to understand how some of these variables, perhaps we've done a design of experiments for pH and temperature. We've got these runs at different process conditions. You can see the highlighted rows jumping around, so that's evidence of a DoE. I am going to invoke the functional data explorer, and this is really cool. What it's going to do, it's going to look at all these curves and use these wavelet models and try to describe a lot of the variability.
I clipped this short to three, but you can let JMP figure out where it should be in JMP Pro determines principal components and gives you a profiler that shows how these principal components affect the shape of the process. Then here's the awesome punchline. Now we can see how these supplementary variables affect the shape. This is sort of an introduction to the functional data explorer where we can see how supplementary variables affect the shape. I'm going to pause here and hand over to my friend Ben that does some really cool stuff with summarizing that.
My name is Ben Barroso-Ingham. I'm a system engineer for JMP, and I'm involved in this project because I've got quite a bit of experience working in bioprocesses. The section that I'm focusing on is that with this Waters add-in we're getting a lot of those analytes in front of us in a nice structured format. Very often we're going to work with DoE's where we're changing those factors, those conditions. We want to see, do those analytes change, and how do they change?
The challenge with that is if we take the example of the many analytes that come out of the equipment, we could have 70 plus analytes to look at. We can, if we wanted to take a very manual process where, let's say, take one of our DoE factors, we can look at what happens when we move from low to high settings on those factors and see, does my analyte over time its shape change?
For ones that don't change, we may not consider them important. Ones that do change, that change has been caused by our factors. Maybe it's got an important biological meaning. Maybe it's something that we could intrinsically link to an important production characteristic at a larger scale, something that we want to look out for.
Doing that by eye is very long-winded. If we're doing that with 70 plus analytes, we're going to be there for weeks trying to get through. I saw this data set and immediately started thinking about JMP rows capabilities with a functional data exploration. That is essentially a very quick way of looking at shapes and finding that list and narrowing down to what is ultimately the most important changes in that process.
What I'm going to go to is the end goal. If I have all these many analytes, what I want is really two tools. The first one is I want just a narrowed list of these are the analytes that have changed when a factor has been altered, and the other thing is we want to just look at those analytes and how they change in terms of their shape. What happens when we do alter those factor settings in that process?
Using JMP Up, what we can get to is something like this, where I have all of my DoE factors along the bottom. In this case, the area overlay rate, the buffer concentrations, the interactions of those effects, and then the analytes that appear that have been significantly affected by those factors. I have 70 plus analytes. What you can see here is a narrowed list with how much that shape has changed in the process.
What I could do is, for example, look at glucose here. When the air overlay rate has changed, the shape has changed. We can hover over and see, yes, the shape has changed in that process. Maybe we can look at the isocitric acid. We can see that's actually affected by both buffer concentration and air overlay rate.
One part of using FDE is that we can get this kind of analysis and visualization similar to a log worth graph in a standard least squares process for finding out what factors matter. In this case, what amyloid has changed. As a companion to that, we can also use a functional DoE. We're probably very familiar with the profiler tool. We can actually use a profiler to look at shapes and how something changes over time.
In this case, have, you know, I analyte. This is a two hydroxyglutaric acid and the shape of it changing over time at set conditions. What does it look like when I change my overlay rate or my buffer concentration? What is it doing to the profile there? I can pair this up quite nicely with this visualization that I've made here, and I can go, right, my glucose has changed here because of their overlay rate.
I could filter down in a list of all the analytes and have that appear here. Now I'm looking at the glucose. I can see what does my air overlay rate do to that process. I can also see what effect has no effect. When I'm changing buffer concentration, it's not doing anything to my glucose in that process.
The way this is working is much in the way that I described, we're looking at visually saying, is there a shape change or is there not? JMP Pro uses FPC scores, which is a quantitative way of measuring how much a shape changes. If we have an FPC score of 0, and it suddenly changes to 10, it's an indicator of a shape change in that process. We can use JMP and take advantage of its ability to make these scores to really quickly get a shape change list and an idea of what has been affected when we have tons of analytes to look at.
The flow of that is that for each analyte that's coming out of, for example, the waters add in, we're using Functional Data Explorer to fit those shapes. We're getting those scores out, the base of the shape scores. Then we're doing as we would do with a standard DoE, we're saying, do those DoE terms that we're using due to the FPC scores? Do they change them? Is there a significant effect?
As a quick example, I've just chosen one of the analytes that's been investigated for this bioreactor, and I'm going to show you the end goal with an FDE is that we take that analyte and each bioreactor in the process, and we're adding a line of fit to it here. Each bioreactor gets fit as closely as possible to the data that has come out of that process.
Then this is getting converted into FPC scores. Values for each bioreactor where it says shape score for this is 0, the shape score for this is 100 in that process or whatever it may be. We can look at the function summaries and see, I have an FPC score of from the first bioreactor of minus 6,000. The other one is 13,000 for bioreactor 2. There's a difference in that shape.
We use the DoEs then to link that to say, well, is that shape caused by the factors that we're using? We can actually go in and by default, when we run a functional DoE, we're testing to see does the term have a significant effect on that FPC. When I look at FPC 1 in this indicator and run a model, a functional DoE, it's telling me, yes, there is a significant effect. Air overlay rate is changing that score. It's not just that the shape is changing, it's that the actual change is caused by one of our DoE factors.
That is essentially what I've done at the start of the process. That whole table is me doing this for every single time, collecting all of the information on the active terms in that process, and then just visualizing it and saying, it exists, if there's a term there, it's something of significance in that process. Then what we're also doing is we automatically get a prediction formula where we can look at, as I've shown before, the effects of each factor on the shape of each analyte.
All of this information is collected basically for each analyte in that process to put together the end tool that was shown, looking at a visualization and at the profile for the tool. Ultimately, this is allowing is a really quick way of screening out where those analytes have changed and giving you the ability to narrow down to where those factor changes have caused an alteration of the analytes that may have an impact on the production process.
Maybe we're looking at a change in analyte relationship we didn't expect. Maybe there's a key factor that has changed one of our analytes that also changes our production processes. We can use this tool to basically pull that apart. The journal, I'm going to say this, but it might be cool. The journal will be available if you're interested, and you can have a look at the process of how this has been done, and some extra details will be added to that process as well.
Thank you so much and boy this data set to me, it's just so rich, has a lot of interesting things, and I've got a couple more things I want to add to our workflow. Far we've seen how the waters add in can assemble the data from the instrument and get it into a nice form for analysis and JMP. Then we've seen already some basic ways you can create some nice graphics start to tell a story and discover what's happening, the different relationships between the variables.
Then, Ben also gave us a really nice dive into a more of a functional curve based approach, which I'd like to see even deeper into potential relationships, especially amongst the DoE factors and everything else that we're measuring. To round things out, I wanted to go ahead and present a couple additional things that I think are extremely powerful in the JMP framework and could be very valuable in terms of both learning about what's happening in the system and actually making key decisions about say changing some of the factors or predicting things and then maybe making no-go decisions on certain key points in the process.
Let's begin here with the… This is the table that comes out of the Waters add in. This one big thing I wanted to mention, that is my first main topic is, noticing that the, the rows, we've got rows in this table, 120 total. But notice that we've got a grouping variable, the bioreactor. These first 15 rows I just highlighted all come from the first bioreactor, and this is what's known in statistical parlance as a split plot design because the DoE factors, which are here, air overlay rate, sparge rate above the concentration.
They are constant over the whole bioreactor. They're set for the whole duration, and the 15 points correspond to, 15 days. This could also be viewed from a time series perspective. That's that's what Ben was getting at with curves. But just from a sheer experimental design perspective. We've got to be careful about the design structure when we go to do statistical modeling.
In particular, JMP is really strong and even traditionally throughout many years at SAS, we want to use what's known as a mixed model. This is a model that consists of both so called fixed and random effects. We don't have time today to dive into deep mixed model theory, but the basic takeaway is bioreactor is our mixed model, random effect, our so-called whole plot in a split plot scenario, also known as like a hard to change factor in our DoE platforms.
The concept is if we want to make conclusions, say causal inferences about these DoE factors, it's critical to use a mixed model setup in order to make sure the statistical conclusions correctly reflect the experimental design. It's easy. There's an easy pitfall here to fall into. Let's say you just ignore that and just started to do, say, a simple ANOVA on the DoE factors versus tighter or whatever outcome you might be interested in, and you just do a naive one way ANOVA say, or maybe factorial.
The inferences from that statistically actually are biased because you haven't accounted for the fact that all these groups of measurements are correlated or clustered because they occur within the same bioreactor. The way to handle it appropriately is with the mixed model. Again, it's somewhat of a deep topic, but we'll just do a very simple one here today.
For further reference that we've got up, there is a whole nice book called JMP for mixed models that I'd recommend if you really want to get into it. It's very practical and once you start seeing this, mixed models tend to just pop up everywhere just because data and experiments tend to not just be the classic independent observations that we assume in statistics 101. Usually there's some kind of clustering factors going on as here with the bioreactor.
The way to set this up, let's just do a really basic idea here and set up a mixed model and JMP. You can actually fit it in either JMP or JMP Pro. Let's let's go the Pro. The pro platform has a few more features, let's go with it. You want to go into fit model. Let's go in and model tighter as our response, and then we've got our factors that we want to put in as our so-called fixed effects. Let's go ahead and add some factorial terms in there. We'll do all the main effects and then all the two-way interactions.
Then here's the key part. To switch into mixed model mode, you want to choose the mixed model personality. Then notice that triggers these extra tabs. There are two kinds of mixed models, one which is the one we're going to use today where we use random effects, and there's a there's a whole another class where we do repeated measures with different structures. We won't do this latter type today, but you can see this is where some of the complexity comes in.
For now, though, let's just do arguably the simplest possible mixed model, which is appropriate for a split plot setup like this, where we're going to take our bioreactor, which notice I've set it to be, it needs to be a nominal variable, not continuous, and add that in as a random effect. We've got three fixed effects or DoE factors along with their interactions and then this random effect. This is the setup that you want to use for any kind of statistical modeling of data like these.
Let's go ahead and run this. It runs quickly. Just to give you an idea of the kinds of things you'll see, at this point you can start to treat it like an analysis of variance, but you'll notice there are some extra pieces of output. For example, here we get a variance component estimate of the bioreactor, which in this case happens to be negative. It's somewhat unusual, but that actually indicates like a bit of a competitive scenario within each bioreactor.
May maybe makes sense because there are nutrients being consumed, and then if one goes up, the other one is likely to go down. It's a fixed enclosed scenario, so that might make sense. Typically, this will be positive if the measurements are positively correlated. It's a way to measure correlation within a bioreactor. Then typically the main focus will be on the DoE factors. As here we can start to look at tests and in this case we're not seeing any significant differences for the tighter, just for this simple model.
But this is just a starting point. You would likely want to add in extra covariates and fit a more complete model that accounts for the system. But for today, I just wanted to make this key point about using a mixed model for cases like this where you do have some split plot set up or repeated measures within some blocking mechanism.
That that's the first analysis I want to show. Then now let's switch gears to more of a predictive modeling concept. If you notice here, each of the little segments we talked about here are addressing a little bit different questions. This is the power and flexibility of JMP, and this is such a nice data set with many different questions we can ask.
We're going to shift our question now from the DoE factors to more of a predictive idea, and let's set up a basic idea here. Let's say what if we try to predict? Let's assume we're running this experiment live, and we just completed day 13, and we want to predict what's going to happen tomorrow on day 14, but we only have data up through today. How would we set up a prediction problem like that?
The first key step in JMP is we actually do want to do a little bit of data manipulation, and we want to do a so-called split of the data where we're now thinking of a time series within each bioreactor and by running tables, I'll split. You can now create a table, the so-called wide table that looks like this now, where we only have eight rows, one row per bioreactor, and all the measurements have now been strung out into different columns.
You can see here, here are the tighter ones with one column per day, and similarly for all of the other variables that we measured. In time series terminology, these are known as lag variables. We now have this so-called wide table that's suited for the predictive modeling problem. With only eight rows, we have to be really careful here. It's very easy to overfit data like this.
Usually in practice, you'll likely have a lot more data than these here, but for this small example, let's just walk through a couple of quick things just for illustrative purposes to show you how I would recommend doing the predictive modeling and going about it. There's several many different ways of doing it, from classic time series things, we have other predictive platforms in JMP. I want to show what I consider maybe the most modern and powerful way to go about prediction. Which for me was born out of many years of doing, Kaggle competitions, which tend to all be predictive in nature, and you learn some things the hard way going through all those competitions.
What I'm going to show you today, I consider to be close, if not right on to what might be considered best practices for doing prediction. The first thing you typically want to do is set up some folds, and the idea here now is we want to do so called K fold cross validation, where the idea is we want to take a fraction of the rows in our table and set them aside. Then the idea is we'll train a model on one portion of the data and then test it on another set.
As you know, JMP, we've had this kind of thing in JMP for many years, although traditionally we've used what's known as a three-way split with training validation test. Here we're just going to do a series of two-way splits, which we just call training and validation. You've got to be careful about so-called leakage and that whatever you hold out, you don't want the algorithm to ever look at that during training because you then you're going to risk biasing results.
The idea now with K fold is let's take this variable here called fold A. We only have eight rows. This one happens to be set up to do fivefold, where we assign numbers randomly 1 through 5, but do it since there's only eight, a couple of places where there's only one row that's held out and then the other three we have two.
Might make sense also to do for what… Might be a little cleaner for this case would be to do fourfold where we would have to hold out two rows at a time and then cycle through those and fit the model four times. That's the basic training scenario that I would recommend, even for small data sets like this, maybe even especially because we have to guard very carefully and strongly against overfitting the data, especially because we're going to use, what I consider the two most powerful methods, boosted trees as implemented in the XGBoost add in, and then we're going to fit a modern deep neural network using the Torch deep learning add in. Both of these are add-ins available on the JMP marketplace.
Let's take a look at XGBoost first and the way you set it up once you have to install the add-in, and then it becomes an option on your menu. You can use this make K folds columns to create folds. I've already done that, so let's just move directly into the platform.
Here the setup is kind of like before, but now again we're thinking about predicting. What we want to do, I've got grouping variables here, so let's expand these out. Let's say we want to predict, titer 14, which I've got here is our Y, and we've got a lot of choices now for what to use as inputs. For sure, let's put our DoE factors in there, and then let's just take so as a very basic lag style model, just take the previous values of titer, which we've got 13 of them, and let's put those in as our Xs. Now, there are literally over 1,000 more potential predictors here, but again, for sake of illustration, let's just keep it simple.
If you have more data, definitely would want to include. You could include these in your model. But here with only eight rows, we're just asking for trouble if we start to blow this out too much. Typically, these lag variables tend to be very powerful predictors, as you might guess. If you know… If you know what happened today, it's likely going to be similar. If you kind of look at the prior trajectory, those are going to be typically be very good predictors. We do want to kind of see how they stack up against our DoE factors as well. Then let's put in some kind of cross validation variable. We can use, let's use our… Let's use our fold A variable there. We'll do five-fold. Again, you might want to do four-fold. That doesn't matter too much.
Got to be a little careful with tiny data like this, but let's not get too bogged down with that. Then here's the basic setter for XGBoost. There are many hyperparameters that you can tune. We don't have time for that today. Let's just run the default model. Often will work pretty well, and then see what happens. The model actually was just run for five times.
Get look what happened here. This is the so-called training loss. Notice it went down and then back up, so this is a warning sign that we're starting to overfit the data after iteration number five or six. You would not want to use this model in practice. You would want to say either stop it. Let's go. We can just do it one more time. Let's go ahead and just stop after five. It's one of the simplest things to do.
Another idea if you want to get fancier, you could regularize the model a little more, or do other tuning things to get it right, but then the idea now is again we are predicting. Our goal is to predict that day 14. Notice for the training data, we actually fit the data perfectly, but that's maybe not so surprising. There's only a few number of rows, but the validation set, these are the so called out of full predictions.
These are what are indicating how the model will likely do on future data. You can see the model is starting to get there. It's certainly far from perfect only in our square of 0.3 in this case correlation of 0.65. The model is picking up a little lift, I'd say, and I think with some tuning you can do a little bit better. But again, with only eight rows, got to be very careful. I'm kind of just showing this for illustration. You typically would want more data to really get a good model that you want to use later.
Then XGBoost comes with some really nice extra features to help you interpret, for example, these variable importances. We can see here our air overlay rate is kind of coming in as one of our best predictors, even over lag values of the titer itself. That's kind of interesting, so we can actually start to make some indirect inferences about our DoE factors. But again, our focus here is on prediction, not on direct statistical inference.
Now as an alternative style model, let's do a neural network using the Torch add-in. The setup is exactly the same way, so you'll want to… You need to install the add-in and then launch the platform, and literally we're going to do the exact same variables in each slot. We want to predict day 14 to use our DoE factors and our lag titers and fold A. Again, we're going to do five-fold. Let's just run the default. Multilayer perceptron model is a two layer model. We can see here this model actually is not doing too well at all. In fact, it's got a negative RSquare. Here out of the box XGBoost is doing a little bit better, I'd say.
But what I found is with a little bit of tuning of each, they tend to do about… They often have similar performance, and they also tend to complement each other quite well. You really wanted your absolute best predictions, one trick is to ensemble the two models together, which you can just do by saving out of full predictions and averaging them. That's a classic trick. You see almost every Kaggle competition. That's what folks tend to do to really squeeze out maximal performance.
Anyway, to me, there's a very powerful, flexible approach that you can adopt and JMP Pro. You can add more predictors. You can fit these models very quickly, and it really, helps answer very practical questions in terms of say, maybe you're about near the end of a batch, and you want to decide do I go another day or not? Just why not run this really quickly, see what things are looking like for the next day in terms of a forecast, and then make a decision on that rather than just kind of waiting or using old old-fashioned style rules. I think the power and flexibility of these kind of modern predictive models is really, really compelling.
Thanks so much for listening, and hope you can get a lot of meaning out of this. This example we've thoroughly enjoyed collaborating on this. Each all of us kind of come from somewhat different backgrounds. It's been a really fun project, and we see there's so much potential here for different things that we can do moving forward. Do please feel free to reach out with questions on anything that we've shown today, and we're happy to talk with you further and continue the dialog.
Presenters
Skill level
- Beginner
- Intermediate
- Advanced