Data Driven Selection as a First Step for a Fast and Future Proof Process Development (2023-EU-30MP-1260)

1 Kudo

Egon Gross, Research & Technology Manager, Symrise AG

Bill Worley, JMP Senior Global Enablement Engineer, JMP
Peter Hersh, Senior Systems Engineer, JMP

This talk will focus on how JMP® helped drastically reduce the cultivation experimentation workload and improved response from four up to 30-fold, depending on the target. This was accomplished by screening potential media components, generally the first, and sometimes tedious, step in fermentation optimizations. Taking characteristic properties such as the chemical composition of complex components like yeast extracts enables flexibility in the case of future changes. We describe the procedure for reducing the workload using FT-MIR spectral data based on a DSD setup of 27 media components. After several standard chemometric manipulations, enabled by different Add-ins in JMP® 16, the workload for cultivation experiments could be drastically reduced. In the end, important targets could be improved up to approximately 30-fold as a starting point for subsequent process optimizations. As JMP® 17 was released in the fall of 2022, the elaborate procedure in version 16 will be compared to the integrated features. It might give space for more inspiration – for developers and users alike.

Hello everyone, nice to meet you. I'm Egon Gross from Symrise, Germany. From professional, I'm a biotechnologist and I'm looking forward for my presentation for you.

Hello everyone. I'm Bill Worley. I am a JMP systems engineer working out of the chemical territory in the central region of the US.

Hi. My name is Peter Hersh. I'm part of the Global Technical Enablement Team, working for JMP out of Denver, Colorado.

Peter welcome to our presentation, Data- driven selection as a First Step for a Fast and Future -Proof Process Development . First I want to introduce the company I'm working for. We are located in Holzminden, more or less in the center of Germany, and there are two sites coming from our history.

Globally seen, we are located with the headquarters in Holzminden . We have big subsidiaries in Peterborough, in Sao Paulo and in Singapore, and there are quite a lot of facilities in France also coming due to our history.

Coming to the history, Symrise was created in 2003 out of a merger from Harmon and Rhymer, which was founded in 1874, in Dragoko, which is the other side, from our facility, which was established 1990. Over the years there have been quite some acquisitions and also in 2014 the acquisition of Diana, which is mainly located in France because that's the reason why there are so many different research and development hubs.

Our main products come from agricultural products or from chemical processes and there are quite a lot of diverse production capacities, production possibilities for our main customers being human or pet. As is so diverse, we are dealing for food for the human consumption, for pets consumption, and also for health benefits.

On the other side, the segment Scent and Care is dealing with fragrances coming from fine fragrances, to household care, to laundry , whatever thing you can imagine, that smell nicely.

As I said in the beginning, I'm a biotechnologist by training and I'm dealing a lot with fermentation processes to optimize them and to scale them up or down. One major issue when it comes to fermentation is the broth, the liquid composition of the media, which will then feed the organisms. No matter which organisms that are, they need carbon sources, they need nitrogen sources, they need minor salts, major salts, pH values, and other things.

It is often important which kind of media one has. When it comes to media composition, there are two big branches which can be seen. One is the branch of synthetic media, so all components are known in the exact amount and composition. The other way are complex media, for example, having a yeast extract or a protein extract or whatever, where it's a complex mixture of different substrates, different chemical substances. The third approach would be a mixture of both.

One of the side effects of these complex media is that it's quite easy to deal with them. But on the other hand, there can be constitutional changes over time, as some vendors tends to optimize their processes, their products, to whatever region, to whatever target. Some customers get hold of those changes, some don't.

Another issue might be the availability o f it's a natural product like [inaudible 00:04:38] or whatever. You might know some ingredients, you will surely not know all ingredients and there might be promoting or inhibiting substances within those mixtures.

At the beginning of a process development, the media is of main importance. Therefore I tried to look at carbon sources, nitrogen sources, salts, trace elements, and so on, being my different raw materials. While growing the organisms, one has to take care of different temperature, stirring velocities to get oxygen into the liquids, cultivation time, and there are a lot of unknown variables to get an idea what the effect might be to the cell dry weight, for example, or to the different targets compounds one has in mind.

For this setup, I used then the definitive screening design. As the most of you know, they are quite balanced and have a particular shape which is reflected in these three- dimensional plot. You can see definitive screening design is somehow walking around certain edges and having a center point. Due to the construction of the definitive screening design, one can estimate interactions and square effects. These interactions are not confounded with the main factors and the main factors itself are also not confounded with each other. This is a very nice feature of the definitive screening design and therefore they are very efficient when it comes to the workload compared to formerly known screening designs.

Some disadvantages are also there. One big disadvantage is if you have about 50% of the factors that are working that have a high influence or even more, you have a significant influence, significant confounding, which you have to take care of. In this particular case, although it's the leanest possible design I found, the practical workload would require five to six months just for screening. This is far too long when it comes to a master thesis.

The alternative was then to build another design or to build another process. I was so inspired in Copenhagen 2019 by a contribution from Bill where he talked about infrared spectroscopy and I thought why that might be a good idea, using the chemical information hidden in a near-infrared spectrum to describe the chemical composition of the mixtures.

Therefore I established this workflow. First, the media preparation was done of all the 65 mixtures. Then the near- infrared spectrum was measured, some chemometric treatments were preferred and afterwards, the space of variation could be held constant at a maximum, but the number of the experiments could be reduced quite significantly.

To show you how the workflow is, I started, as I said, with spectral information. One of the first principles one has to do is to make a standardization of the spectra to avoid baseline shifts and things like that. This is one way to make it. Introducing a new formula to standardize, or what I did, I used an add-in to preprocess and calculate the standard normal variety, which is when it comes to the digits, the same as the standardization, as we see here.

With this standardized spectra, depending on each measurement, I continued then and compiled first all these spectra. What you see here on the top is the absorption of every sample. We have an aqueous system so we took water as a reference. A fter building the difference between the absorption and the water, we then got deeper and saw differences within the spectra.

One of the big question was do I calculate first the difference between the absorption of the sample and the water and calculate then the standard normal variety? Or do I first calculate the standardization and then use these standardized values from the water background?

One could think the procedure is the same, but the outcome is different. As you see here, on the right- hand side of the dashboard, I zoomed into this area and in the lower part, the curves have a different shape, a different distance from each other than in the upper part. T his might have then an influence on the subsequent regression analysis. Therefore, I selected first to make the standardizing and then the difference calculations.

After I did these first steps, then came the chemometric part, that is smoothing and the filtering and to calculate the derivatives. This is a standard procedure using an add-in which is available. You can imagine that the signal might have some noise. This is seen here in the lower part, the red area is the real data, and the blue curves are the smooth data. On the left upper side, you see the first derivative. On the right upper side, the second derivative of these functions. If it comes to polynomial fits, it's depending on the procedure, what you are fitting, what's the polynomial order, and how broad your area is, where you make the calculations in.

If we take here only a second- order polynomial, you see that it might change. Now, this is not a two, this ought to be a 20. Then the curve smooths out. Although it's smooth, you can see differences in height, in shape. To get hold of those data, one has to save the smooth data to data tables, separate data tables. Then I tried different settings for the smoothing process, because I did not know from the beginning which process is the best to fit my desired outcome of the experiment at the end.

After quite a lot of smoothing tables, which were then manually done, and I then concatenated the tables. These are all the tables we just made. I'm going to the first one and say, please concatenate all of the others. The nice thing is that you then have at the end, these different distances coming from the smoothing effect. I had a second polynomial order. A third polynomial order is 20 points to the left and to the right for the smoothing process and 30 and so on.

This is just a small example to show you the procedure. I did quite more. What I did was this amount of treatment. I had [inaudible 00:15:01] for a second, third, or fifth polynomial order with 10, 20, or 30. Now came the big part to decide which particular procedure represents my space at best. This, therefore, I made a principal component analysis of all my treatments I did.

This is a general overview. The numbers represent each experiment by its own that you can follow them in the different scores and loading spots. The loading plot is… That's a regular picture of a loading plot when it comes to spectral data. If you take into account that we are coming from a design, this value of 24% explained variation at the beginning for the first component is very high.

Why? Because the factors of the definitive screening designs are orthogonal to each other and independent from each other. One would expect lower values for the principal components. After this treatment, the first derivative with second order polynomial and 10 points to the left and to the right for the smoothing, it looks very evenly distributed. You might think of a cluster here on top or below.

I went through all of these particular processes and selected then a favored one, where I saw that the first principal component has a very slow describing power for the variation. That's then the way I proceeded.

After selecting the particular pre- processed data, I still have my 65 samples. But as we heard at the beginning, 65 is far too much for a workload. If you ask yourself, why is there 132 samples? That is because I copy pasted the design below, the original design for the spectral treatment I used then.

If you want then to select your runs you are able to make due to time reasons or due to cost reasons or whatever, this is one process you can make use the coverage and the principal components. Then this is the design space which is available dealing for the all variation which is inside. But as you see, we would need to make 132 experiments. If we then go just select all the principal components and say please make only the one which are possible, then you have the ability to type in every number you want to.

At this stage, I selected several smaller or bigger designs and saw how far can I go down to reach at least a good description power. I made these 25 experiments, let JMP select them. The nice thing is with this procedure, if you are coming back to your data table, they are selected. But this procedure I didn't do right at the beginning. At the beginning, I made a manual selection.

How did I do that? I took the score plot of the particular treatment and then selected manually the outer points as good as possible. Not only in the picture of the first and second principal component, but I went deeper. This, for example, is the comparison of a selection method I just showed you with the DOE of the constraint factors and with the manual selection, just for showing you maybe some differences.

If you make this DOE selection several times, don't be confused to get not always the same numbers, the same experiments, which might be important. With this approach, I then reduced the workload from 64 experiments to 25 experiments. In all of these experiments, all my raw materials I had from the beginning were inside. I didn't leave any raw material out, and that was very nice to see, that I could retain the space of the variation.

After the cultivation in two blocks, which took a frame week of three weeks for each block, we yet then analyzed our metabolome and the supernatant and determined our cell drive mass. For time's sake, I show you only the results and the procedure for the cell dry mass. Other molecules might be the same procedure to be done then.

The next issue I had was that there is a confounding. I had to expect the confounding because I had only 25 experiments for 27 mixtures coming out of a design where I knew where I supposed to have interactions and quadratic effects. These interactions is nothing new when it comes to media composition. Quadratic effects were nice to be seen.

Then came the next nice thing, which was introduced by Pete Hersh and Phil K. It's the SVEM process, the Self-V alidated Ensemble Model. In this sketch, you see the workflow and we will go through that in JMP. The first thing was to look at the distribution of my target value. After making a log transformation, I then saw that it's normally distributed. So we have a log- normal distribution. That's nice to know.

The first thing was to download this add- in, Auto validation Set-up, and hit the run button. We then get a new table. The new table has 50 rows instead of 25 rows from our original table. Why is that so? The reason for that is while hitting the button, the data table gets copy- pasted below and we get a differentiation into the validation set and into the training set, as you see here. The nice feature of this Auto validation table is that you can, due to a simulation, find out which parameters, which factors have an influence.

This happens by the spared fractionally weighted bootstrap weight. If you look for example, the second experiment has a value of 1.8 in the training set and the same sample has a value of 0.17 in the validation set. This then gives one the ability to have a bigger weight for some samples in the training set and vice versa in the validation set. While they have a bigger value, a bigger weight in the training set, they have a lower weight in the validation set.

To analyze this, it's necessary to have the pro version to make a generalized regression. As we took the log value of our cell dry weight, I can then make a normal distribution and then it's recommended to make a lasso regression. From the first lasso regression, we get a table for the estimates, and now comes the nice part. We make simulations changing the paired weight bootstrap weight of each factor.

For time's sake, I'm just making 50 simulations. From these 50 simulations, we get then the proportion for each factor we had in the design where it entered the regression equation, or didn't enter the regression equation. This pattern comes due to this randomization process of the bootstrap forest method. From this distribution we go to the summary statistics, customize them, we are just only interested in the proportion nonzero. This proportion nonzero is finally the amount of the 50 simulations. How often this particular variable went into the regression equation.

From this, we make a combined data table and have a look on the percentage of each variable being in a model or being not in a model. This looks a little bit confusing. If we are ordering it by the column two descending, we then see a nice pattern.

Now you can imagine why I introduced at the beginning this null factor or these random uniform factors. T he uniform factors were manually introduced. The null factor was introduced by hitting the auto- validation set. What do these points mean? These points mean that until the null factor, these variables have a high potential because they were quite often within the model- building processes. These at the bottom were quite seldom within the model- building processes so the ability to reduce your complexity is given by just discarding these. Here in the middle one has to decide what to do.

After having this extraction, not losing information, and not losing variation, one can then think of different regression processes making response surface model or step wise regression or whatever regression you have in mind. It's wise to compare different regression models looking what's feasible, what's meaningful. That was the procedure I used in JMP 16. While coming now to Pete and Bill, they will describe you something else.

Thank you, Egon. That was a great example of an overview of your workflow. Thank you. What's new in JMP 17 that might have helped Egon a little bit with the tools he was working with? I'm going to start off with a little bit of a slide show here. I'm going to be talking about Functional Data Explorer. That's in JMP Pro and talking about the pre- processing and Wavelet modeling that are built into Functional Data Explorer now.

All right, so let me slide this up a little bit so you can see. What's new in JMP 17? We've added some tools that allow for a better chemometric analysis of spectral data. Really any multivariate data that you might have that you can think of, these tools are there to help. First is adding the preprocessing methods that are built into FDE now.

We've got standard normal variant, which Egon showed you. We've got multiplicative scatter correction, which is a little bit more powerful than the standard normal variant. Both of these will not disrupt the character of your spectra. That's not the story with Savitzky-Golay. It does alter the spectra, which will then make a little bit harder to interpret the data. The key thing is it still helps. Then we have something called polynomial baseline correction, which is another added tool if you need that.

The next step would be then to save that preprocess data for further analysis, like principal component analysis, partially squares, so on and so forth, so you can do some analysis there.

The Wavelet modeling is a way to look at the chemometric data similar to principal component analysis. We're fitting a model to the data to determine which is the best overall fit for, in this case, 25 spectra. That's the goal here. It's an alternative to spline models. It's typically better than spline models, but not always. You get to model the whole spectra, not the point- by- point, which you would do with other analysis types.

Then you get to discern these things called shape functions that make up the curve. These shape functions are, again, similar to principal component analysis in that they are helping with dimension reduction. Then, as I said before, these are excellent tools for spectral and chromatographic data, but virtually any multivariate data is fair game.

These are just an example of the Wavelet functions that are built in. I could try and pronounce some of these names, but I'll mess them up, but know that these are in there. There is a site here that you can look up what these Wavelets are all about. I got the slide from Ryan Parker so thank you, Ryan.

Almost last but not least, what we're doing with this functional principal component analysis is we're trying to determine, again, what's the best overall fit for these data and then compare the curves as needed. What comes out of the Wavelet modeling is a Wavelet DOE, and we determine which wavelengths have the highest energy for any given spectra or whatever we're looking at.

These Wavelet coefficients can then be used to build a classification or quantification model. That's up to you. It depends on the data and what supplemental variables you have built in. In this case, this is a different example where I was looking at percent active based on some near IR spectra.

Let's get into a quick example. All right. This is Egon's data. I've taken the data that was in the original table, this absorption minus the water spectra, and I've transposed that into a new data table where I've run Functional Data Explorer. I'm just going to open up the analysis here. It does take a little bit to run, but this is the example that wanted to show.

We've done the pre- processing beforehand. We've taken the multiplicative scatter in this case and then the standard normal variate, and then built the model off of that. After this function or these pre- processing tools which are found over here, I'm going to say that data out, and then that data is going to be used for further analysis as needed.

To build on the story here, we've got the analysis done. We built the Wavelet model. After we've gotten the mean function and the standard deviation for all of our models, we build that Wavelet model and we get the fit that you see here. What this is telling us is that the Haar Wavelet is the best overall based on the lowest Bayesian Information Criteria score . Now we can come down here and look at the overall Wavelet functions, the shape functions, and get an idea of which Wavelets have the highest energy, which shape functions are explaining the most variation that you're seeing between curves, and then you can also reduce the model or increase the model with your selection here with the number of components that you select.

One thing that comes out of this is a Score Plot which allows you to see groupings of different in this case, spectra. One that you're going to see down here is this. This could be a potential outlier. It's different than the rest. If you hover over the data point, you can actually see that spectra. You can pin it to the graph, pull that out, and then let's say let's just pick another blue one here and we'll see if we can see where the differences are.

It looks like it might be at the beginning . If we look at this right here, that's a big difference, then maybe that just didn't get subtracted out or pre- processed the same way in the two spectra. I don't have an example of the Wavelet DOE for this set up, but just know that it's there. If you're interested in this —this has been a fairly quick overview— but if you're interested in this, please contact us, and we will find a way to help you better understand what's going on with Wavelet DOE and preprocessing built into JMP Pro. Pete, I will turn it over to you.

All right. Well, thanks, Bill and Egon. Just like Bill, I am going to go over how Self-Validating Ensemble Models changed in JMP 17. Bill showed how you could do what Egon did in 16 in 17 much easier using Functional Data Explorer. For me, I'm going to take that last bit that Egon showed and with the add- in, creating that SVEM set up. Using those partially weighted bootstrap columns and then also making that validation and the null factor. I'm going to just show how that's done now in JMP® 17. So this is much easier to do in JMP 17. Just like that, spectral data processing with FDE, this is done in JMP 17.

If you remember, Egon had gone through, he looked at all those spectra, he extracted out the meaningful area, looking at smoothers, the standard normal variant, and did a bunch of different pre-processing steps. Then he took those preprocessing steps and he selected a subset of those runs to actually run, and he had come up with 25. Here is those 25 runs. From this step, what he did is that Self-Validating Ensemble Model or SVEM.

In 16, this took a little bit of doing. You had to make that model, then you had to simulate, then you had to take those simulations, and run a distribution on each one of them, and then get the summary statistics, and then extract that out to a combined data table, and then graph that or tabulate that and see which ones happen the most often.

That was a lot of steps and a lot of clicks to do, and Egon has clearly done this a bunch of times because he did it pretty quickly and smoothly, but it took a little bit of doing to learn. Clay Barker made this much easier in JMP 17. Same 25 samples here, and instead of running that Auto validation Set- up add- in that Egon showed, we're going to just go ahead and go to Analyze and Fit Model.

We'll set up our model. I f you remember, we're taking this log of the dry weight here. We're going to add a couple of random variables along with all of our factors into the model, and then we're going to make sure that we've selected generalized regression. This is the set up for our model, we're going to go ahead and run it, and in JMP 17, we have two new estimation methods.

These are both Self-Validating Ensemble Model methods. The first one is a forward selection. I'm going to go ahead and use SVEM Lasso because that's what Egon used in his portion, and here you just put in how many samples you want. He had selected 50. I'm going to just go with the default of 200. Hit go, and you can see now it's running all of that behind the scenes where you would have simulated, recalculated those proportional weights, and then at the end here, we just have this nice table that shows us what is entering our model most often up here.

Then when we hit something like a random variable. Just out of randomness, something that's entering that model is entering maybe about half the time. Things that are entering more regularly than a random variable, we have pretty high confidence that those are probably variables we want to look at. Then we would go from here and launch the Profiler. I've already done that over here, so we don't have to wait for it to launch or assess variable importance.

But here, this shows us which of these factors are active. We can see the most active factors, and while it's not making a super accurate model, because again if you remember, we are taking 25 runs to try to estimate 27 different factors. If you take a look here at the most prevalent ones, this can at least give you an idea of the impact of each one of these factors. All right, so that basically sums up what Egon had done. It just makes this much easier in JMP 17, and we are continuing to improve these things and hope that this workflow gets easier with each release of JMP. Thank you for your attention and hopefully, you found this talk informative.