Synthetic Chromatograms: A New Approach to Chromatographic Modelling (2025-EU-30...

Chromatographic methods such as HPLC, GC, and CGE are essential for analytics across various industries. Optimizing these methods to ensure high accuracy and precision is crucial but challenging due to numerous parameters and complex chromatograms. Often, chromatographic targets (e.g., resolution, peak-to-valley) are extracted and modeled, but interpreting these results and their impact on the chromatogram is difficult.

In collaboration with Chris Gotwalt at JMP, we have developed a novel approach to model synthetic chromatograms in-silico based on design of experiments (DOE). We demonstrate how individual peaks in chromatograms can be identified using JMP Functional Data Explorer and modeled via the Generalized Regression platform. Subsequently, the synthetic chromatograms are visualized and optimized in the Profiler.

This innovative approach allows the impact of various DOE parameters to be simulated on complete chromatograms for the first time in JMP. It showcases JMP’s interactive capabilities, offering a new understanding of chromatographic methodologies and addressing new regulatory requirements, such as ICH Q14. We demontrate the potential of this feature, which is expected to be rolled out in JMP 19, with two real-world examples.

Hi, everyone. My name is Marco Kunzelmann, and I'm working as a principal statistician at Boehringer Ingelheim. We have a great collaboration together with Chris Gotwalt from JMP regarding a topic where we want to combine DOEs on the one hand side together with chromatographic data on the other hand side. While this is still work in progress, we thought that this project already reached a phase where we can show you current results. Just keep in mind that a lot of steps are here in the presentation are still performed manually, but will be automated within future JMP versions.

The motivation behind this project was that, of course, efficiency increase and cost reduction are fundamental goals in industry. Also, authorities and regulatory guidances are pushing the topic of multivariate design space characterization and investigation further and further within the past years. Just to name a couple of examples here are the ICH Q8 and the recently released ICH Q14. One established way to achieve these requirements of the authorities are design of experiments or in short, DOEs.

At the latest with the effectiveness of the ICH Q14, which was released in June 2024, statistically modeling via DOEs found a way into analytical development. However, chromatograms are usually very complex and typically, people who want to combine analytical method development usually extract features from the chromatograms and model them as a function of the U-E input parameters.

What do I mean by that? Well, feature extraction are information from the chromatogram, like the peak height or the peak retention time or even the analytical resolution, which is describing how well two or three peaks are separated from each other. The problem with feature extraction is the feature extraction is a tedious work because you need software in the first place for the peak detection. Then peak usually, peak integration is required, and then you need to export all this information from your analytical software towards JMP where you can perform afterwards the modeling part with the usual DOE way.

On top of that, you need one model for each of these features which is extracted. In my example before where we have seen nine different peaks, we need nine models just to describe the peak retention time. You need then nine models to describe the resolution, the peak height, and so on. You can imagine when you want to optimize all these individual models, the interpretability is really difficult in the end because you get just information about retention times, peak height, resolution, and so on, but you never get a whole picture. You never see the complete chromatogram.

This is basically the goal what we wanted to achieve in this project. Our idea was that we have a peak detection right away in JMP within the functional data explorer. Then we wanted to have a numerical parameterization of the complete chromatogram, also, of course, within JMP. A direct modeling should be applied on the complete chromatogram with all peaks included. Then finally, and, of course, we wanted to use their interactive capability of JMP. That means we wanted to have an interactive chromatogram right away in the profiler where you can really work and see how different DOE factor settings affecting the chromatogram itself.

Chris and I, we thought we can illustrate our process in the best way by showing you two different examples of datasets where we have worked with the past year over. Let's get started with the first example here with gaschromatographic data. Gaschromatographic data and in short form GC is an analytical method which is widely used. Simply spoken, you simply evaporate your sample, and then in the gas phase, you have an interaction between the mobile gas phase with the stationary phase, which is on the capillary itself. These interaction effects are different for each of the components with your analytical sample. Based on that, you can separate these different components from each other and detect them in the end via a detector.

Here in this example, we used a different GC instrument, which is called a hyperfast GC, and the separation principle is still the same as in the normal GC. However, the main difference here is that the heating rate is tremendously increased, and due to that, we could create chromatograms with a complete time of 1 minute and below. This is, of course, highly beneficial for this project because we could easily run a lot of different settings without losing much time.

Our DOE experiment here was based on four different DOE factors. That means we had a start temperature, an end temperature, the hold time, and the delta time. I thought, not showing this kind of information just in the data table way, I also show you this in a form for temperature profile here on the right side. This was the initial temperature we had. We varied this initial temperature, and we kept this on different hold times.

Applying different delta t's allowed us to apply different heating rates in combination also with different end temperature, which is the T₂. Due to the fact that we could easily run a lot of experiments on this hyperfast GC, we decided to use a full factorial design with 81 experiments and seven center point runs to really get all the information we can get out of that system.

Due to that, we were investigating here in the DOE, the impact of the DOE factors, the four DOE factors on nine different peaks in the chromatogram. Here on the lower part, we see one example of the GC chromatogram, where we see the nine different individual peaks component in form of raw data. You can imagine that we get this chromatogram for all of the 88 DOE experiments in the end.

This is basically the part where I hand over to Chris, and he's going to show you in more detail how he extracted the information out of the chromatograms in order to create the synthetic chromatograms in the end. Please, Chris.

Thanks, Marco. Here is the ultimate visual representation of the synthetic or predicted chromatogram in a profiler along with the four DOE factors. The synthetic chromatogram is on the left with a yellow background. There are nine specific peaks of interest that we needed to model.

The approach that I'll be showing creates an easy-to-understand visualization with profiler controls, so the practitioner can dial in the DOE factor settings that give the best chromatogram. In this case, the one with the best separated or highly resolved peaks. One can go from this setting at the center of the design space, which has several of the peaks clustered together to these settings, which achieved the best separation we were able to find with this study.

To do this, there were a number of challenges we had to overcome. First, we had to remove the uninteresting solvent peak and remove the blind baseline run. From there, we use peak modeling capabilities in the functional data explorer in JMP Pro 18 to identify and label the peaks. We extract the peak heights, half-height widths, and elution times for each of the peaks for all of the chromatograms. We then model these using Genreg and assemble the resulting prediction formulas into an overall synthetic chromatogram formula.

When we started doing this last year, it took a lot of manual work. Ryan Parker and Clay Barker have done a lot of work to streamline the process in JMP Pro 18. As you'll see in my demo, there are still a lot of steps, but the process has gone from several days of manual work to about an hour, and we are committed to streamlining the process further in coming JMP Pro releases.

We start off by going into FDE in the usual way. One prior step that we needed was to use tabulate to compute the original start and stop times for each batch. This is needed to dealign the data and scale the model for the blind run so that it can be subtracted from each chromatogram. Here we see the raw chromatograms. We want to remove the solvent peak that dominates the picture. To do this, we use Align Maximum.

Now we can find the aligned time that is after the solvent peak, but before the first peak of interest. We do this by using the crosshairs tool, and we find that 0.26 is the time between all the solvent peaks and the first peak of interest. We filter out all observations before 0.26. Now we have just the peaks of interests in the data, and we will save this out as a new dataset so that we know what we are working with.

Now there was a baseline blind run that was done with no product in it. We want to remove this signal from the data since it contaminates our model and biases our results. Here I fit just the blind run with a p-spline and saved the summaries. Then I copied the formula to the data with the solvent peak removed. Then I aligned the chromatograms with the blind run formula manually and subtracted the blind run-off when I created the intensity to blinded formula column that was back in the absolute timescale.

Now that we have the chromatogram data in the form we want, we can return to FDE. We will use the new peak finding capability in JMP Pro 18. The first thing we do is run Automatic Peak Detection. This, of course, finds a lot of peaks in the data and gives ranges and central locations to each candidate peak it finds. We can click on Summaries tab to see the most important derived peak metrics.

Here you can see the AUCs, resolutions, peak heights, and other derived peak summary measures. We notice a problem because many of the chromatograms have more than nine peaks in the summaries, but we know that in actuality there are only nine peaks per spectra.

So we return to the peak finding user interface where we investigate the spectra with more than nine peaks. Doing a little searching around, we find that all the supposed peaks with heights less than 500 are just due to little flukes in the data. An example of one of these spurious peaks is highlighted here. We can use the Automatic Peak Removal tool to get rid of these peaks by clicking it and entering 500 as the lower limit for what we'll consider to be a valid peak height.

After removing the peaks that are lower than 500, we now have a clean and correct extraction of the peak summary statistics that we need to create our synthetic chromatic RAM model. We save these summaries to a new table, then we go into generalized regression and fit log normal quadratic response surface models to all the extracted metrics. We use log normal because all the metrics are strictly positive.

In practice, we have to evaluate our models to make sure they fit well just like any other modeling exercise. I'm going to skip that here, but once we are satisfied, we can hold down the control and save the prediction formulas option to save the prediction formulas for all of the responses in one fell swoop. Now we have models that use the DOE factors to predict each feature of the chromatograms, and I used a formula column to assemble a prediction formula that calculates the synthetic chromatogram with a formula column combining the individual DOE models for every peak's location, half height, width, and height.

To verify our model, we look at overlay plots of the synthetic chromatograms, which are drawn in red, and the actual chromatograms, which are drawn in blue. Overall, we see that the model is not perfect, but it looks like the model captures the main features in the actual spectra reasonably well. From here, we simply use the profiler to find factor settings that improve the overall resolutions of the peaks going from the center of the design to the settings which give us an analytical method that will separate the components of the production process as best as possible for all the peaks.

Now we have a simple, easy to understand visualization that shows how to set up the process in a way that's easy to communicate to people working in analytical method development. I'm going to hand it back over to Marco.

Thank you so much, Chris. Chris has shown us how in silica chromatograms work with GC data. We have applied this methodology already on another example in form of a capillary gelelectrophoresis. CGE, in short, is just another analytical instrument. However, the separation technique is a little bit different, which is leading also to different peak shapes, but we will come to that in a second. The main motivation behind the CGE method development was that we needed to separate two different peaks from each other. Frankly speaking, here Peak 1 from Peak 2.

You see that this is quite challenging because Peak 1 is super small in comparison to Peak 2, and these two peaks are also overlapping each other extremely. We try to find optimal settings here again to separate these two components from each other. This time we had five different UE factors with mainly on three different levels, except for the end concentration. This was investigated on four different levels.

Usually, if you want to investigate all combination here, 324 experiments would be necessary in a full factorial design, and this is, of course, mainly the reason why we used an I-optimal design here in the end, and that way we could save about 80% of the experiments because we created a design of just 64 experiments. This design was capable not only to resolve main effects, 2-factor interaction, and quadratic effects, we could also resolve 3, 4, and 5 interaction effects, and also quadratic effects in combination with main effects in combination with them exactly.

Why did we do that? Well, in the GC example, we have actually investigated that a lot of analytical DOE factors are interacting with each other, and we wanted to investigate this more closely and getting a better understanding about how complex models have to be in the end to have a good precision and accuracy, of course.

Usually, as we have already outlined, you could model also the complete chromatogram in a way that you would extract the features in the first place. However, this was not possible here in that situation at all, because due to the fact that the two peaks are overlapping each other so extremely, the analytical software was no longer capable to calculate the analytical resolution for these two components. This is another reason why we needed to find another way to actually describe the chromatogram in a mathematical way.

What did we do here? Well, as Chris has already outlined, we can describe, of course, a peak in form of a mathematical function. The peak height, the retention time, the width of the peak, and so on. The symmetric peak can be described in form of this formula here on the left-hand side.

In the CGE, unfortunately, things are more complex because CGE always leads to asymmetric peaks. You see this also here that the second component, the second peak here is not really symmetric at all. We need to find another formula which can describe us the asymmetry of this peak.

What we found here in literature was this nice formula here. The parameter a is the parameter which allows us to describe the asymmetry of the peak. You can see this really nicely here on the left and on the right-hand side, where you see that a value of 0 still leads to an asymmetric peak. The higher the value of a becomes, the more asymmetric the peak also is. The other parameters here are basically nothing else than the retention time of the peak, the peak height, and so on. This is basically the same than before. It just looks a little bit more complex.

How can we transfer now this formula towards JMP? What we did is we created a new formula column within JMP. To do that, simply go to Columns, Formula, and then you end up in the Formula Editor. Go to Parameters here on the left lower side and click on New Parameter and simply add all the parameters you need.

When you do so, you can create the two peak formulas as we did here within the green rectangles. We have the formula for Peak 1 where we assume a symmetric peak, and we have the more complex formula for the second peak where we also want to describe the asymmetric part of the peak. You can also see all the parameters we inserted here: h1, w1, t1, and m2, d2, a2, and H2 for the second peak. What we need to do now in the next step is we need to find the right parameter values for all these parameters.

To do so, first of all, start to load all your raw data from your analytical instrument towards JMP. What we have here is we have one column for the migration time and another column for the intensity. We separate all the individual electropherograms here by a third column, which is simply called Run, and this is the DOE Run number.

Afterwards, you can go to Analyze, Specialized Modeling in the Nonlinear platform. What we do there is we simply plug in the Y response, the intensity, and for the predictor formula, we use our newly created formula column, of course, and simply separate everything by column, which is the Run column I have shown here and includes all the individual DOE experiments.

Afterwards, you click on OK, and you end up with a view like the one we see here. Here we have the second DOE experiment, and I zoomed in a little bit that we simply see the data from 26 to 28 minutes. The points are actually the real, raw data. The solid line is the parameterized function with the parameters estimates shown here below. You see the initial starting values are, of course, not the optimal ones.

What we need to do in the next step is to run JMP to find the right estimates for all these individual parameters in order to fit the raw data as best as possible. This is an iterative step. Every time we run one iteration, we come closer and closer to the optimal results. I try to illustrate this here in this little diagram and just four or five iteration. In reality, of course, we have 250 iterations here, and we need to repeat this process for all of the 64 experiments.

It is highly crucial to visual inspect if the optimizer really found good estimates for all the parameters. That this is even possible is we need to find good starting values. Otherwise, we run into a local optimum or the algorithm optimizer is not converging at all, which is a problem here, of course. In order to find good starting values, it makes sense to already give him the algorithm some lower and upper bounds and also to investigate the chromatograms in the first place to have good estimates about the peak height, the retention time, and so on, at least roughly. This is at the moment really a tedious step and a lot of work, but it will be automated within JMP 19.

While this was already a lot of work, our challenge became even bigger because nine out of these 64 experiments led to an edge-of-failure. That means the migration time of our two peaks was for some DOE factor combinations unexpected long. Since the method stops automatically after 60 minutes with the recording, we didn't receive any data about these peaks in nine of the experiments. Now we have two options. Either we call these experiments simply lost, or we use a technique which is already implemented in JMP to use at least some of the information from this peak.

The methodology is called data censoring. Since we know, for instance, for the migration time of the peaks, that the peaks at least come after 60 minutes and subject-matter experts could ask at least tell that they should be somewhere between 60 and 62 minutes, we can use this information, and the same thing we can use also for the other peak parameters like the peak height and the peak width and so on.

How did we use this information within JMP? You simply create a new column, which is called lower and upper for all of your parameters. In this example, I used the easiest parameter M2, that is the retention time or the migration time for the second peak. We have here M2_lower and M2_upper.

For all experiments where we receive data, we can simply insert the same retention time for the lower and the upper component. This is the case for the second, the third run, and so on. But for all the peaks where we didn't receive any migration time, these edge-of-failure runs, we use the lower expected migration time for that peak, so this is for Experiment 1, 60 minutes, and the highest expected migration time, which is 62. We proceed with the other peak parameters like a2, d2, and so on in the same way.

The next step is pretty simple. You go to the Fit Model platform, to the Generalized Regression. You insert M2_lower and M2_upper and insert, of course, all your DOE factors and factor combination, interaction effects, and so on. Simply click on Run, and then JMP automatically recognize that you have used interval censoring data and ask you whether this is correct or not. You simply click on Yes here. By doing that, you can use the information with the upper and lower limit for all your data. This is actually much better than simply removing the data because we have some information about the lost data, and we use them here for the modeling.

We proceed in the same way for all the other parameters like a2, the peak height, a2, the asymmetric parameter, and also this in the same way for the first peak. Now in the next step, after the model fitting and the model selection, and I used here the Pruned Forward Selection method in the Generalized Least Square platform, you save the column for each of the individual parameters with the Save Prediction Formula.

The next step is actually pretty straightforward. You repeat this process for every parameter; h1, t1, w1, et cetera. Now we reassemble the chromatogram formula by simply replacing the parameters with the corresponding prediction formulas.

If you remember, we had this complex formula here in the beginning for Peak 1 and for Peak 2 here, and now we're replacing the individual parameters with the prediction formulas. Now no longer h1 is here in the prediction formula instead in the formula column. Instead, we have here h1_upper Prediction Formula, and this process is repeated for all the other parameters. Since we receive a super, super long formula by doing that, this does not look how fit on my screen, but you can imagine how this continue for the complete formula in the end.

This is the result. Now we want to have the result in an interactive way in the JMP Profiler. What you do in the next step is you simply go to Graph profiler, and you take the Intensity formula, which is now the reassembled formula, which we've seen here. You click Expand Intermediate Formulas, put in the intensity formula here, the Y prediction formula, and simply press on OK.

I would like to give you a brief demo within JMP how the result looks like. Here we have the JMP profile platform, and you see now here on the left side, the buffer pH, DOE factor, buffer SDS, DOE factor, all the DOE factors we have actually investigated.

The interesting part is here on the right side, the electropherogram, where we have on the x-axis the migration time and on the y-axis the intensity. I've already zoomed in a little bit, and this is mainly the reason why we have here a view from just 35–47 minutes. But it's much better to also see how good these two peaks are separated now from each other. Now I have now the possibility to change directly the profiler, the DOE factors, and see what is happening on the chromatogram itself. By changing the end concentration, the peak height is changing, of course.

But by changing the separation voltage, also the migration time of these two peaks is changing. I can repeat this process with the capillary temperature and also with the buffer SDS. Sometimes if you look quite closely, the separation between the two peaks is better, and sometimes like here, it is much worse because now actually no separation is possible anymore. But reducing the buffer pH leads to much better separation of these two components.

This is one part. Now we have the interactivity in JMP combined with our synthetic chromatogram. But I wanted to go even one step further than that. Let me zoom in a little bit into the electropherogram here with the chosen settings, and I tweaked the prediction and confidence interval formula. What I did is I replaced the confidence and the prediction interval formula by the formula of Peak 1 and Peak 2. The confidence interval formula shows us now just the Peak 1 component.

Let's make this a little bit larger. The prediction interval formula shows us now simply the second peak component. What is now possible is pretty interesting. We see even these two peaks are overlapping each other. We still see where the individual peak is coming and how well they are separated from each other.

Now I can really track down where is my Peak 1 component below my second peak component. This, of course, helps me a lot of understanding how well the separation is performed and which settings are the best ones in the end.

Something you should always do after modeling is to pair the real raw data with the model prediction, and this is what I did here in form of the prediction versus measure plots. In red, we see the real data, and in blue, you can see the model prediction. I've selected a couple of experiments here, just to show you how good the model is performing.

Please don't get confused. I'll start here with Run 2 because Run 1 was one of the experiments where we had an edge-of-failure, and due to that, we actually don't have any data with peak components and comparing nothing with a model prediction. It's not really helpful here. But for all the other experiments here from Run 2 till Run 11, we can see that the model is actually performing quite well. Sometimes we have a little change in the retention time, but the peak height is usually fitted quite well. Also, the peak width is pretty good.

Here, we have a little issue with the baseline, but this is something Chris and his team is now really tackling and trying to improve with the baseline modeling techniques. For overall, if we see here how the model is performing, I would say we have a really good prediction between real data and model.

One last part we wanted to do now in the CGE example is, of course, we wanted to find the best DOE factor settings in order to separate the peaks as good as possible. What you usually do is you can simply calculate the analytical resolution for two peak components. Simply subtract the retention time or, in that case, the migration time of the second peak minus the first peak and divide it by the peak width at half peak, width at half maximum from the first and from the second peak.

We have all the models for these parameters here already modeled as a function of our DOE factors. We need simply plug and play this information together, and doing that allows us to create so-called helper model with the resolution as a function of our investigated DOE factors. Here we can now see beside our chromatogram on the right upper side here. We can see on top of that in the second row, we see the resolution as a function of the DOE factors.

Now we simply use the capability of JMP, which allows us to optimize here the settings. It means I simply tell him optimize resolution to maximum and as a boundary conditions, the subject-matter experts, they didn't want to have migration times longer or higher than 50 minutes. So we set here the M2 Prediction Formula, which we have anyway, to an upper level of 50 minutes, of course. Then we run the algorithm to optimize everything, and we simply get the results. That's what we see here. With these DOE factor settings, we get a good separation between these two peak components, and we could solve the problem of the CGE method development.

This was a great achievement. To sum up here, all the important facts is now that we have now a peak detection within JMP, and we will, in future, also automate this numerical parameterization optimization of the chromatograms directly into JMP. Now it's still a manual thing, what you can all do by yourself if you want, but it's still a tedious work at the moment, but that will be changed in JMP 19.

We have the cool feature now that we can directly model the complete chromatogram with all peaks included and see and visualize these results directly into the JMP profile. From my experience, this was a great benefit for the subject-matter experts to get and understand how DOEs work, how they can optimize their methodology, and this is really a cutting edge technology to close the gap between the statisticians on the one-hand side and the people in the lab and the analytical chemists on the other hand.

With that, I would like to finish, and I'm really happy that you listened to this presentation. Thank you so much.

0 Comments