Choose Language Hide Translation Bar

Integrating JMP Data Exploration and Python Machine Learning Capabilities

The quality of biopharmaceuticals is influenced by many factors. Understanding these factors is a prerequisite for delivering the highest quality products to the patients and for complying with regulatory requirements. JMP is straightforward to use for data exploration. However, correlations within or between larger data sets are frequently complex, requiring tools not available in the standard JMP environment.

On the other hand, Phyton's machine learning capabilities offer solutions for almost every data science question. Unfortunately, using Python program code is often difficult for many subject matter experts. To address this tension, we show how to incorporate Python for machine learning questions as an alternative for extended JMP Pro features.

We use JMP journals to guide the user through a machine learning workflow. Data visualization steps such as data interpolation and other measures to clean up data are performed with JMP. The application of machine learning algorithms and the validation are performed with Python using Scikit-learn pipelines. Finally, the visualization of the results is again done with JMP to allow the user a simple adaptation of the plots.

We show how this tool can be used to identify the factors with the largest impact on a certain output parameter.

 

I would like today to use the opportunity to talk to you about integrating Python and JMP to use JMP's data exploration capabilities and the machine learning capabilities of Python. But for that, I first will give you an introduction why do we want to do that.

I work for Rentschler Biopharma, and we are producing biopharmaceuticals via CDMO. Of course, our aim is to understand and improve the production processes. But to do that, we have to understand the process, and then we will be able to deliver the best possible quality.

When you think about a typical production process or a process step, you can see here on the left, it's a chromatography column. That's one typical step we do to purify any protein. We do know quite some factors which influence this process step. Let's say, load density, flow speed, pH, or temperature.

In addition, of course, every production process consists of a lot of these steps. Let's say from a bioreactor, or a filtration, several chromatography steps or a diafiltration. Every single process step is influenced by many factors.

During process characterizations, so in the beginning of any development project, we would usually use design of experiment approach. We would, for example, vary flow speed, pH, load density to see how they interact and how they, together or alone, impact product quality, which is a great start to start a PPQ campaign or to start some commercial process.

However, after a while, you will have generated a lot of data from a commercial manufacturing. This is, of course, just random data, but it illustrates a little bit what would happen if you generate just 100 batches, and you will have a variety of product quality data. Your idea would be to understand how this product quality is affected by an effect or maybe something you have missed in process characterization. For that, machine learning is an ideal tool.

However, most SMEs, the experts which understand the process, are not familiar with machine learning and not necessarily comfortable with writing code. On the other hand, informaticians are frequently not familiar with biological processes, and then you also can get very funny predictions.

The idea here is to use JMP, which is easy to use by SMEs, to just get most out of the data and just to use machine learning without anybody noticing the code behind it.

For that, I use a JMP journal, which is easy to use and which I will show you in a few minutes. But as to the workflow, we use for a few process steps JMP, and then in the background for other processes, Python, but the user will never see any Python.

For JMP, for example, you would explore the data, you will clean up the data, scale it or create features, all the typical steps which also waste most of the time of any machine learning scientist.

Then later on this data is directly handed over to Python, where some typical machine learning tasks like model selection or hyperparameter tuning are performed. Then the data is going back to JMP where it's visualized. Then, of course, if you want to write a report from that, that can be done also easily by Python without the user having to write it himself.

One idea would be, for example, when you think about parameter scaling, a machine learning algorithm would prefer if everything is on the same scale, otherwise one-factor might dominate your model. Of course, that depends on the algorithm to use. But the general idea is that scaling helps you, that everything is comparable, and that no factor gets some advantage on a model.

For that, we use JMP journal, which I would like to show you now. This journal looks like that. We go with starting data, either of course, product quality data, or we just use an example table of random data.

What you would frequently see that maybe not all data types are correct. For example, you see here, the first three columns are categorical data. You might want to make that to numeric data for your algorithm. On the other hand, you have maybe factors which are now continuous, but you would like to have them categorical. Also, that can be done very easily.

Then something which also frequently happens, you might miss some data. You see here on the pH column, you see some data is missing. Then you would wonder, how can you deal with that? You can fill them easily. You can fill it with predictions, but let's say we filled it very simple with the mean or median of our values. Let's take the mean. It's just automatically filled.

Then finally, we come to a scaling approach. Let's see, you would like to scale this on the same axis. It's the standardization. You have these factors over here. You would like to run that to one-hot encoding. You also just click. Then you might have other factors which are logarithmic. Same idea. You just do a lognormal transformation and some concentration you would also maybe like to scale.

In the end, you have still, of course, the result column and a lot of transformed columns which you can use for any model. Then you can go over and then just hand it over to Python. You would just find what you would like to predict. You see maybe elastic net is decent, which regression is the best. Let's use both because you might want to see if there's a difference. Then you can use some hyperparameter tuning in the background and then see how could they perform. This is, of course, a little more labor-intensive for the computer.

Then with these two models, you can see that both identified three factors which play a role in predicting the final column, like pH A, time or another time, both do in essence the same, which is very nice. So these would be the three factors you might be interested in, and you would like to investigate. With this, I would also like to thank you for your attention.