Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Wednesday, November 15, 2023
※注意:このサイトが英語や日本語以外で表示されている方は、画面右上の言語設定で「日本語」、または、「私の設定」>「Preference」から「日本語」を選んでください。   本サイトは、Discovery Summit Japan 2023の配布資料を公開しております。 サイト上部にあります各ファイルをクリックすると資料がポップアップされます。 矢印↓マークをクリックするとダウンロードできます。   全ての資料をダウンロードしたい場合は、「DSJ2023-handsout-all.zip」をダウンロードし、解凍してご利用ください。 尚、本年は配布資料を会場では配りませんので、PCや携帯電話を使用して表示するか、ご自身で印刷をお願いいたします。   【公開している配布資料のファイル名とセッション情報】※資料のないセッションもあります。 A-1_HOSOJIMA:グラフビルダーとデータフィルタを駆使したドリルダウンで多変量解析の訴求力を向上する(東林コンサルティング 細島 章) A-2_TAKYU:JMPからMYSQLサーバーへの検索スクリプト生成の効率化について(国士舘大学 田久 浩志) A-3_KAWAGUCHI:JMP/JSLでアプリケーション作成を行う上でのTips(日本ゴア合同会社 河口 雅彦) A-4_KUROSAWA:機械工作実習の安全教育がもたらす危険感受性向上効果の統計的検証(筑波大学 黒澤 拓未) A-6_NORIO:トヨタ自動車九州における粘土製造演習を教材とした、モノづくりとデータサイエンスと問題解決の習得(トヨタ自動車九州株式会社 則尾 新一) A-7_OTA:研究開発部門におけるJMPを活用した統計・データ解析普及への取り組み(株式会社レゾナック 太田 浩司 他) B-1_Y-TAKAHASHI:共変量を含む多因子実験データでの交互作用を考慮した探索的な変数選択(BioStat研究所株式会社 高橋 行雄) B-2_SAEKI:日本における乳がん患者の経済的毒性に関連する要因: 患者と医師の観点からの比較(がん研究会 有明病院 佐伯 澄人) B-3_ODAI:官能評価とJMPの活用(キリンホールディングス株式会社 小田井 英陽 他) B-4_HONDA:Pharmacovigilance 分野での JMP の活用事例の紹介(小野薬品工業株式会社 本田主税/イーピーエス株式会社 小柳 将) B-5_OGASAWARA:明日から使える!JMPテクニカルサポートへのお問い合わせ事例から見るJMPのTipsのご紹介(JMP 小笠原 澤)
Despite the remarkable advances with AlphaFold2 in predicting protein structure and structure determination by cryogenic electron microscopy (cryoEM), protein crystallography is still more likely to provide higher-resolution data and more reliable structures for drug discovery. The bottleneck in crystallographic studies is the search for the chemical cocktail that promotes crystallization. Each lead condition contains two to seven chemicals and is discovered by sampling hundreds of solutions from sparse matrix screens. Once a good lead's composition is optimized, protein crystals are grown in large numbers for structural studies. Protein crystal growth often gives quadratic responses to rising levels of some factors, and two-way interactions are abundant. Two-level designs miss these features. New two-level modified one-factor-at-a-time (MOFAT) experiments can detect two-way interactions in simulation studies (Yu 2022). The extension of MOFAT to three levels would interest crystal growers. We hypothesize that a three-level extended MOFAT (EMOFAT) design can detect two-way and quadratic effects. This design retains the baseline run of standard OFAT designs and adds runs with all factors at their lowest or highest levels (2m + 3 runs, where m is the number of factors). We conducted simulation studies using JSL scripts to explore the ability of these designs to detect main effects and higher-order interactions, including three-way interactions. These designs may be attractive to protein crystal growers because the three levels will detect nonlinear responses and promise to detect two-way interactions before advancing to more advanced optimization experiments.
Labels (2)
Cerba Research is a globally renowned company that specializes in delivering top-notch analytical and diagnostic solutions tailored for clinical trials worldwide. At Cerba Research Montpellier, our dedicated team customizes immunohistochemistry protocols to detect specific target expressions within patients' tissue sections. To address the escalating demand for protocol development and enhance process profitability, we recognize the vital need to streamline development timelines and reduce costs. Given the diversity of custom protocols to be developed, the conventional OFAT (one factor at a time) approach is no longer sufficient. We have therefore undertaken an in-depth evaluation, comparing various design of experiments (DOE) methodologies, including custom design and space-filling design, using JMP. These DOE approaches are evaluated against previously developed OFAT protocols. We present data illustrating the comparative advantages of OFAT and DOE approaches, in terms of cost-effectiveness and quality.   Hello. I'm Marie Gérus-Durand, and I'm working at Cerba Research as a validation engineer. Today, I will show you how we can cut cost and elevate our quality by using design of experiments when setting up immunohistochemistry clinical protocols. As an introduction, Cerba Research is a worldwide company with activities in all the continents. I highlighted in yellow the department I'm working for. It's the pathology and the IHC department. The main research department for IHC is based in Montpellier where I am working. What is immunohistochemistry? Immunohistochemistry is a study of protein or various targets in tissue. When you have biopsies, we cut releasing slides of the tissue, and you use antibody-based technology to detect different targets of a terrorist on the tissue that we can see under a microscope or with a scanner, on digitized images. Here, for example, you have a skin tissue where you see a nice target, stained in red. More than a quarter of our activities in protocol set up for our clients at Cerba Research Montpellier. We have more than 50 clients. Each one of these favorite targets. It's crucial for us to be able to be fast best development strategy in order to stay competitive in a whole area of a protocol set up for clinical trials. Here is an example of another staining, and this is the staining I will use in the first example. I will show you. It's detection of a target in brown here by IHC, so this protocol allowed to detect only one target. We call it a simplex, and it's a lot of what we are doing. Our actual approach is one factor at a time, so we will first evaluate the antigen accessibility of the tissue for the targets to be recognized, then we will do a titration of the antibody to find the best condition, giving the best signal to ratio, and we will do further optimization if required. Each arrow here, it's a step in the development protocol, and each step is at least one automaton cycle, and each cycle last at least four days from the request to the test results. To improve our grant profitability and amount of protocols developed, we need to reduce time and amount of tests to arrive to the optimized protocol. Here I schematize our strategy, so we test two antigen accessibility conditions with no antibody or a defined antibody consultation, and so we have these four points. Then we choose the best one so, it's condition two to do a titration of the antibody which has more points, but what if in fact, the optimal point was somewhere with condition one outside of the range tested? When we look at this, it reminds me a lot of the design of experiments, and webinars that I saw with JNP. This is an example slide from a manual room, where you compare the offered approach one factor at a time which is one we are using now to the DOE and you see that you have a more covering of your experimental area. Which can be a great improvement, in our setup. I decided to give it a try. First, I start with the simplest thing, which is simple protocol development as I show you, detecting only one target on this issue. We are first to define our constraints and parameters. In our automaton, the one I'm using for this experiment, this automaton has independent positions. It mixed with one test from the design of experiments is one slide in the automaton, and we have to define the responses and the factors. The response we want to analyze is a signal intensity. We want to maximize it, and we want to minimize the background. The factors we can play with at this stage is antigen retrieval, which is a categorical factor, and the primary antibody concentration, which is a continuous variable. Let's go with the design. I choose to compare different designs. I show you the Custom Design and the space-filling design. We would set up, together this design. I just go to the DOE platform and custom design, and we will have to enter the responses and the factors or responses from last year here. The first one is a signal intensity that we want to maximize. I will assume the arbitrary evaluation of it between zero, which is no signal to three which is a signal, and I will add another response here. It is a background we want to minimize, so I should minimize this time, and I put the background here. In the same way, zero it's no background, and three is very high background, and then we will add the factors. I'll just show you because I was a bit hidden here. As factors, we have the categorical factor, which is two-level, the antigen retrieval. It's quite easy to change, and we have two PH usually PH 6 and PH 9. Which is called differently depending on the automaton, but this doesn't really matter. We have the continuous factor, which is an antibody concentration. This usually, we test 10 and maximum, so I will make it vary from zero to 10. Once we have said this, I have no covariates. As I told you, I chose the simplest one. We will continue with the model. I want to see all the possible interactions, so I do RSM. I don't do replicates. I don't do center points. I already did go to the simplest, and you see that by default, she said, okay. You should do up 11 ones, 11 ones fits with what we are doing up, usually, so it's fine. To take advantage of this design explorer that is in JNP 17, I just click here and say, okay. Let's explore maybe different design. You see that when you click, you have the factors again, the model here, and you have different options, so on the left, it's just to express design by design. On the right, you can do a combination. For RSM usually, we do eye optimality, and once, let's say, I want to try different numbers of runs. Let's say a minimum of five. Let's say, we would start with this, and we go up to, so we say 11. What if we do 15 for example, with the step of let's say one. Center points, I don't really care, but we can try and see what it does and replicate. Let's see, it's better if we do replicate. I click generate all design, and it did also design with the different parameters, so I chose eye optimality, so it's eye optimality everywhere. I choose runs number five to 15, and then you have replicates. It didn't but here's a replicate for five. I don't know why, but it's okay. You see for seven for example, I have two replicates, one or zero as the same for all of them. For center points as well, I have one center points or two center points when possible. At the bottom, you have a linen table so you can make it and it adds a nice column. This is just custom-designed for all of them. If you choose a design in this table, you click Custom Design and it opens the table cost. Which is nice here such that you can do some graphs and say, okay. Let's see. Efficiency, depending on the number of fronts, I will do. You see that there is a great correlation between the two of them. We can play and add other variables to look at with the column switcher. Once I want to see runs and I want to see center points and replicates as well. Sorry. I forgot. To select the runs, but anyway, so here we have the runs. Center points doesn't seem to have a good impact and replicates neither, so it looks like it's ready. We should play with runs number. If I look at the runs number, I would come back to this. I would just block the users, the center points, and now you look just at the reference numbers. I didn't remove everything. I hope you should remove everything before. Otherwise, it's just adding all of them. I removed everything. Just selecting. Didn't make a note. Sorry. It takes that, so let's say, I will choose, what was selected, so 12 runs. I will go back just here and say, okay. Is it default? 11? In fact, I want to be sure to have my two negative control, which is no antibody at all with isotopes, so I would say 10. I would make a design with 10, and then I will add manually my two controls. Here you see, it's a small design. You take only 10 seconds, and here is a table. In fact, each time we run it, it will do different numbers here. I will use the one I did before. Just running this script. I will have exactly the same data as anyone I'm showing you. Here it's a table, so you see it was different. I have only 0 or 10 or five of the concentration, but let's keep this stable. In the table, you see you have the model already, the evaluation of the design, and you can go back to the DOE dialog box. Evaluation, I will not do it because I want to compare two designs. I will do it at this time. I just brought it, so the design I have, it duplicates for each point at the end. Even if I did not ask for duplicates, it gave me duplicate. It covers this area of experiment space. I try a space-filling design. It's the same. Good DOE, but this time, it's special purpose design, space-filling here, and what I want to show you, so I'd need to go back to my Custom Design. I want to compare the design, but the responses and factors would be the same. I can just save my responses and my factors. Oh, I would just save responses, in the same way, save photos. Then if I go back to the Space Filling Design. I can load the responses. So sorry. You have to have the window selected, so you find it. The same for the factors, and I wrote the factors. No constraint. Once again, number of warrants is 20, but I want to compare two designs, so I will do 10 the same. I have no choice. It's just a flexible feeling. It generates this design, and you see that this time, the concentration is not strict like 0105. It's a huge range of concentrations, and you can do the table the same. I can close all of this. As for the Custom Design, I rerun the one I did, so I have exactly the same numbers, and so I wanted to compare the design. How do you compare the design? Oh, first. Sorry. I forgot. You see that the so the variation in antibody concentration are not the same I don't have any duplicates except for this control here. It covers more broad range of antibody concentration, which is good actually for what we are looking for. I wanted to compare the design. First, I can graphically because that's the area covered by each design is different. And in my point of view, the Space Filling will allow me to try more antibody concentration, which may be a good point, but I can compare the design. In DOE, you have design diagnostics, compare designs, so you see I have both of them. I already have the Fast Flexible, so I will add the Custom Design. Columns names are the same, so I don't need too much columns. He will do it, you know it, and so recap to date the factors. For the Fast Flexible, it doesn't start from zero exactly and doesn't go up to 10, but it's nearly there. The model I will do to erase them. I cannot have the categorical factor in this, but I have. Let's go back to the antibody by antibody and antibody by antigen risk reward. I cannot get antigen at factor two, but it's okay, and we go for the design evaluation. That's why I didn't do it one by one. I wanted to compare. In blue here, if you look at the power plot, it's a Fast Flexible Design, and in orange is the Custom Design. You see that [inaudible 00:17:23], Custom Design looks better in the name of four of the determination of the model than the Fast Flexible Design. If you go to the fraction of Design Space plots, it's the same. Custom Design seems to fit better. If we compare the design diagnostics, it's relative to Custom Design. Meaning Custom Design adds one value, and if I look for high efficiency, Fast Flexible Design, I open seven, so it's 30% less efficient, let's say, than a sensor Custom Design. I just put this in here. Just a three diagnostic I show you here, and it seems like Custom Design is better than Space Filling one [inaudible 00:18:18], but let's see what the results will tell us. The profilers are obtained after entering [inaudible 00:18:26] so just to show you an example of the data, so I have my DOE table here. I just added the signal intensity and background response, so I look at my images and say, okay. I have no intensity. I have some. I have more, and so-and-so on, and I did the same for both designs. Here's the same. If this were not, it was two runs, so two different data sets. At the end, I did the model. Just click on the model. You see it's a standard square effect screening. With all the interaction here, and I fit everything together. I have some factors that I could remove, but I decided to keep everything. Then I go to the profiler here. If we look at the best condition, I will maximize my disability as it was done. Maximizing my intensity. You see it goes to the high point here, minimize background, and he found this condition. After this, I realized that maybe it's not, I don't want to maximize signal intensity. I just want to match a target because maybe the sample I'm using here, it's not the sample with low, so I want to be able to detect low and expression of my targets. I will change a bit, and I say, okay. I want to match a target, which is a middle intensity. Not diagnosed one, but the middle one. I still want to minimize background because background is not good. Now if I do maximize desirability, you see that it changed the concentration of antibody to use. This is nearly what we obtained here. Maybe I did a little number different, but it's okay. I did the same, and that is the same for a Fast Flexible Design. I obtained two different conditions, one where I am nearly 3.8 of antibodies in CC2 and two in CC2. Here, it's not PHCCN9 because on this automatic, it's called CC1 and CC2, but it's the same. Then I say, okay. I have these two conditions. I would compare to the initial protocol. Meaning our reference, which is our standard brush, which is the images I showed you before. Here, you have the Custom Design conditions and the Space Filling Design conditions. You see that these two conditions give data which has at least as good. I would say even better than the one we define with our standard approach. For me, the Space Filling data, I wrote to test more antibody concentration, which is useful when you have difficult targets. I will go for this because in addition, this two condition you see here. In fact, were present in Space Filing Design. I didn't have to run against the conditions because I had the data in the Space Filling Design and the image is to double-check that is working well. I choose a Space Filling Design to test on another protocol. Just to see if it's working. Don't go through everything. I just take the same paper and change my responses for the new targets. I obtained this model where to match the target of two here of central intensity and to minimize that background, he said, okay, you should have 2.8 of microgram per male of antibody and in CC2 condition. If I compare it to the protocol we developed, it was said CC1 and 5 microgram program, so it's not the same. Both change, but as you can see on the image, it's the same protocol defined by Space Filing Design is at least as good as the one we usually develop. Now what is missing to convince operations and managers is for sure the cost and time. Our offered approach to the design here. First, I compare as a Custom Design and Space Filling Design, as time it took me to design everything comparing to our standard approach. Our standard approach is in blue in this graph and Space Filling in red, Custom Design in green, and I compare for the number of cycles, which is time-consuming the number of slides, the time from design to protocol, and the time from take a request to reserves. If you look at the time from technical request to research, even having to design everything, including that it's a new way of doing for the technical team, we shortened the time to results. Now with a second example, I did not have to design everything again. It's more likely what we will do next step if we take this approach as our new actual approach, and so I compare the standard approach used for this project. In blue, in striped red, I put the DOE setup press the comparison cycle because as we have a reference protocol. I did a comparison cycle, but if it's in the new approach we are using, then we will not do comparison cycle because we will not have reference. We would just do the DOE setup, and it's in red. As you can see, it decreased the number of cycles, the number of sites, the number of antibodies we used, and the final results in days, it's shortened by ALF, which is quite nice. In conclusion, I like this quote from Steve Jobs. "Start Small, think big." At the beginning when I saw all these [inaudible 00:24:53] oh, it's nice, but I can apply it. I struggled a bit, but I said, okay, I will start the simplest way, and I set it up for this simple IHC, and I convinced my manager even more rapidly set up when I thought. Then the operational team and eventually other leaders that the approach can be applied to IHC. We cannot grant in case we need optimization as we do now because, you don't know at the beginning if it would be an easy or not an easy development. The next step, I would like to introduce to the lab. It's multiplex setup. Detecting multiple targets on the same tissue, starting with two targets against the suppressed one and going up to five targets. This is ongoing. Finally, I want to thank you. I want to thank the steering committee to have selected my abstract I want to thank you, Cerba Research Montpellier team also the lab technician because it's sometimes hard to follow my crazy ideas. My manager, which is already supportive in new, strategy and new way of seeing things, and all the conception and validation team. I am a part of to follow me in this test too. I want to thank the JNP French team. Stephane to being supportive and asking all the questions, and then that allowed me to load this new, JNP 17 person where I'm stuck with the 16 and be able to see the nice platform of design exploration, and I thank you all for your attention.
Labels (2)
In biopharmaceutical process development, the characterization of critical process parameters (CPPs) and controllable process parameters is crucial. As in many industries, process development and characterization studies start at the laboratory scale and the production process is subsequently upscaled to the final production facility. JMP plays a central role as a tool to support these studies – from DOE to process modeling. The use of online sensor data is a potential source of information, which is currently underutilized in process development and characterization studies. In particular, the comparison of microbial metabolic profiles between the production scales is mostly performed exploratively. The statistical analysis of this data is an option to understand differences and can help to enable better process control strategies. In this talk, we explore: How Functional Data Explorer in JMP Pro can give a better understanding of scale differences. The current usage of JMP for process characterization studies at Lonza. Utilization of Functional Data Explorer in JMP Pro to support process understanding and optimization   Hello and thanks for joining my talk. I am Anne-Catherine Portmann. I am a Project Leader in the Microbial Process Development Department at Lonza. Today, I am here to talk about using microbial metabolic profiles to improve scale down model qualification and process characterization studies. I want to briefly start with this disclaimer. I will not read out the whole text, but I want to mention that I believe that all information I share with you today is correct. Due to confidentiality reason, I specifically want to mention that all the data is normalized and anonymized. Here is the agenda of my today's talk. We'll start with an introduction to Lonza, and I will give you some background information on bioprocess and process characterization studies. This will help to understand the three following examples where I use JMP. In the first example, I will show you how I applied JMP Functional Data Explorer to compare data from batches at different scales. In the second example, I will use to JMP in a very standard way to determine the proven acceptable range of a parameter. The third example will be on the utilization of Functional Data Explorer to minimize the product-related impurities. Finally, I will conclude this talk and answer your questions. I'd like to start by giving you a quick introduction to the company I work for. Lonza is a multinational manufacturing company for the pharmaceutical biotechnology and nutrition sectors. On the right, you have a picture of the site of Visp in Switzerland, where I am working. Let's have a look to some numbers about Lonza. You most probably already know that Lonza is a global company of about 18,000 employees with a long history of more than 100 years. We have more than 35 sites worldwide, supporting our customers in manufacturing innovative medicines. Visp is the biggest site of Lonza. In Visp, the microbial production capabilities are ranging from 70 liters to 15,000 liters. When a new process arrive at Lonza, it is first transferred from the customer to process development department. The building where I am working in the process development is really central to the site, and we have all the manufacturing production around us. That helped a lot to have good collaboration between process development and manufacturing scale. In the process development department, we are testing and adapting this new process to derisk the upscaling at the target manufacturing scale. I know that many of you are coming from different kinds of industries and are not familiar with microbial bioprocess. Therefore, I will describe you in the next couple of slides, a typical microbial bioprocess and what are the typical steps to characterize it. A bioprocess is composed an upstream part and a downstream part. During the upstream part, the protein of interest, here in orange, is produced in microorganism in the fermentation. Then the cells are broken down, and the cell debris are removed during the separation. During the downstream part of the process, we are removing other proteins, and we are purifying the protein of interest until the final product, which is very pure. The fermentation is an important step as it produce the protein of interest. The amount of product is defined by this step, as well as the product quality. Any mistake in the protein chain could not be corrected after what's in the process. Therefore, it's very important to correctly regulate the input parameters such as temperature, pH, dissolved oxygen, or other kinds of parameters, and to really good monitor the output attributes, which are, for example, the concentration of the product in titer, the purity of the product, or biomass that correspond to the amount of cells in the bioreactor during the fermentation. During a PC study, a process characterization study, we want to understand the impact of the input parameters on the output attributes and define the parameter ranges where the attributes are within specification. Understanding the dependency of input parameters and output attributes is key to comprehending bioprocesses. What is a process characterization, say, the PC? The process characterization is part of the process validation, and it contains four parts. In the first part, we have the risk assessment, where we are selecting the parameter to further investigate. We select this parameter based on the process knowledge, on the historical data, and the experience and expertise of the people involved in these risk assessments. The second part is the qualification of the scale down model. We need a scale down model because we cannot perform all the experiments to investigate these parameters at scale. It will be too much expensive, and the efficiency of this experiment would not be really high. At the laboratory scale, we have the possibility to run multiple reactors in parallel, and that help for having an efficient experiment in the lab. However, the instrument at the lab scale had to perform exactly the same and giving the same result as we obtain in the manufacturing scale. If they are not, we have to evaluate the difference in between these two scales and to see these scale-dependent differences more in detail and to be able to explain it. The next step that we are doing during PC study is preparing the design of experiment using JMP to optimize the number of study, and experiments are then performed in the lab. From this experiment, we will collect all the data from this output, and we will generate data table in JMP and analyze the data to evaluate the interactions between the parameters and the attributes, and define the parameter range and impact. Why a process should be characterized? A process should be characterized to ensure that we deliver a constant product quality and a reproducible yield during manufacturing. When we are taking a medicine, you always want to have the same effect. If you take a painkiller, you always want to have no more pain after you take it. You don't want that once you take it, you don't have any more pain, but two days after, when you take again the same medicine, it's almost doing no effect. The day after, when you take it, the effect is too big. That is really the kind of example where you really want that your medicine is really constantly having the same quality and the same efficiency. To do that, we are doing a process characterization and a process validation. Now, I will show you some examples. I will start with the qualification of a scale down model. As I told you, the scale down model qualification, it's a very key step during the PC study to ensure that the experiments are performed in a representative instrument. In the fermentor or in the fermentation reactor, we have many sensors connected to it, which are recording a lot of data continuously during the experiment. This is a large source of information that will be really good to be used for comparing difference between scales. I will show you how we explore these differences of scale by using this kind of data. As I say now many times, it's really cheaper to use an instrument in the laboratory. We can use a high throughput device at on a laboratory scale, and that will allow us to make a lot of experiment in a cheaper way than doing all these experiment at manufacturing scale. By qualifying the scale down model, we are able to also determine the scale-dependent differences. Until now, we are really using the offline data. It looks like this, how we are doing it. We have the data at lab scale and the data at manufacturing scale in a one-way analysis. We use an ANOVA or a mean comparison to determine if these two group of batches at lab scale are comparable or not. The thing is, when we have online data, we cannot use this kind of graph to make a comparison. The data look like the right part of the graph. Some people are just comparing this curve by an expert eyes and telling, okay, that is comparable, or these data are completely different. The problem of this kind of approach is that the people comparing it could leave the company or some colleagues are looking at the data and saying, "Oh, no, I don't see a difference," and you see it. It's not really a good statistical way to compare data. Thanks to JMP Functional Data Explorer, we are able to statistically compare this data in an appropriate manner and to even determine if we are getting some clusters. Qualifying the scale down model is key to translate the ranges from the PC study to the manufacturing control strategy. In case we have a difference, we are able to translate back this difference. I will show you very quickly how the JMP Functional Data Explorer is working and how we use it for our example that will come later. The Functional Data Explorer is a very good tool to analyze the continuous data. I will not explain you very in detail how it works. I think a lot of talks were already made and are done currently about really how it works. I will really focus on the main part that is useful for my analysis. The first part is really to, when we have our data, is to fit a model. This model has to be chosen between the B-spline, the P-spline, the Fourier, or the wavelet. Then we check which one of the model is better fitting our data. As a second step, we have a functional principal component analysis. If you already worked with principal component analysis, it's the same approach, but with functional data instead of having data points. The idea is to change the space design to uncorrelate the function, so the function shapes that we have here. These function shapes will explain the variability between the batches and the function according to the mean function. For example, here, the first shape function will explain 56% of the variation in the attribute trend between the batches. Another graph that we obtain is the score plot. In the score plot, we can choose which component we want to compare. In this case, we have the choice of the five components. The five shape function component that we can plot in the X and Y-axis. Here, we can, according to the data, looking if we are grouping them along the X-axis and the Y-axis, so along the first or second component in this case. Here, by looking at them, we can already see that along the component 1, we have with high probability, two groups, one at the left part of the X-axis and one at the right part. When we want to ensure this, we can also use the control chart. With the control chart, we generate one control chart for functional principal component. Then adding at the top the different scales. We can look at the means that generated between the scales and evaluate the scale differences and if the means are really the same. Functional Data Explorer allow us to cluster comparable batches and to compare scale means. It's exactly what we were looking for to compare these attributes' data. Let's have a look to an example. This data are real, but for confidentiality reason, we anonymized and normalized the data of this presentation. In this example, we would like to compare the two lab scales, so the lab scale 1 and 2 that are in purple and in blue with the manufacturing scale for a specific attribute which was a time course startup. We would like to know which one of these two scale would be the better scale down model of this manufacturing scale. We run the Functional Data Explorer, we get some results. If we look at the eigenvalues of this functional principal component, we see that the first one is already explaining more than 78% of the variability between batches. You can get the second one, it's more than 9%, so we will concentrate the rest of the analysis on these two components. We look at the score plot. In the score plot, we directly see along the X-axis for the functional component 1, which was the one explaining most of the variability between batches, we have two clusters: one on the left having only the lab scale data and one on the right having the lab scale 2 data and the manufacturing scale, the blue and green dots. We have some outliers. Along the component 2, we cannot really define two groups. To confirm these groups, we look at the control chart. Here, we really see that the green line corresponding to the mean values between the batches are really similar between the blue and green dots, so the lab scale 2 and manufacturing scale. The lab scale 1 had really a different mean, and then this lab scale was not really optimal for us. The lab scale 2 is identified by Functional Data Explorer as the more representative scale down model for this attribute. It's important to say that it's for this attribute. For another attribute, it could be another scale down model. We have to check that for all the attribute that we have decided to explore in our analysis. Now that we have defined the scale down model, we would like to see the next part of the analysis of the PC study. In this step, we would like to design the experiments, to perform them, and to look at the interaction between the parameters and the attributes, and the data analysis with the parameter range and the impact determination. For this two step, we are not using Functional Data Explorer, but JMP with a very standard DoE approach. Why we use a DOE? Doe allowed us to compare parameters and attribute, and to see the correlation between them. Indeed, in the design of experiment, we can identify very different interaction: the main effect, the quadratic effect, and the interactions effect. We also are able to optimize the attribute. For example, we can maximize a titer, minimize some impurities or maximize some purities, some quality attributes. We can attain a certain target. We have many options that really could help us. In the report at the end, we also have the effect of the parameters on the attributes. Thanks to the P-values, we can define if they are significant or not. To design the right model, we can use the JMP menu DoE and to choose the design that we want. If we have doubt how to do it, we always have the easy DoE option that is super convenient and help us to go step by step to design our model. In our case, we use the classical response surface with central composite design. In this case, we have a center point that correspond to our set point. Let's say we have 35 degrees as a set point. Then we have what we call the operating range. It's the range including the accuracy of the probe as well as the small variation that could occur in a fermentation. For example, the 35 will not be a straight line, the 35. You will have a very small oscillation due to the accuracy of the probe, of the nutrients that we are adding, and how the fermenter is maintaining the temperature. That is known that in this operative range, we will be within the specification. Now, we want to know if we enlarge a range or taking a larger range in case something is happening during the process that make during a short time a small difference, a higher temperature or a very low temperature compared to the set point during a short time or maybe during all the process, are we, during this change of temperature, still in the specification with our product or quality? JMP DoE is helping us to optimize the number of experiment performed in the lab, but also to understand the interaction of the parameters and the attributes. Now, we are performing the experiment in the lab, and we are collecting all the data in the data table of JMP, and we want to build a model. When we have all the data in the table, we are doing it with the tidy data principle. That is one row per experiment, one column for each parameter or each attribute. Very standard way to use JMP. Then we can fit the model with the different option. For example, we can put the different parameter in response surface model. We can use the Y here to add the attribute that we want to explore. We can choose different personalities, for example, the standard least square approach or the stepwise approach and so on. Then we will get, after running this model, a report. In the report, we have the tendency to scroll down and see the result in the prediction profiler, to see the interaction between the parameters and the attributes. But to do that and to ensure that we have the right model to do it, it's very important to concentrate on the evaluation of the model. That are the first step. It's why JMP is giving us a lot of output in this report. We have to really understand the result of this evaluation. We have, for example, the lack of fit, the studentized residuals, the summary of fit, and so on. I really advise you to have a deeper look on this data and really understand what it means and to get the best model to your data. JMP performs proper analysis and verifies the quality of the fitting model, ensuring the reliable PC study and drug product. The relationship between attributes and parameters are in the prediction profiler, as I told you before. We have a set point in the middle. Normally, it goes to the top of this prediction curve. We have the operating range. That is where the data are within this attribute range. You have a larger range at the edge of the attribute range and intersection of the curve of the prediction profiler. That means that are the extreme points where we are still within the specification. That is the proven acceptable range that we want to find with a PC study. The PC study also give us the impact, and that is depending on the shape of the curves of the prediction profiler. More the curve is enlarged, lower is in the impact, narrower is the curve to the set point, higher is the impact. The attribute specification is only met within a parameter proven acceptable range. Now, I will show you an example where we use that. In this example of PC study data from a fermentation, we create a response surface model with a stepwise approach and all possible model option. We met quite a lot of criteria in our model evaluation. We were just not getting the lack of fit. We have a lack of fit, and the model is not fitting well over that. We were wondering why. We discussed internally with a lot of experts, and we look really deeply in our data. What we found, it was that we were probably going to have a plateau effect that was not seen by a response surface model because it's a second-degree model. We had to add a third-degree polynomial to the parameters. It's what we did, and we rerun the analysis. There, we were meeting all our criteria. In this case, you have the first model here and the second model there. You see at the end that the parameter that changed, it was the first parameter, and we have a curve a bit different in the prediction profiler. Indeed, we had a plateau effect here at the end. At this point, we found a model that was correct, that was within all our specification. We were able to define the proven acceptable range, and the impact. The impact, as you see in the two first parameter, was higher than in the third parameter. That is something that is now applied in our production, and some runs are performed at production level with these ranges. At the end of fermentation, some product-related impurities could be detected in the analytical method chromatogram as a post peak shoulder. That is really a thing that is complicated to remove in the downstream process. I will show you how by using Functional Data Explorer, we were able to minimize these product-related impurities. As you see here, we have the chromatogram. What we were expecting is that the curve where it goes up here was going down in the same shape on the right side. But we have this bumpy side going down. That means we have impurities in this part of the product, so product-related impurities, which are hard to remove in the downstream process. This issue could be even bigger if we don't remove them, or we don't find a way to remove them, it could be that at the end of the process, we are not in the quality specification of the product. We have now a curve that is getting some impurities in it. We know that with a JMP Functional Data Explorer, we can understand better the curve shape. We want to understand how some parameters, as you see before, we get a DoE, so we test different parameters, how we can maybe use this information of this parameter to explain the impurities and maybe reduce them. The problem for us, it was that the Functional Data Explorer is giving us some principal component analysis that are not real parameters that we cannot change in the lab to have concrete parameters, real parameters. It's not a temperature that we can change from few degrees. But in JMP Functional Data Explorer, we have an option that is super useful, and it converts the data of the functional principal component in real parameter. That is the functional DoE. By adding at the first when you're building your model in the Functional Data Explorer, you can add the DoE analysis that you perform. By doing that, JMP is able to translate back the FPCs into real parameters in the profiler. Then we were able to convert them back. It was our hoping that we were able, with this approach, to remove or to decrease the impurities. Here is the example on how we concretely did it. We have this shoulder. Here, I zoomed on the part of the peak to have a better look. We have this post peak shoulder of 0.11. It was something high. That is the amount when we have the set points at all the parameters. Then we optimize it by moving in the functional DoE profiler each parameter. We found that by decreasing the first parameter and increasing the two next parameters, that we were reducing then half of the amount of these impurities. That is really a great result for us because it's so difficult to remove these impurities in the DSP part that if we can do it really early on the process by optimizing some parameters, that save us a lot of time and money. That is really a beneficial solution for us and for our customers. To that is a great result that we have to test still in the lab. It's not yet done. We have also to evaluate if that would not produce other issue, reducing too much other parameter, other attributes that we don't want to reduce or increase other aspects. During this presentation, you probably realize the importance of statistics to support a process characterization. Today, I present you how we can improve the process characterization and process transfer from small scale to manufacturing scale thanks to JMP. As take-home message, you can remember these four aspects of my presentation. First, a process characterization ensures a constant product quality and reproducible yield during at-scale production. Second, with JMP functional DE, we are able to cluster continuous data to detect scale differences. Third, the JMP offers the option to identify the best fitting model, even for complex data such as PC study data. Finally, JMP functional DoE translates the FPCs into operational parameters which support process understanding and optimization. With that, I would like to warmly thanks all my colleagues who helped me preparing this presentation. A very special thank you to Sven, who had the idea of this presentation topic, but also to Claire and Ludovic, who have shared their statistical knowledge. To Jonas for his support on preparation of this presentation, Romain for sharing some data with me, and all the colleagues for their input and thought. Finally, a big, big thank you to Florian, who support our department for any JMP question, from the more complicated one to the most stupid one during the last years. Thanks also to you for your attention, and I hope you found this all informative, and I am ready to answer any question you have. Also in Manchester, if you have more, we can still discuss about it. Thank you.
Labels (2)
Innovating more sustainable, higher-performing products is the foundation of our ambition for a Clean Future in Home Care at Unilever. Scaling up new technologies from laboratory to factory brings considerable and exciting challenges, so how do we approach innovation to deliver for our consumers and our planet? In Process Development, we believe the most value is created if DOE and modelling are established as key skills in every process engineer. This is why we have built a globally active community of practice through a 70:20:10 approach to digital upskilling, delivering impactful innovations through DOE, and modelling on high-value projects. Embracing a "digital mindset" has empowered engineers to deliver impact and value as individuals, developing deep technical expertise in new-generation technologies through structured data capture and statistical modelling. This approach has enabled the introduction of sustainable biosurfactants and low-CO 2 formulations straight-to-factory, with cost and complexity reductions across supply chains. New efficient process routes, optimised through modelling, have resulted in double-digit million-euro savings and product performance improvements throughout our Home Care portfolio. From formulation to factory, our approach to process development is helping to deliver the Clean Future revolution through a digital approach to innovation.   Hi, I'm Ewan, and I'm a Process Development Engineer working in Home Care at Unilever. Today, I'll be talking you through how we built an active global community for a digital-first approach to innovation and sustainability. Now, whether you know the brands or the company itself, Unilever is one of the world's largest consumer goods companies, with a portfolio of leading purposeful brands, home care brands like Persil or OMO and Comfort, and other brands you'll know, like Ben & Jerry's and Dove. We have an unrivaled presence in future growth markets. As a business, we have a determinedly commercial focus to be sustainable. Now, this focus as a sustainable business helps drive a real impact to the planet through the 3.4 billion people that are using our products every day across over 190 countries. The reach we have as a sustainable business resulted in over €60 billion in turnover in 2022. Now, our purpose and ambition as a business really shines through in our clean future strategy. Now, this is the delivery of products that are unmissable superior in terms of product performance, products that provide great value across all price tiers, in all of our brands and products that are sustainable. It's really the combination of all three in every product that we deliver to our consumers who are at the heart of our strategy, that makes this such a progressive strategy. There are over 50 proof points of the real impact and business power that this strategy has enabled. In home care, it's process development that are the ones that are delivering our clean future strategy to factory scale. Taking the ambition of formulation scientists and marketeers, and scaling up from lab concepts all the way to optimized factory scale production around the world. Typically, we do this in four key steps, the first of which is building an understanding of how the formulation and the process interact. Whether that's the effect of temperature making the product thicker and harder to pump, whether its effects on product clarity, so whether it looks clear or hazy, we really need to build an understanding of these interactions upfront, so they can feed into all the subsequent steps. The next of which is a process route design. Starting to understand and build an idea of how we want to process this, both at pilot scale, so small scale, and also at large scale, so in our factories around the world. From these two steps, we can then start building scale up rules. This is in pilot plant scale up work. Whether we're making product in 50 kilogram batches, 100 kilogram batches, or 200 kilogram batches, even more. This is where we really start to develop rules that will apply at factory scale, where we're producing in the order of tons per batch. Resulting from these is the main plant trial. Testing our product in live factories around the world. In each of these steps, we can generate a lot of data, a lot of understanding to build into the next step. We want to be as exploratory as we can to really push the innovation process to innovate products that are conceptually new and our consumers want. Our ambition here is, can we link these data and our ambition to be exploratory into a digital driven innovation pipeline throughout the whole process. This is what we tried. We wanted to pilot a hypothesis driven approach to design of experiments doe, and we had a new formulation that we needed to launch one of our popular home care brands. We needed to launch this in over eleven factories worldwide, comprising of over 25 to different types, sizes and scale of mixer. Now this is a big challenge and it's essentially a global rollout. How can we optimize this for unmissable superiority product performance to provide great value to our consumers and also to provide a sustainable product? Because every optimization we do at this step can benefit our 3.4 billion consumers around the world, and essentially our planet as well. We started by forming a hypothesis, so that consisted of which process parameters we thought might influence our product, whether the product quality parameters itself or the process. We trialed a design of experiments approach at pilot scale. Doing this at smaller scale before we scale to a factory, allows us to minimize cost, raw materials, and actually expedite the process as well. From these data, we were able to start modeling some of our quality parameters. An actual example is shown here. This enabled us to identify the actual critical control points, not just in the process itself, but some of the formulation parameters as well. What we see here is the prediction of our product. Viscosity depends on temperature. Another quality parameter, parameter 2 and materials A and B. Now parameter 2 and material A don't really have much of an impact on viscosity, but the really interesting part is in the temperature and material B. If we reduce the temperature even further, we remain in the green zone within specification. If we reduce material B, it's the same. There's a potential for reducing energy savings through temperature reduction and the potential for reducing cost through reducing material B. Both of these actually will result in a greenhouse gas reduction. This was the business benefit that we managed to provide through this approach. Double-digit million euros in material savings and a global energy reduction for sustainability, all for a clean future. Now, this is the work of one engineer or a small team. If this is what we can achieve, why wouldn't we want to make these key skills in every engineer? That's what we set out to do. We wanted to make design of experiments and modeling key skills for every engineer in home care process development around the world. These engineers have different first languages, live in different countries, in different time zones, and are at different stages in their development. The aim here was to ensure that everybody could benefit from one program. Different teams work on different products, so they develop specializations, maybe they want to use certain features of jump. We created a global community of practice where engineers can get together, share their learnings, and we can upskill together. The approach we took for this was an approach called the 70:20:10 approach. Ten percent of the time spent on structured training, and this was led by JMP champions, so engineers with a higher proficiency in the use of JMP within our team, working with other engineering teams to upskill. Twenty percent was shared learnings. Regular community of practice sessions with engineers as participants and their mentors, so we can share our learnings, our struggles, challenges, and really how we've progressed with jump, with a name to upskill everybody. The 70% is the most crucial part of this program. That's delivering impact through hands on work in key technologies in high value projects. These projects are business big bets that we want to launch. Using DOE can help expedite timelines, can help optimize products, and can help us understand the behavior of the product to a level we never had before. If we made DOE optional and modeling optional, nothing would happen. We're trying to change ways of working here, so we have to integrate this as an approach for our high value projects. We started by trying to cement good, structured data capture, forming a foundation for all of our future modeling work. We wanted, again, to cement a DOE first approach to our exploration. We typically used a custom design in JMP for this, using a response surface. This was complex enough to accurately represent the formulations and the processes, but not too complex that it took a very long time or would be impossible for our engineers around the world to understand. Building on this, we wanted to build data analysis and modeling skills in our engineers, so they can start developing insights about the formulations and about the processes. Again, this is straightforward and simple. Linear regression using standard least squares. We wanted to ensure that there's no multicollinearity in our models. If we see an effective temperature, we want to be 100% sure that it's temperature and not an artifact of another interaction. The real ambition of this journey is that we can build expert engineers that can create value adding technology insights. Moving on from custom designs and basic regression onto other functionalities like simulations. Whether we can simulate product specifications, for example, using desirability functions to optimize our products for batch cycle time or cost, and move on to more complex ways of modeling. Away from numerical modeling for viscosity, and starting to consider other factors like foam. This is what we managed to do. I'm going to talk you through a couple of case studies where we've managed to develop superior products, where we've managed to provide great value products and deliver the savings to our consumers, and where we've managed to deliver sustainable products to market all three at the same time in our products. The first is an example of a product that's one of the most complex products we've ever developed. It all started with characterization through DOE. Modeling the formulation space helped us really build an in depth understanding of the formulation. From this understanding, we're able to optimize the formulation. This is key, because with our approach to clean future, we want to reshape our formulations, not just for liquids, but powders, gels, capsules, creams and bars. We're having to learn how we can produce the most sustainable formulations possible and deliver these with great performance and great value. This is why characterization is critical. This product was very complex, and we have to be able to make it a factory. What we did is work with the factory teams to incorporate factory data into our model, and we were able to simulate how the formulation behaves in our factories, building confidence, not just in the formulation team, the process development team, but also factory teams and marketing teams. The results of our modeling enabled us to save multi-millions of euros in capital investment. The situation we're in now is that we can de risk the formulations in the formulation development phase, before scale, at work, before factory launch, because we can build confidence in the right first timescale up of higher performing than ever products, better value than ever performing products. This product was 100% sustainable. Through DOE and modeling, we're able to produce superior products, great value products and sustainable products. The second example is extremely similar. This is a product you probably use very frequently. It's very well known. Again, started with characterizing not just the formulation, but also the process as well. Investigating the interactions here led to new understanding. Understanding we did not have before on both the formulation and the process. With this understanding and with this model, we were able to reduce the time it takes to make every single batch by 26% at factory scale, a 26% batch cycle time reduction. This is not just a time saving, because all that saved time is now time that we can make other products, this product and all of our other brands that we can deliver to consumers. We're unlocking factory make capacity and building on our model here. Beyond batch cycle time, we're able to incorporate cost and other quality parameters to start modeling and profiling multiple parameters at once. This enabled us to maintain a high performance, reduce cost that we can pass on as savings to the consumer and improve product sustainability through a 21% reduction in our polymer, which typically are non-biodegradable. Again, we're able to maintain performance, provide great value to our consumers, and deliver sustainable products, not just for our consumers, but for our planet as well. To share a few key learnings from our journey. It was really the focus on high-value projects that helped us to create the real business impact we see here. It enabled us to bring stakeholders on board with the digital first approach to innovation and really helped to start expand this, not just in formulation, but process development, supply chain and other teams. We've moved on from process development, and we're expanding this now. The second is that frequent presentations. The shared learnings, 20% of our journey helped maintain teams development. It helped maintain the pace of value creation for the business. Having written these presentations and given these presentations, it then became much easier to transfer this understanding to other teams as well, and also stakeholders. It helped stakeholder buy in as well. The third learning is that one-to-one mentoring was a huge investment of resource at the beginning, but it was by far the most valuable for us. That early stage investment really helped quickly upskill engineers, and having engineers that professional with JMP, work with other engineers helped us maintain our focus on what really mattered in terms of a formulation and process perspective and overarching. All of this is a focus on the journey, and its long term commitment is key if we want to successfully change our ways of working. This is not overnight. It's a long journey, and it has some sacrifices as well. Your productivity is slightly lower when learning new tools, when starting to work in different ways, your output is lower, but the results really come when that productivity begins to increase because every technical insight you can pull out of your models results in a new space to explore, which then results in more insights. You start here to develop a circular innovation process. That circular innovation process is what we, now are doing, not just in process development, but all across Home Care. I'm Ewan, and I've just taken you through how we built an active global community for a digital first-approach to innovation and sustainability at Unilever. Thank you very much.
Labels (2)
Thursday, March 7, 2024
Whitworth 2
It is not unusual for individuals unfamiliar with how to properly create and analyse pass/fail experiments to treat the data as if they were continuous. Incorrectly assuming binary responses can be handled in such a fashion can lead to disastrous results. Failing to consider an underlying model consistent with these types of data, it is easy to create a design with too few or too many runs, not knowing how to properly estimate the sample size needed to achieve a certain power. Analysing results as if they were continuous fails to consider the fundamental nature of the data and how it might affect model assumptions. Doing so might produce unrealistic results, such as probabilities below zero or above one. This session focuses on designing, evaluating, and analysing experiments for binary responses such as pass/fail. Using two of the most common functions for modelling binary data, the logit and probit, various functionality in JMP and JMP Pro is illustrated, including the use of simulation to estimate power and building prediction formulas. Analysis options for fitting these types of models will also be explored.   Okay, so let's talk about designing and analyzing experiments when we have pass/fail response. I'm going to talk about pass/fail responses throughout this session. However, this is really going to apply to any case where we're dealing with a response that has two categories, that's binary, that you can be one of the two categories or the other. They're mutually exclusive and exhaustive. Pass/fail, go/no-go, yes/no, as long as that response has got two mutually exclusive and exhaustive categories. We'll break the session up into four different parts. We'll start with an example. Maybe, hopefully, it'll look familiar. Hopefully, it'll be something you might have seen in the past in terms of dealing with an experiment with past fail responses. We'll get into a little bit of things that can go wrong. After that, we're going to talk about the model because everything depends on the underlying assumptions I'm making about the model, and where that drives the direction in terms of the analysis that I pick and in terms of sizing the experiment. Once we're through with the PowerPoint slides, we're going to actually see how I would analyze the experimental data, the example data that I'm going to present. We'll talk about some of the options that are available in both JMP and JMP Pro. Then finally, we'll wrap things up with how do you size an experiment? As it turns out with pass/fail experiments, it's not the experimental runs, it's the number of experimental runs, the number of trials per experimental conditions that matters. We'll talk about how we set up in JMP Pro, how we set up the simulation that will allow us to determine this number of trials is good enough. Let's talk about our example, so we're making widgets. We have a process where we're not particularly happy with the failure rate. We identified four different factors that we think are going to help us improve our failure rate for these widgets, we've got an additive, we've got a compression step, where there are three factors that we're going to vary. We've got temperature, pressure, and hold time that we're going to vary. The question is, does the widget have a defect? We process the widget. Yes, it has a defect, no it doesn't have a defect. If it has a defect, we're calling it a failure. Current defect rate is 15%, and we're going to use that in the end to size our experiment. Our goal is to bring the failure rate down to five percent. However, we would be happy if we were to be able to identify a difference of about three to five percent. If we were able to detect that our defect rate goes down to, let's say, ten percent, then we would be happy. We would want to be able to detect that. The other thing All I want to know is that each run takes about 10 minutes. I'm not unlimited in terms of as many runs as I want. I can do about six runs per hour. All right, so let's say we put this factor into JMP. We open up a custom designer. We decide we want a response surface design. We're going to add three center points, and this is the design that we come up with. So far so good. We take the design, we send the design to the operator who's going to run the experiment, and they return this. We've got 24 runs, about 4 hours to run this experiment. We're told that those ones and zeros represent an observation with a defect or an observation without a defect. Now, I go to analyze the data, I put it into JMP. Now, I know that those ones and zeros aren't continuous data. It's really categorical data. It's binary data. I'm going to I treat it as such. Maybe I know enough that what I'm dealing with is called logistic regression. The model, because I've designed this experiment in JMP, the response surface model comes over in my model construct dialog box. I go to fit the analysis, and right off the bat, I should be worried. Towards those parameter estimates, I get this message saying that those parameter estimates are unstable. Let's say I talk this over, with somebody who's a little bit more knowledgeable about these types of experiments, and they tell me, well, problem is you only did one trial for a set of treatment conditions. You really need to do more than one to be able to get a good estimate. Let's say I go back in, and I talk this over with the operator because they don't like the idea of bumping up the size of my experiment from one trial to five trials. Now I've got an experiment that's going to be about 20 hours. But I convinced them that we really need to do this, decide to rerun the experiment where I've got five trials now, for each set of experimental conditions. I get my data back. I've got two new columns that came back with my experiment. I've got that Y5 column that corresponds to the number of times a failure was seen. I've got the column with the N in it that corresponds to the number of trials, for a set of experimental conditions. Now, this time, rather than working with the count data, I think, maybe what I'll do is I'll divide the number of failed trials, divided by the number of total trials, and I'll treat that as continuous data. Here I've got the proportion of failures for a given set of experimental conditions. I'm going to treat that as continuous data. It makes me more comfortable because I'm more familiar with modeling data this way. The same model that we saw from the last time, however, rather than using logistic regression, I'm just going to use standard lease squares. All right, so far so good. I use my favorite modeling technique, maybe I'll use my favorite model reduction technique, and then I come up with this final model. Again, so far so good. Next thing I want to do is I want to optimize. I want to know what conditions are going to drive that, error rate down as far as possible. I use the optimizer within the profiler, and it tells me set the whole time at 45 seconds, set the temperature at 45 degrees C, and that should give me a predicted error rate below zero. I might be willing to dismiss this because this is close enough to zero to say, well, maybe it's just really close to zero, it's not below zero. But I know that technically this is not possible. I cannot have error rates under zero. Even though I might run with this, I might say to myself, well, how accurate is that estimate? How close am I really to zero? Have I built a bad model? Everything gets back to the model that I've... The assumptions I've made, when I treat the data as continuous and the assumptions that I should probably be making When I've got pass/fail data. This is our typical model. We have on the left-hand side, we've got our responses, our observations. In our case, it's the probability of a failure. I'm sorry, on the left-hand side, we've got the probability of a failure. On the right-hand side, we've got our linear models, what we're trying, the relationship we're trying to build between our inputs and our outputs. This is usually what we consider. The typical assumptions in this case is that that relationship is linear, that my response range is unbounded, and that my residuals are normal. My residuals are, I fit a model to the data, whatever's leftover, that's my residuals. I assume that they're normally distributed, which means they're symmetrical around zero, and they've got all these nice properties. That's a part of the problem. If we're dealing with probabilities, our response really looks something like this, where I've got something I think that's more of an S-shape than a linear relationship. That linear relationship really doesn't hold. Unbounded response range doesn't hold either because I know since I'm dealing with probabilities, can't have anything below zero, I can't have anything above one. Even the final point, the assumption of normal errors doesn't work. If you think about it, as I get closer and closer to zero, when my predicted values get closer and closer to zero, my residual I will have less and less space to operate. That between my fitted value and zero, it starts to shrink. I cannot have symmetrical errors close to zero. Likewise, I can't have symmetrical errors as I get closer and closer to one. Something has got to give in this case. Now, I would hope that there would be a fix to this problem that's simple. A simple transformation, probably. Hopefully, and maybe the use of a different distribution. Can we do this in JMP? The answer is we've got three different options. The first option, which we've already seen in the example, and that's logistic regression. That is using the Logit transformation on the data. In this case, we're not looking at the probabilities as a response, we're looking at the ratio of our two different response levels. The probability of a failure divided by a probability of a pass, and we're taking the log of that. We're taking that log to make the relationship a linear relationship. Another way to look at this is since we've only got two different levels, it could be the ratio of the probability of the failure divided by 1 minus probability of failure. Now, if you've ever used these kinds of models before, and you've heard the term odds and log odds, this is exactly the definition. That very first part of the equation is the definition of what odds are. It's the ratio of one level to the ratio of the other level. That's my odds, and then I'm just taking the log of that. That's one option. Another option is to use what's called probate analysis. With probate analysis, I'm just using a different transformation. In this case, I am using the inverse of the normal distribution. Or if I've ever built this in the formula editor or using JSL, that's just the normal quantile function. Now, there is a third option that's possible in JMP. It takes a little bit of work, and we're not going to go into a whole lot of detail in terms of its use, but I do want to present it because you do see it a lot in the literature, and that's the complementary Log function. Now, not available directly in JMP. We need to use something like the nonlinear platform to build it yourself. Can be approximated with, there's a distribution called the Smallest Extreme Value Distribution, but We'll have to leave that for another day. Really, we're going to focus on the stuff that's available out of the box in JMP. Now, there's a second big assumption that we need to make here, and that is, even given the transformation that we pick, whether it's Logit or Probit, we're going to assume that the error is as a binomial distribution. Because it's two parts. Not only am I doing the transformation, I'm also assuming a different distribution. As I had mentioned, the first two are available directly in platforms in JMP, so they're easy. We'll focus on those. The last one can be modeled using the nonlinear platform. Nice thing is that everything is available from the fit model platform. I've got three different options within JMP. I can use the logistic regression, which, again, we've seen that in the example. There's an option called generalized linear model, and there's a strong relationship between nominal logistic regression and generalized linear Then for JMP Pro users, we have generalized regression, which gives me a whole bunch of other options as well. Now, there is another platform where you can do simple logistic regression that's fit Y by X. But, because we're dealing with an experimental condition, and we usually have more than one factor, I'm not going to talk about that. But for simple logistic regression, fit Y by X works as well. Before we get into how I implement this in JMP, we need to talk about the data organization because how we organize the data, what's the data we know how to organize the data? It's easy to set it up in the fit model dialog box. Let me give you a quick, simple example. Let's say we have a two-factor experiment, time and temperature. We've done a full factorial, so we have all possible combinations of time and temperature. Let's say we plan to do five I have trials per experimental treatment. By treatment, I'm talking about row, our unique set of combinations, our row in this data table. I've got three different ways I can set up the data. The first way I could set up the data is just as raw data. Every row in my data table corresponds to one trial. You'll notice that very first row in the raw data, it's a treatment combination one, with a given time and temperature. In this case, my binary response is whether or not my response is green or red. That first response was green. The second row is treatment combination one again, and that response was red and so on. That's the raw format. All three of those approaches can use the raw format. Now, I can summarize. Sometimes it makes more sense to summarize the data. When I summarize the data, I am summarizing over both the treatment condition and the response level. You'll notice that, for the first row, in the summarized stack, I've got treatment condition one, I've got all the greens, and I've got two cases where I saw green. Second row is my second treatment combination, three observations of green, and so on. What I've just aggregated, over my treatment combination and my response levels. All right, all three of the approaches can use that summarized stack format. The third organization, and probably the most natural for a lot of people because this looks a lot like the organization I would use for a linear model, and that's the summarized split format. In this case, I've aggregated, again, over treatment combination, but only for one of my treatment levels. You'll notice that in this case, I've counted the number of times I saw a green observation for each trial, and I've also counted the number of trials. If I were to, let's say, build a DOE using the custom designer, this is the format that I would expect to use, if I were inputting data. If I wanted to use the summarized stack format, I would have to actually duplicate these runs. I would have one for my green, one for my red, and just reorganize things a bit differently. Now, the logistic regression cannot use this format. I've got to use one of the other two formats. The generalized linear models and the generalized regression can use this format. Something to note if you plan on using logistic regression. Okay, so let me do this. Enough with the PowerPoint. Let me go into JMP and let's talk about how I would have analyzed each one of those examples, with the experimental data that we saw. Okay, let's do logistic regression. We'll start with the first example, logistic regression. I also notice that I've got my summarized stack I've got a format here. I've got a column that has got my passes and my fails in it, the count. This is aggregated data. I've got the count also in another column. All right, so pretty straightforward. Everything we're going to talk about is going to be under fit model. Because I designed this in JMP, it comes along with my model. This is my response surface model where I designed the experiment. It's already pre-populated my dialog box here. But just All I want me point out is that here I've, got my response in this case is nominal. That's pass/fail. I also need those counts in there because I've aggregated my data. I'm going to pick my target level, and this really depends on the orientation that I want to see my output, whether I want to do things from the pass orientation or the fail orientation. Let me just set this. Let's put it back to pass. Now, I've got a couple of different choices here. Both of them are logistic regression. I've got nominal logistic regression. If I know a little bit about stepwise regression, I can also perform stepwise regression on my Logit model, on my logistic regression model. That's an option as well. All right, so let's just stick with the original nominal logistic. I'm going to run my model. Again, I'm going to use some data reduction technique or some model reduction technique. I might just go in there, and manually remove items from the model and so on. Or I could have used stepwise regression for this. It really, It's not important because I got the right model to begin with. Let me remove one more term from my model. There we go. Let's say this is our final model, we're happy with this model. Let's go into our Profiler. Let me scroll down to the bottom to the Profiler, make this a little bit bigger, so we can see better. Now, one of the benefits of using this model is that my profile, my passes and failed, I am bounded, but We have to have probabilities between zero and one. Let me turn on my Desirability Functions. Now, this is going to look a little bit different because it really is split into two groups. I've got my proportion of passes I want to target and the proportion of fail. I'm just going to drop those failures to zero because I don't want to optimize for any failures whatsoever. I'll increase my Desirability on my passes and, let's go ahead and maximize my desirability. It tells me, again, the same whole time in temperature. However, I get a better estimate of what that predicted failure rate is or the predicted pass rate in this case. This is telling me that I should have about, with those settings, I should have about a 96% pass rate, so about approximately a four percent failure rate. Certainly not below zero and a little bit further away from zero than I might have expected. Okay, so let's move on to our generalized linear models, and we're going to use the summarized split format. Again, this is something that if you use linear models before this, this should look familiar, this format. I've got these are likely the observations that came from my design experiment. I've counted the number of observations in one of my levels, and in this case, I've counted the number of failing items, and then I've got the total number of trials. All right, again, I'm going to go under my Fit Model dialog box. This is set up a little bit different. For this particular organization, I have two continuous variables. The first one has to be the count of my individual level, and the second has to be my total count. I've already defaulted here to Generalized Linear Model. I picked my distribution. In this case, it's defaulted to Binomial. Here's where I have the option of looking at Logit or Probit. We got slight differences in terms of the probabilities. Let's just go ahead with Logit. Now, by default, the Logit and Logistic Regression are going to give me the same results. What makes generalized linear models a little bit of a benefit is I have the ability to relax some of the assumptions that we saw earlier. One of those assumptions is that the error term follows a binomial distribution. With binomial errors, they're fixed, they're the same regardless of what that probability is. Turning on my overdispersion says, well, the variability is a little bit bigger than I expect it to be, or the error might be a little bit bigger than I expect it to be. It relaxes that assumption of binomial errors a bit. I can also adjust my estimates as well for potential bias. I'm just going to turn on my overdispersion. Again, I'm going to run my model. Now, one of the disadvantages of this approach is that you don't have the stepwise ability here. You need to go in there, and either do this manually or have some model assumed before you decide which final model you want to work with. Again, I'm just going to do this manually. I'm going to come up with my three important factors. You'll notice if you were to compare this to the logistic regression, because I had that overdispersion turned on, my variability actually is a little bit larger than it was estimated. My P-value is a little bit larger. Here I need to ask myself, well, do I want to really believe that even though these are not significant, that they're not worth staying in the model and so on. But the nice thing is that I've got the option of having to be able to relax that particular assumption in the model. Again, if I want to turn on my Profiler and do my optimization, that's available as well. I've got the same nice abilities. In this case, a little bit different organization because I'm modeling not past fails, I'm modeling probability. It's the profiler that we're more accustomed to seeing. Obviously, in this case, I don't want to optimize this because I would be maximizing my failures. I'm going to minimize that. Let's go ahead and optimize that. Again, it's telling me all time, 45 seconds, temperature, 45 seconds. It's going to give me a slightly different estimate, although they're very close. Again, the nice thing is that I am bounded here. It doesn't matter where that slider goes, I am bounded between zero and one with those probabilities. All right, so for the final example, let's talk about generalized regression. Now, this is a JMP Pro only feature. It gives me a couple of other additional options. I get same place to start, fit Model, same setup because I've got my summarized split data. However, I am going to pick generalized regression. Now, I am going to pick a different distribution. In this case, I know that my response because it's set up with two different variables here. I've got a binomial response, or I can pick beta-binomial. Beta-binomial is an analogous technique to using that overdispersion option in my generalized linear model. This allows me to account for a variance that changes as a function of where the probability is. We'll stick with the default in this case. Let's go ahead and run the model. If you've used generalized regression before, you know that all of your model reduction techniques are available from this platform. What I would probably do in this case is probably use one of these techniques. Let me make sure I've got my... I'm going to enforce heredity. Let's use forward stepwise regression. I've got a whole bunch of different options here. If you've never seen a generalized regression before, I've got a whole bunch of nice different options in terms of my model reduction techniques. I'll use forward stepwise, which is the default for stepwise regression. Click go, and here's my final model. Again, very consistent. All three of these models are showing me very consistent results. Parameter estimates are going to be slightly different. Probabilities are going to be slightly different. Okay, so now the question is, have I run enough trials to be able to really make this experiment worthwhile? That gets us into how do I right-size my experiment? Let me switch back to my PowerPoint slide deck for a couple of short slides on setting up the experiment. The way we're going to do this, the way we're going to size the experiment, and then, by the way, this is a feature that's only available out of the box in JMP Pro. For those non-JMP Pro users, you might be able to write JSL script for this, but it would be very cumbersome. JMP Pro makes this extremely easy in terms of setting this up. We're going to start with the custom designer. What we're going to do is we're going to build our intended design using the custom designer. Before we generate that final table that we see, we're going to turn on simulate responses. That's going to be in the hotspot in the upper left-hand corner. Once we create the data table, we're going to get a dialog box where we can change the coefficients. So on the next slide, I'm going to talk a little bit about how do we set what those coefficients are? We're going to change those coefficients. We're going to change them to our desired magnitude. Again, I'm going to go into a little bit more detail on the next slide. Once we change the coefficients we're going to set the error to be binomial, and we're going to select the sample size, the number of trials that we want to see, and then we're just going to click Apply. What the simulate response option gives me, it's going to create, because I picked binomial errors, it's going to give me two columns, one column with the number of simulated trials, trial successes, and the other with the total number of trials. That's what we're going to use to do our overall simulation. Let's talk about how we set these coefficients because this is a crucial step in terms of sizing the experiment. How we set these coefficients are going depend on the underlying model. That is whether I pick a Logit or a Probit model. My example is going to use the data that we currently saw, use the example that we currently saw with the Logit model. That's going to determine how we set those values, the baseline probability. If you remember from our example, we said that, well, at baseline, we've got about a 15% failure rate. That's going to be our baseline. Then we have to ask, well, at what probability am I going to want to be able to see a difference? Again, recall, we said that if we were to be able to see a change to 10%, then we would be happy. Then that's the sizing that we want on our experiment, this ability to see a change going from 15% to 10% improvement. What we're going to do is we're going to actually use the transformation that we saw to calculate values at our baseline, at our 10%, at the probability we want to detect. We're going to take the difference of those two values, and we're going to use that as our coefficient. If we're using this Logit baseline is just that 15% divided by 1 minus the 15%. Our baseline is -1.73. If we're interested in seeing a change to 10% failure rate, again, I'm just going to plug that into the equation. It's 0.1 divided by 1 minus 0.1. That gives me a negative 2.2. That makes the coefficient just the difference between those two values. I'm going to use 0.5 approximately as my coefficients. Enough PowerPoint. Let's go to see how that is done in practice. I have my experiment up, my experimental factors. I'm just going to use the custom designer. Let me go ahead and load my factors there. I'm going to build the design I want. If you recall, I said, well, I have a response surface design. Let's add three center points there. Let's do this in 24 runs, say. I'll just go ahead and click Make Design, let that crank through and determine the best design from the algorithm. Before I generate that very last table, I'm going to go to the hotspot, and I'm going to say Simulate Responses. Now, I'm going to create my table. That table gives me my design, and I'm going to change my coefficients to the values that I calculated. Let me just go ahead and reset those. I said my intercept was -1.73. I'm going to set each one of my coefficients to 0.5. Let me just copy this, so I can paste these. Finally, I'm going to set my distribution, my error distribution to Binomial. Now, I'm going to click the Apply button. I want you to keep an eye on that table. What that's going to do is it's going to generate those two columns that I'm going to use for my analysis. Let me back up one second. I wanted to set my sample size. I can't forget to do that. If you recall in the original experiment, I had five runs. That very second time I ran the experiment. I'm going to reapply that. Now I've got the correct equations in there. These are the columns I'm going to use for the initial analysis and for doing the simulation. Now, if you wanted to, you could actually sign these effects. By signing, I mean making them negative or positive, depending on the relationship you expect to see. For the sake of time, I haven't talked about that. I have a slide that goes over how I might go through these different effects and say, well, should that be positive or should that be negative? That can be done as well. Certainly, it will affect the results to a lesser extent that the actual magnitude of the effect sizes will. But insomuch that with most designs, unless they're completely orthogonal, there's correlation in those effects, and whether an effect is positive or negative might have some influence over my power values. Now that I'm done, now that I've actually generated my initial design, I'm going to go into my Fit Model. I'm going to analyze the design. I got to make sure to put in my simulated and trials. I typically will use, with this particular situation, I will use generalized regression, and again, I'll use Binomial because of all of those options in terms of model reduction. The nice thing is that with the simulation, those all carry through. I don't have to worry about writing any scripts to be able to do that. Let's go ahead and run that. Again, I'm going to use forward selection. I'll just click go. Let me just scroll down to my forward selection model. It's right here. Now that I've got my single observation, what I can do is I can right-click. I will typically hover over one of the probability values. I will right-click, and at the very bottom, there's an option Simulate. I have to make sure that that formula column with the simulated number of successes is selected in both these, and by default, it should be. I'll specify the number of sample trials I want to do. Now, once I click go, what's going to happen is that every iteration of the simulation, it's going to recalculate that formula column and give me a new set of number of trials given my underlying model. Going to do that for 2,500 runs, and then it's going to aggregate that data into a nice report. Now, rather than spending time and watching that crank away, it's quick. It usually only takes about half a minute, maybe a minute. I've actually run results, and this is what the results look like. Here I've got my 2,500 runs, less that very first initial run. I've actually added some additional functions here, formulas here, which I'll be happy to share. But these are the two scripts that get added when this report gets generated. The Power Analysis is just a subset of the distribution. Let's take a look at the Power Analysis. What this does is it gives me a whole bunch of data in terms of for all of those runs, what is the distribution of those P values? If you recall the underlying model in this case, everything was significant. What this tells me is this is the distribution of my P values that I saw through my simulation that my predicted value is about 1 out of 10,000, so that's good. I get a calculation of my confidence interval around my predicted P value. Actually, I don't think it's a confidence interval since we're doing a Bayesian approach. This is more of a credible interval. Maybe it's a confidence interval. But it is going to bound where that probability value is. But an interesting and important piece of information is the number of times I have rejected my, in this case, the null hypothesis. This tells me that if I were to pick an alpha level of 0.01, I would have rejected that null hypothesis about half the time, even though I know that, in fact, I model it to be a real effect. If I go to 0.05, I've rejected it about three quarters of the time and so on. This is the information that I use to ask the question, have I sized my experiment correctly? Will I be able to detect my main effects, my main effects and two-factor interactions, any other effect on my model, sufficient a number of times, a sufficient proportion of times to make me happy with the number of runs that I've picked? What I have often found myself doing in these cases, so not only do I rely on some of the built-in reports, but what I'll do is I will count, for example, the number of correct significant effects. I'll put in a cutoff value, and calculate the number of correct responses. Again, this will all be packaged together with the material, with a little bit of explanation, and the formulas of how I would build all of these models. Again, I really need to use my own definition of risk to determine, have I captured a sufficient number of those significant effects to make that experiment worth running it for 5 runs or 10 runs or 20 runs and so on. All right, so that wraps up what I wanted to show you in JMP. Let's go back to our summary. To summarize, when I am modeling this data, I'm going to use the Fit Model platform, regardless of which of the techniques that I use. I've got logistic regression, in which case I could use either logistic regression or stepwise regression with logistic regression in the background. I've got generalized linear models, which by default is going to give me output same as logistic regression, but with the additional ability to relax the assumption on my errors and to correct for bias. Then I've got generalized regression for if I have JMP Pro that allows me to do, what I would see in logistic regression or generalized linear model, plus the additional benefit of allowing me to do some model production as well. When I size the experiment, I'm going to build my experiment using the custom designer and then use JMP's built-in simulation function to be able to run my bootstrap simulation experiment, my bootstrap simulation analysis. That wraps it up for what I wanted to talk about. There will be additional information in the supplementary slides, which I unfortunately don't have time to go through, but I hope you've enjoyed what you've seen, and hopefully, it's something that you can use going forward, and it's going to be beneficial. Thanks a lot.
Labels (2)
Fujifilm Diosynth Biotechnologies is a contract development and manufacturing organisation (CDMO) that has a dedicated Process Characterisation department focused on performing process characterisation studies (PCS). The aim of PCS is to demonstrate that our customer processes are robust to changes in parameter settings or their normal operating ranges. PCS commonly employ design of experiments (DOE) to investigate the effects that process inputs have on quality attributes (QA) and process performance indicators (PPI). DOE analysis is a useful tool to identify the inputs that have an effect on the QA/PPIs, but it is mainly quantitative. In addition to the tradititional DOE analysis, calculation of the impact ratio (IR) for each input provides a quantitative and qualitative assessment and can aid in the assignment of a parameter as being critical or not. The IR provides a measurement of the effect size relevant to an acceptable range. Doing the calculations manually is time-consuming and prone to error. We will present an automation tool that can extract the required information from a DOE model and compute the IR. An interface allows the user to customise how the results are calculated.   Hello, everyone. I'm really happy to be here today with my colleague Sam. Today, we'll be talking about impact ratios. Basically, what is the impact of your DoE factors on your DoE outputs or responses? I'll introduce the concept and calculations and examples in JMP, and then I'll pass things on to Sam who will actually take you through automating that process using JSL and showing you how he's done this in add-in. First of all, confession time. I really wanted to call this talk, What do Process Characterization Scientists Have in Common with NASA Scientists? But the link was too tenuous, and maybe it would have been a bit mystifying. Nonetheless, I still use the concepts here. This is why we have meteor pictures on the first slide. This is my metaphor. I'm looking at the impact craters, and I'm comparing it with impact ratios because we need to have a little bit of fun. What is happening here? We have a very similar meteor, and it could fall on the Earth or Mars or the moon, and the conditions would be different. Here you have, first of all, an atmosphere or very little atmosphere or no atmosphere at all, which changes the blast, which changes the behavior of the meteor. If you are on the Earth, the impact crater that will be formed It could have a really massive impact on what's going on around because on Earth, we have people living around. With this in mind, what are we going to go through today? We'll look at yet another statistical ratio. Statisticians like their ratios. Then we'll look at what the impact ratio specifically is measuring. Then we'll answer this question, "Can I calculate this in JMP?" The answer is, obviously, you can, and you can do that a number of ways. You can do it the painful I will take you through this manually. I will skip over this step because we don't have time, but we have created a workflow to make things a little bit easier with table management. Then I'll pass things on to Sam, who will show you the happy way of calculating the impact ratio by clicking on his JMP add-in. Here we have a bad hand drawing that I did myself to show you something that you have probably seen many times because that's a control chart, and they are quite omnipresent in statisticians' world. Here we have an output, and it's changing over time as we create more batches, for example. We have zones on this chart. In the bluish purple zone, we have where our output falls. Because this is normally distributed data, we can make a prediction about where it will fall as time goes past. Here we have statistics giving us a prediction of how wide this blue zone might become. Then in green between two specifications, we have a customer type of safe zone. Customers tend to give us specifications, or we might have calculated an acceptance criteria from developmental work, for example. What we want to do is make sure that our process is sitting in a zone that's well is all contained by the safe zone when we are doing process control. One ratio that's used for this is the Capability Index, like the CPK, the PPK, is basically ratioing this zone versus this green zone. This is the most similar I found to impact ratio. Here we have almost the same graph. You could tell I have reused the same drawing. The only thing that changes here is that this time, instead of having statistics predicting the spread of the data, we have statistics predicting the shape of the data. We're modeling what happens to one specific output when you change the input, and you move its value from low to high. This is what you would do if you were doing a DoE and you changed the operating condition systematically. More particularly for process characterization, this low to high would be a range that you want to prove is acceptable. Here we have an equation in the end, and we have a blue zone, the min to max predicted for our response, and we still have a green zone, what's acceptable to give us quality product at the end of the process. We are checking, again, if the deviation is occupying an acceptable proportion of our safe or play area. This impact ratio can help us compare the impact of different factors because we will get such a model for all of our factors. It could also be a criterion for classifying those factors if this is part of a DoE. What is this impact ratio really measuring? You had a clue on the slide before, but we're back to meteorites. Our NASA scientists have calculated two things. The place where our meteorite will impact the planet and also the radius of the impact crater. They have said, as long as we are well within this green safe area, whoever lives on the outside will be fine. Of course, that would not be true for meteorite, but just stay with me here for a second. What we're ratioing here is basically the radius of the crater to the radius of the safe area. We're hoping that this number is much smaller than this one. That the safe area is very big compared to the impact crater. Now, what happens is that we could be off target so that the NASA scientists predicted that the meteorite would fall here, but it fell here. Even though the safe area was probably big enough for a centered meteorite/process, in this case here, we actually have the impact crater well outside of the safe area. I hope you don't live around here. Another possibility, and there are lots of other possibilities, would be we are on target here, but we have miscalculated the crater. The crater is much bigger than what we thought it would be. Again, it's outside of the safe area, and the radius of the safe area is actually smaller than that of the crater here. This is just a picture, and let's see how this would look on a graph. Same graph again with some extra arrows here and some equations. In plain English, your impact ratio really is ratioing the effect size over the distance to your specification or acceptance criterion. The effect size here is those skinny blue and orange arrows in this case. It's the distance from the minimum prediction to the center or the maximum prediction to the center. We are ratioing this to the distance from the center to your specification on either side. In this case here, this distance occupies 70% of our safe zone. That is not a good impact ratio here. Here is only 40%, it's still a pretty high figure. But to be fair, we're going to have to take the maximum of those are two values, so we would consider 70% here. This is my hand drawing. Let's go to an actual JMP chart. This is a jump chart, and it's just one part of a profiler that you would get at the end of finishing up a DoE analysis. In this case, we have a bit of curvature, but the zones are the same. If we want to calculate our impact ratio to the minimum here, the prediction is that minimum to center is taking about 42% of minimum to specification. On the other end, minimum to top or maximum prediction is only taking about 7% of our safe zone. Nonetheless, we're going to have to report this number, so it's still a pretty high number here. The questions we're trying to answer here is, are we on target with a small impact compared to what is acceptable? Now, I will exit from here and go into JMP to show you how to carry this out manually, the painful way. Here we have a box standard DoE, five factors. We've only shared here four responses. It's all anonymized, so I'm very sorry, but the numbers look a bit funny because they're all between minus one and one. But what we're looking for here is that we have response limits specified so that we can play around with the goal for those responses. Maximize for most of them for something like impurities that might be minimized, and we might not even be interested in how high they go. I'm not going to go through the modeling because that's not what we're here for. Sorry about that. Wrong script. There we go. We've already fitted this, and this is the example I had in the notes, so I'm going to keep with this one. Here we are using the Profiler to get all of the numbers we need for our equation. We're setting this at center point, and we're going to record that number. There we go. That's the center point conditions. Everything is kept at center point here. I'm going to ask JMP to remember this setting. I'm going to call this Load Center Point because I'm interested in the load here. That's my first one. Then manually, I wouldn't really need to ask JMP for help here. Everything is at center point for those two. I could just move this here to the minimum and I can record this again as the minimum value. That's my load min here. For the maximum, it's a little bit trickier. I could try to do this. I could make this bigger and try to make sure I get to the maximum here. But I'm going to trust JMP for this one. I'm going to click Control + Alt, and click on the other two graphs here, and I'm going to lock the factor settings at center point. Because I'm only interested in what the load is doing and what the maximum is for the load when everything else is kept at that point. Then I'm going to use the maximizer in JMP. There we go. I could have reached that by hand, but now I know it's exactly at the maximum. I'm going to save this one as well. We have the max. Now I have almost everything I need. If you're doing this, manually, you're doing this for every factor or equation parameter on this list for every response. If you have fairly big models, and you have maybe 5–10 responses, This becomes a lot of work very quickly. Then you would have to export all this data into table. You would use this if you had many done, but I only have one, so I'm going to do this. Here you have all the info you need in the wrong format. What we need to keep is where are these? Then we want this number here. Now, you would use transpose in tables here, but I'm not going to do because I've already done it ahead of time. I just want to quickly share that with you. Why is it not happy? Here we have the transpose data. We have all the labels from the table here. Manually, we would have to add our lower and upper spec. From these and the min and max and the CP, we can calculate the numerator for the ratios here, the denominator for the ratios, and the ratio themselves. You have the small distance over the large distance give you your low impact ratio. Then you simply use formula to get the maximum of the two, and you'd have to repeat that for every one of them. I'll leave the floor to Sam now who could show you how much easier this is to do once you have an add-in for it. Sam? Thanks, Gwen. As Gwen mentioned, we decided to create an add-in to automate this task. I'll just give a quick demonstration now of how the add-in works. I'm using the same data table that Gwen has just been working with. The reason why we decided to make an add-in was just because it makes it easier than running a script directly from a script window. The nice thing about having an add-in as well is when you hover over the add-in name, you can add a tool tip. In this case, the tool tip just tells you the correct window that the add-in has to be run from. In fact, if I try clicking it, you'll see that nothing happens because I'm not in the right place. If I now open up the script with those models that we made earlier, you can see that I can now run the script and then it will run properly. First of all, the user is then presented with some windows which have some instructions and then later on some areas for user input. The first window just has some instructions around requirements for the underlying data table. For example, the factor columns have to be coded, and any units have to be input as column properties as well. If that's not the case for any of the columns, you can then just hit Cancel, and I can go back and make those changes. But in this case, I know that they are all coded correctly. I'll hit run again and then click OK. Then the next window just has the area for inputting of the factor settings. You'll notice that we have an input here for categorical factors. It's not possible to calculate an impact ratio for a categorical factor. However, any categorical factors contained within the models have to be fixed at a particular setting, so this just allows the user to input this here. For the remaining factors that were continuous factors that were evaluated in the DoE, you can see that we have an area for inputting those settings. You can see that the script is automatically read in the minimum and maximum value just by reading from the table. What it has done as well is calculated the center point just by looking at the middle between the minimum and maximum. However, in some cases, the center point might not be at the exact center of the range. In this case, that was the case for load, so I'll change that to 20. It's possible to edit any of these values. The benefit of being able to change the input here is that if we run the adding, and we get impact ratios that are too high for some of the factors. We can then just run the adding again, and I'll change the values here, try evaluating a reduced factor range, for example. Then see if the impact ratio looks any better at that new range. But I'll keep the rest of the values the same and click OK. Then the final window just has the input for the response acceptance criteria. I'll put those values in now. If there are only one criteria, then you can just put one in and leave the other blank. If you have no criteria, you can leave both boxes blank, and it will still be able to calculate the impact ratio. In that case, it's just the percentage difference between the set point response and the minimum or maximum prediction. It doesn't quite offer the same measurement in terms of practical significance, but it will still be calculated by the script. Now, if I click OK, you can see that the summary table has been generated. But if I go back to the window where I ran the add-in from, you can see that underneath each prediction profiler, the settings have been remembered. These are the settings that were used to obtain the minimum and maximum prediction for each factor in the model. This is useful so that you can then go back and see how the calculations were made. It's good to be able to review that. But now I'll just give a very general overview of how the script actually works. First of all, the script loops through each response model in the window. It then sets the desirability to maximize the response. The script then loops through each term in the model. For the first term, which is load in this case, it then unlocks the factor settings, so it can be free to move. Then all of the other factors are then locked at their set point setting. The script then executes the Maximize and Remember function. You can see now that that setting has now been saved underneath the Profiler, and you can see that we've maximized this response by just changing the load factor in this case. It then continues that operation for each term in the model, and then the desirability is then set to minimize the response, and then that process is repeated again. Each term is then evaluated again to get the minimum prediction. Then finally, all of those remembered settings are then updated so that the name is meaningful. Then we show the factor name and whether the goal was to minimize or maximize the response. Then finally, this entire process is then repeated for each response model contained in the window. That's the script works. I'll just return now to the summary table that was output. You can see here that all the data just gets collected and output into this table. We have each row contains one particular factor for each response model that was evaluated. We have columns for the acceptance criteria that were input by the user. There is then columns showing the prediction at the set point settings, and then the minimum and maximum prediction for each factor contained within that response model. Then the remaining columns are just formulas. Here we've got the difference between the minimum prediction and the set point prediction, and then, similarly, for the maximum prediction and the set point prediction. Then the last two columns are just the calculated impact ratios. We have the lower impact ratio and upper impact ratio. You'll notice that where we only have one criteria, so for impurity in this case, since we only have an upper acceptance criteria, we only calculate an upper impact ratio. Then the final column is just the overall impact ratio. This is just simply the maximum of the lower and upper impact ratio. I'll just finish off this part of the talk by summarizing the benefits of using an add-in to automate this task and jump. Firstly, using the add-in is much quicker than doing it manually. Secondly, this allows the task to be repeated much more easily. As I mentioned, if you run the add-in and get impact ratios that are considered to be too high, you can then just rerun the add-in, change the factor settings, and see if you can obtain an acceptable impact ratios. Then finally, there's much less chance of any errors occurring because you don't need to do any data transcription or manipulation of the data. The script collects all that data together and then puts it all into this table that gets output at the end. Okay, thank you. That concludes my section of the talk. I'll hand over to Gwen to summarize things. Thank you. Just in addition to what Sam said, a big advantage of the add-in is that you could repeat the calculations and change what you input in the first couple tables. You could change the factor settings if you wanted, for example, to bring in what you had set out to show where your proven acceptable range is, a little bit outside of your normal operating range is, presumably. The other thing you could change is revise your specification or acceptance criteria. If your impact ratios were really high, then your safe zone would have been maybe a bit too small to be comfortably operating in. You could push the specifications out and check if all your impact ratios are being a bit smaller. You could actually probably use this as a justification for changing those acceptance criteria. Another thing that you can do with those impact ratio is that you could use them collectively as a criterion to classify your tested process parameters. It could be that a process parameter is critical, highly critical or non-critical at the end of the DoE because its impact is very small. I think that concludes what we had to say about impact ratios today. We're both available to answer questions if you have any. Thank you very much.
Labels (2)
In the pharmaceutical development of tablets, most active substances are difficult to process or dissolve. There are also many process steps and functional components that need to be included to solve all the issues that appear along the way. To narrow the focus, it is important to recognize which of the many potential factors are the most important for the responses of interest.  Definitive screening designs are often considered to be most appropriate for experimentation with four or more factors. Whenever there are available results of experiments that are not part of a specific design, it is good to use tools such as advanced predictive modelling techniques to help capture valuable information. The aim of this project was to apply different analytical techniques to evaluate the effects of input factors on responses. Another goal was to find the balance between the factors that contribute to tablet appearance and mechanical resistance and the factors that enable quick active substance dissolution, which is important for product in-vivo performance. By using a combination of analytical tools, valuable insights were obtained regarding effect of formulation and process factors on tablet characteristics. Optimal settings were then defined to maximise dissolution.   Hello, my name is Tijana Miletić, and I work in Hemofarm, part of STADA Group in product development. Today, I'm going to show you how we use the definitive screening design and advanced predictive modeling as useful tools in product development. Main goal of product development here was to find optimal formulation and process settings for several quality attributes. At the beginning, we suspected what could be potential effects of formulation and process variables on tablet properties. This is how we selected the factors for our experimental study. But we did not know what would be the actual relationship between these variables for this specific system. Here we use experimental design as a way to extract most information from a limited number of experiments. Our main challenge in this study was to achieve maximum dissolution while maintaining mechanical resistance of tablets at the same time, and of course, to decide which ranges or which factors we are going to use and which experimental design we are going to select. Because we suspected that we will have some significant quadratic effects and interactions, we selected the definitive screening design. Overall, impact that we achieved with this study was positive for our product development because we managed to identify most important factors and optimal values to achieve desired responses. Here, the main response was the solution because it is important in vitro result, which is considered prior going into costly clinical studies. The tablet hardness, on the other hand, is a good indicator of mechanical resistance of tablets, which tablet needs to withstand manufacturing and packaging process. All in all, moving ahead in product development with the right decisions being made, is something that saves time and resources in all stages of product development. Here we were happy with the value which we achieved because we got some direction in product development. Instead of performing for six factors, full factorial design on 64 runs, we managed to execute the definitive screening design with 13 runs in about four times less time than it would be for 46 runs. Besides that, we also use some additional experiments and advanced modeling techniques to get even more insights into factor effects and their significance, which in the end resulted in having development goals achieved. Our main data analysis was execute it with three main activities, which we will present with our poster presentation. We used for this data analysis JMP 17. Here we presented visual data exploration. We also used the platform for Fit definitive screening and for model screening. At first, we used scatterplot matrix, which provided quick assessment of relationships between multiple variables at the same time. In order to better understand the nature of relationships within our variables, we looked into our models in more detail. For definitive screening, we recognized that the most significant factor for hardness was compression force, and for this solution, amount of disintegrate and compression force. We did not observe significant quadratic effects, and the only significant interaction was between amount to binder and compression force for response of this integration. By going back to our visual data exploration, we saw that there is not that much of connection between this integration and the solution, meaning that this result is not that significant, and we will not be able to see if we are going to like our dissolution results, so we did not focus too much on this interaction at the moment. We used the run 14 to evaluate predictability of a model being created here, and we received the 78 versus 80, and 93 versus 94% with dissolution, which is considered to be good match for this type of test. We received exact match for hardness of 54 newtons. By being happy with this, we then use prediction profiler and by maximization of this ability, meaning we wanted to see how can we maximize the solution, we learned that we need to use 8% of this integrant, 3% of binder, and compression force at the lowest level. Here we were worried that with lowest compression force, we would compromise mechanical resistance of tablets, knowing that it could lead to lower hardness. Here we used a bigger data set of 27 runs to graphically evaluate if there could be trend there so that we could rely on amount of binder to get a positive effect of tablet hardness. We also performed the model screening for this response, and we picked Fit least squares method to develop the best models to evaluate possible effects. We reduced models so as to get the highest possible significance of effects. Here it was confirmed that binder could have a positive effect on hardness. We also performed model screening for response dissolution, and Here we presented with two methods and two models, highest possible dissolution prediction, which is in both cases about 87%, and we also confirmed the significance of this integrant amount in compression force for this response. Here we saw some difference in what would optimal level of lubricant could be, and this is not surprising because we know that the lubricant can have potentially negative effect on dissolution and hardness. But here it seems that this effect is not significant. We could go ahead with higher level in order to decrease risk for sticking or to perform additional experiments to better explore this effect. By looking at all these insights, we were satisfied with the conclusions that we made. Definitely, key ingredient for desired dissolution was this integrant. We were happy to learn that amount of binder did not have negative impact on the solution, but could have positive impact on tablet hardness. We also learned that glidant does not have significant impact on evaluated responses. Of course, compression force is key process parameter, and it should be carefully set. Overall, definitive screening design provided the directions in our formulation development, which led to desired results in last time, and we were happy with that. With advanced modeling techniques, we managed to get additional insights. Of course, in order to have better predictability of models or to investigate relationships between variables in more details and with more precision, we would need to generate more data. But based on this experimental study, we got results that made us feel confident enough to go ahead within our product development. That is all that I prepared for today, and thank you for your attention.
Labels (2)
908 Devices released a JMP add-in tool that facilitates the direct analysis and trending of amino acid and vitamin concentrations generated at-line by the REBEL media analyzer. In media development and adjustment, various parameters are tested over time, which leads to a high number of samples and generated data. REBEL analyser, in combination with the JMP add-in function, allows data sets to be visualized immediately in a customized view. Alvotech presents a case study demonstrating how JMP enables a fast comparison of amino acid levels in different bioreactor runs with different media formulations, leading to improved process understanding. In a complex experimental setup, various media formulations were tested over 13 days of multiple bioreactor runs by analysing amino acids and vitamins concentrations at-line with REBEL in three different dilutions in duplicates to evaluate batch performances. The results of this large data set of all 21 amino acids and six vitamin levels were visualized with JMP in a simple way that still provided various setups for comparing data and determining measurement accuracy.   Hi, my name is Ildiko. I'm representing Alvotech Germany on this meeting together with my colleague, Thomas. We are both members of the Process Innovation Team at Alvotech, and we are working mostly on media development projects. We are active JMP users, mostly using different DOEs, complicated DOEs, but this time, I would like to present a small smart tool for data visualization, which is for REBEL users. REBEL is an ad-line spend media analyzer machine, which was manufactured by 908 Devices company. Last year, they released a JMP add-in tool which facilitates the direct analysis and trending of amino acids and vitamin concentrations. Together with this measurement, really large data sets are generated which can be immediately visualized in a customized view with this JMP add-in. We would like to present the workflows to compare, to show trending in time, and different results, different media formulations, different bioreactor runs, to show how this Smart Ramping tool can lead to the improved process understanding, and how this can help us in our everyday work. A few words about the REBEL. What is REBEL? It's a spend media analyzer machine, which is a capillary electrophoresis-based small mass spectrometry instrument working with kits, and it has the capacity to analyze 33 different metabolites in one sample, approximately 7, 10 minutes, so it's really fast. We use this platform, as I mentioned, in the media development for different purposes. We compare vendors in formulation of a specific big media. We compare them, and we define which vendor could perform the best or media. As I mentioned, we are doing complex deals, and we need to follow the analyte trends in these different media conditions. We have to do fast decisions day-by-day to define feeding strategy, or we just need to see these conditions, the analyte trending across a time course. We also use this tool to compare different conditions in bench scale bioreactor runs. If there is one product lifecycle management, and we would like to implement a new media, for example, and we compare it with the other setup. When we generate the large data sets, we have this JMP add-in tool for really fast data processing and visualization. I would like to note here that it works only with JMP 16 or above versions. If you go back to the full design and the methods, as I mentioned, I would like to present three small comparison workflow on this poster through three case studies. I would like to show some short demo videos about how these workflows are showed through data import until the customized dashboard view, which could be then further processed, saved, sent, exported, whatever according to our needs. The analysis just takes really 1, 2 minutes for large data set with the prepared data filters. This is a really useful tool for REBEL users. How it works? First, after installation of the add-in, which is really easy, you just have to install the add-in file. We at Alvotech have a remote version of JMP because this is a GMP-validated environment with so many restrictions. Here, after the add-in is installed, we see here the three workflows, the sample comparison workflow, what I would like to show in the study one, and the both time course workflows for anomaly trending and condition trending in the study two, and I forgot… Here, the study 3 shortly. The result files, what the machine generates are CSV files coming from a so-called REBEL batch, which is based on the batch run sheet. This is a template, an Excel template, when the sample labels are well-defined. When the result sheet is generated, it can be directly imported to JMP. There is an optional file, which is the so-called sample label file, which is very useful because it has the option to correlate samples to custom metadata, which means it is important in the trending workflows, then we would like to correspond, for example, a bioreactor name to a sample label name to see which conditions are in which bioreactor. The JMP recognizes this sample label and these components. I will show that later in the demo videos, it will be more clear. Finally, when the results were imported into JMP, it generates a default report dashboard first, which can be customized later with data filters. This is the workflow at REBEL. We go to the first comparison workflow. I would like to show how this customized view were generated. I would like to show that in a short video. First we open the sample comparison workflow. We select the folder where this file is, which is a little bit complicated for us as we are working with a remote JMP. We find the folder. It's empty first, when we select the folder. We have to select the sample label file and open it. We adjust the header, which starts the row number one. This is the imported file, which JMP generates data table from that. It takes a little time. Directly, we go to the sample comparison dashboard view. We can see here all the analytes in different bioreactors with different media. The tabulate shows the mean values of the analytes because we are doing duplicates or triplicates. On the left side on the Filter panel, we can select sample label because in this case, he would like to show only the medium types. It also takes some time to open it. This is medium A composition. All the other medium should be appeared one after each other. Other workflows are much faster. Somehow, the media, the comparison workflow is really, really slow. These are media, same media prepared by different vendors. You can see the differences. Also, we can see the concentration pattern of the analytes, and we can decide if the dilutions of the samples were good or not. Also in the control panel, we can select Confidence Intervals. We can select the standard deviation for the values to figure out if these are in the defined range that we would still accept. Once we have the final customized dashboard, we can save this dashboard. After saving, it should appear when we click on the data table. Yes. It's in the control panel on the top left. The modified dashboard, which is there, we just click on that and it appears again. Of course, if we need to save the data table to make it appear again next time. We also can export the data in any format, of course. This was the first workflow. The other workflows, what we are using the most in the media development, the analyte trending workflow and the condition trending workflow. If you go to the analyte trending, I also have a video for that. But before telling, I would like to tell here about the very important action which is needed to take, define the time course component, which works in a way that in the original results table, Excel table, we have the first entry of the sample label column. In this case, it's the culture station one, vessel number two on the day three. JMP should know which is the time component, first time component, which is number three. It works with delimiters to define the time course component. Delimiter, as it is shown up here, is the character that separates the string of the text. In this case, it's a double space based on this naming convention. The time course component I want to show as first time, which is the number three. This is the third component of the sample label. We can just update the preview here, and it appears that time three will be the first time component, and the sample label is that one. Which is the same, what I defined in the optional sample label file, which is corresponding to the batch record position, what I would like to see in the analyte trending data set. Once we have this, the JMP import shows the dashboard. It's better to show that on the video as well, go step by step, which makes more sense to see. We open the second Analyte trending workflow. We pass the information from a sample label as I mentioned before, because I created a sample label optional file. The time course sample label file is there. JMP recognizes immediately and import that. This is how I define the time course component with the double space and the number 3, which shows the time starts with 3, which is day 3. Now the data table appears and also the analyte trending dashboard view, which shows all the analytes. It's important to say, this analyte trending, it shows the mean of everything in a time course. All the days which we have samples are defined here. We can set the standard error range, and we can also have the filters to show only one culture station because otherwise, it won't be the good trending. This analyte trending shows only one condition. All analytes, we can see the outliers here. We can see the non-usual behavior of an analyte. We can select analytes as just showing up one in a customized view. We also can see if values were not measured. We can put the confidence intervals, we can put the standard deviations, if we need that. Check the different analytes one by one if we would like to do that. And saving the dashboard to the data table. See that we can, we can save any number of modified dashboards that we would like to export that later. Here, there is an unusual thing. Actually, a bug was identified in this data set. I'm showing that in the next slide with the condition trending workflow, which is generated from the same data set, at least the first one. Here, I'm not showing the video this time because the data import would be the same then last time. As I mentioned, this is a DOE for media conditions. It's an AMBR15 small bioreactor setup. Here, this is my favorite workflow, the condition trending workflow, when we select one analyte, throughout time course in all conditions we have. Here, there is this bioreactor position with question marks, which was not in the sample label file. This is a bug in the system. In this case, when checking the data table, the time course components and the name of the sample label, it was not correctly separated. This was identified as a bug. We can see here that values are missing at specific days, like here and here, day 11 and day 6. We also can see that in the data table missing values. This means a bug which can be reported to the 908 Devices. It's still ongoing to investigate what happened. The reason why I find it really good and really useful, this time course condition trend workflow, because here we can see, for example, different conditions. How unbalanced are the different vessels, different conditions, because it's a DOE with any kind of conditions, and we have no idea about the outcome. Media is not balanced. Amino acids really show different trending. The alanine, for example, here, it shows up completely different pattern in how the cells are growing and dying after a specific day. It shows really different trends, which is very, very valuable and important information for us when concluding the study. Also, if we would like to make an emergency feeding strategy change, day by day, the REBEL can measure every day all the analytes in all conditions, and we can see that really fast with this workflow. This is also the condition trending, but in the last study I would like to show, this is the life cycle management of a mAb product. When we would like to show how a new media formulation performs compared to the original media formulation. We just follow the cells growing and the product produced throughout this 14 days time course. We followed, of course, the analyte trends. A large data set were generated, and I created the sample label file which means that after the time course component was defined, same way than last time, then the batch accord positions were added with the same name to make the JMP able to identify the batch accords. I'm showing you this last video. These are concatenated files. Two different batches under the same folder could be visualized by JMP and imported to JMP. This is also a really good feature in this adding tool. That different result files for the same study can be opened at the same time and data are concatenated. The sample label file is defined and read by JMP immediately. Here the definition is a little bit different because the first sample label column is also different. I identify as day zero here. Now we see that as a default, all analytes under all conditions appears. That's why we need the local data filter to apply here. Let's see arginine, for example, which is on the slide as a screenshot. We can see clearly how balanced is the media, it's defined condition already. The trending is the same in between two different media types, at least for this analyte. The [inaudible 00:22:57] positions are perfectly here. No bug in this case. Once it's done, we can save the dashboard as before. Of course, it can be further customized with the Graph Builder options for more advanced JMP users. This is like how you can export that by selecting an export folder, and it can be exported as a picture file, or it can be exported into Excel, the result table. There are many exports here already. If we can go back and conclude this really useful tool. We can tell that these complex data sets could be visualized really in few clicks by adding some time course component and identification of some component, which is really easy. It is really useful for non-advanced JMP users because I don't believe that JMP knowledge is needed at all to use this add-in tool. The benefit is great because in a very short time frame, it allows to really, really fast data-driven decisions, which is really important. I have a notification from 908 Devices that the upgraded statistical analysis tool, the version 2.6, is under launching. This is the improved version. When the separated sample label file is not needed, it can be as a separate column in the original result file, which is the sample label file, and there is no need to add an optional sample label Excel file when defining the time course and opening the trending workflows. That's all. I hope you like that. Thank you so much for the attention.
Labels (2)
Thursday, March 7, 2024
Ballroom Ped 4
Pharmaron has developed a platform process to generate adeno-associated viruses (AAV) gene therapies with a highly adaptive toolbox to manage varying AAV products and serotypes. Our toolbox can rapidly assess a product's compatibility with our platform through a manufacturing feasibility assessment and finely tune a number of parameters for targeted process optimisation. One essential tool for Pharmaron’s approach to optimisation is DOE (design of experiments). We show how a central composite DOE approach can maximise the recovery of monomeric AAV by identifying the optimal residence time, loading density, and load pH for the initial AAV purification process's capture step. The optimal loading conditions were measured using titre by Capsid ELISA and multi-angle dynamic light scattering (MADLS) and monomer percentage by DLS. DOE analysis showed a strong link between loading density and monomer content, whereby a higher loading density resulted in a higher yield of monomeric virus. Load pH and residence time had negligible effects on recovery and monomericity. Since it facilitates the analysis of multiple parameters in a fraction of the time, DOE has enabled Pharmaron to rapidly identify the optimal conditions for affinity capture. It significantly improves process performance and drives generation of a highly pure, monomeric virus.   Hello, my name is Damon Ho, and I am a Scientific Associate III at Pharmaron Biologics UK, based in Liverpool. Thank you for tuning into this talk. I'm really proud to have completed this work during my student placement via Liverpool John Moores University at Pharmaron Biologics. And even happier to announce that JMP was a huge part of the success of the optimization of our capture chromatography step. Moving on to the first section of the poster. To introduce Pharmaron to individuals who may not have heard of us before, we are a leading pharmaceutical research and development services provider with worldwide operations. In Liverpool, we currently work on viral vector-based gene therapy development and clinical manufacture, currently focusing on adeno-associated viruses or AAVs. At Pharmaron, we have an impressive platform process that can manufacture multiple AAV serotypes and products, which is illustrated in Figure 1. This process utilizes depth filtration, followed by capture chromatography, and then intermediate polishing chromatography, followed by polishing chromatography, onto our formulation steps, and finally, sterile filtration. Now, I'll introduce to you to our robot and the use of JMP. We are immensely proud to share that we have a state-of-the-art high throughput process development robot called the Beckman Biomek i7, as you can see in Figure 2 and Figure 3, that can rapidly assess a product's fit with our platform and finally tune parameters for its seamless integration into our process. Utilizing this robot for HTPD alongside the use of JMP for DoE, we can generate a high-yielding, high-purity drug product in a fraction of the time compared to conventional methods. This poster focuses on optimizing the initial capture chromatography step in our platform process for an AAV product. A DoE was designed using JMP software to determine the optimal loading conditions for processing. Three factors were chosen to be optimized, which were the residence time, loading density, and loading pH, with viral titre and viral aggregation to be measured as outputs. Now, for some quick information on how the DoE was designed in JMP. A central composite design, or CCD was chosen to create the DoE for a number of reasons. Firstly, that a minimal number of factor combinations are required to estimate main effects, and it is able to analyze two-factor interactions and quadratic effects. The CCD also has a good lack of fit detection, which can easily show which factors affect the chosen outcome, and graphical analysis is possible through various tools available in JMP, as you can see throughout the poster. The CCD was created in JMP by first selecting a response surface design, entering the parameter names of loading pH, loading density, and residence time, and then inputting the predefined high and low ranges. The goal was to maximize the monomeric virus content and viral capsid recovery, so these are input in the responses section. The software is very flexible, and you can always add more responses in the finished model. In this instance, as this was a conventional CCD, on face was selected. Triplicate measurements were selected for the center point. The CCD was then ready to be generated. In total, 17 different experimental conditions were generated via the CCD model with varying low, medium, and high parameters. Center points are defined as all medium parameters that form the foundation of the CCD model, with the low and high parameters acting as probes to test how the factors interact and influence each other. This forms a 3D response surface that can quite accurately predict the interactions of factors. Using JMP, ultimate conditions can then be hypothesized and tested in a subsequent confirmation run. Our HTPD robot was utilized to perform the capture chromatography at a very small scale, allowing multiple conditions to be run at the same time, which would have taken considerably longer at lab scale. This system allows for a highly-accurate and reproducible process due to its automatic nature. Once all of the conditions were run on the Biomek i7 platform, we employed the use of our world-class analytics to measure AAV capsid titre by an enzyme-linked immunosorbent assay, shortened dualizer, and multi angle dynamic light scattering, also known as MADLS. Monomeric virus content was measured by conventional dynamic light scattering or DLS. Not only is JMP using the design of an experimental study, but also in the analysis of results. Once analytical data was available, this was entered into JMP with the data able to be visualized in a number of different ways, including counterplots displayed in Figure 4, Figure 4a, 4b, and 4c. A simulation was run to generate tens of thousands of virtual results that helps build an optimized model and also to compare against real-world experimental data to build a confidence level in the model. This is shown in Figure 5. The contour plots highlight the impact of a higher loading density, which increases viral monomer content shown in Figure 4a as the brown. However, upon analyzing capsid ELISA recovery, which is highlighted in Figure 4b, it showed that a lower loading density, also shown in brown here, returns the highest capsid yield. This was challenged by MADLS data shown in Figure 4c, which provided the evidence to support the impact of a higher load density to increase viral monomer content, shown as the brown and the red and orange here in this bottom right corner. An important consideration is that MADLS does not measure aggregates, thus confirming DLS findings. Ph and residence time did not have a significant impact on capsid titre, MADLS or DLS results, so the contour plots are not included in this poster. Now, we can move on to the interpretation of the findings. After the three learning conditions that were analyzed, learning density had the most significant impact on capsid recovery and the monomeric nature of the AAV, which is visualized by the steep gradient of the line on the left graft of the predictive profiler shown in figure 6. Residence time shown in the central graph did not have any notable impact on even monomeric virus or capsid ELISA recovery, which is indicated by the near horizontal line of the prediction profiler. Loading pH shown in the right graph also did not significantly impact the capsid recovery or monomeric virus, which also has a near horizontal line. The optimum condition suggested by the DoE is a higher loading density with a shorter residence time and loading at pH condition A. This is an ideal outcome of the model as with a higher loading density, less capture chromatography media is used to achieve the same target loading density. A shorter residence time also results in a faster process, which streamlines the AAV loading stage. Loading a pH condition A is also ideal as it does not require pH adjustment of the load material, thus saving time during load material preparation. Using these optimized loading conditions determined by the DoE, a confirmation run can then be to verify the DoE output. Once confirmed, these optimum conditions can then be used in a scaled up run, which is specifically designed to achieve a high quality, high purity, and low impurity AAV product. This approach led to a rapid low-resource demand and low-cost assessment to optimize our capture chromatography step for our Pharmaron platform process. In total, the optimization experiments took place over only three days. This would have taken many, many weeks to complete if it were a lab scale, along with using much more AAV material and much higher cost associated when increasing the scale as only one condition can be run at the time. The use of JMP software in tandem with our HTPD capabilities can allow us to perform robust optimization of our platform process to suit specific gene therapy product needs. This greatly speeds up the feasibility and optimization stages, which are often the most time-consuming phases of the process development pathway. Ultimately, this drastically reduces the time taken from product development to the successful delivery of a gene therapy product to a patient in the clinic. I would like to end by thanking you for listening to my poster presentation, and I hope you have learned a thing or two about what we can do at Pharmaron, plus the importance of JMP in our product development pathway. Thank you very much.
Labels (2)
An experimental design was created to study the formation of an unwanted byproduct in an esterification reaction. Four mixture component factors plus one process-based factor were used to generate a 26-run space filling experimental matrix, specifically for analysis using Self-Validated Ensemble Modelling (SVEM). This approach was selected over a traditional mixture design intended for a polynomial Scheffe model. The resulting predictive model was an excellent fit to the data, clearly identifying the impact of each factor on the level of byproduct formed. This information was used to accelerate the development of a kinetic model and scale up the process.   Hello, I'm Andrew Fish. I'm a Senior Principal researcher at Johnson Matthey. I'm a chemist by background. I've been working with JMP since 2016. My poster today is a mixture: process experimental design and SVEM analysis for an esterification reaction. As a way of introduction, so within the catalyst technologies business of Johnson Matthey, we sell catalysts and floor sheets for a range of different technologies, and that lends itself quite to design of experiments and to advanced data analytics. We develop catalyst formulations, and we optimize process conditions for chemical reactions. What we do is we work at small scales in the laboratory, and we translate that to commercial-sized reactors. The reaction which we're going to focus on for this poster is an application of Fish esterification reaction. I don't want to dwell on the chemistry because that's not the focus of this poster. But just by way of background, what we have is a reaction where we are taking an acid and react it with an alcohol in the presence of an acid catalyst to form an ester product and water. This is a reversible reaction, so it can go both ways. If we don't get the conditions right, we're not going to end up maximizing the amount of our ester product. The added complication we have here is that we have a side reaction. This is reaction 2, where the alcohol can react with the same acid catalyst to produce an ether. That reaction is an irreversible reaction, which means that if the ether is formed, we can't get our alcohol back. We've consumed our reactants, and it means the reaction just isn't as efficient as it can be. What we're trying to do in this process is minimize the amount of this byproduct, ether, and try and maximize the amount of the vesta that's formed under these conditions. To do that, we used a design of experiments. Normally, we would tend to use a generic mixture design with one process factor. We're going to take a slightly novel approach of using a spaceful and design here. To summarize the factors, we have our continuous factor, which is a temperature. We then have four traditional mixture components, which is the amount of alcohol, the amount of acid catalyst, the amount of and the amount of water. The final component is the original acid, but we're going to fix that value at 25% of the mixture, which means the remaining four mixture components have to sum to a total of 75%. Then what we're looking for is the amount of ether that we're producing, and we're going to try and minimize that. This is a homogeneous equilibrium reaction, so what we're going to do is all of these components are going to be present at the same time. We're going to heat it up to temperature, and then we're going to measure after 30 minutes the amount of ether that's formed. What we're going to do is in a normal traditional setup, I'm just going to come out of JMP, is in a normal mixture design, which is one I've prepared before, I'll just reload this. In a traditional design, we would introduce the temperature as continuous factor. We would introduce our four mixture components as mixture components. Then we would use a chef cubic type model, so we'd add all the terms in to make that occur. That would suggest that we need around about 40 experimental runs to be able to create a data set large enough and to be able to analyze that data. If we look at how that looks in a traditional ternary plot, we can see here the factor combinations of the different mixture factors and the experiments we would run over those 40, you can see it's exploring the space quite well. The issue that we have potentially is that with the temperature, we're only really looking at high and low temperatures, and we're still at the extremes of the mixture factor settings. What I'm going to do now in the way we design this experiment is we're going to use a Space-filling Design instead, and we're going to use SVEM to analyze the results of that. To build the space-filling design, I'm going to go into DoE and Special Purpose and Space-filling Design. I'm going to load in my factors which I've prepared earlier. I'm going to also load in my response. Okay, so I've got my ether as a response, which I'm trying to minimize. I've got temperature as a continuous factor. What the difference here is instead of introducing these four mixture components as mixture factors, I'm going to leave them as continuous. I'm going to still specify the range. What I'm going to do this time is I'm going to specify some constraints, so I'm going to load in my constraints. What that says is that the sum total of the four mixture components must be less than 75% or 0.75 as a mole fraction. I'm going to put in the equivalent negative constraint as well just to make the maths work in the background. When that's done, I can then… I'm not restricted by a polynomial model in terms of the design space and how that's followed through. I can specify how many runs I want. In this case, I've only got enough time to do 25 runs. I'm going to select 25 runs, and then we're going to make the design. We have a 25-run design here to a lot of decimal places, which isn't going to be possible, but we're going to target these in our experiments. If we look at the results of this versus what we did before in the ternary plot, we can see very similar. We're covering a lot of the space, but instead of being at the edges, there's a lot more in the middle. Again, in our a multivariate plot, we can see the temperature now we're covering a lot more of the middle of the space rather than the edges, versus the original traditional mixture, Scheffe Cubic type design, and we're doing this in 15 fewer runs as well. The aim of this really is once we've collected this data, we can then apply a machine learning type neural network algorithm to it instead of a traditional polynomial model and hopefully increase the resolution, and the understanding that we get out of this system. I will head back to the poster. I've already said before, the experiments, there's going to be 25 of them. We actually made this 26, so we included a repeat. The red dot that you now see in these plots is the repeat test, and that was just to ensure that there was good reproducibility in our measurement of the ether. As I said before, the mixture factors have been treated as continuous. The mixture sum is 75%. We're carrying out these experiments in mini-order claves. We're going to leave them for 30 minutes… we're going to get them up to temperature, leave them for 30 minutes, and then use analytical technique called gas chromatography to measure the amount of ether after that 30 minutes, and that's going to be our response for these 26 experiments. We carried on. We did those experiments. It didn't take too long. We then did some slight modifications to the data set. What I've done before up to this point, is I've talked about these factors in terms of percentages. It's easier to work with later on if we transform this to a mole fraction. Essentially, I've just divided the percentage by 100, so we get a number that sums up to 0.75 instead of to 75%. The second problem is because, and I'll come out of jump again here, I'm going to my results is these are our components, and in some cases, the sum total adds up to more than one or more than 0.75, more than one if we had the given concentration of the acid which is fixed. The reason for that is because these factors have been measured as part of the experiment. We can't fully achieve what we wanted to achieve. What I've done in this case is a bit of a manual tweak. I've taken the largest component in the mixture, which happens to be the ester, and I've just adjusted that, so you can see here. I've adjusted by 0.01 just so that everything sums to one, and that just helps the maths work in the background. For the 26 experiments in the data set, that wasn't a big issue to do. Some small adjustments, we also confirmed that the repeat run gave us a result within experimental error in terms of the ether concentration. All good to go and progress. We then started looking at the modeling of the data set. As I've mentioned before, the reason we use a Spaceful and design over a traditional Scheffe Cubic polynomial one is potentially fewer results in the test, and we can apply some more neural networks to it and hopefully increase the resolution, not be restricted by quadratic or cubic terms, which are very limited functions. The way we can do this is these neural networks are generally more applied to really large data sets. You need a lot of data. You can't really apply them to small DoE type data sets. That's because every run in a DoE is important. You can't afford to discount certain runs as part of a validation or a test set because every run in the DoE counts, you violate the design structure. Whereas if you use SVEM, and I don't have time, unfortunately, to go into the background of SVEM. I'm just going to show how you apply it. But a little bit of background is that it works using SVEM, self-validation ensemble modeling. The self-validation part is normally when you're fit in a neural network, you would divide your data into a training set, a validation set, and a testing set, so you will partition your data. Within self-validation, what we're going to do is we're going to replicate the data set, and we're going to use something called paired fractionally weighted bootstrapping with a gamma distribution. Effectively, you get anti-correlated pairs of each data point, one with a high weight, one with a low weight, one is used in the training set, one is used in the validation set, and you build your neural network using that. Then where the ensemble modeling comes in is you build lots of those models using different weightings in the gamma distribution. Then you average the final model. That's effectively how SVEM works. It gives you, in theory, give you a nice model with high resolution than a traditional polynomial fit. It can be really applicable for mixture designs in particular. That's what we did. This is our resulting SVEM model. It was a neural network algorithm. I used 50 models which were bootstrapped and average those. It was quite a simple neural network. There was only one layer with three hyperbolic tangent functions. Again, I'm going to come out of presentation and just go into JMP and show you how this works. This was all done, the SVEM, via an add-in made by a company called Predictum. This is a licensed add-in. We're going to build a neural network. What we're going to do is I'm going to select my factors, which is the temperature, the acid catalyst, the alcohol, the water, and the ester, which I've transformed by the adjustment of 0.01 to make everything sum up. I'm going to select my ether as my response. I'm going to click Run, and I'm going to end up with a dialog to launch the SVEM. I can select how many models are going to be averaged, how many bootstrap models. I'm just going to leave it at 50. I'm going to leave it fairly simple and have three 10-age layers… three 10-age functions in the first layer. You can modify these as much as you like and bring in linear and Gaussian functions. I'm just going to leave it as we are. It's going to go ahead and run the SVEM, and this will be different every time you do it because the fractional bootstrapping will be different each time. We should get something similar to what I had before. Here is my actual by-predicted plot, where this is the actual value of my ether response, and this is the predicted value by the SVEM. You can see I've got a really fantastic R² of 0.99 with quite a low error associated with it. I would then save my columns to the table, which I've already done. This is here, just to show you what this formula looks like versus a standard formula. Effectively, the ether concentration or response is a combination of these tan h functions multiplied by coefficients, multiplied by our different mixture and process factors. This here is one model, this is the second model, this is the third model, and so on. Down to a total of 50 models, and then all of those models are averaged, and that gives us our prediction formula. That's effectively how SVEM works, and gives us that really nice prediction. I'm all back to the poster. By way of comparison, even though it wasn't particularly designed for it, what I did was I built a least square regression model using the Scheffe Cubic terms in that model, and use stepwise parameter selection to narrow those terms down into the model. Built that model and compared it directly against the SVEM model. Here you can see the model comparison, so the SVEM predictions are in red. The least square regression predictions are in blue. Higher R² value for the SVEM model, much lower error as well associated with it. Also on the residuals plot, again, same colors the least squared regression model has much higher residuals for certain data points. The SVEM producing a better model with a 25 run data set effectively. Normally it's very, very difficult to apply neural networks to a DoE. I'm just going to go back. Looking at how that impacts the overall purpose of this work, which was to minimize the ether formation, and we can do that by looking at the prediction Profiler. Just exported that prediction formula into prediction Profiler. Again, I'm comparing the SVEM model versus the least squared regression model SVEM in colored in red at the top. What you can see here is there are differences between the two. We do have much higher resolution on the SVEM model, whereas the least squared regression is limited by polynomial curves. Essentially for low amount of ether byproduct, we need to end up with a low temperature reaction, low acid catalyst in the mix, mid-range alcohol, high level of water, and a low level of ester. There's some surprising results there for us, but we're fairly confident after looking at that actual biopredicted plot that we've modeled the system quite well, and you can see differences between the two models. For example, here, the acid catalyst hasn't been picked out as an important factor in the least square's regression model, and the direction of the trend versus ester is completely different. That explains the differences in the R² value and the predictions of the least squared regression. You can also see some nice features here, for example, on the alcohol where we've got a dip in the middle versus just a standard polynomial who doesn't really pick it up as much for the least square's regression. To summarize, instead of a traditional mixture-type design, we've used a space-filling design of experiments. We've treated the mixture components or the mixture factors as continuous factors with constraints built-in, and that's how we've accounted for the sum total of the mixture. We applied SVEM as a modeling technique to maximize the information we've got from a very small data set. We've increased the resolution of that prediction versus a traditional polynomial type model, and that's really helped us to understand the conditions to minimize ether formation in this chemical reaction. It's also accelerated the time for us to start building a kinetic model for this whole system. In terms of future work, we do also have time series data from these experiments. Instead of just having the data point at 30 minutes, we also have data points at 5, 10, 15, 20. We can do a bit more processing and try and integrate that to build and that to build, to calculate actual chemical rates and build a more developed kinetic model. That's where our attention is going to focus in the future. Also, many thanks to a lot of the scientists and engineers at Johnson Matthey on the technology team who contribute to this work. Finally, to Predictum, who provided training in the use of SVEM and also for its licensed use of the add-in. Thank you for listening to this poster presentation.
Labels (2)
My research aims to enhance the efficiency of early-stage process development with mammalian cells in the biopharmaceutical industry by applying an intensified design of experiment (iDoE) approach. Unlike classical design of experiment, iDoE involves intra-experimental variations of critical process parameters (CPPs). This approach not only increases data-generation efficiency but also enables the consideration of temporal process dynamics through stage-wise optimization. However, a potential limitation is that previous CPP settings may (irreversibly) impact the cells and affect their response behavior to subsequent CPP set points. To address this issue, my research focuses on developing guidelines for planning and evaluating iDoEs robustly, considering the impact of process history. The focus of the presented simulation study is to investigate the impact that different effect sizes of interaction terms associated with the process history have on our regression models. Subsequently, the beta estimates and variance components of these models are compared to evaluate the impact of not explicitly considering the process history. This research has the potential to significantly impact the biopharmaceutical industry by innovating the way process optimization in early-stage development is performed, considering the dynamic nature of these processes.   Hello and welcome everyone to my short talk that I'm going to give on the simulation study that I performed using JMP, where I simulated the impact of the process history on our iDoE regression models. Let me start with a really brief introduction. We are working with bioprocesses, and those are usually performed within the bioreactors. Within these bioreactors, cells are grown, and these cells then produce our final product, which is usually an antibody. These processes depend on many different process parameters such as temperature, dissolved oxygen, and many others. Usually, we would use design of experiment to identify ideal process parameter to maximize our response, for example, the yield of our antibody or the viable cell density that we obtain within our process. These processes can be divided into different classical stages that are typical for such a cell growth process, namely the growth, the transition, and the production phase. Here in this next slide, I show two different bioreactor growth curves, the green one and the blue one. As you can see, the blue bioreactor, in this case, has three times the title of the green bioreactor. How do they differ? They differ, as you can see now, based on the temperature profile that has been executed in these respective processes. As you can see, the blue growth curve is a lot higher. The viable cell density is a lot higher than in the green process, which leads to this increased title. This just highlights that by understanding the process dynamics, we can potentially vastly improve our process performance in our bioprocesses. Exactly this is the idea behind intensified design of experiment. In intensified design of experiment, we would divide our process into separate stages. Here's stage one, stage two, and stage three, and perform intra-experimental parameter shifts over time. We would change the process parameters from the growth to the transition to the production regression phase, and thereby we are able to optimize the parameter settings within each of these separate stages instead of having one set of parameters that we would use for the whole process, so we can optimize these process dynamics. A certain challenge in this approach, though, is how to properly consider the effect of the process history on our regression models. For example, how do the process parameters of stage 1 influence the response behavior of our cells to our process parameter shifts in stage 2 or in stage 3? Exactly this is the aim of this simulation study to investigate the impact of process history on our regression models. How did I simulate this? I simulated the data for an intensified design of experiment with two process parameters, in this case, temperature and dissolved oxygen, and three stages, stage 1, stage 2, and stage 3. The effect of each process parameter within each stage is described explicitly, as you can see here, temperature 1 and dissolved oxygen for the first stage, then for the second stage, and for the third stage. All of these process parameters are modeled as hard-to-change factors, and at the same time, we're also investigating the culture duration as an easy-to-change factor, which is the temporal component. We also have the bioreactor, which is modeled as a whole plot to accommodate for random offsets between the bioreactor. We model it as a linear mixed model. Where can we find the process history within this? The process history can be found within these across-stage interactions, meaning what is the impact of the process parameter settings of the first stage on the response behavior of our cells through the process parameter changes in the second or the third stage, here highlighted for the temperature in stage 1. These across-stage interactions can be used to get an idea of the influence of the process history. This brings me to the setup of my simulation. I used the simulate response platform in JMP. I used exactly the design that I just showed to you within these last slides to create a model and to get an idea what these beta coefficients in this model potentially could look like. I used knowledge from historic regression models to create this base model. Within this base model, I highlight simulated the across-stage interactions, which I used to model the impact of the process history. These across-stage interactions is what gives us an idea of the impact of the process history on the regression models. To also get a visual idea of what the data looks like, I here plotted the viable cell density over the culture duration for this simulated base model, colored by the effect of temperature in stage 1. But since we are interested in the impact of process history, I deviated this base model, creating alternative model where I varied the magnitude of these across-stage interactions. You can see here, in this case, I doubled the coefficients of the across-stage interactions, then I halved them, and I quartered them to simulate different scenarios with different impact of the process history. As you can see in those plots, these also result in perfectly valid-looking viable cell density curves. I used these four different models that I just showed to you as the fixed effects in my linear mixed models. I also introduced a random hole plot error and the random residual error. What I did then was to fit these models. First, here you can see those are all the eligible terms that I had in my model. All the terms that I simulated, including these across-stage interactions, which I use as proxy for process history. The second model that I fit was a model that did not contain these across-stage interactions, so it had no way of accommodating for the process history. Those two models I then compared based on their beta estimates, how well they are able to estimate the beta coefficients which I knew from my simulation, as well as the variance components. I repeated this fitting 1,000 times and afterwards looked at the respective distributions. This is what I would like to show to you next in the results section. To give you an idea what the data actually looks like then, here for one beta, I visualized the beta estimate distribution for the full model with across-stage interaction, considering the process history, and for the reduced model without across-stage interactions, not considering the process history. As you can see for big across-stage interactions here times two, we have a rather big difference between the mean beta-estimate distributions. Whereas for really small across-stage interaction effects, we have a really small difference between the mean estimates of both beta estimate distributions of this 1,000 fittings that I performed. Estimates of both beta distributions, if the difference of those is zero, it would mean that both models mean estimates are the same. On the next slide, I calculated the mean difference between the beta estimate of the full model, considering the process history, and the reduced model without across-stage interaction interactions which doesn't consider the process history. As we can see for big effect sizes of these across-stage interactions, there's quite a big difference in the mean beta estimates for the full model and the model without across-stage interactions. Whereas for small effect size of these across-stage interactions, there's a really marginal difference. What I would like to show to you on the next slide are the variance components. Again, the comparison of the full model, considering process history, considering these across-stage interactions, and we can see that the variance components are estimated correctly here with 0.25 independent of the effect size of these across-stage interactions. Whereas for the reduced model without across-stage interactions, we can see that we have a very inflated whole plot variance compared to the residual variance. The smaller the effect size of these across-stage interactions, the closer these variance components get to the actual simulated of the variance components. This inflated whole plot variance is probably due to the model explaining some of this variance due to the missing interaction terms in the model by inflating this whole plot variance term. This brings me to the conclusion, so to the final of my presentation. What we saw was that for very small across-stage interactions, meaning a small effect of process history, the model without across-stage interactions projects onto the full model, which accommodates for the process history. Meaning that these across-stage interactions could be potentially neglected in modeling. Whereas when we have big across-stage interactions or a high impact of process history, we can see quite a big impact on the beta estimate, which can be seen at the difference between the mean beta estimate of the full model compared to the model without across-stage interactions. This is partwise compensated in the model without across-stage interactions by an inflated whole-plots variants. To give a really brief outlook, we would like to generalize these results maybe by quantifying the magnitude of across-stage interactions compared to non-across-stage interactions to get a way to compare them to actual experimental results that we will hopefully obtain in the future. Furthermore, we would like to extend this benchmarking to further modeling approaches such as stage-wise regression models, as well as using the Functional Data Explorer for analyzing iDoE data. That's the end of my presentation. I want to thank you all very much for listening, and I want to especially thank Verena and Beate for the supervision and the guidance during my PhD project.
Labels (2)
Thursday, March 7, 2024
Ballroom Ped 7
Aerospace grade formulations are often composed of several ingredients whose ratios and interactions will impact one or more properties of the final component. Theory and experience can help with the design of these formulations, but sometimes there are interactions or synergies that have not been discovered yet. Therefore, it can be useful to explore a wide experimental space to discover the unexpected. In this presentation, I share the results and insights obtained after running a mixture design, including how to visualize, normalize, and analyse the data. I also discuss ternary plots, how to communicate technical information to a nontechnical audience, the challenges encountered, and what could have been done better.   Hello, my name is Carlo Campanelli, and this poster is about use of a mixture design to optimize an aerospace formulation. Resin systems for aerospace applications typically compromise multiple components that need to be balanced to simultaneously meet thermomechanical safety and regulatory requirements. The testing process can be long, expensive, and with multiple sources of variability. Furthermore, there are often time constraints which limit the number of formulations, tests, conditions, samples, and repeats that can be done. The objective of this work is to improve the specific material properties while maintaining unaltered all other product, the characteristics and performance. This is often challenging as it can be a zero-sum game. And generally, it is a matter of finding the best compromise rather than finding the best formulation. The best compromise can change depending on the specific customer or the specific application. The approach for this work is to use a mixture design with 3 variables, 15 runs, and 1 repeat. Here on the top right, we can see a picture showing the experimental space of the mixture design containing the 15 formulation tested. The components or variables are X1, X2, and X3, and their sum is always 100%. Here on the right, we can see an image showing how to read the ternary plot and how the sum of the free the component X1, X2, and X3 is actually 100%. For this work, I've tested several properties, but I've reported they are three in the form of a color-coded ternary plot. Where green and red mean good and bad values, respectively. Starting from property one, we can see how the formulation at the bottom of the ternary plot have a good value, but this property tend to decrease by going up in the ternary plot. By going up, it means that we are increasing the amount of the X1 component. While going from left to right, so changing the amount of the X2 and X3 component, doesn't have an impact on this specific property. This is an example of a property that is dominated mainly by one component, X1 in this case. Now, looking at the second property, we can see a similar but opposite trend. By going up, this property get better. By increasing the amount of X1, we're improving this specific property. But in this case, going from left to right, we have a decrease in value. Here for this specific properties, we can see that it is influenced by X1 and X3 mainly. This causes us to have three sections where we have good properties, average properties, and bad properties. Regarding the third property, according to theories extrapolated from literature and direct technical experience, property three should be better for formulation at the bottom of the ternary plot. This is generally true, as we can see here, but there is an exception. This top-right formulation, which is a little bit of an outlier. This highlights the importance of screening a wide space and not having a bias toward certain parameters. Because if we were to follow the theory and only test or analyze the bottom formulation, we would have likely missed on this top-right formulation. Finally, the Predictor Profiler is useful to compare multiple properties and trends at the same time. Use this ability factor to maximize the most desired properties and visualize the confidence interval. In this case, I've reported or plotted seven properties against the three variables, and we can observe all the trends at the same time. We can also see that for some properties, the confidence intervals are very narrow, while for some other properties, they are very wide. We should be careful when we make conclusions using this data. About now, the learnings and the future works. This mixture design has highlighted several trends and some unexpected results. This will help with the optimization and tailoring of several products. It would have been beneficial to have more repeats as some properties have quite wide confidence intervals, and it is important to identify and understand the main sources of variability. A more in-depth study of the data generated is needed to find the correlation between the measured properties, explain outliers, and make connection with previous studies to confirm or disprove theories, and see the bigger picture. In fact, I believe that it is important to make full use work and the data that we have generated. This is everything for this work. Thank you for your attention.
Labels (2)
Design of experiments (DOE) has always had an intrinsic contribution toward sustainability. Simply by minimising the number of experiments to reach the target desired, significant savings in resources can be obtained.  However, it is not only about using DOE, but also combining it with the Prediction and Contour Profilers. These profilers enable scientists and engineers to reach optimal products and processes, generating some “secondary” contribution to sustainability. For example, by making a process more efficient, it's possible to generate less waste and/or spend less energy. Achieving a better or more efficient product could save resources in the application of that product or enhance its lifetime by again generating less waste. In this paper, we show how at Johnson Matthey, a global leader in sustainable technologies, we consider sustainability, not only in relation to the application of our products but from the very beginning at their formulation. We explain how we use DOE and the profilers to enable us to formulate taking not only performance into account but also trying to minimise the footprint of the formulation itself, and therefore including sustainability in multiple aspects of the product.     Okay. Hello, everyone. Today, I'm going to go through our contribution to the Discovery Summit. The title is Utilizing Design of Experiments and the Prediction Profiler for Creating Sustainable Formulations. But first at all, who are we? My name is Pilar. I'm a research scientist at the Speciality Chemical Company, Johnson Matthey, and I'm the lead of a group called Statistical Thinking Team, which aim is to extend the use of statistical tools around Johnson Matthey. Here I have my colleagues, Patricia and Jenny, they are going to introduce themselves. Hello, I'm Patricia Blanco. I also work at Johnson Matthey. I'm based at the research center. I am a chemist by background, and I've been working in the field of coding and formulations for about 20 years. Hello, I'm Jenny. I work at the Technology Center, and I work on LCA within the sustainability team, helping scientists to make sustainable decisions during their research projects. For those that don't know, Johnson Matthey, I'm going to introduce a little bit. As you know, the world is facing some of its biggest challenges yet, and we need to change, to transition to net zero and ensure a more sustainability future. That's where Johnson Matthey comes in. Through our inspiring science and continued innovation, we are catalyzing the net zero transition for millions of people every day. We have a 206-year history and over 12,000 employees all around the world. Our technology is based on advanced metals, chemistry, catalysis, and process engineer. These expertise have been established over many years and underpins all our businesses. In JM, we are committed to sustainability, and Jenny is going to tell us all about it. One of our core values at Johnson Matthey is to protect the planet and the people. The way that we're looking at achieving this is by looking at the way that we develop our portfolio of technologies, about the operations that we include to produce those technologies, and how we can work with customers as well as suppliers to create a sustainable product within our portfolio. We're looking at protecting the climate. We want to protect nature and advance the circular economy, and we're looking at promoting a safe, diverse, and equitable society. We're using ESG ratings to show that we're tracking how we are becoming a more sustainable company. Thank you. Today, we are going to talk about how we use Design of Experiments and sustainability. For us, Design of Experiments has an intrinsic contribution to sustainability since just by reducing the number of experiments, you are saving resources, materials, and energy. Also, if you're applying Design of Experiments, for example, to make one of your processes more efficient, then you are making less weight and using less energy. If you are applying your design of experiment to make it a more efficient product, then you might be saving resources in the application of that product, or you might be increasing their lifetime and therefore decreasing the amount of waste. All these applications of DoE to sustainability are more directed through the application of the products. In this presentation, we wanted to go also further and talk about DoE and sustainability through the formulation of that product. Going backwards before that product even exists. Formulations in JM. Product formulation is one of JM's core capabilities, and it requires a multidisciplinary approach where chemistry, modeling, manufacturing, processing, and engineer need to work together to be able to deliver JM science in a form that is suited to the customer's needs. As I said in the introduction, I'm part of the Formulations team at JM's Technology Center in the UK. As a team, we provide support to all of the JM businesses in a range of projects that go from fundamental long term understanding to fast critical support for product manufacturing. We also build and extend our science base by developing new techniques and skills in our labs, but also collaborating with external experts. We use statistical tools to design our experiments and to get the most value out of our data. Most of JM's products are formulated, and I'm just showing a few pictures here as examples. Starting from the left, we have catalytic converters that they are coated with alizarin that contains an active catalyst. The black picture in the middle is a fuel cell ink, both fuel cell and electrolyzer catalysts are deposited to produce coated layers from inks. Then powders are really important as well because they are used to produce granulated or pelletized catalysts that they are used in the production of chemicals. In all of these examples, it's crucial to understand the materials and how they interact between each other or how they interact with each other. Formulations are complex systems, and using JMP tools like Mixture Design of Experiments are important, and they help us extract the most value and understanding when studying formulating products. In this demonstration that we are going to do. We have run a Mixture DOE on a formulation and looking at performance of the formulation that will be introduced in the Profiler in JMP. But also we will introduce a model related to the sustainability based on life cycle analysis concept. Both of them were looking at it together in a Profiler to obtain an optimized formulation that is not only taking into account the performance, but also the sustainability of the formulation. If you haven't heard much about lifecycle analysis before, Jenny will go through it. The lifecycle analysis is a method we can use to quantify any direct or indirect environmental impacts that may be associated across a full production chain. So that's from Cradle-to-Gate or Cradle-to-Grave. What LCA is doing, is it's helping to promote this design of any sustainable products or processes, and identifying any red flags that may appear, so we can do something about them sooner. This will lead to us having less environmental impacts, lower toxic materials released into the environment. LCA specifically will focus on three major targets. This is human health, ecosystem health, and resource availability. What we're doing at the moment is we're focusing on the carbon dioxide equivalent so that we can look at the carbon footprint of products from Cradle-to-Gate with Pilar in DOE. I'm going to leave here. I'm going to do a demonstration directly in JMP. We have then a mixture of three components. A main component, a modifier, and a solvent. Let me open the ternary plot. See how it looks. It's the blue. This is our ternary plot, within the main component, modifier, solvent. Obviously, system formulations are a lot more complex like that, but we have done this little example with just three, so they can easily be visualized in the ternary plot. Normally very typical in formulations, the experimental space is very small compared to all the possible combinations. We normally need to assume in a very small space. But actually, although it looks like a very small space, like very drastic kind of... The formulations can have very small changes in this formulation within this space can cause very drastic behavior in terms of performance. We designed a mixture design, in this sketch based on a [inaudible 00:09:50] filling designs with constraints, as you can see, in which it gives us a good distribution of the data points around the experimental space. The data was model. The data that we are collecting is some kind of property related with the performance that you can see, and the data was model. We have now a model of how different formulations are giving us different performance properties. At the same time, we also have a model for the carbon footprint, the kilograms of CO₂ equivalent produced by kilogram of formulation taking into account the carbon footprint of each individual ingredient. We are going to visualize each of these models in a ternary plot. If we start with the carbon footprint, for example, we can see here I've done a zoom over the experimental space. The experimental space is in white. You can see here through the contour lines in red how the carbon footprint increases is going towards the left. If we want to minimize the carbon footprint of this formulation of this product, we need to go… Moving towards the right in this direction. We also have… If we look at the model of the performance predictor, then you can see how the model is more complex, and you can see how the values move between 20 and 60. You can see how there is the 20, the 30. We do have as well the areas for the 40 on the dotted lines. If we want to have a performance property of 40, then we need to work there. We also have the 50. If we want to work around 50, we need formulations that are around this area, and we also have a sweet spot of 60. Say we are looking at maximizing this property, then we need to choose formulations that are within these spaces, especially we will be then interested in this area for the formulations. If we look at it all together, then we have the contour lines for the performance property in red and the carbon footprint in blue. By seeing it all together, it allows us to choose then formulations in which we could be as closer within the area of interest to maximize the performance property, but also trying to minimize the carbon footprint. We will probably choose a formulation that is in the right side of this area to be as closer as possible to lower values of carbon footprint. This is obviously just visually, and the visualization gets more complicated when you have more components. But anyway, you can do all through the Mixture Profiler, but you can also use other tools in JMP as the Prediction Profiler or the Design Space Profiler to help you to optimize the system, taking into account different outputs as it is in this case with the performance, but also with the carbon footprint, and find a solution that could compromise between the two aspects that we are looking into. We are looking into a formulation of a product. In summary, the importance… There is a high importance of providing tools to a scientist to be able to consider sustainability and early stage research. If not, these decisions are very difficult to be made. JMP and with the sign of experiments and the profilers have proved for us a very useful tool for the scientists to be able to optimize the performance, the sustainability, and other many aspects we can consider in a product or a formulation, such as could be the cost, for example, in a simultaneous way to obtain the best solution all over. This is all for us. Thank you very much for your attention.
Labels (2)
In this presentation, we introduce a novel design of experiments (DOE) methodology developed in-house. Named "Dynamic DOE," it is specifically tailored for time-dependent DOEs in chemical development using kinetic reaction data. The development of this innovative approach addresses the challenges faced in traditional DOE methods when dealing with time-sensitive chemical processes. We present benchmark data comparing different DOE designs and their performance in combination with various regression techniques. This comprehensive analysis demonstrates the advantages of the Dynamic DOE methodology in terms of accuracy, efficiency, and adaptability. Furthermore, we showcase real-life application examples from late-stage chemical development at Boehringer Ingelheim. These case studies illustrate the successful implementation of the Dynamic DOE technique in combination with high-throughput automated lab reactors, highlighting its practical benefits and potential for widespread adoption in the industry. Join us to learn more about chemical development advancements through the Dynamic DOE methodology, an innovative technique that seeks to change the way we utilize time-dependent experiments in the field.   Welcome, everybody, to our JMP Discovery Summit Talk with the title, Time Dependent Design of Experiments in Chemical Development. Before I start, some words about the company we're working for, Boehringer Ingelheim. We are a global pharma company, which has been founded quite since a time ago in 1885 by the family Boehringer, and we are, till today, family-owned, so not listed on the stock market. Our business focuses on three areas, first on human pharma business, that's our biggest business, then animal health, and we have a small biopharma business. Worldwide, we have more than 50,000 employees distributed over 16 sites worldwide. In 2022, we made a revenue of nearly €25 billion. With that, to what Jonas and I are doing, we both are working in chemical development, and we develop chemical processes for supplying clinical studies and later on market. In more detail, that means we develop new chemical routes to synthesize our active pharmaceutical ingredients. We develop robust, scalable, and sustainable processes, first in the lab, and later on, we scale them up to pilot plant scale and then to plant scale. Doing that, we supply drug substance for all clinical phases from one to three, and we generate all data necessary for later market submission. As we always try to accelerate timelines to meet patient needs faster, we actively develop new technical and digital solutions to speed up development process. With that, let's come to our main goal, chemistry. Chemistry is more or less the science or art of combining or changing molecular fragments to build up bigger and more complex molecular fragments, which are, and then, finally, our active ingredient. To do so, we have chemical processes which consists of a large amount of parameters we can change, for example, reagents, solvents, the stoichiometry of all those things, and then physical parameters, like temperature, time, et cetera. You can see a multifactorial problem with quite a large range of target variables we want to influence. Most prominent always are process yield, so we want to get as much as possible out from our processes. At the same time, we want to have a good purity, so we want to decrease any impurities that might be formed. Furthermore, we are highly interested in reducing the environmental impact of our processes. So we want to use less solvent, less reagents. We want to utilize green solvents and reagents, et cetera. And then last but not least, we want, or we have to develop robust processes, meaning we have to make sure that we know in which range we can vary our parameters without affecting product quality, and we want to be able to set up a design space. Statistically speaking, this can be considered as a Multifactorial Multi-Response Black Box optimization problem, which we want to optimize as efficiently as possible while making sure to find our global optimum. Obviously, a perfect use case for DOE. We're actively using DOE in multiple settings. In early phases, when it comes to screening, we use screening DOEs to efficiently find in a large, large range of parameters our influencing parameters and a possible sweet spot. Later on, we go to optimization, DOEs Response Surface Designs to really hit the sweet spot and the global optimum. Later on, when it comes towards market submission, we characterize our processes using, again, quite efficient DOEs to screen all parameters with respect to a possible influence on drug substance quality. DOE is a great methodology. We really like it, but it has some challenges we came across. First of all, Design Robustness, especially when it comes to screening DOEs. What we quite often see is that the high low settings for influencing parameters are not set correctly by our experts in the labs, meaning low was set too low, or high was set too high, meaning on those settings, our corrections fail completely, or we are operating in an unstable region where we don't get a stable response. Secondly, DOEs are highly efficient. However, still, if you consider to optimize a large range of parameters, you tend to need quite a high number of experiments, and that's especially for late development problem, as there we're working on quite a big scale in the labs, meaning we can't perform any reactions at once, and furthermore, reactions are quite costly on that scale. That's sometimes a problem. We wanted to tackle those problems, and we wanted to address it with multiple things. First of all, the problem of design robustness. The obvious solution for this problem was, if it's a problem that you perform all experiments either on low or high setting, the solution to avoid the problem that 50% of your reactions might fail is just to distribute your experiments more evenly across the complete parameter range. So not only on low and high, but rather in the middle of your design space. Maximizing efficiency, there we got inspiration from literature. Chemical processes are governed by chemical rate laws, so physical differential equations. Meaning in the time course of a reaction is a lot of information, and the idea was to incorporate this time information better into our DOEs, so sampling multiple time points for one reaction, and thereby utilizing the full information of one reaction and maximizing each and every reaction. Last but not least, Automation. DOEs are highly repetitive and therefore, the perfect candidate for high degree automation. We wanted to try to optimize the conduction of DOEs in the lab as much as possible. With that, let's go to our Project Scope. In this project, we wanted to investigate, can we use different DOE types to come across this problem of design robustness? Furthermore, are different DOE types more efficient when it comes to using this new kinetic data? Second, regression methodology. Obviously, we want to incorporate for us a new kind of data, and we wanted to see whether different regression methodologies work better than others. Third, how much data is needed. Can we reduce the amount of experiments that are needed for conducting one DOE? Last but not least, obviously, how to bring this approach to our labs in an automated fashion. When we wanted to evaluate all those possible combinations, we quite early came across the problem of benchmark data. Obviously, we need data to try different combinations. If you just look on one DOE, if you want to consider nine different design types, 18 reactions each, with 10 time points per reaction, you get quite a high number of DOEs you have to perform with even a higher number of reactions you would have to perform. We came into 25,000 reactions. That's obviously not feasible for us. We can't conduct that many experiments. Therefore, we have to come across a different approach. We got back to the fact that chemical reactions are governed by the underlying rate law, so a differential equation with which you can describe the chemical process more or less exactly. And we use that. We took a chemical reaction, measured kinetics, and solved this rate law. We had more or less in silico representation of one specific reaction. Now using this rate law, we were able to simulate the outcome of different the DOEs using our chemical rate law. We just performed many different DOEs, screened different DOE types, used the rate law to conduct the outcome of the reaction in a time dependent fashion. We calculated many responses for multiple time points for one reaction. And the outcome you can see here. We get nice kinetic profiles. They vary quite significantly from reactions that perform well through reactions where nothing happens. This data then was transferred to a normal JMP table, and just let you have a feeling how those tables look like. We just stacked all the experiments. Experiment One consists of three time points. Our influencing parameters are the same, only time changes and our response in this example yield varies. With this approach, we were now able to perform in silico approach. Obviously, we had to use some experimental reactions as a basis. We selected two different reaction types depicted here. First reaction is a reagent mediated coupling. What DOEs that mean? We have two substrates shown in green. Substrate A is activated by the reagent. Activated complex is formed. This complex then reacts with our starting material, too, forming the product. We have the formation of impurity here depicted in gray. In this reaction, we investigated five different parameters, temperature, concentration in stoichiometry, plus the time. Second example is an enzymatic reaction, where, again, we couple two substrates. Substrate A is activated by the enzyme, forming an activated complex. Substrate two is activated by a chemical reagent, and we form product and impurity. All in all, we generated four different reactions, and two reactions where we form product and two reactions where we form a yield. All in all, four different data sets. Then we set up a workflow in JSL. The workflow consisted of two parts. The first part was a selection of different DOEs, compiling those DOEs with different number of reactions. There we screened everything between six reactions per DOE to 24 reactions per DOE. We investigated different number of time points sampled for each individual reaction. There we tried everything from three samples per reaction to 12 samples per reaction. Those DOEs where we set up. We simulated the response using our pre-fitted kinetic rate laws, and then we took all that data and regressed it with different regression techniques available in JMP Pro. There we not only tried different regression techniques, but furthermore, we added different levels of normal noise to our response to check whether noise has an influence of the combination of regression methodology with DOE type. The obtained models then were validated using an external validation set consisting of 10,000 samples. We calculated the RMSE for each model. Then to be able to compare the RMSEs between our four different data sets, we normalized the RMSE by normalizing it to the RMSE of a mean prediction model. Then we calculated the mean RMSE for all four different data sets. That's what we show in the next slides. First we started investigating which regression methodology works best with which type of DOE with respect to performance of prediction. For this exercise, we consider always 18 reactions per DOE with 12 samples per reaction with no noise. The results can be seen in this heat map. What can you see in this heat map? Here, the RMSE norm mean is plotted. RMSE norm, the mean of the RMSE norm are four different reactions. On the Y axis, you can see the different DOE types we have been investigating, and on the X axis, you can see the regression techniques. Let's first come to the DOE designs. What can be seen is on the bottom part, the more classical design types are shown, so D-optimal and I-optimal. You can see a lot of gray or dark orange meaning there, we don't have a good performance of the final models. However, coming to Space Filling Designs where you have more even distribution of your points and the complete parameters space, we see much better and much more consistent performance. Especially Fast Flexible, Uniform, and Latin Hypercube designs show a lot of green or light yellow. Coming to the regression techniques, we see more or less three different techniques that work quite nice. First of all, Functional Data Analysis, and that's not really surprising as this type of analysis has been specifically made for analyzing dynamic data. That was something we expected. Second, Boosted Neural Networks perform in combination with IMSE Optimal designs quite nice. However, the top performer, consistent top performer of all four data sets are Gaussian Process regression. We investigated two different types of the fast version and the normal version. They perform significantly better than all other regression techniques. That was quite surprising us, but on the other hand, quite nice as GPs are fitted really easily in JMP Pro. You can see they work best with Fast Flexible, Latin Hypercube or Uniform designs. We identified a combination of regression technique and design type, which seems to work best, and we now wanted to analyze that in more detail. First, we wanted to know how many reactions are really needed to get predictive models, and what about the influence of sampling multiple time points for each reaction. Let's first cover the part of how many reactions are needed. In this plot, you again can see the RMSE norm mean versus in this case, the number of experiments per DOE. What can be seen is not really surprising. The more reactions you perform or the larger the DOEs are, the better the model performance will be. However, what can be seen is for GPs shown in yellow, we see some plateau starting at 20 reactions for those four reactions, and that's something we observe in reality as well. Normally, four times the number of parameters being investigated is a good indicator for the number of experiments you should perform to get good performing models. For Functional Data Analysis shown in gray and Boosted Neural Networks shown in green, we don't see any plateau, so here we should perform more experiments than plotted in this thing. Next question was, DOEs time point sampling really improve our model performance? If you look on the yellow line for Gaussian Process in bold, it's shown the use case with 12 samples per reactions versus the dotted line where only three samples per reaction have been sampled, and you can't see any difference, meaning time point sampling DOEsn't improve the performance of our model. What was quite disappointing for us. However, going to Functional Data Analysis and Neural Network, there you see a difference. There, 12 samples, increased model performance. However, in the second step, we wanted to see if noise influences this analysis. As up to now, we always looked on noiseless data, which is not realistic for our lab use case, and we wanted to see, DOEs that influence the results. And yes, it DOEs. Now we see a clear a difference between 12 samples per reaction involved versus only three samples per reaction as a dotted line. Meaning using more time point samples per reaction, we can boost the performance of our models or can dampen the influence of noise on the final performance of our model, which is a really good, great news for us as taking more samples per reaction DOEsn't cost anything. We get that more or less for free. That's an easy way to boost model performance without having to pay anything. Same thing again for Boosted Neural Networks and Functional Data Analysis. However, in all analysis, GPs were always much better than any other combinations. Okay, so we could show in this first analysis that GPs in combination with Space Filling designs seem to work good. However, the question was, do they really work better compared to conventional approaches? That's why we simulated this approaches as well. We made up to different approaches. First, No Sampling but Time prediction, meaning a classic DOE approach, considering time as a normal influencing factor, meaning we only take one time point sample per reaction. Second approach would be No Sampling and Fixed Time prediction, meaning setting up a DOE without considering time at all. We set up a DOE model for one specific time point during our reaction. Again, we simulated those approaches for all combinations of regression methodologies with design types. Again, in this case, plotted for 18 reaction with 12 samples and no noise. What can you see on the very left? We show our new Dynamic DOE approach, and yes, GPs in combination with Space Filling Designs work best. The second approach, time as a normal factor, so only one sample per reaction, works comparably good. Interestingly, GPs and Space Filling Designs work here best as well. Last but not least, the approach, No Time Prediction at all. A DOE model for one specific time point works best. That's not really surprising. What is surprising as well here, GPs in combination with Space Filling Designs seem to work best, which is surprising as this is a standard DOE approach. The second step, again, we wanted to investigate the influence of noise as that's the final thing we want to have. This was quite nice what we observed here. Now our Dynamic DOE approach works best compared to all other approaches. Time as a normal factor, DOEsn't perform at all. We don't get any predictive models that outperform a mean model. Even a DOE model specifically made for one time point, is less good than our Dynamic DOE approach, clearly showing that using multiple time points for one reaction really improves model performance, especially when it comes to noisy data. With that, I'm at the end of my part, and I want to hand over to Jonas. Great. Thanks, Robert. Now, I would like to illustrate the implementation of this Dynamic DOE approach within chemical development at BI from a technical and practical perspective, as well as show the performance of this approach on one example reaction from BI's development portfolio. First of all, the experimental realization was started in a Conventional Wet Lab in our laboratories, which is the most conventional setup for an organic lab where reactions are conducted in a typical glass reactor or glass flask, where one reaction at the time can be conducted. Every steps are operated manually, including the sampling of the reaction. But this approach not only limits the throughput of reactions, but also creates deviations in the generated data due to varying operators and equipment. Therefore, we opted for a semi-automated system that automates single reactions like dosing or sampling of the reaction. Although some of the steps are still done manually and only one reaction at a time can be conducted, this approach still accelerates the whole throughput of this approach and increases reproducibility of the generalized data due to harmonization of some operations, and to further increase turnover and reproducibility, as well as increase or decrease development timelines to meet patients needs quicker. We opted for a fully autonomous system that can conduct chemical reactions completely autonomously without human or manual interactions and is accelerated even more due to parallelization. It has ensured reproducibility due to harmonized procedures and equipment. How this system works is shown in the next slides. A bit of technical insight into this platform. This is a fully autonomous system that can conduct parallel experiments at a 100 milliliter scale with altering reaction conditions for all parallel experiments. This system allows us to record reaction kinetics in the way the specific reaction requires. Therefore, the whole system enables reliable Dynamic DOE conduction and reliable data that is highly reproducible. The system features six 100 milliliter stainless steel reactors here in the front that are heatable, coolable, and can be steered under an inert atmosphere to conduct the reactions in. To set up the reactions and operate those reactors, we have liquid and solid handling tools to dispense reactions, solvents, and reagents directly into the reactors. The liquid handling system can also be used to take the samples, prepare the samples, and inject the samples into analytical instruments. Both tools directly dispense all necessary substances into the reactors where then the reactions take place. This is the operative set up now to the realization the whole Dynamic DOE workflow with an example from the BI's development portfolio. First of all, how is the Dynamic DOE set up? First of all, the design of the factors of interest is created based on the results Robert just showed you. We opt for a Latin Hypercube screening that allows us to cover the parameter space sufficiently within a reasonable amount of single runs. Like Robert said, we usually opt for three to four experiments per parameter. In the example, reaction, those would be 10 factors with these limits. We started with 30 runs and set up a Latin Hypercube design. Next, a separate design for the sampling times is created. The sampling times are distributed pseudo-randomly over the whole reaction duration to avoid sampling every reaction at the same time point, which would lead to overrepresentation of those time points. Then later on to overfitting, we rather opt for an evenly distributed data landscape over the whole reaction duration. In this example case, the reaction duration was 185 minutes, which was split into 10 samples with windows of 18.5 minutes. Additionally, we defined constraints to avoid sampling time being too close together and set it up in a Fast Flexible filling design. Afterwards, both designs are combined, and the single runs are stacked by means of their sampling times like Robert already showed. Because this is a very tedious work and very click-intensive, we created a JMP app that takes over all these steps and allows to enter factors like temperature or else add factors, set the low and high factor limits, specify the design for the parameters of interest, specify the number of time points, the number of runs, the minimum and maximum reaction time, and all these steps are now done in the background. We end up with two tables, one which includes the runs and their conditions, and one that is spread or that spreads every run, including the sampling times. Both can be used to now conduct this DOE on the robotic platform that I just showed you. In the tables, the results will be filled. The time points suggested or defined previously are then replaced with the actual time points, which can vary slightly due to technical reasons of conducting of the reactions. Same is true for reaction temperature. After those tables are filled completely with all the results, including conversion of starting materials to product and formation of side products. We can now go to the regression of this whole process to obtain models that predict or describe the underlying process sufficiently and obtain prediction formulas which allow us to do a Design Space Analysis, which I would like to show you on the example of a Suzuki reaction, which was part of one of the BI's current development projects, and it's a very common reaction for pharmaceutical chemistry processes. This reaction involves the catalytic transformation of two starting materials with a palladium catalyst to product. In this case, one side product was formed from one starting material, and under those catalytic conditions, two further side products were formed from both starting materials. The Dynamic DOE approach was applied to screen for optimized reaction the conditions and investigate the process robustness. Like shown previously, 10 factors were investigated within 30 runs and 10 samples per reaction were taken, which were analyzed value via HPLC, and the respective area percent were taken for analysis. Those are the factors and their limits, like I already showed in the JMP window. After setting up this experiment and conduction of the experiment, after setting up this whole Dynamic DOE, conducting the reactions analysis of the experiment, we end up with a data table that looks like this. We have the factors, we have all the time points, and we also have the final responses and the analytical data. With these, we can do a Gaussian Process regression, which allows us to investigate the model quality as well as study factor significance, but very importantly, study the influence of the single factors on the responses as well as their interactions. We can see that, of course, time has a very significant impact as well as temperature as well as on conversion and on impurity formation. By defining desirability. We want to maximize product formation and obviously minimize side product and starting material occurrence. We can optimize the reaction conditions, which then leads to optimized conditions, where we can see that we have a very favored time point, reaction, temperature. From the prediction formulas we obtained from the regression, we can then create a second profiler, which then allows us to do a Design Space Analysis. That helps us to identify factor limits in which the underlying process would furnish InSpec results. For this, we first need to specify specification limits for all responses. After creating a random table and connected to this Design Space Profiler, we can get a plot that shows us in which ranges or factor ranges the process furnishes results. In this case, 80% of the results would become InSpec, and we can see that there are factors that can be applied on a very broad range, like potassium carbonate, isopropanol volume, the temperature ramp in this case, or the ligand. On the other hand, there are factors that need to be kept in a very specific range to become InSpec results. Of course, this is time because the conversion only becomes high enough after a specific reaction time. That is in this case, temperature seems to be very sensitive. That allows us to, on the one hand, reliably optimize the reaction conditions and investigate process robustness within a very short time frame and only conducting 30 experiments to get sufficient information about 10 parameters, which then allows us to make reliable statements about the underlying process. With that, I will summarize that we showed you the method screening based on in silico data generated from chemical rate loss. This in silico data was used to evaluate the combination of different design types and regression methods that ultimately led to result that the best performing combination is to combine Gaussian Processes with Space Filling Designs. This approach was implemented as the standard method for DOE based optimization of chemical reactions in chemical development at BI. For the experimental conduction of the statistical approach, robotic autonomous systems are used, and the benefit of this approach was demonstrated on one reaction from one of the current BI's development projects to be an efficient and very accurate method for reaction optimization and process robustness screening. With that, we'd like to thank for your attention and the possibility to show or to present these results here.
Labels (2)
Thursday, May 30, 2024
Conference Room 1,2, or 3
Labels (1)
  • Design of Experiments
Thursday, May 30, 2024
Conference Room 1, 2, or 3
Speakers : 沈佳苹, 工艺支持工程师,应用材料 The speech topic : JMP DOE 分析及建模优化在预测外延硅生长速率中的应用 Speech abstract : 在半导体制造中,外延硅被广泛应用于晶圆衬底、pMOS SiGe、nMOS SiP、沟槽填充等应用。外延硅层的生长速率会受到多个步骤、多个工艺参数的影响。为了更高效地建立此种强交互的预测模型,我们需要彻底且全面的DOE评估,设计高度正交的DOE。此项目以外延硅生长速率的历史数据为起点,进行设计评估。对于建立预测模型而言,该历史数据结构功效弱,D 效率低,设计均匀性差,效应相关性和预测方差较高。在建模期间,由于缺乏模型自由度,RSM的逐步算法不稳定,导致模型重复性不佳,最优设计点的置信区间过宽。除此之外,RSM 模型中观察到了两对强相互作用:一对相互作用可能归因于各效应之间的高度相关;而另一对相互作用则与工艺的竞争机制相关。随后利用稳健设计和蒙特卡罗模拟进行公差设计,利用设计空间刻画器用于进行公差分配分析,以模拟未来的技术需求。为了以最小成本改善现有的DOE结构从而优化预测模型,我们未采用全新的DOE,而是利用增强设计方法,通过三种方法改善现有DOE结构:(1) 移除非正交的数据; (2) 默认增强算法; (3) 中心增强算法。通过非常全面的增强设计,JMP建议了最佳四个增强数据点,最终改进了DOE 结构,极大地提高了对于历史数据的利用效率,显著缩短了工艺开发的周期及成本。  
Labels (2)
Tuesday, October 22, 2024
Executive Briefing Center 8
Labels (2)
Tuesday, October 22, 2024
Executive Briefing Center 8
Once you’ve learned how easy it is to design an experiment in JMP, you never look at the world around you the same. Everything becomes an opportunity for an experiment! This presentation uses a practical example to demonstrate the process of design of experiments (DOE), including designing the experiment, modeling the results, and optimizing the inputs to provide the most desirable output. Attendees at last year’s Discovery conference were treated to an evening of unique fun: hitting glow-in-the-dark golf balls on the driving range at Indian Wells Golf Resort. The driving range has Toptracer technology that monitors each shot. Total distance, carry, ball speed, launch angle, and curve are some of the variables reported with each shot. A driving range that provides so much data provided a perfect opportunity to design an experiment using JMP! After an evening with fellow JMP users and friends, an experiment was designed using the Custom Designer in JMP. The design took only minutes to create. Input variables based on the golf stance setup were used in the design. These included variables such as grip, club head alignment, stance width, and ball location. The designed experiment was executed on the driving range, a model was created, and optimum settings to create the longest and straightest shot were discovered. The modeling and optimization were completed in minutes, while still on the driving range! This allowed for confirmation runs to immediately be performed. The benefits were later transferred the golf course as well.
Tuesday, October 22, 2024
Executive Briefing Center 9
There have been numerous studies showing efficacy of strategies in process optimization. The common comparisons made are usually between the ‘one factor at a time’ or OFAT experiments and a ‘design of experiments’  approach. When faced with an unfamiliar, high-dimensional process space (e.g. >10 factors), researchers often resort to the OFAT methods as they are easy to interpret.   Generally, it would be cost-prohibitive and logistically challenging to run multiple experiments geared towards the same objective just to evaluate which strategy outperforms others. To circumvent these issues, we used a Polymerase Chain Reaction (PCR) simulator with 12 unfamiliar continuous and categorical factors to explore these questions. Our team comes from decades of experience in process optimization in the electronic materials industry (former employees of Apple and others). We intentionally sought and selected a simulator from a research area completely unknown to us that has the ability to simulate a large number of factors and their complex interactions on many responses. To automate experimentation, we used a python web automation script. By using a simulator and our script, we can run through many experiments while mimicking real-life constraints and experimental budgets as seen in our own professional careers. While adhering to run budget rules, we compare the efficiency and accuracy of four strategies; two OFAT type strategies as commonly used in the industry, and two strategies from the DOE and advanced DOE genre. JMP is used for all experimental analyses and modeling and an objective attempt is made to compare the strategies.
Labels (3)
Wednesday, October 23, 2024
Executive Briefing Center 7
Design of experiments (DOE) is a statistical method that guides the execution of experiments, analyzes them to detect the relevant variables, and optimizes the process or phenomenon under investigation. The use of DOE in product development can result in products that are easier and cheaper to manufacture, have enhanced performance and reliability, and require shorter product design and development times. Nowadays, machine learning (ML) is widely adopted as a data analytics tool due to increasing availability of large and complex sets of data. However, not all applications can afford to have big data. For example, in pharma and chemical industries, experimental data set is typically small due to cost constraints and the time needed to generate the valuable data. Nevertheless, incorporating machine learning into experimental design has proved to be an effective way for optimizing formulation in a small data set that can be collected cheaper and faster. There are three parts in this presentation. First, the literature relevant to machine learning-assisted experimental design is briefly summarized. Next, an adhesive case is presented to illustrate the efficiency of combining experimental design and machine learning to reduce the number of experiments needed for identifying the design space with an optimized catalyst package. In the third part, which pertains to an industrial sealant application, we use response surface data to compare the prediction error of the RSM model with models from various machine learning algorithms (RF, SVR, Lasso, SVEM, and XGBoost) using validation data runs within and outside the design space.   
Labels (2)