Abstracts

0 attendees

0

Friday, December 18, 2020

レベル：中級現状を良くするためのアンケートは結果系と原因系の両方の質問が必要である．そして，両者の因果関係を回帰分析で把握すれば手を打つべき対象の選抜ができる．その場合，聞き逃した項目は調査後にカバーができないが，結果として不必要なものがあった場合は事後にそれを無視すればよい．このため回答者に負担をかけ過ぎないという配慮のもとに漏れのない質問項目を用意すると，その結果とし質問項目は多数となり項目間に高い相関が現れる．この問題に対しては，高橋・川崎により「多群主成分回帰分析」が提案されている．その本質は群内の相関は高く群間の相関は低くなるように構成した合理的な群のもとで群毎に主成分を求め，これを説明変数とした主成分回帰で重要な主成分を選択し，選択された主成分に対する因子負荷量の絶対値の大きなものの対策を打つべき主要項目として選抜するというものである．時にはこの主要項目が密集することがあるが，それは因子分析で対応できる．因果分析は主成分を用いた場合が表側因果分析で，因子を用いた場合が裏側因果分析で，両者を合わせたものが両側因果分析である．本発表ではそのための理論とJMPを用いた具体的な方法を紹介する．著者は半世紀に亘りQM（質経営），SQM（統計的質経営）および設計論の研究と実践（多数の企業との共同研究および経営指導他）を行ってきた．1990年代より新しい設計パラダイムである超設計（Hyper Design）を提案し，その数理であるHOPE理論（Hyper Optimization for Prospective Engineering）を研究し，その支援ソフトであるHOPE－Add-in for JMPをSAS社と共同開発を行っている．以上より考え方である超設計，統計数理であるHOPE理論，支援ツールであるHOPE－Add-in for JMPの三位一体で新しい設計法を実現している．　設計は敷居が高いため特殊な人々による特殊な活動と誤解されることが多い．これを打破し多くの人が設計を使いこなせるようにするために，著者は理論研究とともに新しい教育方法についても開発している．それは実物教材（紙ヘリコプター，紙グライダー，コイン射撃ほか）と仮想教材（飛球シミュレーターほか）を用いた体験型教育である．この教育プログラム（統計の基礎から超設計まで）は多くの場面でJMPによる可視化した解析と設計を行うために，分かり易くかつ面白い教育に仕上がっている．この教育は30年以上に亘り国内外の大学（慶應義塾大学，Yale University，東京理科大学，筑波大学他）および多数の企業で実施しその有効性を確認している． ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である高橋武則様まで直接ご連絡ください。

0 attendees

0

Event has ended

0 attendees

0

Friday, December 18, 2020

レベル：上級超設計は汎用的でかつ包括的な設計法のため幅広い実践的な応用が可能である．中でも頑健設計に関しては従来の頑健設計の枠を大きく超えた柔軟な設計を可能にしている．これまで，殆どの頑健設計は撹乱因子を質的因子として扱っている．しかし，多くの事例の撹乱因子は本質的に量的因子であり，設計理論の都合や計測方法の制約（多くは工夫不足）から質的因子として扱っている．超設計は量的撹乱因子を量的変数として扱うことができるため，従来の戦術的な頑健設計を戦略的・政略的な頑健設計へと大きく進化した設計法である．　量的撹乱因子の頑健設計では合成関数（関数の関数）の高度な活用が決め手となる．多重（多階層）の合成関数をシステマティックに用いることで画期的な最適化による設計が可能となる．複雑な入れ子構造の合成関数を扱うには高度なソフトが不可欠であるが，JMPはそれを容易に可能にしている．超設計は新しいパラダイムを用意したもとで従来の考え（戦術的な発想）の枠を超えた設計（戦略的・政略的な設計）を実現している．　本発表では論文および書籍で取り上げられた実事例のデータを用いて量的撹乱因子の頑健設計の理論とその応用を具体的に議論する．著者は半世紀に亘りQM（質経営），SQM（統計的質経営）および設計論の研究と実践（多数の企業との共同研究および経営指導他）を行ってきた．1990年代より新しい設計パラダイムである超設計（Hyper Design）を提案し，その数理であるHOPE理論（Hyper Optimization for Prospective Engineering）を研究し，その支援ソフトであるHOPE－Add-in for JMPをSAS社と共同開発を行っている．以上より考え方である超設計，統計数理であるHOPE理論，支援ツールであるHOPE－Add-in for JMPの三位一体で新しい設計法を実現している．　設計は敷居が高いため特殊な人々による特殊な活動と誤解されることが多い．これを打破し多くの人が設計を使いこなせるようにするために，著者は理論研究とともに新しい教育方法についても開発している．それは実物教材（紙ヘリコプター，紙グライダー，コイン射撃ほか）と仮想教材（飛球シミュレーターほか）を用いた体験型教育である．この教育プログラム（統計の基礎から超設計まで）は多くの場面でJMPによる可視化した解析と設計を行うために，分かり易くかつ面白い教育に仕上がっている．この教育は30年以上に亘り国内外の大学（慶應義塾大学，Yale University，東京理科大学，筑波大学他）および多数の企業で実施しその有効性を確認している． ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である高橋武則様まで直接ご連絡ください。

0 attendees

0

Event has ended

0 attendees

0

Friday, December 18, 2020

レベル：中級働き方改革が叫ばれる昨今、実験と解析の効率化の重要性が高まっている。JMPによる実験効率化の威力を示すには、実験データを決定的スクリーニング計画（DSD）やカスタム計画で置き換えて見せて実験数を大幅に削減できることを示すと良い。その際の応答データは既存実験データから拾い出す。複数の表に分けられた実験データを見つけた時は、一つのテーブルにまとめて多変量解析を行い、プロファイルで可視化して見せて、OFAT(One Factor at a Time)的方法の落し穴に気づいてもらう。繰り返しのある実験データを平均で分析する考え方に対しては、積み重ね処理や平均・分散による多目的最適化やロバスト最適化の方法があることを示す。開発現場で実験計画法を使う場合は交互作用の存在を予測できないことが多く、しかも交互作用は決して稀なことではない。DSDは主効果と交互作用(2FI)の交絡や２FI間の交絡がなく、実験数が因子数の２倍程度の少ない数で済む。これは大きな利点である。実際にDSDを使って分かったこと、主効果数＋交互作用項数が因子数に近づく時に起きる破綻、拡張計画による解決方法、JMPコミュニティやASQから入手したDSD関連論文の中で実務上重要と思われる点、などについて報告する。山武ハネウエル（現Azbil）でFA開発部長，理事研究開発本部長，理事品質保証推進本部長，アズビル金門参与，などを歴任したのち東林コンサルティングを設立．専門領域は生産データ解析による歩留まり改善や品質改善，市場不良予測・ロバスト設計・最適化設計・実験計画などの統計的問題解決全般，デザインレビュ―・根本原因分析手法（RCA）・ヒューマンエラーの未然防止・工程改善などの現場指導など．著書は『ネットビジネスの本質』　日科技連出版　2001（共著）【テレコム社会科学賞受賞】，『実践ベンチャー企業の成功戦略』　中央経済社　2011(共著)，『よくわかる「問題解決」の本』　日刊工業新聞社　2014(単著)．主な論文は「生産ラインのヒヤリハットや違和感に関する気づきの発信・受け止めを促進するワークショップの提案」品質管理学会 2016【2016年度　品質技術賞受賞】．主な講演「作業ミスを誘発する組織要因を可視化し改善を促進する仕組みの提案」(Discovery-Japan 2018) 「JMPによる品質問題の解決～製造業の不良解析と信頼性予測～」(Discovery-Japan 2019)

0 attendees

0

Event has ended

0 attendees

0

Friday, December 18, 2020

レベル：中級特定検診は、生活習慣病予防の観点から、40歳から74歳を対象にメタボリックシンドローム(通称：メタボ)の該当者を減少させることを目的としている。特定検診受診者全体に対する検診結果の要約は、個人と全体を比較するベンチマークとなり得るため、中高年個々人の健康管理に対して参考になると思われる。厚生労働省が提供するNDBオープンデータでは、特定検診の情報として、年度ごとに検査項目（腹囲、血糖値、血圧など）の平均値や階級別分布を入手することができ、メタボの判断基準となるいくつかの検査項目に対し、性別、年代などの属性ごとに基準外の人数、検査人数を求めることができる。属性ごとに各検査項目に対する基準外の割合をグラフ化してみると、検査項目によっては年代による傾向が表れないなど興味深い結果が得られる。本発表ではこれらグラフ化とともに、各検査項目に対する基準外の割合に対し、年度、都道府県、性別、年代を要因とした一般化線形モデルをあてはめた結果を示す。このモデル化により、対象者の属性（性別、年代、居住している都道府県）における基準外の割合を予測することができ、特定検診受診者全体を深く理解できる。 JMPジャパン事業部の技術エンジニア。現在は主に製薬会社、食品会社を対象としたJMP製品のプリセールス業務を行っている。JMPをお客様に紹介する立場ではあるが、自身も一人のJMPユーザであるという意識が強い。近年はメディア等で話題となる事柄について、JMPで分析した結果をブログや分析レポートの共有サイトである「JMP Public」に投稿している。 https://public.jmp.com/users/259

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Caleb King, JMP Research Statistician Tester, SAS In this talk, we illustrate how you can use the wide array of graphical features in JMP, including new capabilities in JMP 16, to help tell the story of your data. Using the FBI database on reported hate crimes from 1991-2019, we’ll demonstrate how key tools in JMP’s graphical toolbox, such as graphlets and interactive feature modification, can lead viewers to new insights. But the fun doesn’t stop at graphs. We’ll also show how you can let your audience “choose their own adventure” by creating table scripts to subset your data into smaller data sets, each with their own graphs ready to provide new perspectives on the overall narrative. And don’t worry. Not all is as dark as it seems... Auto-generated transcript... Speaker Transcript Caleb King Hello, my name is Caleb King, I am a developer at the at the JMP software. I'm specifically in the design of experiments group, but today I'm going to be a little off topic and talk about how you can use the graphics and some of the other tools in JMP to help with sort of the visual storytelling of your data. Now the data set I've chosen is a...to illustrate that is the hate crime data collected by the FBI so maybe a bit of a controversial data set but also pretty relevant to what what's been happening. And my goal is to make this more illustrative so I'll be walking through a couple graphics that I've created previously for this data set. And I won't necessarily be showing you how I made the graphs but the purpose is to kind of illustrate for you how you can use JMP for, like I said, visual storytelling so use the interactivity to help lead the people looking at the graphs and interacting with it to maybe ask more questions, maybe you'll be answering some of their questions, as you go along. But kind of encourage that data exploration, which is what we're all about here at JMP. So with that said let's get right in. I'll first kind of give you a little bit of overview of the data set itself, so I'll kind of just scroll along here. So there's a lot of information about where the incidents took place. As we keep going, and the date, well, when that incident occurred. You have a little bit of information about the offenders and what offense, type of offense was committed. Again some basic information, what type of bias was presented in the incident. Some information about the the victims. And overall discrimination category, and then some additional information I provided about latitude and longitude, if it's available, as well as some population that I'll be using other graphics. Now just for the sake of, you know, to be clear, the FBI, that's the United States Federal Bureau of Investigation, defines a hate crime is any criminal offense that takes place that's motivated by biases against a particular group. So that bias could be racial, against religion, gender, gender identity and so forth. So, as long as a crime is motivated by a particular bias, it's considered a hate crime, and this data consists of all the incidents that have been collected by the FBI, going back all the way to the year 1991 and as recent as 2019. I don't have data from 2020, as interesting as that certainly would be, but that's because the FBI likes to take its time and making sure the data is thoroughly cleaned and prepared before they actually create their reports. So you can rest assured that this data is pretty accurate, given how much time and effort they put into making sure it is. Alright, so with that let's kind of get started and do some basic visual summary of the data. So I'll start by running this little table script up here. And all this does is basically give us a count over over the days, so each day, how many incidents occurred, according to a particular bias. From this I'm going to create a basic plot, in this case it's simple sort of line plot here, showing the number of incidents that happen each day over the entire range. So you get the entire...you can see the whole range of the data 1991 to 2019 and how many incidents occurred. Now this in and of itself would probably a good static image, because you get kind of get a sense of where the the number of incidents falls. In fact here I'm going to change the axis settings here a little bit. Let's see, we got increments in 50s, let's do it by 20s. There we go. So there's a little bit of interactivity for you... interaction. We changed the scales to kind of refine it and get a better sense of how many incidents, on average, there are. I ran a bit of a distribution and, on average, around 20 incidents per day that we see here. Now of course, you're probably wondering why I have not yet addressed the two spikes that we see in the data. So yes, there are clearly two really tall spikes. And so, if this were any other type of software, you might say, okay, I'd look like to look into that. So you go back to the data you try and, you know, isolate where those dates are and maybe try and present some plots or do some analysis to show what's going on there. Well, this is JMP and we have something that can help with that, and it's something that we introduced in JMP 15 called graphlets and it works like this. I'm just going to hover and boom. A little graphlet has appeared to help further summarize what's going on at that point. So in this case there's a lot of information. We'll notice first the date, May 1, 1992. So if you're familiar with American history, you might know what's happening here, but if not, you can get a little bit of an additional clue by clicking on the graph. So now you'll see that I'm showing you the incidents by the particular bias of the incident. So we see here that most of the incidents were against white individuals and then next group is Black or African-American and it continues on down. I kind of give away the answer here, in that the incidents that occurred around this time where the Rodney King riots in California. Rodney King, an African-American individual who was unfortunately slain by a police officer and that led to a lot of backlash and rioting around this time. So that's what we're seeing captured in this particular data point, and if you didn't know that, you would at least have this information to try to start and go looking there...looking online to figure out what happened. We can do the same thing here with a very large spike. And again, I'll use the hover graphlet, so hover over it. I'll pause to let you look. So we look at the date, September 12, 2001. That's in it of itself a very big clue as to what happened. But if we look here at the breakdown, we can see that most of the incidents were against individuals of Muslim faith, of Arab ethnicity or some other type of similar ethnicity or ancestry. In this case, we can clearly see that, after the unfortunate events of September 11, the terrorist attacks that occurred then, there was on the following day, a lot of backlash against members who were of similar ethnicity, similar faith and so forth, so we had an unfortunate backlash happening at that time. So already with just this one plot and some of that interactivity, we've been able to glean a lot of information, a lot of high level information in areas where you might want to look further. But we can keep going. Now something new in JMP 16 is, because we have date here on the X axis, we can actually bin the dates into a larger category, so in this case let's bin it by month. And we see that the plot disappears. So here's what I'm going to do. I'm going to rerun it and let's see. There we go. You never know what will happen. In this case, so this is what's supposed to happen; don't worry. So we've binned it by month and we noticed an interesting pattern here. There seems to be some sort of seasonal trend occurring, and let's use the hover graphlets to kind of help us identify what might be happening. So I'm going to hover over the lower points. So if I do that, we see okay, January, December, okay. Interesting. Let's hover over another one, December. And yet another one, December. Ah, there might actually be some actual seasonal trend in this case going on. We seem to hit low points around the the winter months. And in fact, if I go back to my data table, I've actually seen that before. It was something I kind of discovered while exploring that technique and I've already created a plot to kind of address that. So this was something I created based off of that, kind of, look at, you know, what's the variation in the number of incidents over all the years within this month. And here we can see them the mean trends, but we also see a lot of variation, especially here in September because of that huge spike there. So maybe we need something a little more appropriate. So I'll open the control panel and hey, let's pick the the median. That's more robust and maybe look at the interquartile range, so that way we have a little bit more robust metrics to play with. And so, again, we see that seasonal trends, so it seems that there's definitely a large dip within the winter months as opposed to peaking kind of in the spring and summer months. Now this might be something someone might want to look further into and research why is that happening. You might have your own explanations. My personal explanation is that I believe the Christmas spirit is so powerful that overcomes whatever whatever hate or bias individuals might have in December. Again that's just my personal preference, you probably have your own. But again, with just a single plot, I was able to discover this trend and make another plot to kind of explore that further. So again with just this one plot, I've encouraged more research. And we can keep going. So let's see, let's bin it by year, and if we do that, we can clearly see this kind of overall trend. So we see a kind of peak in the late '90s around early 2000s before dropping, you know, almost linear fashion, until it hits its midpoint about in the mid 2010s before starting to rise again. So keep that in mind, you might see similar trends in other plots we show. But again, let's take a step back and just realize that in this one plot we've seen different aspects of the data. We even answered some questions, but we've also maybe brought up a few more. What's with that seasonal trend? And if you didn't know what those events were that I told you, you know, what were those particular events? So that's the beauty of the interactivity of JMP graphics is it allows the user to engage and explore and encourages it all within just one particular medium. All right. Let's keep going. So I mentioned, this is sort of visual storytelling, so you can think of that sort of as a prologue, as sort of the the overall view. What's...what's what's the overall situation? Now let's look at kind of the actors, that is, who's committing these types of offenses? Who are they committing them against? What information can we find out about that? So here I've created, again, a plot to kind of help answer that. Now this might be a good start. Here I've created a heat map that then emphasizes the the counts by, in this case, the offender's race versus their particular bias. So we see that a lot of what's happening, in this case I've sorted the columns so we can see there's a there's a lot going on. Most of its here in this upper left corner and not too much going on down here, which I guess is good news. There's a lot of biases where there's not a lot happening, most of it's happening here in this corner. Now, this might be a good plot, but again there's a lot of open space here. So maybe we can play around with things to try and emphasize what's going on. So one way I can do that is I'll come here to the X axis and I'm going to size by the count. Now you'll see here, I had something hidden behind the scenes. I'd actually put a label, a percentage label on top of these. There was just so much going on before that you couldn't even see it, but now we can actually see some of that information. So kind of a nice way to summarize it as opposed to counts. But even with just the visualization, we can clearly see the largest amount of bias is against Black or African American citizens and then Jewish and on down until there's hardly any down here. So just by looking at the X axis, that gives you a lot of information about what's going on. We can do the same with the Y, so again, size by the count. And again, there's a lot of information contained just within the size and how I've adjusted the the axes. And this case we include...we've really emphasized that corner, so we can clearly see who the top players are. In this case, most of it is offenders are of white or unknown race against African-Americans, the next one being against Jewish, and then anti white and then it just keeps dropping down. So we get a nice little summary here of what's going on. Now, you may have noticed that as I'm hovering around, we see our little circle. That's my graphlet indicator, so again I've got a tool here. We've we've interacted a little bit and again, this could be a great static image, but let's use the power of JMP, especially those graphslets, to interact and see what further information we can figure out. So in this case, I'll hover over here. And right here, a graph, in this case, a packed bar chart, courtesy of our graph guru Xan Gregg. In this case, not only can you see, you know, what people are committing the offenses and against whom, your next question might have been, you know, what types offenses are being committed? Well, with a graphlet, I've answered that for you. We can see here the largest...the overwhelming type of offense is intimidation, followed by simple and aggravated assault, and then the rest of these, that's the beauty of the packed bar chart. We can see all the other types offenses that are committed. If you stack them all on top of each other, they don't even compare. They don't even break the top three. So that tells you a lot about the types of...these types of offenses, how dominant they are. Now, another question you might have is, okay, we've seen the actors, we've seen the actions they're taking, but there's a time aspect to this. Obviously this is happening over time, so has this been a consistent thing? Has there been a change in the trends? Well, have no fear. Graphlets again to the rescue. In this case, I can actually show you those trends. So here we can see how has the types of...the number of intimidation incidents changed over time? And again, we see that the pattern seems to follow what that overall trend was. A peak in the like, late 90s, and then the steep trend...almost linear drop until about the mid 2010s, before kind of upticking again more recently. And again we can maybe see that trend and others. I won't click to zoom in, but you can just see from the plot here, those trends in simple assault here and aggravated assault as well, a little bit there. And you can keep exploring. So let's look at the unknown against African-Americans and see what difference there might be there. In this case, we can clearly see that there are two types of offenses that really dominate, in this case, destruction or damage to property (which, if you think about it, might make sense; if you see your property's been damaged, there's a good chance you may not know who did it) and intimidation are the dominant ones. And again, you can...the nice thing about this is the hover labels kind of persist, so you can again look and see what trends are happening there. So in this case, we see with damage, there's actually two peaks, kind of peaked here in the late '90s early 2000s, before dropping again. And with intimidation, we see a similar trend as we did before. Again within just one graphic, there's a lot of information contained and that you, as the user, can interact with to try and emphasize certain key areas, and then you, as the user, just visualize...just looking at this and interacting with it, can play around and glean a lot of information. All right. And let's keep going. Now you'll notice that amongst the reporting agencies, so, most of them are city/county level police departments and so forth, but there's also some universities in here. So there might be someone out there who might be interested in seeing, you know, what's happening at the universities. And so, with that, I've created this nice little table script to answer that. Now this time, I've been just running the table scripts and I mentioned, I won't go too much behind the scenes, this is more illustrative. Here I'm going to let you take a peek, because I want to not only show you the power of the graphics but also the power of the table script. Now if you're familiar with JMP, you might know, okay, the table script's nice because I can save my analyses, I can save my reports, I can even use it to save graphics like I...like I did in the last one, so you may not have noticed that you can also save scripts to help run additional tables and summary tables and so forth. So let me show you what all is happening behind here, in fact, when I ran the script, I actually created two data tables. You only saw the one, so in this case I first created the data table that selected all the universities and then from that data table it created a summary and then I close the old one. And then I also added to that some of the graphics. So I won't go into too much detail here about how I set this up, because I want to save that for after the next one. I'll give you a hint. It's based off of a new feature in JMP 16 that will really amaze you. All right, let's go back to...excuse me...university incidents. And here again I've saved the table script. This one that will show us a graphic. So here we can see again is that packed bar chart, and here I'm kind of showing you which universities had the most incidents. Now again, this in and of itself might be a pretty good standard graphic. You can see that, you know, which university seem to have the most incidents happening and again it's kind of nice to see that there's no real dominating one. You can still pack the other universities on top of them, and nobody is dominating one or the other. So that in and of itself is kind of good news, but again there's a time aspect to this. So have these been maybe... has the University of Michigan Ann Arbor, have they had trouble the entire time? Have they...would they have always remained on top? Did they just happen to have a bad year? Again, graphlets to the rescue. In this case, you'll see an interesting plot here. You might say, you know, what what is this thing? This looks like it belongs in an Art Deco museum. What... what kind of plot is this? Well, it's actually one we've seen before. I'm just using something new that came out in JMP 16, so I'm going to give you a behind the scenes look. And in this case, we can see, this is actually a heat map. All I've done is I do a trick that I often like to do, which is to emphasize things two different ways, so not only emphasizing the counts by color, which is what you would typically do in a heat map, the whites are the missing entries, I can also now in JMP 16 emphasize by size. And so I think this again gets back to where we size those axes before. It emphasizes...helps emphasize certain areas. So here we can see now maybe there's a little bit of an issue with incidents against African-Americans, that has been pretty consistent, with an especially bad year in apparently 2017, as opposed to all of the other incidents that have been occurring. Now there's no extra hover labels here. All I'll do is summarize the data, but that's okay. This in and of itself gives you a lot of information, so this is a new thing that came out in JMP 16 that can again help with that emphasis. And again, we can keep going. We can look at other universities, so here, this might be an example of a university where they seem to have a pretty bad period of time, the University of Maryland in College Park, but then there was an area where things were really good, and so you might be interested in knowing, well, what happened to make this such a great period? Is there something the university instituted, what they did that seemed to cause the count, the number of incidents to drop significantly? That might be something worth looking into. And you can keep going and looking again to see whether it's a systemic issue, whether like, in this case, it seemed there's just a really bad year that dominated, overall they were just doing okay. They were doing pretty good. Again, this might be another one. They had a really bad time early on, but recently they've been doing pretty good, and so forth. So again, kind of highlighting that interactivity yet again, and in this case, with some of the newer features in JMP 16. Now, before we transition to the last one, I have a confession. I'm a bit of a map nerd, so I really like maps and any type of data analysis that, you know, relates to maps. I don't know why. I just really like it and so I'm really excited to show you this next one, because now we look at the geography of the incidents. But I'm also excited because this really, I believe, highlights the power of both the table scripts and the JMP graphics, especially the hover graphs. So hopefully that got you excited as well, so let's run it. Now this one's going to take a little while because there's actually a lot going on with this table script. It's creating a new table. It's also doing a lot of functions in that table and computing a lot of things. So here we've got not just, you know, pulling in information but also there's a lot of these columns here near the end that have been calculated behind the scenes. Now I have to take a brief moment to talk about a particular metric I'm going to be using. So a while back, I wrote a blog entry called the Crime Rate Conundrum on on the JMP Community (community.jmp.com), so shameless plug there, but in that I talked about how, you know, typically when you're reporting incidents, especially crime incidents, usually we kind of know that you don't just want to report the raw counts, because there might be a certain area where it has a high number of counts, a high number of incidents, is that just because that's...there's a problem at problem there? Or is it because there's just a lot of people there? And so we, of course, would expect a lot of incidents because there's just a lot of people. So of course people report incidents rates. Now that's fine because everybody's now on a level playing field but one side effect of that is it tends to elevate places that have small populations. Essentially, you have, if you have small denominator, you will tend to have a larger ratio just because of that. And so that's sort of an unfortunate side effect, and so there, I talk about an interesting case where we have a place with a really small population that gets really inflated. And how some people deal with that. One way I tried to address that was through this use of a weighted incident rate, essentially, the idea is I take your incident rate, but then I weight you by, sort of, what proportion... excuse me...basically a weight by how many people you have there. In this case, I have a particular weight, I basically rank the populations, so that the the largest place would have rank of of the smallest. However, in this case there's 50 states, so the state with the largest population would have a rank of 50 and the smallest state a rank of one. If you take that and divide that by you know the maximum rank, that's essentially your weight so it's it's a way to kind of put a weight corresponding to your total population and the idea here is that, if your incident rate is such that it overcomes this weight penalty, if you will, then that means that you might be someone worth looking into. So it tries to counteract that inflation, just due to a small population. If you are still relatively small, but your incident rate is high enough that you overcome your weight essentially, we might want to look into you. So hopefully that wasn't too much information, but that's the metric that I'll be primarily using so I'll run the script and here we go. So first I've got a straightforward line plot that kind of shows the weighted incident rates over time for all the states. Now I'll use a new feature here. We can see here that New Jersey seems to dominate. Again interactivity, we can actually click to highlight it. There's some new things that we do, especially in JMP 16. I'm going to right click here and I'm going to add some labels. So let's do the maximum value and let's do the last value just for comparison. So here we can see this...the peak here was about 11.4 incidents per 1,000 (that's a weighted incident rate) here in sort of the early '90s. And then we see a decreasing trend, again it seems to drop about the same that all the the overall incident rate did before starting to peak again here in 2019. So again with just some brief numbers again this, in and of itself, would be an interesting plot to look at, but as you could see, my little graphlet indicator is going, so there's more. Here's where the the map part comes in. So I'm going to hover over a particular point. In this case, not only can you see sort of the overall rate, I can actually break it down for you, in this case by county. So here I've colored the counties by the total number of incidents within that year. And again, there's that time aspect, so this shows you a snapshot for one particular year, in this case 2008. But maybe you're interested in the overall trend, so one one way you could do that is, hey, these are graphlets. I could go back, hover over another spot, pull up that graph, click on it to zoom, repeat as needed. You could do that or you could use this new trick I found actually while preparing this presentation. Let's unhide...notice over here to the side, we have a local data filter. That's really the key behind these graphlets. I'm going to come here to the year and I'm going to change its modeling type to nominal, rather than continuous, because now, I can do something like this. I can actually go through and select individual years or, now this is JMP, we can do better. Let me go here and I'm going to do an animation. I'm going to make it a little fast here. I'm going to click play, and now I can just sit back relax and, you know, watch as JMP does things for me. So here we can see it cycle through and getting a sense of what's happening. I'll let it cycle through a bit. We see...already starting to see some interesting things happening here. Let's let it cycle through, get the full picture, you know. We want the complete picture, not that I'm showing off or anything. Alright, so we've cycled through and we noticed something. So let's let's go down here to about say 2004, 2005. So somewhere around here, we noticed this one county here, in particular, seems to be highlighted. And in fact, you saw my little graphlet indicator. So again, I can hover over it, and here yet another map. Now you can see why I'm so excited. Again, in this case, I can actually show you at the county level, so the individual county level... Excuse me, let me...let's move that over a bit. There we are. Some minor adjustments and again, you can see my trick of emphasizing things two ways by both size and color. We can kind of see dispersion within the ???, this is individual locations and because there's that time aspect again, we know...now we know better, we don't have to go back and click and get multiple graphs, we can again use the local data filter tricks. So I can go back. I'll do the year, and so in this case, we can again click through. Here I'm just going to use the arrow keys on my keyboard to kind of cycle through. And just kind of get a sense of how things are varying over time. In this case, you see a particular area, you've probably already seen it, starting about 2006, 2007ish frame. There's this one area...this here. Keansburg, which seems to be highlighted and you'll notice yet another graphlet. How far do you want to go? Graphception, if you will. We can keep going down further and further in. In this case, I get...I break it out by what the bias was, and again I could do that trick if I wanted to, to go through and cycle through by year. So, again so much power in these graphs. With this one graphlet, I was able to explore geographical variation at county level and even further below, and so it might be allowing you to kind of explore different aspects of the data, allowing you to generate more questions. What was happening in Keansburg around this time to make it pop like this? That's something you might want to know. So that's all I have for you today, hopefully I've whet your appetite and was able to clearly illustrate for you how powerful the the JMP visualization is in exploring the data. If you want to know more, there's going to be a plenary talk on data viz. I definitely encourage you to explore that and it kind of helps address different ways of visualization and how JMP can help out with that. But I did promise you, at one point, to give you a peek as to how I was able to create these pretty amazing table scripts and I'll do that right now. It's called the enhanced log now in JMP 16. This is one of the coolest new features in JMP 16. Enhanced log actually follows along as you interact and it keeps track of it. And so whenever I closed, in this case, closed a data table, opened a data table, ran a data table, if I added a new column, if I created a new graph, it gets recorded here in the log. This is something that John Sall will be talking about in his plenary talk. It's, again, one of the most new amazing features here. And this is the key to how I was able to create these tables scripts. I can honestly say that if this hadn't been present, I probably wouldn't have been able to create these pretty cool table scripts, because it'd be a lot of work to do. So again, this is a really cool feature that's available in JMP 16. So I hope I was able to convince you that JMP is a great tool for exploring data, for creating awesome visualizations, interactive visualizations. And that's all I have. Thank you for coming.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Ross Metusalem, JMP Systems Engineer, SAS Text mining techniques enable extraction of quantitative information from unstructured text data. JMP Pro 16 has expanded the information it can extract from text thanks to two additions to Text Explorer’s capabilities: Sentiment Analysis and Term Selection. Sentiment Analysis quantifies the degree of positive or negative emotion expressed in a text by mapping that text’s words to a customizable dictionary of emotion-carrying terms and associated scores (e.g., “wonderful”: +90; “terrible”: -90). Resulting sentiment scores can be used for assessing people’s subjective feelings at scale, exploring subjective feelings in relation to objective concepts and enriching further analyses. Term Selection automatically finds terms most strongly predictive of a response variable by applying regularized regression capabilities from the Generalized Regression platform in JMP Pro, which is called from inside Text Explorer. Term Selection is useful for easily identifying relationships between an important outcome measure and the occurrence of specific terms in associated unstructured texts. This presentation will provide an overview of both Sentiment Analysis and Term Selection techniques, demonstrate their application to real-world data and share some best practices for using each effectively. Auto-generated transcript... Speaker Transcript Ross Metusalem Hello everyone, and thanks for taking some time to learn how JMP is expanding its text mining toolkit in JMP Pro 16 with sentiment analysis and term selection. I'm Ross Metusalem, a JMP systems engineer in the southeastern US, and I'm going to give you a sneak preview of these two new analyses, explain a little bit about how they work, and provide some best practices for using them. So both of these analysis techniques are coming in Text Explorer, which for those who aren't familiar, this is JMP's tool for analyzing what we call free or unstructured texts, so that is natural language texts. And it's a what we call a text mining tool, so that is a tool for deriving quantitative information from free text so that we can use other types of statistical or analytical tools to derive insights from the free text or maybe even use that text as inputs to other analyses. So let's take a look at these two new text mining techniques that are coming to Text Explorer in JMP Pro 16, and we'll start with sentiment analysis. Sentiment analysis at its core answers the question how emotionally positive or negative is a text. And we're going to perform a sentiment analysis on the Beige Book, which is a recurring report from the US Federal Reserve Bank. Now apologies for using a United States example at JMP Discovery Europe, but the the Beige Book does provide a nice demonstration of what sentiment analysis is all about. So this is a monthly publication, it contains national level report, as well as 12 district level reports, that summarize economic conditions in those districts, and all of these are based on qualitative information, things like interviews and surveys. And US policymakers can use the qualitative information in the Beige Book, along with quantitative information, you know, in traditional economic indicators to drive economic policy. So you might think, well, we're talking about sentiment or emotion right now. I don't know that I expect economic reports to contain much emotion, but the Beige Book reports and much language does actually contain words that can carry or convey emotion. So let's take a look at an excerpt from one report. Here's a screenshot straight out of the new sentiment analysis platform. You'll notice some words highlighted, and these are what we'll call sentiment terms, that is, terms that we would argue have some emotional content to them. For example at the top, "computer sales, on the other hand, have been severely depressed," where "severely depressed" is highlighted in purple, indicating that we consider that to convey negative emotion, which it seems to if somebody describes computer sales as "severely depressed" it sounds like they mean for us to interpret that as as certainly a bad thing. If we look down, we see in orange, a couple positive sentiment terms highlighted, like "improve" or "favorable." So we can look for words that we believe have positive or negative emotional purple for negative, orange for positive, and some sentiment analysis keeps things at that level, so just a binary distinction, a positive text or a negative text. There are additional ways of performing sentiment analysis and, in particular, many ways try to quantify the degree of positivity or negativity, not just whether something is positive or negative. So consider this other example and I'll point our attention right to the bottom here, where we can see a report of "poor sales." And I'm going to compare that with where we said "computer sales are severely depressed." So both of these are negative statements, but I think we would all agree that "severely depressed" sounds a lot more negative than just "poor" is. So we want to figure out not only is a sentiment expressed positive or negative, but how positive or negative is it, and that's what sentiment analysis in Text Explorer does. So how does it do it? Well, it uses a technique called lexical sentiment analysis that's based on some sentiment terms and associated scores. So what we're seeing right now is an excerpt from what we'd call a sentiment dictionary that contains the terms and their scores. For example, the term "fantastic" has a score of positive 90 and the term "grim" at the bottom has a score -75. So what we do is specify which terms we believe carry emotional content and the degree to which they're positive or negative on an arbitrary scale, here -100 to 100. And then we can find these terms in all of our documents and use them to score how positive or negative those documents are overall. If you think back to our example "severely depressed," that word "severely" and takes the word "depressed" and what we call intensifies it. It is an intensifier or a multiplier of the expressed sentiment, so we also have a dictionary of intensifiers and what they do to the sentiment expressed by sentiment term. For example, we say "incredibly" multiplies the sentiment by a factor of 1.4 where as "little" multiplies the sentiment by a factor of .3, so it actually, kind of, you know, attenuates the sentiment expressed a little. Now, finally there's one other type of word we want to take into account and that is negators, things like "no" and "not," and we treat these basically as polarity reversals. So "not fantastic" would be taking the score for "fantastic" and multiplying it by -1. And so, this is a common way of doing sentiment analysis, again called lexical sentiment analysis. So what we do is we take sentiment terms that we find, we multiply them by any associated intensifier or negators and then for each document, when we have all the sentiment scores for the individual terms that appear, we just average across all them to get a sentiment score for that document. And JMP returns these scores to us in a number of useful ways. So this is a screenshot out of the sentiment analysis tool and we're going to be, you know, using this in just a moment. But you can see, among other things, it gives us a distribution of sentiment scores across all of our documents. It gives us a list of all the sentiment terms and how frequently they occur. And we even have the ability, as we'll see, to export the sentiment scores to use them in graphs or analyses. And so I've actually made a couple graphs here to just try to see as an initial first pass, does the sentiment in the Beige Book reports actually align with economic events in ways that we think it should? You know, do we really have some validity to this sentiment as some kind of economic indicator? And the answer looks like, yeah, probably. Here I have a plot that I've made in Graph Builder; it's sentiment over time, so all the little gray dots are the individual reports and the blue smoother kind of shows the trend of sentiment over time with that black line at zero showing neutral sentiment, at least according to our scoring scheme. The red areas are times of economic recession as officially listed by the Federal Reserve. So you might notice sentiment seems to bottom out or there are troughs around recessions, but another thing you might notice is that actually preceding each recession, we see a drop in sentiment either months or, in some cases, looks like even a couple years, in advance. And we don't see these big drops in sentiment in situations where there wasn't a recession to follow. So maybe there's some validity to Beige Book sentiment as a leading indicator of a recession. If we look at it geographically, we see things that make sense too. This is just one example from the analysis. We're looking at sentiment in the 12 Federal Reserve districts across time from 1999 to 2000 to 2001. This was the time of what's commonly called the Dotcom Bust, so this is when there was a big bubble of tech companies and online businesses that were popping up and, eventually, many of them went under and there were some pretty severe economic consequences. '99 to 2000 sentiment is growing, in fact sentiment is growing pretty strongly, it would appear, in the San Francisco Federal Reserve district, which is where many of these companies are headquartered. And then in 2001 after the bust, the biggest drop we see all the way to negative sentiment in red here, again occurring in that San Francisco district. So, just a quick graphical check on these Beige Book sentiment scores suggests that there's some real validity to them in terms of their ability to track with, maybe predict, some economic events, though of course, the latter, we need to look into more carefully. But this is just one example of the potential use cases of sentiment analysis and there are a lot more. One of the biggest application areas where it's being used right now is in consumer research, where people might, let's say, analyze some consumer survey responses to identify what drives positive or negative opinions or reactions. But sentiment analysis can also be used in, say, product improvement where analyzing product reviews or warranty claims might help us find product features or issues that elicit strong emotional responses in our customers. Looking at, say, customer support, we could analyze call center or chats...chat transcripts to find some support practices that result in happy or unhappy customers. Maybe even public policy, we analyze open commentary to gauge the public's reaction to proposed or existing policies. These are just a few domains where sentiment analysis can be applied. It's really applicable anywhere you have text that convey some emotion and that emotion might be informative. So that's all I want to say up front. Now I'd like to spend a little bit of time walking you through how it works in JMP, so let's move on over to JMP. Here we have the Beige Book data, so down at the bottom, you can see we have a little over 5,000 reports here, and we have the date of each report from May 1972 October 2020, which of the 12 districts it's from, and then the text of the report. And you can see that these reports, they're not just quick statements of you know, the economy is doing well or poorly, they they can get into some level of detail. Now, before we dive into these data, I do just want to say thank you to somebody for the idea to analyze the Beige Book and for actually pulling down the data and getting it into JMP, in a format ready to analyze. And that thanks goes to Xan Gregg who, if you don't know, Xan is a director on the JMP development team and the creator of Graph Builder, so thanks, Xan,for your help. Alright, so let's let's quantify the degree of positive and negative emotion in all these reports. We'll use Text Explorer under the analyze menu. Here we have our launch window. I'll take our text data, put it in the text columns role. A couple things to highlight before we get going. Text Explorer supports multiple languages, but right now, sentiment analysis is available only in English, and one other thing I want to draw attention to is stemming right here. So for those who do text analysis you're probably familiar with what stemming is, but for those who aren't familiar, stemming is a process whereby we kind of collapse multiple... well, to keep it nontechnical...multiple versions of the same word together. Take "strong," "stronger," and "strongest." So these are three versions of the same word "strong" and in some text mining techniques, you'd want to actually combine all those into one term and just say, oh, they all mean "strong" because that's kind of conceptually the same thing. I'm going to leave stemming off here actually, and it's because...take "strongest," that describes as strong as something can get versus "stronger," which says that you know it's strong, but there are still, you know, room for it to be even stronger. So "strongest" should probably get a higher sentiment score than "stronger" should, and if I were to stem, I would lose the distinction between those words. Because I don't want to lose that distinction, I want to give them different sentiment scores. I'm going to keep stemming off here. So I'll click OK. And JMP now is going to tokenize our text, that is break it into all of its individual words and then count up how often each one occurs. And here we have a list of all the terms and how frequent they are. So "sales" occurs over 46,000 times and we also have our phrase list over here. So the phrases are sequences of two or more words that occur a lot, and sometimes we actually want to count those as terms in our analysis. And for sentiment analysis, you would want to go through your phrase list and, let's say, maybe add "real estate," which is two words, but it really refers to, you know, property. And I could add that. Now normally in text analysis, we'd also add what are called stop words, that's words that don't really carry meaning in the context of our analysis and we'd want to exclude. Take "district." This happen...or the Beige Book uses the word "district" frequently, just saying, you know, "this report from the Richmond district," something like that, it's not really meaningful. But I'm actually not going to bother with stop words right here and that's because, if you remember, back from our slides, we said that all of our sentiment scores are based on a dictionary, where we choose sentiment words and what score they should get. And if we just leave "district" out, it inherently gets a score of 0 and doesn't affect our sentiment score, so I don't really need to declare it as a stop word. So once we're ready, we would invoke text or, excuse me, sentiment analysis under the red triangle here. So what JMP is doing right now, because we haven't declared any sentiment terms or otherwise, it's using a built-in sentiment dictionary to score all the documents. Here we get our scores out. Now before navigating these results, we probably should customize our sentiment dictionary, so the sentiment bearing words and their scores. And that's because in different domains, maybe with different people generating the text, certain words are going to bear different levels of sentiment or bear sentiment in one case and not another. So we want to really pretty carefully and thoroughly create a sentiment dictionary that we feel accurately captured sentiment as it's conveyed by the language we're analyzing right now. So JMP, like I said, uses a built-in dictionary and it's pretty sparse. So you can see it right here, it has some emoticons, like these little smileys and whatnot, but then we have some pretty clear sentiment bearing language, like "abysmal" at -90. Now it's it's probably not the case that somebody's going to use the word "abysmal" and not mean it in a negative sense, so we feel pretty good about that. But, you know, it's not terribly long list and we may want to add some terms to it. So let's see how we do that, and one thing I can recommend is just looking at your data. You know, read some of the documents that you have and try to find some words that you think might be indicative of sentiment. We actually have down here a tool that lets us peruse our documents and directly add sentiment terms from them. So here, I have a document list. You can see Document 1 is highlighted and then Document 1 is displayed below. I could select different documents to view them. Now, if we look at Document 1, right off the bat, you might notice a couple potential sentiment terms like "pessimism" and "optimism." Now you can see these aren't highlighted. These actually aren't included in the standard sentiment dictionary. And a lot of nouns you'll find actually aren't, and that's because nouns like "pessimism" or "optimism" can be described in ways that suggests their presence or their absence, basically. So I could say, you know, "pessimism is declining" or "there's a distinct lack of pessimism," "pessimism is no longer reported." And, in those cases, we wouldn't say "pessimism" is a negative thing. It's...so you want to be careful and think about words in context and how they're actually used before adding them to a sentiment dictionary. For example, I could go back up to our term list here. I'm just going to show the filter, look for "pessimism" and then show the text to have a look at how it's used. So we can see in this first example, "the mood of our directors varies from pessimism to optimism." And the next one "private conversations a deep mood of pessimism." If you actually read through, this is the typical use, so actually in the Beige Book, they don't seem to use the word pessimism in the ways that I might fear, "optimism is increasing." So I actually feel okay about adding "pessimism" here, so let's add it to our sentiment dictionary. So if I just hover right over it, you can see we bring up some options of what to do with it. So here I can, let's say, give it a score of -60. And so now that will be added to our dictionary with that corresponding score, and it's triggering updated sentiment scoring in JMP. So that is, it's now looking for the word "pessimism" and adjusting all the document scores where it finds it. So let's go back up now to take a look at our sentiment terms in more detail. If I scroll on down, you will find "pessimism" right here with the score of -60 that I just gave it. Now I might want to actually...if you notice, "pessimistic" is, by default, has a score of -50, so maybe I just type -50 in here to make that consistent. And I could but I'm not going to, just so that we don't trigger another update. You'll also notice, right here, this list of possible sentiment terms. So these are terms that JMP has identified as maybe bearing sentiment, and you might want to take a look at them and consider them for inclusion in your dictionary. For example, the word "strong" here, if you look at some of the document texts to the right, you might say, okay, this is clearly a positive thing. And if you've looked at a lot of these examples, it really stands out that the word "strong" and correspondingly "weak" are words that these economists use a whole lot to talk about things that are good or bad about the current economy. So I could add them, or add "strong" here by clicking on, let's say, positive 60 in the buttons up there. Again, I won't right now, just for the sake of expediting our look at sentiment analysis. So we could go through, just like our texts down below, we could go through our sentiment term list here to choose some good candidates. Under the red triangle, we also can just manage the sentiment terms more directly, so that is just in the full term management lists that we might be used to for a Text Explorer user, so like the phrase management and the stop word management. You can see we've added one new term local to this particular analysis, in addition to all of our built-in terms. Of course, we could declare exceptions too, if we want to just maybe not actually include some of those. And importantly, you can import and export your sentiment dictionary as well. Another way to declare sentiment terms is to consult with subject matter experts. You know, economists would probably have a whole lot to say about the types of words they would expect to see that would clearly convey positive or negative emotion in these reports. And if we could talk to them, we would want to know what they have to say about that, and we might even be able to derive a dictionary in, say, a tab separated file with them and then just import it here. And then, of course, once we make a dictionary we feel good about, we should export it and save it so that it's easy to apply again in the future. So that's a little bit about how you would actually curate your sentiment dictionary. It would also be important to curate your intensifier terms and your negation terms, and again, you don't see scores here, because these are just polarity reversals. Just to show you a little bit more about what that actually looks like, if we...let's take a look at sentiment here, so we can see instances in which JMP has found the word "not" before some of these sentiment terms and actually applied the appropriate score. So at the top there, "not acceptable" gets a score of -25. So I show you that just to, kind of, draw your attention to the fact that these negators and intensifiers, they are kind of being applied automatically by JMP. But anyways let's let's move on from talking about how to set the analysis up to actually looking at the results. So I'm going to bring up a version of the analysis that's already done, that is, I've already curated the sentiment dictionary, and we can actually start to interpret the results that we get out. So we have our high level summary up here, so we have more positive than negative documents. As we discussed before we can see, you know, how many of each. In fact, at the bottom of that table on the left, you see that we have one document that has no sentiment expressed in it whatsoever. "strong" occurring 14,000 times, "weak" occurring 4,500 times approximately and looking at these either by their counts or their scores, looking at the most negative and positive, even looking at them in context can be pretty informative in and of itself. I mean, especially in, say, a domain like consumer research, if you want to know when people are feeling positively or expressing positivity about our brand or some products that we have, what type of language are they using, maybe that would inform marketing efforts, let's say. This list of sentiment terms can be highly informative in that regard. manufacturing, tourism, investments. And sometimes we want to zero in on one of those subdomains in particular, what we might call a feature. And if I go to this features tab in sentiment analysis, I'll click search. JMP finds some words that commonly occur with sentiment terms and asks if you want to maybe dive into the sentiment with respect to that feature. So take, for example, "sales." We can see "sales were reported weak," "weakening in sales," "sales are reported to be holding up well" and so forth. So if I just score this selected feature, now what JMP will do is provide sentiment scores only with respect to mention of "sales" inside these larger reports, and this is going to help us refine our analysis or focus it on a really well-defined subdomain. And that's particularly important. It could be the case that the domain in the language that we're analyzing isn't, you know, so well-restricted. Take, for example, product reviews. You're interested in how positive or negative people feel about the product, but they might also talk about, say, the shipping and you don't necessarily care if they weren't too happy with the shipping, mainly because it's beyond your control. You wouldn't want to just include a bunch of reviews that also comment on that shipping. And so it's important to consider the domain of analysis and restrict it appropriately and the feature finder here is one way of doing that. So you can see now that I've scored on "sales," we have a very different distribution of positive and negative documents. We have more documents that don't have a sentiment score because they simply don't talk about sales or don't use emotional language to discuss it, and we have a different list of sentiment terms now capturing sales in particular. Let me remove this. One thing I realized I forgot to mention, I mentioned it briefly, is how these overall document scores that we've been looking at are calculated, and I said that they're the average of all the sentiment terms of... that occurred in a particular document. So let's look at Document 1. I'd just like to show you that if you're ever wondering where does this score come from, let's say, -20, you can just run your mouse right over and it'll show you a list of all the sentiment terms that appeared. And you can see, here we have 16 of them, including all at the bottom, "real bright spot," which was a +78 and then, if you divide...add all those scores up. divide by 78... or divide by 16, excuse me, then you get an average sentiment of -20. And this is one of two ways to provide overall scores. Another one is a min max scoring, so differences between minimum and maximum sentiments expressed in the text. Now we can get a lot of information from looking at just this report, you know, obviously sentiment scores, the most common terms. But oftentimes we want to take the sentiments and use them in some other way, like look at them graphically, like we did back in the slides. So when it comes time for that part of your analysis, just head on up to the red triangle here and save the document scores. And these will save those scores back to your data table so that you can enter them into further analyses or graph them, whatever it is you want to do. So that's a sneak preview of sentiment analysis coming to Text Explorer in JMP Pro 16. The take-home idea is that sentiment analysis uses a sentiment dictionary that you set up to provide scores corresponding to the positive and negative emotional content of each document, and then from there, you can use that information in any way that's informative to you. So we'll leave sentiment analysis behind now and I'm going to move on back to our slides to talk about the other technique coming to Text Explorer soon. And that is term selection, where term selection answers a different question, and that is, which terms are most strongly associated with some important variable or variable that I care about? We're going to stick with the Beige Book. We're going to ask which words are statistically associated with recessions. So in the graph here, we have over time, the percent change GDP (gross domestic product) quarter by quarter, where blue shows economic growth, red shows economic contraction. And we might wonder, well, what terms, as they appear in the Beige Book, might be statistically associated with these periods of economic downturn? For example, a few of them right here. You know, why would we want to associate specific terms in the Beige Book with periods of economic downturn? Well, it could potentially be informative in and of itself to obtain a list of terms. You know, I might find some potentially, you know, subtle drivers of or effects of recessions that I might not be aware of or easily capture in quantitative data. I might also find some words for further analysis. I might...I might find some potential sentiment terms, some terms that are being used when the economy is doing particularly poorly that I missed my first time around when I was doing my sentiment analysis. Or maybe I could find some words that are so strongly associated with recessions that I think I might be able to use them in a predictive model to try to figure out when recessions might happen in the future. So there are a few different reasons why we might want to know which words are most strongly associated with recessions. So, how does this work in JMP? Well, we we basically build a regression model where the outcome variable is that variable we care about, recessions, and the inputs are all the different words. The data as entered into the model takes the form of a document term matrix, where each row corresponds to one document or one Beige Book report, and then the columns capture the words that occur in that report. Here we have the column "weak" highlighted and it says "binary," which means that it's going to contain 1s and 0s; a 1 indicates that that report contained to the word "weak" and 0 indicates that that report didn't contain the word "weak." And this is one of several ways we could kind of score the documents, but we'll we'll stick with this binary for now. So we take this information and we enter it into a regression model. So here's the what the launch would look like. We have our recession as our Y variable and that's just kind of a yes or no variable, and then we have all of these binary term predictors entered in as our model effects. And then we're going to use JMP Pro's generalized regression tool in order to build the model, and that's because generalized regression or GenReg, as we call it, includes automatic variable selection techniques. So if you're familiar with regularized regression, we're talking like the Lasso or the elastic net. And if you don't know what that means, that's totally fine. The idea is that it will automatically find which terms are most strongly associated with the outcome variable "recession," and then ones that it doesn't think are associated with it, it will zero those out. And this allows us to look at relationships between "recession" and perhaps you know hundreds, thousands of possible words that that would be associated with them. So what do we get when we run the analysis? We get a model out. So what we have here is the equation for that model. Don't worry about it too much. the idea is that we say the log odds of recession, so just it's a function of the probability that we're in a recession and when the Beige Book is issued is a function of all the different words that might occur in the Beige Book report. And you can see, we have, you know, the effect of the occurrence of the word "pandemic" with a coefficient of 1.94. That just means that the log odds of "recession" go up by 1.94 if the Beige Book report mentions the word "pandemic." Then we see minus 1.02 times "gain." Well, that means if the Beige Book report mentions the word "gain," then the probability of recession... or the log odds of recession drops by 1.02. So we get out of that are a list of terms that are strongly associated with an increase in the probability of recession, things like "pandemic," "postponement," "cancellation," "foreclosures." And we also get a list of terms that are associated with a decreasing probability of recession, so like "gain," "strengthen," "competition." We also see "manufacturing" right there, but it's got a relatively small coefficient, about -.2. And you'll actually notice here, and if we if we look at a graphical representation of all the terms that are selected in this analysis, you don't see too many specific domains like "manufacturing," "tourism," "investments" and all that. That's because those things are always talked about, whether we're in a recession or not, so what we're really looking for words that are used, you know, when we're in a recession more often than you would expect by chance. So we have...for example, those are "pandemic" being the most predictive. Makes a lot of sense. We're not talking about pandemics at all until pretty recently and then we've also experienced the recession recently, so we've picked up on that pretty clearly. Then we have a few others in this, kind of, second tier, so that's "postponed" "cancel," "foreclosed," "deteriorate," "pessimistic." And it's kind of interesting, this "postponement" and "cancellation" being associated with recessions. It makes sense, you really want to talk about postponing, say, economic activity when a recession is happening, or at least that's perhaps a reliable trend. It's...so that that's insight, in and of itself. In fact, I mean, I couldn't tell you how the Federal Reserve tracks postponing or canceling of economic activities, but the the fact that those terms, get flagged in this analysis suggests that's something probably worth tracking. Alright, so that's term selection. We actually get this nice list of terms associated with recessions out and we can see the relative strength of association. Now let's actually see that briefly here, how it's done in JMP. So I'm gonna head back on over to JMP and what we're going to do is pull up a slightly different data table. It's still Beige Book data, though, now we have just the national reports. And we have this accompanying variable whether or not the US was in a recession at the time. And of course there's some auto correlation in these data. I mean, if we're in a recession last month, it's more likely we're going to be in a recession this month than if we weren't. And you know that typically could be an issue for regression based analyses, but this is purely exploratory. We're not too too concerned with it. So I'm going to just pull Text Explorer up from a table scripts just because we've kind of seen how it's launched before. Note though that I've done some phrasing already, as we did before. I've also added some stop words, you can't see here, but words that I don't want them necessarily returned by this analysis. And I've turned on stemming, which is what those little dots in the term list mean. For example, this for "increase" now is actually collapsing across "increases," "increasing," "increasingly." And that's because now I, kind of, consider those all the same concept, and I just want to know if, you know, that concept is associated with recessions. So to invoke term selection, we'll just go to the red triangle, and I'll select it here. We get a control window popping up first, or I should say section, where we select which variable we care about, that's recessions. Select the target level, so I want this to be in terms of the probability of recession, as opposed to the probability of no recession. I can specify some model settings. If you're familiar with GenReg, you'll see that you can choose whether you want early stopping, whether you want one of two different penalizations to perform variable selection, what statistic you want to perform validation. And if that stuff is new to you, don't worry about it too much. The default settings are good way to go at first. We have our weighting, if you remember, we had the 1s and 0s in that column for "weak," just saying whether the word occurred in a document or not, but you can select what you want. So for example, frequency is not, did "weak" occur or not, it's how many times did it occur. And this kind of affects the way you would interpret your results. We're going to stick with binary for now. But I'm going to say, I want to check out the 1,000 most frequent terms, instead of the 400 by default, which you can see, that's a lot more than 436, and normally you can't fit a model with 1,000 Xs but only 436 observations, but thanks to the automatic variable selection in generalized regression, this isn't a problem. So once again it selects which of these thousand terms are most strongly related, hence the name term selection. So I'm gonna run this. You can see what has happened is JMP has called out to the generalized regression platform and returned these results, where up here, we see some information about the model as it was fit. For example, we have 37 terms that were returned. Let me just move that over. Because over here on the right is where we find some really valuable information. This is the list of terms most strongly associated with recession. Now I'll sort these by the coefficient to find those most strongly associated with the probability of recession, so once again that's "pandemic" "postponement" "cancellations" and, as you might expect, at this point if I click on one of these, it'll update these previews or these text snippets down below, so we can actually see this word in context. So this list of terms in itself could be incredibly valuable. You, you might learn some things about specific terms or concepts that are important that you might not have known before. You can also use these terms in predictive models. Now a few other things to note. You can see down here, we have once again a table by each individual document, but instead of sentiment scores, we now have basically scores from the model. We have for each one the... what we call the positive contribution, so this is the contribution of positive coefficients predicting the probability of recession. Here we have the ones on the negative end. And then we even have the probability of recession from the model, 71.8% here and then what we actually observed. And we're not building a predictive model here necessarily, that is, I'm not going to use this model to try to predict recessions. I mean, I have all kinds of economic indicator variables I would use too, but this is a good way to basically sanity check your model. Does it look like it's getting some of the its predictions right? Because if it's not, then you might not trust the terms that it returns. You also have plenty of other information to try to make that judgment. I mean, we have some fits statistics, like the area under the curve up here. Or we can even go into the generalized regression platform, with all of its normal options for assessing your model there further as well. I'm not going to get into the details there, but all of that is available to you so that you can assess your model, tweak your model how you like, to make sure you have a list of terms that you feel good about. Now you see right here, we have this, under the summary, this list of models and that's because you might actually want to run multiple models. So if I go back to the model...oh, excuse me...if we go back to our settings up here, I could actually run a slightly different model. Maybe, for example, I know that the BIC is a little more conservative than the AICc and I want to return fewer terms, maybe did an analysis that returned 900 terms and you're a little overwhelmed. So I'll click run and build another model using that instead. And now we have that model added to our list here, and I can switch between these models to see the results for each one. In this case, we've returned only 14 terms, instead of 37 and I would go down to assess them below. So two big outputs you would get from this, of course, is this term list. If you want to save that out, these are just important terms to you and you want to keep track of them, just right click and make this into a data table. Now I have a JMP table with the terms, their coefficients and other information. And if what you want to do is actually kind of write this back to your original data table, maybe, so that you can use the information in some other kind of analysis or predictive model, just head up to term selection and say that you want to save the term scores as a document term matrix, which if I bring our data table back here, you can see I've now written to their columns for each of these terms that have been selected. In this case filled in with their coefficient values, and now I can use this quantitative information however I like. That's just a bit then about term selection. Again, the big idea here is I have found a list of terms that are related to a variable I care about and I even have, through their coefficients, information about how strong that relationship might be. So let's just wrap up then. We've covered two techniques. We just saw term selection, finding those important words. Before that we reviewed sentiment analysis, all about quantifying the degree of positive or negative emotion in a text. These are two new text mining techniques coming to JMP Pro 16's Text Explorer. We're really excited for you to get your hands on them and look forward to your feedback.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Vince Faller, Chief Software Engineer, Predictum Wayne Levin, President, Predictum This session will be of interest to users who work with JMP Scripting Language (JSL). Software engineers at Predictum use a continuous integration/continuous delivery (CI/CD) pipeline to manage their workflow in developing analytical applications that use JSL. The CI/CD pipeline extends the use of Hamcrest to perform hundreds of automated tests concurrently on multiple levels, which factor in different types of operating systems, software versions and other interoperability requirements. In this presentation, Vince will demonstrate the key components of Predictum’s DevOps environment and how they extend Hamcrest’s automated testing capabilities for continuous improvement in developing robust, reliable and sustainable applications that use JSL: Visual Studio Code with JSL extension – a single code editor to edit and run JSL commands and scripts in addition to other programming languages. GitLab – a management hub for code repositories, project management, and automation for testing and deployment. Continuous integration/continuous delivery (CI/CD) pipeline – a workflow for managing hundreds of automated tests using Hamcrest that are conducted on multiple operating systems, software versions and other interoperability requirements. Predictum System Framework (PSF) 2.0 – our library of functions used by all client projects, including custom platforms, integration with GitLab and CI/CD pipeline, helper functions, and JSL workarounds. Auto-generated transcript... Speaker Transcript Wayne Levin Welcome to our session here on extending Hamcrest automated testing of JSL applications for continuous improvement. What we're going to show you here, our promise to you, is we're going to show you how you too can build a productive cost-effective high quality assurance, highly reliable and supportable JMP-based mission-critical integrated analytical systems. Yeah that's a lot to say but that's that's what we're doing in this in this environment. We're quite pleased with it. We're really honored to be able to share it with you. So here's the agenda we'll just follow here. A little introduction, my self, I'll do that in a moment, and just a little bit about Predictum, because you may not know too much about us, our background, background of our JSL development, infrastructure, a little bit of history involved with that. And then the results of the changes that we've been putting in place that we're here to share with you. Then we're going to do a demonstration and talk about what's next, what we have planned for going forward, and then we'll open it up, finally, for any questions that that you may have. So I'm Wayne Levin, so that's me over here on the right. I'm the president of Predictum and I'm joined with Vince Faller. Vince is our chief software engineer who's been leading this very important initiative. So just a little bit about us, right there. We're a JMP partner. We launched in 1992, so 29 years old. We do training in statistical methods and so on, using JMP, consulting in those areas and we spend an awful lot of time building and deploying integrated analytical applications and systems, hence why this effort was very important to us. We first delivered JMP application with JMP 4.0 in the year 2000, yeah, indeed over 20 years ago, and we've been building larger systems. Of course, since back then, it was too small little tools, but we started, I think, around JMP 8 or 9 building larger systems. So we've got quite a bit of history on this, over 10 years easily. So just a little bit of background...until about the second half of 2019, our development environment was really disparate, it was piecemeal. Project management was there, but again, everything was kind of broken up. We had different applications for version control and for managing time, you know, our developer time, and so on, and just project management generally. Developers were easily spending, and we'll talk about this, about half their time just doing routine mechanical things, like encrypting and packaging JMP add-ins. You know, maintaining configuration packages and, you know, and separating the repositories or what we generally call repo's, you know, for encrypted and unencrypted script. It was...there was a lot we hade to think about that wasn't really development work. It was really work that developer talent is...was wasted on. We also had, like I said, we've been doing it a long time, even at 2019, we had easily 10 years, so over 10 years of legacy framework going all the way back even to JMP 5, you know, with, you know, it was getting bloated and slow. And we know JMP has come a long way over the years. I mean in JMP 9, we got namespaces and JMP 14 introduced classes and that's when Hamcrest began. And it was Hamcrest that really allowed us to go this this...with this major initiative. So we began this major initiative back in August of 2019. And that's when we are acquired our first Gitlab licenses and that's the development of our new...the development of our new development architecture, there you go, started to take shape and it's been improving ever since. Every month, basically, we've been adding and building on our capabilities to become more and more productive, as we go forward. And and that's continuing, so we actually consider this, if you will, a Lean type of effort. It really does follow Lean principles and it's accelerated our development. We have automated testing, thanks to this system, and Vince is going to show us that. And we have this little model here, test early and test often And that's what we do. It supports reusing code and we've redeveloped our Predictum system framework. It's now 2.0. We've learned a lot from our earlier effort. All that's gone, pretty much all of its gone, and it's been replaced and expanded. And Vince will tell us more about that. Easily, easily we have over 50% increase in productivity, and I'm just going to say the developers are much happier. They're less frustrated. They're more focused on their work, I mean the real work that developers should be doing, not the tedious sort of stuff. There's still room for improvement, I'm going to say, so we're not done and Vince will tell us more about that. We have development standards now, so we have style guides for functions and all of our development is functionally based, you might say. Each function requires at least one Hamcrest test, and there are code reviews that the developers, they're sharing with one another to ensure that we're following our standards. And it raises questions about how to enhance those standards, make them better. We also have these, sort of, fun sessions, where developers are encouraged to break code, right, so they're called like, these break code challenges, or what have you. So it's become part of our modus operandi and it all fits right in with this development environment. It leads to, for example, further tests, further Hamcrest tests to be added. We have one small, fairly small project that we did just over a year ago. We're going into a new phase of it. It's got well over... well over 100 Hamcrest tests are built into it and they get run over and over and over again through the development process. So some other benefits is it allows us to assign and track our resource allocation, like what developers are doing what. Everyone knows what everyone else is doing, continuous integration, continuous deployment, something like that), there's...code collisions are detected early so if we have... and we do, we have multiple people working on some projects, so, you know, somebody's changing a function over here and it's going to collide with something that someone else is doing. We're going to find out much sooner. It also allows us to improve supportability across multiple staff. We can't have code dependent on a particular developer; we have to have code that any developer or support staff can support ging forward. So that's was an important objective of ours as well. And it does advance the whole quality assurance area just generally, including supporting, you know, FDA requirements, concerning QA, you know, things like validation, the IQ OQ PQ. So it's...we're automating or semi automating those tasks as well through this infrastructure. We do use it internally and externally, so you may know, we have some products out there, (???)Kobe sash lab but new ones spam well Kobe send spam(???) are talked about also elsewhere in the JMP Discovery European Conference in 2021. You might want to go check them out, but they're fairly large code bases and they're all developed, in other words, we eat our own dog food, if you know that expression, but we also use it with all of our client development, so this is something that's important to our clients, so because we're building applications that they're going to be dependent on. And so we, we need to...we need to have the infrastructure that allows us to be dependable, and anyway, that's a big part of this. I mentioned the Predictum system framework. You can see some snippets of it here. It's right within the scripting index, and you know, we see the arguments and the examples and all that. We built all that in and 95%, over 95% of them have Hamcrest tests associated with them. Of course, our goal is to make sure that all of them do and we're we're getting there. We're getting there. Have...these framework...this framework is actually part of our infrastructure here. That's one of the important elements of it. Another is just that...Hamcrest... the ability to do the unit testing. And I'm going to have...there's a slide at the...at the end, which will give you a link into the Community where you can learn more about Hamcrest. This is a development that was brought to us by by JMP, back in JMP 14, as I mentioned a few minutes ago. Gitlab is a big part of this; that gives us the project management repository, the CI/CD pipeline, etc. And also there's a visual...visual studio code extension for JSL that we created and we'd...you see five stars there because it was given five stars on the on the visual studio. I'm not sure what we call that. Vince, maybe you can tell us, the store, what have you. It's been downloaded hundreds of times and we've been updating it regularly. So that's something you can go and look for as well. I think we have a link for that as well in the resource slide at the end. So what I'm going to do now is I'm going to pass this over to Vince Faller. Vince is, again, our chief software engineer. Vince led this initiative, starting in August 2019, as I said. It was a lot of hard work and the hard work continues. We're all, in the company, very grateful for Vince and his leadership here. So with that said, Vince, why don't you take it from here? I'm gonna... I'm... Vince Faller Sharing. So Wayne said Hamcrest a bunch of times. For people that don't know what Hamcrest is, it is an add-in created by JMP. Justin Chilton and Evan McCorkle were leading it. It's just a unit testing library that lets you run, test, and get results of it in an automated way. It really started the ball rolling of us being able to even do this, hence why it's called extending. I'm going to be showing some stuff with my screen. I work pretty much exclusively in the VSCode extension that we built. This is VSCode. We do this because it has a lot of built-in functionality or extendable functionality that we don't have to write, like get integration, get lab integration. Here you can see this is a JSL script and it reads it just fine. If you want to get it, if you're...if you're familiar with VSCode, it's just a lightweight text editor. You just type in JMP and you'll see it. It's the only one. But we'll go to what we're doing. So. For any code change we make, there is a pipeline run. We'll just kind of show what it does. So if I change the README file to this is a demo for Discovery. 2021. And I'm just going to commit that. If you don't know get, committing as just saying I want a snapshot of exactly where we are at the moment, and then you push it to the repo and it saved on the server. Happy day. Commit message. More readme info. And I can just do get push, because VSCode is awesome. Pipeline demo. So now I've pushed it. There is going to be a pipeline running. I can just go down here and click this and it will give me my merge request. So now pipeline started running. I can check the status of the pipeline. What it's doing right now is it's going to go through and check that it has the required Hamcrest files. We have some requirements that we enforce so that it can... we can make sure that we're doing our jobs well. And then it's done. I'm going to press encrypt. Now encrypt is going to take the whole package and encrypt it. If we go over here, this is just a vm somewhere. It should start running in a second. So it's just going through all the code. Writing all the encrypted passwords, going through, clicking all that stuff. If you've ever tried to encrypt multiple scripts at the same time, you'll probably know that that's a pain, so we automated it so that we don't have to do this because, as Wayne said, it was taking a lot of our time to do these. Like, if we have 100 scripts to go through and encrypt every single one of them every time we want to do any release, it was awful. Because we have to have our code encrypted because, yeah sorry, opinion, all right, I can stop sharing that. Ah. So that's gonna run. It should finish pretty soon. Then it will go through and stage it and then the staging basically takes all of the sources of information we want, as our as in our documentation, as in anything else we've written, and it renders them into the form that we want in the add-in, because much like the rest of github, gitlab, most of our documentation is written in markdown and then we render it into whatever. I don't need to show the rest of this but yeah. So it's passing. It's going to go. We'll go back to VSCode. So. If we were to change, so this is just a single function. If I go in here like, if I were to run this... JSL, run current selection. So. You can see that it came back...all that it's trying to do is open Big Class, run a fit line, and get the equation. It's returning the equation. And you can actually see it ran over here as well. But. So this could use some more documentation. And we're like, oh, we don't actually want this data table open. But let's let's just run this real quick. And say, no. This isn't a good return, it turns the equation in all caps apparently. So if I stage that. Better documentation. Push. Again back to here. So, again it's pushing. This is another pipeline. It's just running a bunch of power shell scripts in order, depending on however we set it up. But you'll notice this pipeline has more stages. So when we in an effort to help be able to scale this, we only test the JSL minimally at first, and then, as it passes, we allow to test further. And we only tested if there are JSL files that have changed. But we can go through this. It will run and it will tell us where it is in the the testing, just in case the testing freezes. You know, if you have a modal dialog box that just won't close, obviously JMP isn't going to keep doing anything after that. But you can see, it did a bunch of stuff, yeah, awesome. I'm done. Exciting. Refresh that. Get a little green checkmark. And we could go, okay, run everything now. It would go through, test everything, then encrypt it, then test the encrypted, basically the actual thing that we're going to make the add-in of, and then stage it again, package it for us, create the actual add-in that we would give to a customer. I'm not going to do that real quick because it takes a minute. But let's say we go in here and we're, like, oh, well, I really want to close this data table. I don't know why I commented out in the first place. I don't think it should be open, because I'm not using it anymore, we don't...we don't want that. We'll say okay. Close the dt. Again push. Now, this could all be done manually on my computer with Hamcrest. But you know, sometimes a developer will push stuff and not run all of their Hamcrest for everything on their computer, and this is...the entire purpose of it is to catch it. It forced us to do our jobs a little better. And yeah. Keep clicking this button. I could just open that, but it's fine. So now you'll see it's running the pipeline again. Go to the pipeline. And I'm just going to keep saying this for repetition. We're just going through, testing, and encrypting, then testing because sometimes encryption enters its own world of problems, if anybody's ever done encrypting. Run, run, run, run, run. And then, oh, we got a throw. Would you look at that? I'm not trying to be deadpan, but you know. So if we were to mark this as ready and say, yeah we're done, we'd see, oh, well, that test didn't pass. Now we could download why the test didn't pass in the artifacts. And this will open a J unit file that I'm just going to pull out here. It will also render it in getlab, which might be easier, but for now we'll just do this. Eventually. Minimize everything. Now come on. So, we could see that something happened with R squared and it failed. Inside of blue. So we can come here. Say, why is there something in boo that is causing this to fail? We see, oh, somebody called our equation and then they just assumed that the data table was there. So because something I changed broke somebody else's code, as if that would ever happen. So we're having that problem. Where did you go? Here we go. So that's the main purpose of everything we're doing here, is to be able to catch the fact that I changed something and I broke somebody else's stuff. So I could go through. Look at what boo does. Say, oh well, maybe I should just open Big Class myself. Yeah, cool. Well, if I save that, I should probably make it better. Open Big Class myself. I'll stage that. Open Big Class.Get push. And again, just show the old pipeline. Now this should take not...not too long. So we're going to go in here. We're...we only test on one...to... JMP version, but you can see automatically, we only test on one. Then it waits for the developer to say, yeah, I'm done and everything looks good. Continue. We do that for resource reasons, because these are running on vms that are automatically just chugging all the time, and we have multiple developers, who are all using these systems. We're also... You can see, this one is actually a docker system, we're containerizing these. Well, we're in the process of containerizing these. We have them working, but we don't have all the versions yet. But we run 14.3, at least for this project, we run 14.3, 15, 15.1, and that should work. Let's just revert things. Because that you know works. Probably should have done a classic...but it's fine. So yeah. We're going to test. I feel like I keep saying this over and over. We're going to test everything. We'll actually let this one run to show you kind of the end result of what we get. It should only take. a little bit. And so we'll test this, make sure it's going, and you can see the logs. We're getting decent information out of what is happening, on where it is, like it'll tell you the runner that is running. I'm only running on Windows right now. Again, this is a demo and all that but we should be able to run more. While that's running, I'll just talk about VSCode some more. In VSCode, we also...there's also snippets and things, so if you want to make a function, it will create you all over the the function information. We use natural docs again, that was stolen from the Hamcrest team, as our development documentation. So it'll just put everything in a natural docs form. So it just, again, the idea is helping us do our jobs and forcing us to do our jobs a little better, with a little more gusto. Wayne Levin For the documentation? Vince Faller So that's for the documentation, yeah. Wayne Levin As we're developing, we're documenting at the same time. Vince Faller Yep. Absolutely. You know, it also has for loops, while loops. For with an associate row, stuff like that. Are we...is this...is this done yet? It's probably done, yep. So we get our Green checkmark. Now it's going to run on all of the systems. If we can go back to here, you'll just see it. Open JMP. It'll run some tests, probably will open Big Class. Then close all...close itself all down. Wayne Levin So we're doing this, largely because many of our clients have different versions of JMP deployed and they want a particular add-in but they're running it, they have, you know, just different versions out there in the field. We also test against the early adopter versions of JMP, which is a service to JMP because we report bugs. But also for the clients, it's helpful because then they know that they can...they can upgrade to the new version of JMP. They know that the applications that we built for them have been tested. And that's just routine for us. Good. Vince Faller You're done. You're done. You're done. Change to... I can talk about... And this is just going to run, we can movie magic this if you want to, Meg. Just to make it run faster. Basically, I just want to get to staging but it takes a second. Is there anything else you have to say, Wayne, about it? Cool. I'll put that... Something I can say, when we're staging, we also have our documentation in mk docs. So it'll actually run the mk doc's version, render it, put the help into the help files, and basically be able to create a release for us, so that we don't have to deal with it. Because creating releases is just a lot of effort. Encrypting. It's almost done. Probably should just have had one pre loaded. Live demos, what are you gonna do. Some. Run. Oh, one thing I definitely want to do. So, the last thing that the pipeline actually does is checks that we actually spent our time, because, you know, if we don't actually record our time spent, we don't get paid, so forcing us to do it. Great, great time. Vinde Faller So Vince Faller the job would have failed without that. I can just show some jobs. Trying. That's the docker one. We don't want that. So You can see that gave us our successes. No failures. No unexpected throws. That's all stuff from Hamcrest. Come on. One more. Okay got to staging. One thing that it does it creates the repositories. It creates them fresh every time. So it's like, it tries to keep it in a sort of stateless way. Okay, we can download the artifacts now. And now we should have this pipeline demo. I really wish it would have just went there. What. Why is Internet Explorer up? So now you'll see pipeline demo is a JMP add-in. If we unzip it. If you didn't know, a JMP add-in is just a zip file. If we look at that now, you can see that has all of our scripts in it, it has our foo, it has our bar. If we open those, open, you can see it's an encrypted file. So this is basically what we would be able to give to the customer and not have so much mechanical work. Wayne mentioned that it's less frustrated developers, and personally, I think that's an understatement, because doing this over and over was very frustrating before we got this in place, and this has helped a bunch. That. Wayne Levin Now, about the encryption, when you're delivering an add-in for use by users within a company, you typically don't want, for security reasons and so on, you don't want them to anyone to be able to go in and deal with the code. You know, that sort of thing, so we may deliver a code unencrypted just for, you know, so the client has their code on encrypted, but for delivery to the end user, you typically want everything encrypted, just so it can't be tampered with. Just one of those sort of things. Vince Faller Yep, and that is the end of my demo. Wayne, if you want to take it back for the wrap up. Wayne Levin Yeah, terrific. Sure, thanks, very much for that, Vince. So there's a lot of moving parts in this whole system so it's, you know, basically, making sure that we've got, you know, code being developed by multiple users that are not colliding with one another. We're building in the documentation at the same time. And actually, the documentation gets deployed with the application and we don't have to weave that in. It's... We set the infrastructure up so that it's automatically taken care of. We can update that, along with the code comprehensively, simultaneously, if you will. The Hamcrest tests that are going on, each one of those functions that are written, there are expected results, if you will. So they get compared and so we saw, briefly, there was, I guess, some problem with that equation there. An R square or whatever came back with a different value, so it broke, in other words, to say hey, something's not right here; I was expecting this output from the function for a use case. So that's one of the things that we get from clients, so you know, we build up a pool of use cases that get turned into Hamcrest tests and away we go. There are some other slides here that are available to you, like, when you, when you, if you go and download the slides. So I'll leave that available for you and here's a little picture of of the pipeline that that we're employing and a little bit about code review activity for developers too. If you want to to go back and forth with it. Vince, do you want to add anything here about how code review and approval takes place? Vince Faller Yeah, so inside of the merge request it will have the JSL code on the diffs of the code. And again, a big thank you to the people who did Hamcrest, as well, because they also started a lexer(?) for GitHub and GitLab to be able to read JSL, so actually this is inside of getlab. And they can also read the JSL. It doesn't execute it, but it has nice formatting. It's not all just white text, it's it's beautiful there. We will just go in, like in this screenshot, you click a line, you put in a comment that you want, and it becomes a reviewable task. So we try to do as much inside of GitLab as we can for transparency reasons, and once everything is closed out, you can say yep, my merge request is ready to go. Let's put it into the master branch, main branch. Wayne Levin Awesom. So you're really it's...it's helping, you know, we're really defining coding standards, if you will, and I don't like the word enforcement but that's what it what it amounts to. And it reduces variation. It makes it easier for multiple developers, if you will, to understand what what others have done. And as we bring new developers on board, they come to understand the standard and they know what to look for, they know what to do. So it...it makes onboarding a lot easier, and again it deals with...everything's attached to everything here, so you know supportability and so on. This is the slide I mentioned earlier, just for some resources so we're using GitLab. I suppose the same principles applied to any git generally so like GitHub or what have you. Here's the community link for Hamcrest. There was a talk at in tucson, that was in 2019, in the old days when we used to travel and get together. That was a lot of fun. And here's the the marketplace link for the visuals...visual code studio. Visual studio code, what have you. So as Vince said, yeah we make a lot of use of that editor, as opposed to using the built-in JMP editor just because it's all integrated. It's... it's just all part of one big application development environment. And with that, Vince and I, on behalf of Vince and myself, I want to thank you for your your interest in this, and want to, again we really want to thank the JMP team. Justin Chilton and company, I'll call out to you. If not for Hamcrest, we would not be on this. That was the missing piece or was the enabling piece that really allowed us to to take JSL development to, basically, the kinds of standards you expect in code development, generally, in industry. So we're really grateful for it, and I know our... you know, that that is propagated out with each application we've deployed. And at this point, Vince and I are happy to take any questions that... info@predictum.com and it'll get forwarded to us and we'll get back to you. But at this point, we'll open it up to Q&A.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Stuart Little, Lead Research Scientist, Croda This presentation will show how some of the tools available in JMP have been successfully used to visualize and model historic data within an energy technology application. The outputs from the resulting model were then used to inform the generation of a DOE-led synthesis plan. The result of this plan was a series of new materials that have all performed in line with the expectations of the model. Through this approach, a functional model of product performance has been successfully developed. This model, alongside the visualization capabilities of JMP, has allowed for the business to begin to embrace a more structured approach to experimentation. Auto-generated transcript... Speaker Transcript Stuart Little Hi everyone, and welcome to this talk around how JMP is being used at Croda to help drive new product development. So what we're going to cover today, and firstly we're going to cover some context to Croda, how we are using JMP and where we are on that journey, and a summary of the problem we're trying to solve. Once you've been covered the problem, we then are going to move to JMP, look at how the tools and platforms in JMP have allowed easy data exploration and easy development of a structure performance model. And then finally we'll wrap this up by discussing the outcomes of this work and how by doing this kind of research, we've been able to increase buy in to the use of data and DOE techniques in and...in the research side of the business. So firstly, who we are as Croda. It's a question that does come up quite a lot, because we're a business-to-business entity. But as a business, Croda are the name behind a lot of high-performance ingredients and technologies, and behind a lot of the biggest and most successful brands across the world, across a range of markets. We create, make, and sell speciality chemicals. From the beginning of Croda, these have been predominantly sustainable materials. So we started by making lanolin, which is from sheep's wool, and we continually build on that sustainability. Last year we made a public pledge to be climate, land, and people positive by 2030 and have signed up to the UN Sustainable Development Goals as part of our push to achieve this and become the most sustainable supplier of innovative ingredients across the world. So, in terms of the markets we serve, we have a kind of very big personal care business where we deal with skincare and sun care and sort of hair care, color cosmetics, and those kind of traditional personal care products. Life sciences business, our products...our products and expertise help customers optimize their formulations and their active ingredient use. I mean, most recently, we in an agreement with Pfizer to provide materials that are going into their COVID-19 vaccine. Our industrial chemicals business, that's...that part of the business is responsible for supplying technically differentiated, predominantly sustainable materials to a huge range of markets. A lot of markets aren't quite...don't quite fit into anything else on this slide. And then finally we've got our performance technologies business. This covers a lot, again, a lot of similar areas, providing high performance answers across across all of these. And then today, in particular, we're talking about our energy technologies business, and specifically, kind of, battery technology in high performance electronics. So where we are in Croda and JMP is we've been using JMP for about two years and we've had a lot of interest internally. It's been harder to build confidence that these techniques have real value to research. And so to prove this, we've gone away, we've created a number of case studies that have been pretty successful on the whole. We've demonstrated the potential and some of the pitfalls within that. And then all of that has then led to a slightly bigger set of projects, one of which is the one we're going to talk to about...talk to you today. how do we improve the efficiency of electrical cooling systems? The primary driver for this project is sort of transport electrification, so that's battery vehicles. How do you maintain the battery properly? How do you make, make sure the motors are working at their optimum level? And how'd you do that without electrocuting anyone? So currently there's a set of cooling methods for these things, that our customers are certainly looking at how that can be improved, because the better the control of your battery cooling, for instance, the better battery capacity you have, and the more consistent the range will be. And because, you know, this is critical and there's lots of different applications that are broadly similar, the really useful thing for us would be to build an understanding of these fluids, by having some sort of data led model and that's where JMP came in. So how can we do that? Well, the first thing we we looked at, was what are the current cooling methods? So for batteries, predominantly they're air cooled or coldplaee cooled in the previous generation. And the electronics in the car, you have the opposite problem of the battery but thing at tend to get too hot so, then we have heatsinks, to try and take that energy away. And in electric motors, we're trying to minimize the resistance in there, they tend to be jacketed with fluid. In all three of these cases, the incoming alternative method of cooling relies on fluid, so that's direct immersion for batteries and electronics, and then in terms of the electric motors, that tends to be more of a flow. So what does that fluid look like? Obviously, we're dealing with high voltages so we have to have something that's not electrically conductive. It also needs to have a really high conductivity of heat, so that it can pull heat out of the electronics. And because these fluids need to be moved around the system, the viscosity has to be low. So we have kind of practical physical constraints that have been introduced by the application itself. If you look at it in a bit more depth, the ability of the fluids to transfer heat is based predominantly on this equation. And what this tells us is there is a part that we can control by the fluid, which is the heat transfer coefficient, and then there is a part that is controlled by the engineering solution in the application to that. What's the area for cooling, and what are the temperatures of the surfaces that you're trying to cool? But for in all cases, to get an efficient heat transfer, we have to have a high heat transfer coefficient, and as that's the thing we can effect, that's where we looked. That heat transfer coefficient is defined by this equation in a simplistic way, there are other terms in there, but predominantly, it's a function of density, thermal conductivity, the heat capacity, and then having an effect of the viscosity of the system. So, if we look specifically for the applications were interested in, if we want to optimize our dielectric fluid, we need to increase the density, increase the thermal conductivity, and increase the heat capacity, but alongside that, we need to reduce the viscosity of the...of the fluid. And these match up pretty well with the engineering challenges that we have, which is helpful. So from that, what we really wanted to do was, we knew what the target was. And we really wanted to understand what the relationship was between structure and product performance as a dielectric fluid. So initially we proceeded in kind of a fairly traditional way and we started conducting a large-scale study measuring the physical properties of a lot of esters and a lot of other materials. And then, when when we saw that and looked at that, we thought, well actually, this data exists so why don't we use these data sets to try and build some models, and say, can we really understand that physical property to structure to performance relationship. So that's where...we're just going to pop into JMP so just bear with me one second. Okay. So the first thing that we we did was we collated that mix of historic data, data that was being obtained through targeted testing by the applications teams. And once we've got that into one place, we kind of examined that in JMP to say is there...to understand, is there a relationship, but at a really simple level between the physical properties we're measuring. So, if we look at that that data set, the first port of call for me, as ever, is the distribution platform in JMP. And it's a really easy way just to see if something that you want has any kind of vague pattern anywhere else. So if I, in this case, if you say, oh, we want everything that's got a high thermal conductivity, what we see is the properties that are pretty stretched out across the other...the other properties we've measured. So it doesn't really say, oh, there's brilliant relationship, what you need is this, which is kind of what we expected. But it's nice to have a check. Similarly, if we then plot everything as scatterplots, what we see is a lot of noise. I mean, these lines of fit are just there for reference to show there isn't really any fit. In no way am I claiming your correlation on these. And all of that was disappointing, that there isn't an obvious answer was expected. Where it got interesting to us is, we said, well, we were expecting that there isn't a clear... a clear relationship between any of these factors, because if there was, it would have been obvious to the experienced scientists doing the work, and we would have known that. So, then, we said, well, what we do know is these properties all have an understanding...a relation to their structure. What happens if we calculate some physical parameters for these things and combine that with a number of, sort of, structural identifiers and ways of looking at these molecules? What happens if we take that and add that data to the test data? Do...you know, can we then build some kind of model? This starts being able to estimate structure and performance, so that's exactly what we did. You know, in this case, what we see is that, again, if we use the multivariate platform, just as a quick look to see if there's any correlation on some of these these factors, this clear differentiation in some cases, between up and down, and maybe a little hints of correlation, but nothing clear that says, this is the one thing that you need. Again, this is what we expected. So then, what we did was we used the regression platforms in JMP to try and understand whether we could build a model, and what that relationship looks like. So to do this, we randomly selected a number of rows for the row selection tool in JMP. Generally, pulling out five samples at random, which weren't going to participate in the model, then it's a relatively built up these models and refined them that way, so we always had a validation set from the initial data, just to...just to check that what we were doing had any chance of success. So then, if we just look at the 80 degree models, the first model that we we came to was this one. Clearly, as we can see, there are a number of factors that were included in this model that make no sense from a statistical point of view, because they're just overfitting and they are just non significant. However, these are fairly important in terms of describing the molecules that are in there, so as a chemist, we created this model. So this is a model that allows, you know ,molecules, if you like, to be designed for this application, even though we know it's over fitted. And we know that it's not...it's not really a valid model because these terms are just driving the R squared up and up and up. We also built the model without those terms. This is a far better model in terms of estimating the performance of these things, the R squared is a touch lower, but all the terms that are in there have a significant impact on the performance. The downside of this model is it doesn't really help us design any new chemistry. But, in both cases, when we look at the predicted values against the actual measured values, we see a reasonable correlation between them. Certainly when we expect things to be high, they are. So that gave us some confidence that this model might actually perform for us. In terms and...then in terms of when we looked at how good this might be, we just simply looked at what's the percentage difference between the measured value and the actual value. And what we see is they are almost universally within 10%, predominantly within 5% for either model. Again across a range of different types of material, this gave us confidence that what we were...what we were seeing might be a real effect. All of which is very nice, is this just an effect of the data we've measured? So what we did was that was we used the profiler platform in JMP, produced a shareable model that we could send around the project team, and essentially set up a competition and said look, whoever can find the highest... the highest thermal conductivity in this model from a molecule that could actually be made, wins. You know, from that we had a list of about 14 materials back that looked promising. We had to cut a few out because they were impossible to source of raw materials, so we ended up with about nine new materials that were synthesized and tested. Now these materials were almost exclusively made up of materials the model hadn't really seen before, you know. In some cases, part of the molecule would be the same, but they were quite distinct from the original materials. So once we'd made them and, once they had been tested, we put them back into the model and to see, just to see what the predictive power of this model was like. So if we have a look at that data, you know, I think, given the differences of these materials, I was fully expecting this... this to break the model. However, when we... if we look at the predictions again, what we see is the highlighted blue ones are the new materials that were made. You know, we deliberately picked a couple that were lower just out of curiosity, just to check. And all the ones that we picked that we thought would be high were high. So in the overfitted model that had value from a designing a structure point of view, what we see is one outlier. In the...in the model that was statistically reasonable, we actually see a much better fit overall. And that was edifying, that we can start to be able to not, sort of, design a single molecule and say, oh, here you go, off you...off you pop; here's the one thing you need to make. But certainly be able to direct synthetic chemists to the right, sort of, types of materials to really drive projects forward. So then, if we just look again at these residuals, what we see is for the you know, for the statistically good model with no overfitting, what we saw was everything was within 10% of all the new materials, which, for what we were trying to achieve, was good enough. There's a few in the overfitting model that were a little bit under 10% but, again, this is kind of what I would expect to see. And it's, you know, it was it was nice that they were all in the right range, because it shows that this approach was was having value, but it was also quite to find that they weren't all at exactly right, because I tlhink, had we produced nine materials and they'd all been within 1%, I'm not sure that people would have believed that either. So the fact is, we were getting a similar... a similar level of difference to the predictions from the materials we started with and the new materials that we made. So we started having some real confidence in this in this model. And then, if we just go back to the slides a second. So what we can say then is that the structure performance relationship of these materials has been created in JMP using the regression platforms. We've used the visualization tools in JMP to be able to see that there's real benefits to do this, and that the model itself is being used to direct this emphasis of new materials in this project. It's being used to screen likely materials to test from things we already make. And it's a, you know, there's an acceptable correlation in the results between the model and the new molecules we're making, all of which has given real confidence to this approach, and it's really allowed us to, kind of, push this project further and sort of split it out into specific target materials. So, in terms of new molecules, we've directed this emphasis of molecules with higher thermal conductivity. So as you can see in this plot, you know, all the new molecules are sort of medium to high on that range of thermal conductivity, which is kind of what we wanted to achieve from them. We demonstrated that we could target an improvement, using data and then verify that in the lab and make it. Where this project then becomes harder still is, we're now trying to build similar models for all the other factors that influence the performance of these dielectric fluids, and then we will be trying to balance those models against each other to find the best outcomes. So all of that further development is ongoing, but that momentum has come purely by the ease of use of JMP and the platforms in it to take a data set and with a bit of kind of domain knowledge, really push that forward and say, yep here's a model that will help direct this emphasis for this project and subsequent projects in this area for Croda. So then, just in conclusion, data that we've obtained from testing has been used to successfully model the performance of these these materials. It's not absolutely perfect, but it's good enough for what we want. The model... the model demonstrates that there is a structure performance a relationship of esters (sorry, not sure why my taskbar is jumping around). The model has been used to predict materials of high thermal conductivity. Those predictions are then verified initially by just exclusion and then laterally by making new materials, and really showing that this this model holds for that type of chemistry. It's also demonstrated the possibility of tailoring properties of, in this case, dielectrics but other materials, if you build similar models, so that you can start being able to create specific materials for specific applications. And I think most importantly for me, the real success of this work has built internal momentum to sort of demonstrate that JMP is not a nice to have, it's a...it's a real platform to develop research, to very quickly look at data sets and say, is there something there? And with that, I just like to say thank you for for watching. Obviously I can't answer any questions on a recording, but if you want to get in touch, feel free to comment in the Community. Yeah, thank you very much.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Philip Ramsey, Professor, North Haven Group and University of New Hampshire Wayne Levin, President and CEO, Predictum Trent Lemkus, Ph.D. Candidate, University of New Hampshire Christopher Gotwalt, JMP Director of Statistical R&D, SAS DOE methods have evolved over the years, as have the needs and expectations of experimenters. Historically, the focus emphasized separating effects to reduce bias in effect estimates and maximizing hypotheses testing power, which are largely a reflection of the methodological and computational tools of their time. Often DOE in industry is done to predict product or process behavior under possible changes. We introduce Self-Validating Ensemble Models (SVEM), an inherently predictive algorithmic approach to the analysis of DOEs, generalizing the fractional bootstrap to make machine learning and bagging possible for small datasets common in DOE. Compared to classical DOE methods, SVEM offers predictive accuracy that is up to orders of magnitude higher. We have remarkable and profound results demonstrating SVEM capably fits models with more active parameters than rows, potentially forever changing the role of classical statistical concepts like degrees of freedom. We use case studies to explain and demonstrate SVEM and describe our research simulations comparing its performance to classical DOE methods. We also demonstrate the new Candidate Set Designer in JMP, which makes it easy to evaluate the performance of SVEM on your own data. With SVEM one obtains accurate models with smaller datasets than ever before, accelerating time-to-market and reducing the cost of innovation. Auto-generated transcript... Speaker Transcript Wayne Levin self-validating ensemble model, as we view it as a paradigm shift in design of experiments and so. Just gonna...there we go. Our agenda here is...we're going to...just some quick introduction about who's talking. So, what, why and how of SVEM. So we'll have a quick overview of DOE and machine learning, describe blending them into SVEM, analyze some real world...real world SVEM experiments, and we're going to review some current research and demonstrate JMPs new candidate set designer, new in JMP 16, and then we'll end with the usual, you know, next steps and Q & A and so forth, so. SVEM is a remarkable new method to extract more insights with fewer experimental cycles and build more accurate predictive models from small sets of data, including DOEs. Using SVEM then we anticipate, we promise you're gonna have less cost, you'll be faster to market, a lot faster problem solving. We're going to explore all that, as we go forward. in the session. I'm Wayne Levin and I'm the president of Predictum and joining me is Chris Gotwalt, who's the chief data scientist at JMP, and also joining us is Phil Ramsey who's a senior data scientists here at Predictum and associate professor at the University of new Hampshire. So let's just take stock here. JMP's contributions to data science is huge, so we all know, and in the DOE space going back over 20 years ago now, it would...JMP 4, with the coordinate exchange algorithm. And then about eight years ago, nine years ago, we saw definitive screening designs coming out. And we'd like to think that SVEM, we call it for short, self-validating ensemble modeling (try and say that fast three times), we like to think that this is another contribution. machine learning and design of experiments. It overcomes limitations of limited amounts of data, right. With small amounts of data, you can't do machine learning, you normally have to have large amounts of data to make that happen. Now we've been trying SVEM in the field and we've had a number of companies who approached us on this and we've made a trial version of SVEM available. We're, at Predictum, the first to create a product in this area, bring a product to market, and I just want to share with you some of the things we've learned so far. And one is that SVEM works exceptionally well. I mean, when you have more parameters and number of runs, it works very well with higher order models. It does give more accurate predictions, it also helps recover from broken DOEs. So, for example, if you have definitive screening designs that for whatever reason can't be run in fit definitive, we've had some historical DOEs sent to us that lack power. They they just, you know, didn't find anything and but with SVEM, actually it did. It was able to. And you know we've had cases where there were some missing or questionable runs. Something else we've learned is that it's not going to help you with rational factorial, so things like that. We're interested in predictive models here and fractional factorials really don't give predictive models; they're not designed for that. So it's not going to give you something that the data is not structured to deliver, at least not to that extent. The other thing we've learned is that we've not yet tested the full potential of SVEM. To do that, we really need to design experiments with SVEM in mind to begin with, and that means look at more factors. We don't want to just look at three or four factors. How about 10, 15 or even more? We understand from the research that Chris and Phil will be talking about that Bayesian I-optimal designs are the best...the best approach for using with SVEM and also mixture designs were particularly useful for SVEM and I'll have a little something more to say about that later. And as far as SVEM goes if you'd like to try it out, we'll have some information at the end after Chris and Phil to to talk about that. And so with that said, I'm going to throw it over to you, Chris, who will go into the what and the why and the how of SVEN. So off to you, Chris. Chris Gotwalt Alright, great well, thank you, Wayne. Oh wait, am I sharing my screen? Wayne Levin Not yet. There we go. Chris Gotwalt Okay, all right, well, thank you, Wayne. So now I'm going to introduce the idea of self-validating ensemble models or SVEM. I'm going to keep it fairly basic. I'll assume that you have some experience of design and analysis of experiments. Most importantly, you should understand that the statistical modeling process boils down to fitting functions to data that predictor response variable as a function of input variables. At a high level, SVEM is a bridge between machine learning algorithms using big data applications and design of experiments, which is generally seen as a small and good analysis problem. Machine learning and DOE each have a long history, but until now, they were relatively separate traditions. Our research indicates that you can get more accurate models using fewer observations using SVEM. This means that product and process improvement can happen in less time using fewer resources at lower cost. So if you're at a competitive industry, where being first to market is critical, then SVEM should be of great interest to you. In particular, we believe that SVEM has particular...has tremendous promise in the pharma and biotech industries, high technology like semiconductor manufacturing, and in the development of consumer products. In my work here at JMP, I've had something of a bird's eye view of how data is handled in a lot of industries and areas of application. I've developed algorithms for constructing optimal design experiments as well as related analysis procedures like mixed models and generalizing your models. On the other hand, I'm also the developer of JMP Pro's neural network algorithms, as well as some of the other machine learning techniques that are in the product. Over the last 20 years I've seen machine learning become more prominent when modeling large observational data sets and seeing many new algorithms that have been developed. At the same time, the analysis of designed experiments has also changed, but more incrementally over the last 20 years. The overall statistical methodology of industrial experimentation would be recognizable to anyone that read Box, Hunter and Hunter in the late 1970s. Machine learning and design of experiments are generally applied to industry to solve problems that are predictive in nature. We use these tools to answer questions, such as is the sensor indicating a faulty batch or how do we adjust the process to maximize yield while minimizing costs. Although they are both predictive in nature, machine learning and design of experiments have generally applied to rather different types of data. What I'm going to do now is give a quick overview of both and then describe how we blended them into SVEM. I'll analyze some data from a real experiment, to show how it works and why we think it's doing very well overall. I'll go overall...I'll go...I will go over results of simulations done by Trent Limpkiss(?), a PhD student at the University of New Hampshire, which has given us confidence that the approach is very promising. Then I'll go over an example that shows how you can use the new candidate set designs in JMP 16 to find optimal subsets of existing data sets so you can try out SVEM for yourself. I'll highlight some of our observations, I mentioned some future questions that we hope to answer, and then I'll hand it over to Phil, who will go through a more in-depth case study and a demo of the SVEM add-in. Consider the simple data on the screen. It's a data from the metallurgical process and we want to predict the amount of shrinkage at the end of the processing steps. We have three variables on the left that we want to use to predict shrinkage. Modeling amounts to finding a function that connects pressure, time, and temperature that predicts that the response (shrinkage) here. Hopefully this model would generate...will generalize to new data in the future. In a machine learning context, the standard way to do this is to partition that data into two disjoint sets of rows. One is called the training set and it's used for fitting models to make sure that we don't overfit the data, which leads to models that are inaccurate on new observations. We have another subset of the data, called the validation set that is used to test out various models we have been... that have been fit to the training set. There's a trade off here, where functions that are too simple will fit poorly on the training and validation sets because they don't explain enough variation. Whereas models that are too complex can have an almost perfect fit on the training set but will be very inaccurate on the validation set. We use measures like validation R squared to find the goldilocks model that is neither too simple nor too complicated. This goldilocks model will be the one whose validation R squared is the highest on the validation set. This hold out style model selection is a very good way to proceed when you have a quote unquote reasonably large number of rows, often in the hundreds of millions of rows or more. The statement reasonably large is intentionally ambiguous and depends on the task at hand. That said, the 12 rows you see here is really too small and is used here just for illustration. In DOEs, we're usually in situations where there are serious constraints on time and/or resources. A core idea of designed experiments, particularly the optimal designed experiments that are a key JMP capability, is around obtaining the highest quality information packed into the very smallest number of rows possible. Many brilliant people over many years have created statistical tools like F tests, half-normal plots and model selection criteria, like the AIC and BIC, that help us to make decisions about models for DOE data. These are all guides that help us identify which model we should use to base our scientific, engineering and manufacturing decisions on. One thing that isn't done often is applying machine learning model selection to designed experiments. Now why would that be? One important reason is that information is packed so tightly into designs that removing even a few observations can cause the main effects and interactions to collapse so that we're no longer able to separate them into uniquely estimable effects. If you try to do it anyway, you'll likely see software report warnings, like lost degrees of freedom or singularity details added to the model report. Now we're going to conduct a little thought experiment. What if we tried to cheat a little bit? What if we copied all the rows from the original table, the Xs and Ys, and labeled one copy "training" and labeled the other copy "validation"? Now we have enough data, right? Wrong. We're not fooling anybody with this, this is this crazy scheme. It's crazy, because if you tried to model using this approach and looked at any plot of index of goodness of fit, you'll see that making them more complicated leads to a better fit on the training set but because training and validation sets are the same in this case, the approach will always lead to overfit. So let's back up and recast machine learning model selection as a tale of two weighted sums of squares calculated on the same data. Instead of thinking about model selection in terms of two data partitions, we can think of it as two columns of weights. There's a training weight that is equal to one for rows in the training set and zero otherwise, and validation column that is equal to zero for the training rows and equal to one for the validation column, for the validation rows. So from machine learning model selection, we can think of each row is having its own pair of weight values, where both weights are binary zeros and ones. Having them set up in this way is what gives us independent assessment of model fit. Models are fit by finding finding parameter values that minimize the sums of squared errors weighted using the training weight, and the generalization performance is measured using the sum of squared error weighted using the validation column weights. If we plot these weight pairs in a scatter plot, we see, of course, that the training validation weights...the training and validation weights are perfectly anti correlated. Now, what if, instead of forcing ourselves to use perfectly anti correlated pairs of binary values, we relax the requirement that the values can take on only the values of zero and one and allow the weights to take on any strictly positive value? To do that, we have to give something up, in particular, to my knowledge, for this to work mathematically, we can no longer have perfectly anti correlated weight pairs, but we can do this in a way that we still have a substantial anti correlation. There are many ways that this could be done, but one way to do this would be to create weights using the same scheme as the fractionally weighted bootstrap, where we use an exponential distribution with a mean of one to create the weights. And there's a little trick using the properties of the uniform distribution that we can use to create exponentially distributed, but highly anti correlated weight pairs. When using this as a two column weighted validation scheme, we call this autovalidation. Phil and I have given several Discovery papers on this...on this very idea. Using this approach is like having a training column of weights that has a mean of variance of one and the validation column with the same properties. Under this autovalidation scheme, if a row contributes more than average weight to the training sums of squares, that row will contribute less to the validation sums of squares and vice versa. There is an add-in that Mike...I want to point out that there's an add-in that Mike Anderson created that sets data sets up into this kind of format that JMP Pro's modeling platforms can consume. Now recently Phil and I have taken on Trent as our PhD student. Over the spring and fall, Trent did a lot of simulations trying out many different approaches to fitting models to DOE data, including an entire zoo of autovalidation based approaches. I'll show some of his results here in a little bit. Suffice to say, the approach that worked consistently the best, in terms of minimum average error on an independent test set, was to apply autovalidation in combination with a variable selection procedures such as forward selection, but instead of doing this once we repeat the process dozens or hundreds of times, saving the prediction formula each time and ultimately averaging the models across all these autovalidated iterations. And we call this general procedure self-validated ensemble models or SVEM. So, to make things a little more concrete I'm going to give a quick example that illustrates this. We begin by initializing the autovalidation weights. We apply a model selection procedures, such as the Lasso or forward selection, then save the prediction formula back to the table. Here's the first prediction formula. In this iteration, just two linear main effects have been selected. Then we reinitialize the weights, fit into the model again using the new set of weights, and save that prediction formula. Here's the second autovalidated prediction formula. Note that this time, a different set of main effects and interactions were chosen and their regression coefficients are different this time. We repeat this process a bunch of times. And at the end, the SVEM model is simply the average across all the prediction formulas. Here's a succinct diagram that illustrates the SVEM algorithm as a diagram. So the idea of combining the bootstrap procedures with model averaging happens over and over again, as we're...after we reinitialize the weights. We save the the prediction formula from that iteration. We save...here I'm saving the param...here the illustration is showing the parameters, but it's the same thing with the formulas. Just save them all out and then at the end of the day, the final model is just simply the average across all the iterations. Once we're done, we can use that SVEM model in the graph profiler with expand intermediate formulas turned on so that we can visualize the resulting model and do optimization of our process and so forth. Now I'm going to go over the results from Trent's dissertation research. a Box Behnkens and DSDs in four and eight factors. Each simulation consisted of 1,000 simulation replications and each simulation had its own set of true parameter values for our quadratic response surface model. The active parameters were doubly exponentially distributed and we explored different percentages of active effects from 50% to 100%. For each of the 1,000 simulation reps, the SVEM prediction was evaluated relative to the true model on an independent test set that consisted of a 10,000 run space filling design over the factor space. We looked at a large number of different classical single shot modeling approaches, as well as a number of variations on all the validation. But it's easier just to look at the most interesting of the results since there was too much to compare. The story was very consistent across all the situations that we investigated. This is one of those simulations where the base design was a definitive screening design, with eight factors and 21 runs. In this case 50% of the effects were non zero. This means that the model had as many non zero parameters as there were runs, just about. On the left are the box plots of the log root mean squared error for the Lasso, Dantzig selector, forward selection and pruned forward selection, all tuned using the AICc, the best performing at the methods. Also, the best performing of the methods from the fit definitive platform is next, followed by SVEM applied to a different modeling...several different modeling procedures over on the right. There is a dramatic reduction in independent test set root mean squared error, when we compare the test set RMSEs relative to the SVEM predictions. And note that this is on a log scale, so this is a pretty dramatic difference in predictive performance. Here's an even more interesting result. This was the plot that really blew me away. This is the same simulation setup, except here all of the effects are non zero. So the true models are supersaturated, not just in the design matrix but in the parameters themselves. In the past we have had to assume that the number of active effects was a small subset of all the effects and certainly smaller than the number of rows. Here we are getting reasonably accurate models despite the number of active parameters, being about two times the number of rows, which is really truly remarkable. Now I'm going to go through a quick illustration of SVEM to show how you can apply it to your own historical DOEs. I'm going to use a real example of a five factor DOE from a fermentation process, where the response of interest was product yield as measured by pDNA titer. The great advantage with SVEM is that we can get more accurate models with less runs. This means we can take existing data sets, use the new candidate set designer in JMP 16 to identify the subset of rows that form the quote unquote predictive heart of the original design. I'll take the original data which had 36 rows, load the five columns in as covariates and tell the custom designer to give me the 16 rows subset of the original design that gives the best predicted performance at that size. Then I'll hide and exclude all the rows that are not in that subset design, so that they can be later used as an independent test set. I'll run the SVEM model and compare the performance of the SVEM model to what you get if you apply a standard procedure like forsward selection, plus AICc fit to the same 16 runs. I do want to say that the new candidate set designer in JMP 16 is a...is also a remarkable contribution and I just want to call out how incredible Brad's team has been in the creation of this new tool which is going to be profoundly useful in many ways, including with SVEM. So to do this, we can take our data set, go to the custom designer. We select covariate rows, so this is a new outline node in JMP 16. Then we select the five input factors. Click continue. And we're going to press the RSM button, which will automatically convert the optimality criteria to I-optimal or predictive type design. We select all the non intercept factors and right click in the...in the model... select list here and create...set all the non intercept effects as if possible effects. This is what allows us to have models with more parameters in them than there are runs as our base design, as the design criteria will now be a Bayesian I-optimal design. Now we can set the number of runs equal to 16. Click make design. And once we've done that, the optimizer works its magic, and all the rows that are in the Bayesian I-optimal subset design are now selected in the original table. We can go to the table and then use an indicator or transform column to record the current row selection. And I went ahead and renamed the the column a meaningful name for later. We can select the points not in the subset design by selecting all the rows where the new indicator column is equal to zero. And then we can hide and exclude those rows so that they will not be included in our model fitting process. Phil will be demoing the SVEM add-in, so I'll skip the modeling steps and go right to a model comparison of forward selection plus AICc and the SVEM model on the rows not included in the Bayesian I-optimal subset design. So you can see that comparison here in the red... in the red outline part of the the report here. We see that SVEM has an R square on the independent test set of .5, and the classical procedure, it has an R square of .22, so we're getting basically twice the amount of variation explained with SVEM then the classical procedure. We can also compare the profile traces of the two models when we apply SVEM to all 46 observations. It's clear that we're getting the same model basically with when we apply SVEM to all 46 runs as we're getting with just 16 runs under under SVEM. But the forward selection based model is missing curvature terms, not to miss...not to mention there's a lot of interactions that are missing also. This is a fairly procedure...fairly simple procedure that you can use to test SVEM out on your own historical data. So overall SVEM raises a lot of questions. Many of them are centered around the role of degrees of freedom in design experiments as we are now able to fit models where there are more parameters than rows, meaning that there would effectively be negative degrees of freedom, if there were such a thing. I think this will cause us to reconsider the role of P-value based thinking in industrial experimentation. We're going to have to work to establish new methods for uncertainty in analysis, like prediction...in confidence intervals on predictions. Phil and I are doing some work on on trying to understand what is the best family of base models that we can work from. So this could be quadratic RSMs and we're also looking at these partial cubic models that had been proposed a long time ago, but now we believe are worthy of reconsidering. What kinds of designs should we use? What sample sizes are appropriate as a function of the number of factors in the base model that we're using? What is the role of screening experiments? And one big unknown is what is the role of blocking and split plot experimentation in this framework? So now I'm going to hand it over to Phil. He's going to do a more in-depth case study and also demo Predictum's SVEM add-in. Take it away, Phil. Philip Ramsey Okay, thank you, Chris. And what I'm going to do is discuss a case study, so let me put this in slideshow and I'll do some illustration, as Chris said, of the Predictum add-in that actually automates a lot of the SVEM analysis. And what we're trying to look at is an analytical method used in the biotech industry. And this one is for characterizing the glycoprofiling of therapeutic proteins, basically proteins to become antibodies. And many of you who work in that industry know glycoproteins are really a very rich source of therapeutics and you also know that if you work to see GMP, you have to demonstrate the reliability of the measurement systems. And actually for glycoprofiling, fast, easy to use analytical methods are not really fully developed. So the idea of this experiment was to come up with a fairly quick and easy method that people could use that would give them an accurate assessment of the different (I am not chemist here) sugars that have been attached to the base protein post transcription. And to give you an idea, the idea of using chromatography and I'm going to assume you have some familiarity with it, basically it's a method where you have some solution, you run it through a column, and then, as the different chemical species go through the column, they tend to separate and come out of the column at different times. So basically a form, what is called a chromatagram, where the peak is a function of concentration and the time at which the peak occurs is actually important to identifying what the species actually was. So in this case, we're going to look at a number of sugars. I'm simply going to call them glycoforms, and what is going on here is the scientists who did this work developed a calibration solution. And we know exactly what's in the calibration solution. We know exactly what peaks should elute and roughly where. And then we can compare an actual human antibody sample to the calibration sample. And some of these glycoforms are charged. They're difficult to get them through the column. They tend to stick to it, so we're using what is sometimes called a gradient elution procedure. And I won't get into the details. And what we're doing here, we're using something called high performance anion exchange chromatography. I'm not an expert on it, but the scientist I've worked with has done a very good job of developing this calibration, and the reason we need a calibration solution that historically the sugars that elute from a human antibody sample are not entirely predictable as to where they're going to come out of a column, so we have something that we can use for calibration. So. The person who did the work, and I'm going to mention her in a moment, designed two separate experiments. One is a 16-run three-level design. And then later she came back and did a bigger 28-run, three-level design. And so one could be used as a training set and the other is a validation set, but to demonstrate the covariate selector and Bayesian I-optimal strategy that Chris talked about, both designs were combined into a single 44-run design. We then took that design into custom design with the covariate selector and then created various Bayesian I-optimal designs. And we could have done more combinations, but we only have so much time to talk about this. So there are three designs. One has 10 runs, remember the five factors and we're doing 10 runs. A 13-run design and a 16-run design and again, we could do far more. So, what we want to do is see how these designs perform after we use fan to fit models and, by the way, as Chris mentioned, in each one of these designs, the runs that are not selected from the 44 are then used as a validation or test set to see how well the model performed. Okay, so these are the initial factors, the initial amount of sodium acetate, the initial amount of sodium hydroxide. And then there are three separate gradient elution steps that take place over the process time, which runs to roughly 40 minutes. So these are the settings that are being used and manipulated in the experiment. There are actually, in this experiment, 44 different responses one could look at. So, given the time we have available, I'm going to look at two. One is the retention time for glycoform 3, and this is the key, this is what anchors the calibration chromatogram. And we'd like it to come out at about eight and a half minutes because it aligns nicely with human antibody samples. It's actually fairly easy to model. The second response is the retention time of glycoform 10. It's a charged glycoform. It elutes very late and tends to be bunched up with other charged glycoforms and is harder to distinguish. OK, so those are the two responses I'll look at. And for those who aren't familiar with chromatography, here are a couple of what are called chromatograms. These are essentially the responses, pictures of what the responses look like and, in this case, if you take a look at the picture, we're going to look at the retention time, that is, when did Peak 3 come out? At what time and then, at what time did Peak 10 come out and show up in the chromatogram? And, by the way, notice between the two chromatograms I'm showing, how different the profiles are. So in this experiment when we manipulated those experimental factors (and by the way, we have 44 of these chromatograms; I'm showing you two of them), you see a lot of differences in the shapes and the retention times and so forth of these peaks and in resolution. So whatever we're doing, we are clearly manipulating the chromatograms, and for those of you who are curious, yes, we are thinking about this in the future as functional data and how functional data analysis might be used. But that is often the future but that's definitely something that we're very interested in. Chromatograms really are functional data in the end. But I'm just going to extract two features and try to model them. Okay. And I also want to give credit to Dr Eliza Yeung of Cytovance Biologics, a good friend and excellent scientist. And she did all the experimental work and she was the one who came up with the idea of constructing the... what she calls the glucose ladder, the calibration solution. And Eliza, besides being a nice person, she also loves JMP. So we like Eliza in many ways, okay. Excellent scientist. Okay, so before I get into SVEM, I just want to mention...Chris mentioned the full quadratic model that's been used since, what, 1950 as the basis of optimization. Basically, all the main effects, two factor interactions and squared terms. In point of fact, for a lot of complex physical systems, the kinetics are actually too complex for that model. And Cornell and Montgomery, in a really nice paper in 1998, pointed this out and suggested that in many cases, what they call the interaction model, we call it the partial cubic model, may work better. And in my experience, using SVEM and applying this model to a lot of case studies, they are right. It does give actually better predictive models, but there's a problem. These partial cubic models can be huge and they are a big challenge to traditional modeling strategies where supersaturated models, as Chris mentioned, are a big problem. How many potential terms are there? Well, take K square, two times that and add one for the intercept. So for five factors, the full partial cubic model would have 51 terms. By the way, I'm I'm going to use these models, but I'm only going to use 40 of the terms that are usually most important. And then we're going to use self-validating ensemble modeling to fit the models and, as Chris said, SVEM has no problem with supersaturated models. And by the way, in the machine learning literature, supersaturated models are fit all the time. And using the right machine learning techniques, you can actually...they can actually show you, you get better prediction performance. This is largely unknown in traditional statistics, okay. So we're going to use the SVEM add-in and I mentioned if you'd like to learn about it, you can contact Wayne Levin, who's already spoken, at Predictum and I'm sure Wayne would love to talk to you. So let's go over to JMP briefly. And I'll show you how the add-in actually works, so let me bring over JMP. So I've installed the add-in so it has its own tab. So I click on Predictum. By the way, this is one of the Bayesian I-optimal designs we created, and this one has 16 runs. Select self-validating ensemble modeling. So you see what looks like your typical fit model launch dialogue. I'm going to select my five factors. And I'm just going to do a response surface model at this point for illustration, and I want to model retention time for glycoform 3. Notice in the background, this setup, the autovalidation table, so there's a waiting function. And there's an autovalidation portion. This is all in the background, you don't need to see it. But the saw...but the add-in created it and then it basically hides it, but you can look at it, if you want, but it's hidden because it's not terribly important to see it. So now we open up GenReg, so this is the GenReg control panel. And for illustration, by the way, SVEM is really agnostic in terms of the method you use it with for model building. So, in fact, you could use methods that aren't even in GenReg, but that's our primary focus today. So we're going to do forward selection. And because we only have so much time I'm only going to do 10 reps. Click go and we'll create the model. So here is the output for only 10 reps, and you get an actual by predicted plot. But what's really nice about this, is it actually creates the ensemble model average for you. So I click on self-validating ensemble model, save prediction formula. So I'm going to go ahead and close this display. Come over. And there's my model. I can now take this model, and I can go ahead and apply it to the validation data. So we've got 16 runs that were selected, so there's another 28 available that can be used as a validation set to see how we actually did. So at this point I'm now going to go back to the presentation and, by the way, without the SVEM add-in, it can be a little bit difficult for, especially if you aren't particularly proficient in JMP scripting to actually create these models. So that add-in, it may not look like it did an awful lot of work in the background, okay. So how did we do? So let me put this back on slideshow. So what I did, there are three designs, a 16-run design, 13-run, and then I picked a 10 run. We really push the envelope. And there are two responses, as I said, glycoform 10 elutes late and tends to be noisy and and can be hard to resolve. So for the 16-run design, I actually fit a 40 term partial cubic model. Keep that in mind, 16 runs. And I fit a 40 term model. And then the key to this is the root average square error on the validation set, okay. So I fit my sixth...my 40 term model, and then the root average square error on the validation set was low and the validation R squared was actually close to 98, so it fit very, very well. Notice, even for the much noisier glycoform 10, we still got a pretty good model. I'll show you some actual by predicted plots. So we did this for 13, once again for 13, we got very good results. And then finally for 10. By the way for 10, I just fit the standard full quadratic model, I felt like I was really pushing my luck. But I could have gone ahead and tried the partial cubic. And notice, once again, even with 10 runs, modeling glycoform 3 retention time, I got an R square .94. To me, that is really impressive and again I modeled...my model had twice as many predictors as there were observations and even with the difficult to model glycoform 10, we did quite well. So, here are some actual by predicted plots and again I pick glycoform 10 because it's really hard to model. So I didn't want to game the system too much. I know from experience it's hard to model. So there is an actual by predicted plot. There's another actual by predicted plot for the 10-run design, it still does pretty good when you consider what we're trying to do here. And then finally, how did we do overall? So this is a little table I put together and I took the 16-run design as the baseline, and then what I'm doing is comparing what happened when we fit the 10- term model and the 13 term for retention time for G10. You'll notice that we got a rather large error for 13 and a much smaller one for 10. I don't know why. I chose to just show what the results were. It's a bit of a mystery, and I have not gone back to explore it. But for retention time three, notice I go from 16 down to 10. Yes, I got some increase in the root average square error, but you know if you look at the actual by predicted plot, it still does a good job of predicting on the validation set. So the point of this being, and this is, I know this is, I work a lot with biotech companies, the actual efficiencies you could potentially achieve in your experimental designs and I know for many of you these experiments are costly. And even more important, they take a lot of time. This really can shorten up your experimental time and reduce your experimental budget and get even more information. So just kind of a quick conclusion to this. This SVEM procedure and this, and as Chris showed, you we've investigated this in simulations and case studies. SVEM is great for supersaturated models. I won't get into it, but from machine learning practice supersaturated models are known to perform very well. They use all the time and deep learning as an example. Basically, as Chris says, SVEM is combining machine learning with design of experiments. And even more important, once you start going down the pathway of SVEM, that, in turn, informs how you think about experimenting. So basically using SVEM, we can start thinking about highly efficient experiments that can really speed up the pace of innovation and actually reduce the time and costs of experimentation. And with time to develop products and processes becoming...the lead times getting shorter and shorter, this is not a trivial point, and I know many of you in biotech are also faced with serious budget constraints. And then (by the way, most people are) and then finally one promising area is Bayseian I-optimal designs. And again, I'll mention that Brad Jones and his group have done a great job introducing these in JMP. It's the only software package I know of that actually does Bayesian I-optimal designs. And we think SVEM is going to open up the window to a lot of possibilities. So basically, that is my story and I'm going to turn it over to Wayne Levin. Wayne Levin Thanks, very much for that, Phil and Chris. I hope everybody can see that there's an awful lot of work, a lot of research has gone into this, including some work in the in the trenches, you might say, and I've been in this game for over 30 years now, I don't mind telling you I'm very excited about what I'm seeing here. I think this can be really transformative, really changed what...how industrial experiments are are conducted here. I was previously very excited by supersaturated designs alone and that was facilitated with a custom designer and now with...that's on the design side. of things. Now, when you bring SVEM on the analysis side, I mean those are two great tastes that go great together, you might say. So anyway, just to conclude, I'm just going to slide in here that Phil, myself and Marie Goddard, we're putting together an on-demand course (it should be available next quarter, like in April) on mixture designed experiments, and we're going to focus a good amount of time on SVEM in that course. So if that's of interest to you, please let me know or will...so we can let you know or follow us on LinkedIn or one of those things. And for the SVEM add-in, again, if I may, I'm delighted we were first to market with this. We've been working hard on it for a number of months. We do a new release about every month, so it is evolving as we're getting feedback from the field, and I want to thank Phil very much for leading this. And he's been partnering with one of our developers (???) and it's really a really been a terrific effort, a lot of hard work to get to this point, and we want to make that available to you. And so as Phil mentioned earlier, there's really two ways for you to try this. One is just try the add-in itself, it works with 14 and 15. And with 16 coming out, I think we're going to have it working for 16 as well. It does require JMP Pro, so if you don't have JMP Pro, there's a link here, where you can apply for...to get a JMP Pro trial license. And you can get this link, of course, by downloading the slides for this presentation. So that's how how you can get it, so this is one way you can work with it to try it yourself. Another way is just contact us and we can work with you on a proof of concept basis, all right. And we could do some comparative analyses and so on, and of course, review that analysis with you. So you know we'll put the necessary paperwork in place, and we don't mind trying it out. Wwe've done that for a good number of companies and that may be the easiest way for you, whatever whatever works for you. Okay, and for now, then what I'd like to do is open it up for any questions or comments at this time. I'll say on behalf of Chris and Phil, thanks for joining us. So okay, questions and comments, please.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Pilar Gomez Jimenez, Principal Scientist, Johnson Matthey Ben Francis, JMP Systems Engineer, SAS Since the release of JMP Pro 14, Functional Data Explorer (FDE) has enabled new questions to be asked from existing data that didn’t have an appropriate statistical analysis methodology. Also, functional data can now be sought and constructed in order to solve these new questions. Within Johnson Matthey, FDE is being used in a variety of areas to generate powerful new understanding. With great power (of FDE) comes great responsibility – to ensure that the functional models being fit align with the observed data. As such, we introduce the new JMP Add-in for FDE Actual vs. Predicted plots along with further model diagnostics such as functional error correlations. These tools encourage the necessary considerations and diagnostics when using FDE-constructed models. This leads to peace of mind in the Johnson Matthey application, enabling validated and robust functional models to be deployed to the user base of scientists and engineers. Explore on JMP Public: https://public.jmp.com/packages/2llssr3Kf9DgWsX050vrt Download and try the Add-In here: https://community.jmp.com/t5/JMP-Add-Ins/Function-Data-Explorer-Actual-vs-Predicted-Diagnostics-Add-In/ta-p/364808

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Phil Kay, JMP Senior Systems Engineer, SAS Peter Hersh, JMP Technical Enablement Engineer, SAS Scientists and engineers design experiments to understand their processes and systems: to find the critical factors and to optimize and control with useful models. Statistical design of experiments ensures that you can do more with your experimental effort and solve problems faster. But practical implementation of these methods generally requires acceptance of certain assumptions (including effect sparsity, effect heredity, and active or inactive effects) that don’t always sit comfortably with knowledge and experience of the domain. Can machine learning approaches overcome some of these challenges and provide richer understanding from experiments without extra cost? Could this be a revolution in the design and analysis of experiments? This talk will explore the potential benefits of using the autovalidation technique developed by Chris Gotwalt and Phil Ramsey to analyze industrial experiments. Case study examples will be presented to compare traditional tools with autovalidation techniques in screening for important effects and in building models for prediction and optimization. We were not able to answer all questions in the Q&A so here goes... I see weights are not between 0-1, how they are generated? They are "fractionally weighted". This is the key trick that enables the method. In Gen Reg in JMP Pro we use this in the Freq role. You can use this addin for JMP Pro 15 to generate the duplicate table with weighting. We can recommend a talk by Chris Gotwalt and Phil Ramsey to explain the deep statistical details of the weighting. When you duplicate rows, they are duplicated with the same output result? What is the purpose of this duplication? Yes they are duplicated with the same response. It would be a bad idea just to do that and progress with the analysis. The key innovation here is that they are fractionally weighted such that they have low weighting in one portion of the validation set and a higher weighting in the other validation set. For example, a run with low weighting in "training" will have a duplicate pair that has a high weighting in "validation". Some duplicate pairs will have more equal weighting in training and validation. We fit a model using these weightings to determine how much of each run goes into training and how much goes into validation. Then we redraw the weights and fit another model. We repeat this process hundreds of times. We then use the results across all these models, either to screen the important effects from their proportion non-zero, or to build a useful model by averaging all the individual models. Is it possible to run the add-in in the normal, non-pro, JMP 15 version? Quite possibly. You can run the add-in but you will need JMP Pro to utilize the validation column in the analysis - this is critical. How does it work with categorical variables? We presented examples with only continuous variables. A nice feature is that it would work for any model and with any variable types. Is this a recognized method now? In a word, no. Not yet. Our motivation is to make people aware of this method to stimulate people that have an interest to explore the method and provide critical feedback. As we made clear in the presentation, we do not recommend you to use this as your only analysis of an experiment. If you like the idea, try it and see how it compares with other methods. Having said that, the ideas of model averaging and holdback validation are recognised in larger data situations. It seems that this should be beneficial in the world of smaller data that is designed experiments. Do you do the duplicate step manually? No, this is done as part of the autovalidation setup by the addin or by the platform in JMP Pro 16. You could easily create the duplicate rows yourself. The harder part is setting up the fractional weighting, without which the analysis will not work. Bayesian uses also a lot of simulation (MCMC), would it also be applicable? Integrating Prior and posterior distributions? Quite possibly. The general idea of fractionally-weighted validation with bootstrap model averaging might well be applicable in lots of other areas. Auto-generated transcript... Speaker Transcript Phil Kay Okay, so we are going to talk about rethinking the design and analysis of experiments. I'm Phil Kay. I'm the learning manager in the global technical enablement team and I'm joined by Pete Hersh. Peter Hersh Yeah I'm a member of the global technical enablement team as well, and excited to talk about DOE and some of the new techniques we're...we're exploring today. Phil Kay So the first thing to say, is we're big fans of DOE. It's awesome. It's had huge value in science and engineering for a very long time. And having said that, there are some assumptions that we have to be okay with in order to use DOE a lot of time. And they don't always feel hugely comfortable, as a scientist or an engineer. So things like effect sparsity, so the idea that not everything you're looking at turned out to be important and actually only a few of the things that you're actually experimenting on turn out to be important. Or effect heredity is another assumption. So that means that we only expect to see complex behaviors or higher order effects for factors that are active in their simpler forms. And, and just this idea of active and inactive effects, so commonly, the sequential process of design of experiments is to screen out the active effects from the inactive effects. And that just feels sometimes like too much of a binary decision. It seems a bit crazy, I think, a lot of time, this idea that some of the things we are experimenting on a completely inactive, when really, I think we, we know that really everything is going to have some effect. It might be less important but it's still going to have some effect. And these are largely assumptions that we use in order to get around some of the challenges of designing experiments when you can't afford to do huge numbers of runs and yeah. Pete, I don't know if you want to comment on that from your perspective as well. Peter Hersh Yeah I completely agree. I think that thinking of something like temperature being inactive is is...I think hard to imagine that temperature has no effect on an experiment. Phil Kay Yeah it's kind of absurd, isn't it? So, yeah, if that's in your experiment, then it's it's always going to be active in some way, but maybe not as important as other things. So I'm just going to skip right to the results of of what we've looked at. So we've been looking at this auto-validation technique that Chris Gotwalt and Phil Ramsey essentially invented and using that in the analysis of designed experiments, and it's really provided results that are just crazy. We just didn't think they were possible. So, first of all I looked at a system with 13 active effects and analyzing a 13 run definitive screening design and from that I was able to identify that 13 effects were active, which is... that's what we call a saturated situation. We wouldn't... commonly, we've talked about definitive screening designs as being effective in identifying the active effects when we have effect sparsity, when there's only a small number of effects that are actually important or active. But in this case I managed to identify all of the 13 active effects. And not only that , was actually able to build a model with all those 13 active effects from this 13 run definitive screening design. So, again that's kind of incredible; we don't expect to be able to have a model with as many active effects as we've have rows of data that we're building it from. And Pete, you looked at some other things and got some other really interesting results. Peter Hersh Yeah, absolutely, and and Phil's results are very, very impressive. I think what the next step that we tried was making a super saturated design, which is more active effects than runs and we tried this with very small DOEs. So a six run DOE with seven active effects, which if we did in standard DOE techniques, there'd be no way to analyze that properly. And we looked at comparing that to eight and 10 run DOEs and how much that bought us. So we got fairly useful models, even from a six run DOE, which was better than I expected. Phil Kay Yeah, it's better than you've got any right to expect really, isn't it? And so we've got these really impressive results and the ability to identify a huge number of active effects from a small definitive screen design and actually build that model with all those active effects. And in Pete's case, I have been able to build a model with seven active effects from really small Designed experiments. So, how did we we do this? How does the auto-validation methodology work? Well, it's taking ideas from machine learning, and one of the really useful tools from machine learning is validation. So holdout validation is a really nice way of ensuring that you, you build the most useful model. So it's a model that's robust. So we hold out a part of the data, we use that to test different models that we build, and basically, the model that makes the best prediction of this data that we've held out is the model that we go with, and that's just really tried and tested. It's actually pretty much the gold standard for building the most useful models, but with DOE that's a bigger challenge, itn't it, Pete? It doesn't really obviously lend itself to that. Peter Hersh Yeah, yeah, the the whole idea behind DOE is exploring the design space as efficiently as possible. And if we start holding out runs or holding out analysis of runs, then we're going to miss part of that design space and we really can't do that with a lot of these DOE techniques like definitive screening designs. Phil Kay Right, right, so it'd be nice if there was some trick that we could get the benefits of this holdout validation and not suffer from holding out critical data. So that brings us to this auto-validation idea, and, Pete, do you want to describe a bit about how this works? Peter Hersh Absolutely, so this was a real clever idea developed by Chris and Phil Ramsey, and they they essentially take our original data from a DOE and they resample it, so you repeat the results. So if you notice that the table here at the bottom of the slide, the the runs in gray are the same results of runs in white. They're just repeated. And the way they get away with this is by making this weighting column that is paired together. So basically, if one has a high weight, the the repeated run of that has a low rate...weight and so on and so forth. And this is...this is...enables us to use the data with this validation and the weighting and we'll go into a little bit more detail about how that's done. Phil Kay Yeah, you'll kind of see how it happens when we go through the demos. So we've basically got two case studies of simulated examples that we use to illustrate this methodology. So this first case study I'm going to talk through, I emphasize it's a simulated example. And in some ways, it's kind of an unrealistic example, but I think it does a really nice job of demonstrating the power of this methodology. We've got six factors and to make it seem a bit more real, we've chosen some some real factors from a case study here where they were trying to grow some biological organism and optimize the nutrients that they feed into the system to optimize the growth rate. So we've got these six different nutrients and those are our factors. We can add in different amounts of those, so I designed a 13 run definitive screening design to explore those factors with this growth rate response. And the response data was completely simulated by me, and it was simulated such that there were 13 strongly active effects. So I simulated it so the all of the main effects, all all six of the the main effects, are active. And then, for each of those factors, the quadratic effects are active as well, so we've got six quadratic effects. Plus we've got an intercept that we always need to estimate, so there are 13 effects in total that are active, that are important in understanding the system. 1 signal to noise, but that's still a real challenge with standard methodology in order to model that, and we'll come to that in the demo. So, really, the question is, can we identify all those important effects and, and if we can, then can we build a model with all those important effects as well? Because as I said, that would be really quite remarkable versus what we can do with standard methodology. And then case study #2, Pete? Peter Hersh Yeah, absolutely. Very, very similar to Phil's case study. Ssame idea with we're feeding different nutrients at different levels to an organism and and checking its growth rate. In this case I simplified what Phil had done and broke it down to just three nutrient factors. And this is building a different type of design, so an I-optimal supersaturated design where we're looking at a full response surface in a supersaturated manner and we looked at six, eight and 10 run designs. And so same same idea. The the effects were very, very high signal to noise ratio, so really wanted to be able to pick out those effects if they were active. And just like Phil's, I kept the main effects in the quadratics active, as well as the intercept and we're trying to pick those out. And same idea, so how many runs would we need to see these active effects and how accurate of a model can we make from these very small designs? Phil Kay Yeah because you know, like I said, you've really got no right to expect to be able to build a good model from such a small design. Peter Hersh Yeah, exactly. Okay. Phil Kay So I'll go into a demo now of case study #1. And I'm presenting this through a JMP project, so that's a really nice way to present your results. I'd recommend trying this out. And that's our design, so this is our 13 run definitive screening design, where we vary these nutrient factors, and we have the simulated growth rate response. As I said, that's been simulated such that the main effects, the quadratics of all of these factors are strongly active, plus we've got to estimate this intercept. Now, with a definitive screening design I've generally recommend you use fit definitive screening as a way of looking at the results as one of the analyses that you can do. It works really well when we have this effect sparsity principle being true. So as long as only a few of the effects are strongly active...are active and the rest of them are unimportant, then it will find those...the few important effects and separate them from the unimportant ones. But in this case I wasn't expecting it to work well and it doesn't work well. It does not identify that all six factors are active. In fact it only identifies one of the factors as being active here. So that's not a big surprise, this is too difficult, too challenging a situation for this type of analysis. If somehow we knew that all of these active effects are active and we try and fit a model with all six main effects, all six quadratic and the intercept, then that's a saturated model. We've got as many parameters to estimate as we have rows of data, so we can just about fit that model, but we don't get any statistics. And in any case, you know, aside from the fact of I've simulated this data, in a real life situation, we wouldn't know which ones are active, so we wouldn't even know which model to fit. Now, using the auto-validation method, I was able to actually very convincingly identify the active effects, and I'll talk through how we did this. And this is just a visualization of my results here. You don't necessarily need to visualize it in this way. This is for presentation purposes. I was able to identify that first of all, the intercept was active. I've got all my six main effects, and my quadratic effects, and then my two factor interactions, which I simulated to have zero effect. You can see they are well down versus the other ones. And there's actually a null factor here that we use that...so so a dummy factor. So anything less than the null factor we can declare as being unimportant or or inactive, if you like. And what we're actually...the metric we're looking at here is something called proportion nonzero and I'll explain what that means, as we go through this. That's kind of the metric we're using here to identify the strength of an effect, of the importance of an effect. So a bit about how I went through this. So I took my original 13 run definitive screening design and then I set it up for auto-validation so we've now got 26 rows we've duplicated. And there's an add-in for doing this, one of our colleagues, Mike Anderson. created an add-in that you can use to do this in JMP 15. In JMP 16 they're actually adding the capability in the predictive modeling tools in the validation column platform. And what that does, we get this duplicate set of our data, and then we get this weighting and as Pete said, we have...each row is in the training set and in the validation set. In the training set, if it has a low weighting, it'll have a high weighting in the validation. So if it has a high weighting in the training set, it'll have a low weighting in the validation set. And what we do actually is, we read...these have basically been randomly assigned. We reassign those and we were able to kind of iterate over this hundreds of times, fitting the models each time and then looking at the aggregate...aggregated results over many simulation runs. So what you would do is to fit the model and I'm using GenReg here in JMP Pro. And you'll need JMP Pro anyway, because you need to be able to specify this validation role, so we put...the train validation column goes into validation. And the weighting goes into frequency and then we set up everything else as we normally would with our response. And then I've got a model, which is the response surface model here with all these effects in, and then I would click run. And it will fit a model, and we can use forward selection or the Lasso. Here, I've used the Lasso. It's not hugely important in this case. And what's actually happened is we've identified only the intercept as being important in this case, so we've only actually got the intercept in the model. But if we change the weighting, if we go back to our data table resimulate these weightings, we will likely get a different result from the model. We weight different rows of data, different runs in the experiment, that changes the model that's fit. So we're going to do that hundreds of times over, and what I'm going to do is actually to use the simulate function in JMP Pro. And what we do is we switch out the weighting column and switch in a recalculated version of the weighting column. And you can do that a few hundred times. I actually did it 250 times in this case. I'm not going to actually let that run, because that will take a minute or two. Once you've done that, what you'll get is a table that looks like this. So now I've got the parameter estimate for every one of those 250 models for each of these effects. So in my first model in this run that I did, this was the parameter estimate for this citric acid main effect. In the next model when we resampled the weighting, citric acid main effect did not enter the model, so it was zero in that case. And you can actually run distributions on all of these parameter estimates. And one of the things you can do is to customize this statistics, the summary statistics, to look at the proportion nonzero. So you can see the intercept here, the estimates that we've had of the intercept. You can see with citric acid, a lot of the time it's been estimated as being zero so those the models were, citric acid main effect was not in the model, and then a lot of the time it's been estimated as around about 3, which is what I'd simulated it to be. So what we look at is the proportion of times that it is non zero and we can make a combined data table out of those. And I've already done that, and just done a little bit of... a little bit of additional augmentation here. I've just added a column for whether it's a main effect or whatnot, and then that was how I created this visualization here. So what you're looking at is the proportion of times each of those effects is non zero, so the proportion of times that each of the effects is in our model over all those 250 simulation runs we've done, where we've resimulated the fractional weighting. And that's what we use to identify the active effects, and that's...and it's done a remarkable job. It's been able to do what our standard methods would not be able to do. It's identified 13 active effects from a 13 run definitive screening design. Now, what would you want to do next? We maybe want to actually fit that model with all those effects and I've been able to do that. And I'm comparing the model that I've fit here versus the true simulated response, and you can see how closely they match up. So I've been able to build a model with all these main effects, all these quadratics and the intercept. So I've got a 13 parameter model here that I've been able to fit to this 13 run definitive screen design, which again is just remarkable. And I'm not going to talk through exactly how I got to that part. I'll hand over to Pete now. He's going to talk a bit more about this idea of self validated ensemble models. Peter Hersh Absolutely. Thank you, Phil. Let's see. I'm going to share my screen here and we'll just take a look at this project. So you can see here in the same flow as Phil, we're looking at a project here and I have started with that six runs supersaturated DOE, and here you can see, I have three factors, what my my actual underlying model growth rate is and then what the growth rate...the simulated growth rate was and then like like Phil mentioned, I create this auto-validation column, which can be done with an add-in in JMP 15 that that Mike Anderson developed. Or in JMP 16, it's it's built right into the software and you can access that under the analyze predictive modeling platform make validation column. So just like Phil showed, he showed a excellent example of how we can find which factors are active, so a factor screening. And that is oftentimes our main goal with DOE, but if we want to take it a step further and build a model out of that, we'd go through this the same process, right. So we build our DOE, we get an auto-validation added to that DOE, we build our model, just like Phil showed, using generalized regression and one of the variable selection techniques. So Phil Ramsey and Chris Gotwalt have looked at many of these different techniques and they all seem to work fairly well. So whether you're using a Lasso or even a two stage forward selection, they all seem to have similar results and work fairly well. So once you set this up and launch it, you get a model, like like Phil had shown, and you know some of these models will have

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Els Pattyn, Non-Clinical Efficacy and Safety Statistician, Ablynx, a Sanofi company Immunogenicity assays to detect anti-drug antibodies (ADA) in subject samples provide data on drug safety and efficacy and are required for approval by health authorities such as EMA and FDA. Such assays are usually qualitative or semi-quantitative and thus require cut-points as threshold values for distinguishing positive and negative samples. Establishing appropriate cut-points is crucial to ensuring acceptable sensitivity of the assay to detect ADA and as such requires particularly complex statistical calculations. Since one study can have multiple immunogenicity read-outs, these cut-point calculations become very laborious. We developed a JMP script for cut-point determination and assessment, which follows a pre-determined decision tree developed based on industry guidance, white papers and scientific best-practice. The script is designed to be appropriate for validation and use in GxP studies. End-users can select decision trees applicable for their needs, such as type of assay and study. The application accepts Excel files to upload data and makes outcome-dependent decisions – for example, adapting the effects included in the mixed-effects model based on the study-design, selecting the most appropriate normalization/transformation, calculating analyst-specific cut-points in case of significant analyst-specific differences, and adapting down-stream analysis in cases where no second-tier confirmatory data is available. In summary, this JMP script allows immunogenicity cut-points to be calculated quickly and efficiently in a standardized way, including automated reporting that is suitable for regulatory submissions. Auto-generated transcript... Speaker Transcript Els Pattyn Hello, my name is Els Pattyn and I'm working as a non clinical efficacy and safety statistician of Sanofi. So I'm happy to present you in JMP an end user tool we have developed for immunogenicity assessments. First short introduction of the company I work for. So originally I was employed by Ablynx. Ablynx is a biopharmaceutical company which is based in Ghent, Belgium. And we are engaged in the discovery and development of nanobodies. Nanobodies are camelid heavy chain entities, as you can see here, and you can discriminates. (OK, I will first my pointer...set my pointer. I don't think you see anything.) Okay, here we are. So nanobodies are entities from camelid heavy chain antibodies and they are smaller than the conventional antibodies. Conventional antibodies have to have two heavy chains, two large chains and camelid antibody you only have the heavy chain. So they are small, they can also be used modular you can have multiple nanobodies and have multiple specificities and they're also easier to manufacture. We were founded in 2001 and then 17 years later, we had our first compounds in the clinic. It was Cablivi. It was and it is targeted for a rare blood disease aTTP. And 2018 was another important year as then we got acquired by Sanofi, which is a large biopharmaceutical company with headquarters in France. So then at that time, we were more than 450 employees, but with the acquisition of Sanofi and we are part of a fairly large company. Sanofi itself has more than ??? employees with around 15,500 employees committed to R & D. And you see we are spread across the three...across three continents. And also therapeutic platform of Sanofi is very wide. It's only...it's not only the nanobodies, but it's really a lot and it's still growing. Then about the tool we have developed. It's an immunogenicity tool. About the context of immunogenicity. Immunogenicity is the ability of a substance to provoke an immune response. So often it's wanted. It's when your immune system has an appropriate response to a pathogen, but it can also...can also be wanted response towards a therapeutic agents and that's done in the case of a vaccine. That it can also be an unwanted response, a response to a therapeutic antigen and then it's called anti drug antibodies or ADAs. And these ADAs can be a you can have allergic reactions or even anaphylactic shock. Loss of efficacy. You can have antibodies that have neutralizing capacity so that you lose your activity, so it will be no surprise that it has special attention to the regulatory entities. Several guidelines issued by the EMEA and the FDA regarding the ADAs and how to analyze it. So completes ADAs determinations fills several steps. It's a multi tiered approach. You have a screening assays where you determine your sensitivity, then you have confirmatory assay, where you look for where it's specific or not. And you also have characterization assays, and that's done to determine whether your antibodies have neutralizing capacity. And sometimes you want to determine which isotypes it targets. So what we have to do is to determine good points, so that we can determine what's positive to discriminate the positives versus the negatives, and that has to be determined on so-called blank or naive population. So the blank population, of course, should be representative; it should be free of outliers also not exist...have pre existing antibodies and incorporate all sorts of variability. You have biological variabilities, so the different subjects that you test, subject ids, but you also have variability...technical vulnerability so that's of your analysts, so of your run, plate to plate credibility. And when you want to assess it in a proper way, you prefer to do it by mixed effect model with a REML model so that you can have these elements as random effects in your model. So there was within Sanofi, a a... a need to have an end user tool for good point setting, so the aims were that we could have a harmonized and standardized approach across all sites; to have a user friendly end user tool with the state of the art analysis, so it should be a statistical package because we want to have...to use...make use of the mixed effect model. It should also operate in a regulator...regulated environment, because we have also lots of GxP studies and it should also include uniform and automatic reporting. So that's where we started, where we explored whether we could do it with JMP and that's by its its language, it's JSL language. You all know it can easily be retrieved when you have a graph. You can just click save scripts, and then you have the script. So that's an easy start. And another...and what is also an asset is that it is programmable towards an outcome dependent decision. So we started with retrieving the scripts from every graph and every analysis we wanted to do, and then it became really a huge analysis. We have one parental code and see here, and we have different child codes that are called by the parental codes. And it is, as I said, it should be, this should be programmable towards outcome depending decision, so the code gets really more difficult by audience, because we have to do a lot of statistical analysis, assess whether you have statistical significance or not. We have different normalizations we want to assess. So that's around some extra aspects of the code, how it look like. So before giving a demonstration, here there are two, in fact two steps we want to perform the analysis. First, there is an Excel template where we populate...where the raw data should be populated. And then the JSL has to be launched. So it's done, only the parental code that has to be launched. I will go over these two steps. So we have one Excel template. I'll open it so that you can have a look. So here, you see an overview, where we can populate it and here is data for upload in the Excel file. As I have told, it can be an ADA assay or an NAb essay and depending on whether it's NAb or ADA assay, the header names are different. So here, you can say it's an ADA assay, it's either without or with confirmatory data. The code is adapted that it can either accept...that it can either accept summarized data or otherwise raw data, replicated data. So say it's here, replicated data that you can answer the number of replicates. Say, for instance it's three replicates. And in order to to determine if it's a ??? point we also related to negative control...to the negative control of a plate, and that should be multiple negative controls on one plate. For instance, 2, and I say, I call it Start and End, so when you populate this then we make a worksheet. And then here, you have your Excel templates that you have to populate and it's ready for upload and have the correct naming for uploads. So, once you have done that, then you can go to JMP. I already have JMP open. And what you have to do, then, is to click on the parental JSL codes and automatically an interaction menu appears. And then you have to fill in the fields to be filled out, so yeah. Just do here and test for JMP. Then there are five types of assays, so let's say we have now an ADA assay and here you can upload this file for uploads. I have here input data, some dummy data to upload. Then we select it and on here, there are some numbers you can change. The default numbers are filled out, but if you want another percent's thresholds as acceptance criteria, you can change it. And if your data has an order specificity done three decimal places, you can change it, and I think here...there's no decimal places in the data, so I change it to zero. So then you have a second introduction window here. It's an ADA assay so that's typical for the ADA assay. You have an NAb assay, you have another window. So here, you have to select whether it's clinical or nonclinical because good points, I think, is different there. I have here a confirmatory data and I want to analyze in a clinical setting, so I click it here. Now you have different outlier.... outline removal approaches that you can click. And also, there is another level of flexibility that you can enter. When you...just to make it user friendly, we have one BTD approach. BTD is an internal guideline and when you have...you want this approach, then there will be no other window that is opening. And when you want to have extra options then you can click it here, and you can have extra output or calculate, for instance, run specific differences on a specific difference. So when you click here, you have extra options, but I want to stick to the general upload approach. And then you can also have an upfront exclusion. We want to do it here and, once you have entered that, then the analysis starts. So it's a whole chain of analysis. It's an outlier removal approach and it's an iterative way. It checks so that when you have removed outliers, first analytical, than biological or if there are other outliers. So here, you see a series of analysis that's been performed. So it takes a while, because this data has screening data and also also confirmatory data. So now the calculations are going on. And then, at the end, there is a report. That's what you see here, that's still assembling the report now. And it's now outputting the data sets in PDF formats. Generating the outputs so now, and now, it's finished. So in the result parts, all results are automatically output here. That's what you see here. You have the data sets, you have a report and a journal, and also a PDF file. And then data sets, the different data sets that were outputs. And here you see that you have full report with the analysis settings, the methodology. Close it. And you have your descriptives. And of course what is nice for JMP is that you have your interactivity when you want to follow this subject, for instance. I have to double click on 411. You can follow it throughout all all the analysis so it's all linked. So, then you have here, then for my information of the screening cut point factors. It assesses different transformations, different normalizations, then it evaluates the most appropriate blank population. You have an output of all outliers, whether they were analytical or biological. And identities of the outliers. Yeah you have...so you see, you have all analysis that has to be performed, all analysis relevant for the settings. Same also for confirmatory cut points. And here is ADA scoring, whether the score is negative, positive, reactive. So by this, the whole analysis is done and you have the final conclusions. Under the data tables outputs. And the same, what we see here, the same, we also have, of course, in a PDF format. That can be then submitted for... submitted for...it's validated. We are able to validate it so everything we saw there, you see again here in a PDF format. So so, to conclude, what was the effect, what was the aim to do. We were able to generate codes to do an analysis very quickly and efficiently. Normally it took days for the analysis and for the reporting, now we can we have our... our analysis and report in 10 minutes, so it's an automated in a standardized way. It's an automatic reporting and it's also it's performed in a validated environment that's suitable for submission. So yeah I want to talk...to thank my colleagues, my colleagues of the non clinical efficacy and safety teams. I highlighted people that were involved in the development of the end user tool. And also yeah it was, of course, a collaboration between different teams, so I want to thank all people that were involved in that.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Jose Goncalves, Upstream Scientist, Process R&D Department, Oxford Biomedica Rebecca Clarke, Upstream Senior Scientist, Process R&D Department, Oxford Biomedica George Pamenter, Downstream Process R&D Scientist, Oxford Biomedica Thomas Chippendale, Upstream Senior Scientist, Process R&D Department, Oxford Biomedica A significant growth in the generation of cell culture bioprocess data has been observed at Oxford Biomedica in recent years. This increase in data collection is not only driven by intensive process development and characterization studies, but also due to incorporation of high throughput production systems. Throughout a cell culture process, many online and offline process parameters are recorded. However, due to the complexity of biological systems, one of the main challenges is the identification of genuine factors that can influence process performance or product yield. Holdback validation is often used to avoid overfitting and prevent the inclusion of non-genuine terms in a model, but the use of a single validation column may not be effective in the prediction of the best model. This presentation will demonstrate how refitting models using multiple validation columns can allow the identification of the simplest and most frequently occurring model. Data from 67 bioreactor production runs containing 20 candidate predictors were used for analysis. The simulate function in JMP® Pro was used to randomly generate 100 validation columns. Auto-generated transcript... Speaker Transcript Rebecca Clarke Okay hi everyone, and my name is Rebecca and I'm joined today by Jose and George and today we're going to talk to you about how we use the JMP Pro validation methods to help us to stop wasting time investigating false effects. So firstly I'm just going to introduce us as a company and what we do so. We work for Oxford Biomedica. We're pioneers in the gene and cell therapy business, and we have a leading position in lentiviral vector research, development and bioprocessing. So we have over 20 years experience in the field. We were the first company to treat humans with a lentiviral vector(?) based therapy. Some of the local partners that we work with are listed here in the bottom right, so we work with Novartis, Sonofi, Boehringer Ingelheim, and Orchard Therapeutics and some of the conditions that we treat include cancer, central nervous system disorders, and cystic fibrosis just to name a few. So last year, Oxford Biomedia also joined the fight against COVID-19 when we signed up to the UK vaccine consortium. And the consortium was run by the Jenner Institute of Oxford University and the goal was to rapidly develop a COVID-19 vaccine candidate. And by joining this consortium, Oxford Biomedica allowed for this scaled up manufacturing of the vaccine by allowing the consortium access to our state of the art ??? suites. So following on from that, we then signed a clinical and commercial supply agreement with AstraZeneca. And, with this agreement, we agreed to supply the AstraZeneca COVID vaccine should it prove to be successful in the clinical trials. And as I'm sure you're all aware, the vaccine was indeed successful and Oxford Biologica continue to manufacture the COVID-19 vaccine to this day. So I also just want to briefly introduce the processes that we use here and then at OXB. So a bioprocess is a process that uses a living cell or their components to obtain a desired product, so a well known example of such a process would be the fermentation of alcoholic beverages. And during this process, the yeast is the living component and the yeast plus all the other ingredients needed to brew beer are put into a bioreactor and the yeast initiates a chemical reaction called fermentation. During fermentation glucose is converted into ethanol. And then, once this fermentation reaction is complete, the contents of the bioreactor is then harvested and purified via filtration and then we're left with the final product that's beer. So the process that we utilize here in Oxford Biomedica is similar to fermentation. So the living component of our process are cell cultures, so with the bioreactors we put into cell cultures and also the other components needed to make the lentiviral factors. At the end of this production process, the contents of the bioreactor are passed on for downstream processing, where the product is purified and then it gets bottled and shipped out for use in patients. So this slide is just showing the life cycle of drug development. So on the left hand side at the very bottom is the research stage. So it's at this stage where we identify target genes or proteins that are important in disease mechanisms. And, and we also develop drugs against these targets and we show whether the drug is successful in its ability of targeting the disease. And then, at the very end of the life cycle, we have commercial manufacturing, where the drug is made and bottled for use in patients. So our group, the Process, Research and Development group, sit in between these two stages. It's our job to design, optimize, and scale up the production process and move drug product from the research bench through to manufacturing. Some of the processes that we use include small scale shake flasks and we work right the way up to 50 liter bioreactors. So, as you can imagine, we generate a lot of data during the course of our work and the bioreactors have many input parameters that we need to monitor and optimize. With the sheer volume of the data that we generate, it makes it quite difficult to discern which of the parameters we should focus on during our optimization. And just because of the sheer number of parameters that we have to work with, it's not commercially viable for us to optimize each one individually, so at this point, we look to statistical modeling. And we use the models to try and narrow down the list of potential parameters that we should optimize. And so, while these models are useful, we do generate a lot of random noise during our work, and this is down to the living nature of bio processing, so no two cultures are the same. So this random noise obviously gets incorporated into our statistical models and this can lead to us wasting time looking at non genuine effects during our optimization experiments. So it was with this in mind that we looked to use JMP Pro, specifically the validation features, which will hopefully help us to further narrow down the list of parameters that we'll look at, remove the non genuine effects from the models and also to avoid us overfitting our models as well. So now I'm going to hand over to Jose who's going to talk you through the whole data validation method. Jose Goncalves Right, my name is Jose and I'm going to speak about the whole optimization method, so this is a method, this is a modeling tool only available on JMP Pro. We were interested to explore how to use...how useful it could be at OXB for... I saw how useful it could be to OXB for modeling of bioprocess data. So as Rebecca was saying, we use these steritank(?) reactors in our process and basically we grow cells into these bioreactors and we produce these viral vectors, which are our main product. It's a...it's what we want to maximize and we want to model, so this is our response variable really. And bioreactors require many control inputs and produce a lot of data during one production process, so having a tool to help determine which inputs outputs are worthy of investigation, it would be of great value for us. It will allow us to save time, free up resources, and allow for a more targeted approach to process investigations. So before jumping into model building we we had to first collect as much data as we could from historical batches. And so, this data gives us indications on how our production process is running and any abnormalities in the data, might indicate a detrimental effect on our vector titers. So here on the left side, you can see that we collected the data from 20 variables, unfortunately, unfortunately we couldn't show the actual names of the variables and we had to anonymize the data. So we list them here from X1 to X20 and since our our process is quite a dynamic process, we cannot expect the degree of colinearity between predictors. So in summary, we have a data set that contains 20 variables or 20 candidate predictors to our model, and we, and we have collected the data from 67 historical batches. So, since this was part of JMP Pro trial, we thought it would be a great thing to do in the beginning to use standard JMP to build the model and then use it as a comparison to to to to to another model built using the hold out formulation method. Here, to build the model without the validation method, we use the standard modeling platform on JMP and we selected here response surface model because we wanted to allow all main effects, interaction terms and square terms to enter into the model. And here, in the personality, we chose to do a stepwise progression. So here on the right hand side, we have the stepwise regression control panel and we use the stopping rule to be our P value threshold, so the remaining parameters, we left them by default. And then, after running the stepwise regression, this was the final model that was generated. So here, you can see that we have eight terms into the model and seven of them, they are actually significant. So I would like to bring your attention just for these first three three factors here, because they have...they are highly significant in our model, and they will be very important in in a couple of slides when I'm going to show the results...the results when we approach using the hold out validation method, so the old model here is significant and can explain about 44% of the variation in our data set. And so, this was kind of our initial approach. We were happy with with with this first model that was generated, but we were more interested to know more about this about validation methods so. Because to investigate all of these predictors it would take a considerable amount of time and laboratory resources so that was the approach we took next. So here is the an overview on how this holdout validation method works. And, essentially, we have our data set and our data set will be randomly split into two groups. So on the left hand side, 70% of the data will be, it will be assigning into a training set, which is the proportion of data that will be used to exclusively build the model. And here on the right hand side we have 30% of the data that we will be assigning to a validation set, and this is the proportion of data that will be used to stop our model building process when this R squared of the validation set, which is its maximum. So here at the bottom, we can see, this is a screenshot of the stepwise regression platform, which is essentially the same as we used before, but there was as a stopping rule, which shows this to be, the maximum validation R squared instead of our p value threshold. So and here, this graph at the bottom, it shows we have the training set and we have the validation set here, which is labeled excluded. And so, as you can see, the training set, as we start entering terms into the model, because we have this rule here, this forward direction rule, we start with zero terms into the model, and then we go in a stepwise fashion to add terms into the model. As we start adding those terms, the R squared increases quite fast, up to when we have five terms into the model and afterwards the there is not much of an increase in the improvement in the R squared when we start...when we keep adding more and more terms. And so one of the important things when we use this method is also what happens into the validation set, because this is related to our stopping rule. And so what happens is essentially the same as we start adding terms into the model, the R squared increases quite fast, up to when we reach the maximum R squared and then after that point, as we keep adding more and more terms the R squared decreases. And so, this is exactly what what this validation set does is to...what...we want to use this validation set to stop the model building process at this stage, so we can include all these these terms here that explain most of the variation in our data set and avoid the inclusion of all these terms in the model that might create model overfitting or or they might not actually be genuine at all. So after running the stepwise regression, so this was the final model that was generated and, as you can see, we have we had a reduction from eight or seven significant terms into only three. And, as you recall, these terms here they were highly significant when we didn't use the the the the hold out validation method and so here they don't appear to be significant at all. And actually this model is only able to explain about 4% of the of the variation in our data set. And so, this was quite...we we thought this was quite a dramatic decrease in the number of terms in the model and we start thinking, okay, maybe we were a bit unlucky with the validation set that was generated for for for building this particular model and we thought we... One of the great things in JMP Pro is that we can use this generalized regression platform, which is essentially...the output is essentially it looks the same as in a stepwise regression, but in this generalized regression platform, we are able to use this simulation function. And what this basically does is to rerun the model over and over again, as many times as we want. And then use it for each model that is generated, it uses a different validation set. so this was the approach we took next. And because we couldn't list all of the all of the models that we generated, essentially we just...we tried to generate 100 models, or we did 100 simulations of models. So we couldn't list them all in here. And the way we we wanted to to summarize the results is to...we tried to count the number of times that a particular factor was picked up to be included in the model after those 100 simulations. And so, this is exactly what this graph is showing here. So on the right...on the left hand side we have just...these are just the parameter estimates. This is good just to observe the scatter of the data. And here, as you can see, we have the X13, X8 and it's this interaction terms that were actually significant when we didn't use this method. And here if we take as an example, X13, it was only...it was only picked up 46 times to be included in the model or 46% of the time to include in the model, and so, which is quite a low number, as we were expecting since these factors here, they were highly significant when we didn't use this method. So the conclusion that we take on using the the hold out validation method is that it's definitely a good tool to have for scientists and especially when we when we build models and we have these variables that come out as significant and they don't quite fit in our scientific understanding. So another one...another conclusion that we take is that specifically for this study is that we can safely assume that the process attributes that we studied do not significantly impact our vector titers. So I'm going to hand over to George now. He's going to speak about the autovalidation. George Pamenter Okay hi everyone, my name is George. I work in the purification group here Oxford Biomedica. I'm going to be talking a little bit about this autovalidation method that was put to us from JMP and its specific use in design of experiments, like data structure. And I'm going to talk a little bit about how we've used that methodology to actually remove a variable from an optimization model of viral vector purification. Okay, so just to start with a little bit of background on a viral vector purification for those that don't know. So essentially we receive this cell culture material from the bioreactors. And that material is quite crude material, so includes our therapeutic of interest are very viral vectors, that also includes thousands of other species, contaminant species. And they can be anything from unwanted proteins to bacterial DNA or sometimes even ineffective viruses. And so the aim of purification really is to separate our pure viral vector product from all these different contaminants. But these can be incredibly complicated processes, and you know, involving chromatography or enzymatic treatments and a number of different factors that affect their performance. And so we need a way of efficiently selecting the best conditions to reach the best purity. And the way we do that a lot of the time is by use of design of experiments. And design of experiments that are common throughout the bioprocess development timeline, we essentially use them to screen, to optimize, and to fully characterize our bioprocesses. Essentially, as we move through the drug development timeline, as we move towards commercial manufacturing, the size of our experiments goes up. The number of experiments we can do actually goes down, and it becomes harder and harder for us to change things the closer we get to a commercial product. And so we need to be able to make smart decisions early on in the development timeline and the way we do this is by developing effective models. So if we have an effective model early on, it sort of affords us the confidence that we've chosen the best conditions as we go through our scale of procedure. So I'm just going to detail how we've used this autovalidation technology on a specific example. So we conducted the response surface design of experiments on a viral vector purification step and again we're not able to share the actual data with you but we've had to normalize it here. So this was a three factor design space and our...and our output or response variable of interest is this impurity level. So this impurity level could be the amount of contaminating DNA protein, something of that description. So the goal of this really was to optimize these models for the lowest impurity level possible. And this is how we sort of looked initially in our JMP setups, there are three input variables and this response impurity. So we initially built some models, using the stepwise regression platform and also the Lasso regression, that was a JMP Pro feature. And actually at the beginning, we were we were quite pleased with what we were seeing so, we were explaining quite a lot of the data. We seemed to be explaining it quite well and we started seeing the variables we would expect to see crop up. But there was one variable that kind of made us less sure, and that was this X1 X3 interaction, you can see I've highlighted here. If you draw your attention to the top right of the screen, you can just see that this X1 profile completely changes depending on the value of X3, so at low values of X 3, we've got this kind of no real effect to be honest, and then a high values of X3, we've got this kind of positive correlation. And that actually really confused us and it was a little bit concerning because actually this X1 X3 interaction really didn't fit with what we would understand scientifically. So that kind of made us a little bit confused and it was also a little bit like, were our models really explaining the data we'd seen properly. It also had a significant real world implications for process development in the sense that this X3 variable was actually known to change quite a bit. So the analogy I would use here would be that if we were developing a drug dosing of two drugs, so X1 and X2, and we discovered that actually the dosing depends on how old the patient is, say X3, that's something that you really have to dig down and characterize. So exploring this interaction would have actually required significant extra experimentation costs and in time, as well. So we wanted to run some validation methodology on that, so we consulted our colleagues at JMP, and they essentially said to us, unfortunately, they wasn't really possible to use this holdout validation on the design of experiment data structure. And the reason for that being is, if you can imagine, I assigned 70% of a validation set, that sort of training sets to my data here to the design space, that we're actually building the model, but actually completely missing part of design space. So it's not really applicable to use this hold out validation on DOE data sets. So our colleagues at JMP came to us with this autovalidation. And the way it was explained to me was essentially that you would resample your entire data set. So you use your whole data set for training and your whole data set for validation. And that might appear at first glance kind of like cheating, it certainly did to me, but I'm told the way in which this is achieved fairly is is by the application of these three extra columns, and really by this pared fractionally weighted bootstrap weight. And, essentially, I think the the idea here is that if you use a piece of data in the training of your model, you would then wight that so that you didn't use that same piece of data as much in the validation, and that's how you sort of get around this idea of double counting your data. So we we used this methodology. We ran this on our models. Here you can see the setup in JMP and we use this again generalized regression platform, which is a JMP Pro tool. And here it's pretty much the same as what Jose was explaining, but instead of simulating for different training and validation assignments, we actually simulate this partial weighting. So we simulate that a number of times to generate different models and we actually went through this 500 times. Again, so the output of our data now like, you can see, with Jose's before, on the right here is a histogram of how many times a particular variable was listed in our model, and on the left here, you've got this these parameter estimates from various models. I've actually just drawn this red line here and I've called this this the null factor line. So this null factor variable kind of operates sort of like a fake variable. So we know this to be a sort of a nonsense variable because, if we look at the null factor on the left here, you can see its parameter estimates are quite widely distributed around zero, so it's not really knowing where to go. And so we draw this line and we say that any...the first time that null factor appears, anything below that we would consider quite likely to be a non genuine effect. So this is kind of a range of the non genuine terms. And I've highlighted in blue, as I'm sure you can see, this this X1 X3 interaction, which was the one of concern. And you can see here that it's it's come up less times and a lot of the null factor effects, so this was sort of the first indication we had that this might not be a genuine effect. So we didn't want to stop there. We ran multiple iterations of this with different autovalidation setups, and I've just sort of generated three examples here. But you can see, in in every example, highlighted in blue, this X1 X 3 interaction was consistently coming lower than this this red line, this null factor line. So, based on that, we were pretty confident that what we were seeing probably wasn't a genuine effect, and so we were able to eliminate that from the models that we desired. Now just a comparison of kind of final outputs of what we got. So on the left here was the initial model we built that included this X1 X3 interaction. And then on the right, this autovalidated model where we removed that. And I suppose, the first thing we noticed was actually this reduction in R squared. So at first it appears like you're probably explaining less of the data, but actually now we're a lot more confident that what we're explaining was actually genuine. And you can see here the difference in the minimization optimization conditions that it's output. And this actually had a number of kind of benefits for us, actually, in terms of processing. So remember these are all real life variables, they have real things behind them. So this X1 variable actually, that's a very expensive thing for our processes. So we were very...it was good to see that we could reduce that that variable. Again, also as this X2 variable goes down, that's actually also, for scientific reasons, quite a benefit to our process, so we're able to reduce that as well, which was of benefit to us. And these are kind of happy chance coincidences, I guess, but the one thing that we were really pleased with really was that actually, you can see the removing this X1 X3 interaction, we were seeing the actually the profiles of X1 and X2 were constant and that actually really fit with the scientific understanding that we expected, so we were quite happy to observe this. And obviously we've gone away now, and we've tested both of those models with follow up experiments just to see which one was performing the best. And on all occasions, this actually...this autovalidated model has agreed with the observations we've seen in our final final experiments. I'll just hand it back now, Rebecca. Rebecca Clarke I'm just going to do a quick summary of what we were talking about. So we know that non true effects have real world implications for both the development timelines and experimental costs for bioprocessing. And here we looked at two validation methods to help us with our bioprocesses, to highlight a potential non true effects that we can eliminate from our optimization experiments. So the hold out validation tool we used successfully to remove a number of parameters from our optimization list, and then we also use the autovalidation during our DOE type experiments. And, and in these cases the models that were generated are also then later confirmed by experimental work. So overall, we were very pleased with how we could use JMP Pro, particularly, the validation methods within our bioprocesses to help save us a lot of time and resources and not focus on non true effects. And just to finalize, we'd like to acknowledge Robert Anderson and Anna Roper at JMP who helped us throughout this presentation, and that is it for us, and thank you for listening.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Yassir EL HOUSNI, R&D Engineer/Data Scientist, Saint Gobain Research Provence Mickael Boinet, Smart Manufacturing Group Leader, Saint-Gobain Research Provence Working on data projects across different departments such as R&D, production, quality and maintenance requires taking a step-by-step approach to the pursuit of progress. For this reason, a protocol based on Knowledge Discovery in Databases (KDD) methodology and Six Sigma philosophy was implemented. A real case study of this protocol using JMP as a supporting tool is presented in this paper. The following steps were used: data preprocessing, data transformation and data mining. The goal of this case study was to evolve the technical yield of a specific product through statistical analysis. Due to the complexity of the process (multi-physics phenomena: chemical, electrical, thermal and time), this approach has been combined with physical interpretations. In this paper, the data aggregation (coming from more than 100 sensors) will be explained. In order to explain the yield, decision tree learning was used as the predictive modelling approach. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. In our case, a model based on three input variables was used to predict the yield. Auto-generated transcript... Speaker Transcript YASSIR EL HOUSNI Hello. I am Yassir El Housni, R&D engineer and data scientist, in the smart manufacturing team of Saint-Gobain Research Provence. in Cavaillon, France. We are working for ceramic materials business units. In this post, we have two points. In the first we will present the data projects life cycle that we propose for manufacturers data projects. And in the second, we will continue to present two user cases of Saint-Gobain materials...ceramic materials industries. Working on data projects across different departments, such as R&D, production, quality and maintenance, require taking a step by step approach to pursue the progress. knowledge discovery in database methodology, and the six Sigma philosophy, also known as DMAIC to define, measure, analyze, improve, and control. We define in this infinity loop seven steps to pursue in order to manage correctly data analysis project. In all of them, we ensure to have the good understanding of the process, because we believe it's a relevance key to sucessful data projects in an industrial world. For example, to detect the variation in the process we use the SIPOC or flow chart map and to detect the causes of variation we use our toolbox of problem resolving which contain a ??? or Ishikawa diagram. Then infinity loop presents also another route to achieve the continuous improvement. In the next we will detail the approach, step by step. Let's start with define the projects. It's necessary to define clearly three elements before starting data projects. We propose here some questions which we found very useful to define clearly the elements of the trade(?). First of all business need definition. Frequently, the targets of data project in manufacturing is to optimize a process, maximize a yield, improve the quality for a specific product or reduce the consumption of energy. Under the definition of the business opportunity, we should know what will be used, is the target need just a visualization or analytics. And after that, the impact should bring our quantified gain to the business. Secondly, data availability and the usability. In it we launch a diagonal analysis of quality of data. It is so important in the step to determine the feasibility of the data project. And therefore the team setup, a person from the data team, a person from the business unit team and a person from the plants, a process engineer with Six Sigma green or black belts. Let's move to the second step. Data preparation. With the transformation, integration, cleansing, it's an important step which consumes a lot of time in data projects. For example, we have here different sources of data and we need to centralize them in one table. Mainly we use X for inputs and Y for outputs. In this setup we use a different tools in JMP such as missing value processing, the ??? of constant variables, and of course the JMP data tables tools which ensures the right SQL request to transform correctly tables. The third step is about exploring data with dynamic visualization and with JMP we have a large choice for visualization. For example, plot the distribution of variable and estimate the load(?) that it's follow. Detect the outliers with the box plots diagram, nonlinear regression between two variables, contour or the density mapping to determine the principal placement for concentration of each population, and we have a large choice to plot it. ???, we use them usually in our work and we found them very useful. The fourth step is to develop the development of the model and it depends on the kind of analysis that we need. It's to... the target is to explain or is to predict. The first is about links between variables and it serves to explain patterns in data. The second is about formula and it serves to predict patterns in data. Generally we cut our data sets in three blocks, 70% for training, 20% for testing, and 10% for validation. And sometime if we have a small size of data we use 70% for training and 30% for validation. If the model is good, we request a new set of data in order to drive decision making. We have to approach supervised and unsupervised learning. Today at Saint-Gobain we'll use the normal version of JMP. And we have access to the supervised learning tools, such as linear regression, decision tree and the neural network. We work a lot today with decision tree because it gives us a relevant results, which help us to resolve a challenge in ceramic materials industry. The fifth step is about finding the optimal solution. Sometimes it's just a solution, but in other cases it's a combination between a lot of models. And to ensure the good sense, we added some constraints, for example, the mean max and the step of valuation of each variable. JMP profile give us large possibility to optimize quickly solutions. From now, the next step is pass to the plant, by the support of our process engineer with the Six Sigma green or black belt. The sixth step is about implementing the best solution in the plants and governed by only one representative model. For example, we implement the control charts of output Y1 and analyze the different variation. In the seventh step, we monitor the model effectiveness and we visualize the global gain of working on our data project. For example in the pie charts, we see the radial impact to the global yield. And the last but not least, the preparation to the step one for a new data project to ensure the continuity, the continuous improvement and the continuity of the infinity loop. That was all about data project lifecycle and now we will pass to present two real case studies of the protocol using JMP. In the two examples we studied the same process technology. It's about electric arc furnace, but for the two we have different products and different target. In this technology, we have a complex multi-physical phenomena such as electrical, thermal, chemical, time and others physics effects. Here, we have, for example, more than 100 process variables that comes from different kinds of sensors. The business need was to explain the global yield of a specific product, JO7. In it we detect a lot of kind of defects, the Defect #1, #2 to #N. And for that to prepare correctly the data sets, we used Pareto, outlier processing with Mahalanobis distance and recoding the correct attributes to correct the type of errors and missing value processing. In step three, in explore, we present here just an example of correlation that we study between inputs to reduce number of variation after working with models. As our results, we found our decision tree with just three variables and here, the goal was to explain why the yield is not in the max. And we have a decision tree with just three variables under 100. So for the model we use here 70% of the data for training because we don't have a big size of data and 30% for validation and we get good results with important root square. As you see, it's more than 70%. So the message that we passed to the plant is that just with the specific setting of X1, X2, and X3 we can explain the global yield, and if we need to maximize Y, the precent of this yield, we need to get a specific setting just for X1 and X3. And the global yield should evolve rapidly. It's the point...at the cluster of points here. And for each project, we give also the physical understanding of each parameter to the plant. For the second example here with the same technology but for another product, here we have just 80 process variables and the target was to evolve the number of pieces with no defect, D1. The need is about explaining the quality of our specific products, so we use the same methodology for our steps. For example here, we studied the same correlation between inputs to reduce the number of variables that we will put in the model. And, as a result, we use also a decision tree, but here we found 12 variables that explain this global yield with good results. As you see the R square was 84%, the RMSE was 3% and number of size was 287. For that, we used the Cross validation method because we have a very small size of table. So the first parameter was very important, as you see, it contributes 50%. And it was difficult to explain that with the 12 variables to the plants. For that, when we plot just the first variable, we see visually that we can define a threshold just with the variable X1, and with it the global yield should evolve rapidly. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Arhan Surapaneni, Student, Stanford OHS Siddhant Karmali, Student, Stanford OHS Saloni Patel, Student, Stanford OHS Mason Chen, Student, Stanford OHS Our projects include topics ranging from high level analysis of gambling utilizing hypothesis testing tools, probabilistic calculations and monte-carlo simulation (with Java vs. Python programming) to strategic leadership development through quantification of troop strength in the Empire: Four Kingdoms video game. These projects carefully consider decision-making scenarios and the behaviors that drive them, which are fundamental to domains of cognitive psychology and consciousness. The tools and strategies used in these projects can facilitate the creation of user-interfaces that incorporate statistics and psychology for more informative user decision-making – for example, in minimizing players’ risk of compulsive gambling disorder. The projects are about the game of poker and use eigenvector plots, probability and neural network-esque Monte Carlo Simulations to model gambling disorders through a game consisting of AKQJ cards. Offering a subtle analytical approach to gambling, the economic drawbacks are explained through multi-step realistic statistical modeling methods. Auto-generated transcript... Speaker Transcript Siddhant Karmali Hi everyone, this is Siddhant Karmali, Mason Chen and Arhan Surapaneni and we're working on optimizing the AKQJ game for real poker situations. So COVID-19 as effective mental health and can worsen existing mental health problems. The stressors involved in the pandemic, namely fear of disease or losing loved ones, may impact people's decision making ability and can lead them to addictive behaviors. And addiction to gambling is one such behavior that has increased due to an increase in online gambling sites...the site traffic of online gambling sites. This project analyzes how different situations in the game of poker affect how people make irrational decisions. These include situations that may lead to problem gambling. We developed a simplified model of poker that only uses the ace and the face cards, so A, K, Q, and J. That increase the probabilities of certain winning hands called and we called it AKQJ for ace, king, queen, jack. The variables in this model are the card value, the number of players, the number of betting rounds, and whether cards are open or hidden. Open cards or yeah...the objective of this model is to simplify the complicated probability calculations for the winning outcomes in a full game of poker. And we will extend this objective to the idea that since poker in real life has more than one betting round, we can prove or we can show that this model is effective, even in different variations of poker with different betting rounds. So this is how the project...or this is the outline of the project. First we researched emotional betting and compulsive gambling and what are the risk factors for compulsive gambling? How do people like...what do compulsive gamblers think like? Why do they gamble? So we found that people will gamble for thrills, just like addict...people who have addictions to drugs, they use the drug for a high or thrill. So again we... we we infer that gambling or gambling as an addiction must hit the same chords in the brain that are involved in the review...or the that are involved in the reward system. And then we went to...our technology was using hidden and open cards in real cases. So hidden and open cards are...so open cards are the cards that a player keeps face up and the hidden cards are face down, and only the player knows its identity. The and then we made two separate algorithms. There was a comprehensive algorithm and a worst case algorithm. But comprehensive algorithm is more complicated since all the cards are hidden and it's hard to do calculations, and the worst case algorithm had some open cards so players...or our modeled players could infer whether you take the bet or not. And so this was our engineering part. We used JMP to model players play styles. And we also used Java and Python programming, as Arhan will show, to generate...to randomly generate card situations, and we calculated the probabilities and conducted correlation and regression tests in JMP. So hidden...hidden and open cards. Open cards are, as I mentioned, open cards are the play...cards that a player keeps face up so other players can see it. And hidden cards are facedown and only the player knows its identity. The comprehensive algorithm, which ...which is what...usually what happens in a real game of poker where players have to try and calculate the probability of them winning against another person or them winning against their opponent, based on their current hand. And in a comprehensive algorithm it's hard to do, since all the cards are hidden and you don't know which which card which player has. And the open cards make AKQJ game easier calculation wise. And the number of hidden cards increases with the number of betting rounds. so the first case we did was with one round and six players, which had six hidden cards. Then we have one betting round and five players, which had seven hidden cards and so on. So earlier, we...or in the model, there were six players given labels A through F. We assign them probability characteristics, which are the percentages of confidence they have to make a bet. A's is 0%, B's is 15%, C's is 30%, D's is 45%, E's is 60% and F is 75%. And F's 75% probability means that unless they are...unless they are 75% sure...at least 75% sure that they will win against that person...their opponent, then they will not take the bet, so it means they're very, very conservative with their betting. a general poker case, which is the comprehensive algorithm, and the worst case algorithm. The general method is calculated or, for example, if we're trying to calculate the probability of A winning the poker match, in terms of the general method, we would have to use the probability of A is the probability of A versus A winning versus B times the probability of A winning versus C, all the way to probability of A winning versus F. This takes a very long to calculate and it's cumbersome in a real poker match, since the betting round time can be 45 seconds to a minute and not many people can do this kind of calculation in a minute. So the worst case...so that's where we developed the worst case method. The worst case...we calculated the worst case outcome by seeing which player can make the best hand with the cards they can see, out of four shared cards which are which are open to all players and one hidden card and one open card per player. We use these two algorithms in three different cases. The first case is with one betting round and six players. We have to determine in which cases each player will fold or stay and how many chips they will win or lose. For example, A stays, even if they lose chips, because that was ...that was one of our models, which we...which we knew had a problem gambling, so...that...so according to our condition, they had to stay. B wins against E, but not against anyone else. C does not win in this case. D wins against B and C and ties with A and E. And E doesn't win, and F wins all...against all the other players. Note that because C and E did not win, they...or because B didn't win enough because C did not win at all, and because E did not win at all, they all fold in the next betting round and they lose their...they lose their chips. And so the ones that stay, they're the ones that like...considering this is a one match or one betting round poker match, the ones that are...the one that is most likely to win is F, since they have the highest worst case winning probability. And we can see that, if you go to the previous slide, we can see that player A or player ...or for the six players' case, player F's overall was very close to 80% so we can... so the...so there is a correlation between...or there was a strong correlation between the AKQJ worst case method and the general method. And the next case is with one betting round and five players. In this case, the confidence values change. A's is 0%, B's is 12% , C's is 25%, D's is 38%, and F's is 65%. Player E was removed, since they lost the most chips and had to fold in the previous round. In this round with the fewer players, we see that there are more hidden cards. The number of hidden cards increases with the number of betting rounds and players. With the number of hidden cards increasing, the calculation time may take longer, and this may make players more nervous and unwilling to do those calculations, since they could lose money. In this case, player F didn't win, as shown by how their worst case winning probability is less than their confidence percentage, so they are forced to fold. Meaning that...or this could be because conservative players may not do well in the later stage of the game, because they are, you know, too stingy with their money. They do not make the right bets even when the stakes are higher. The next case is with four players and we did this test to confirm whether player F wins or not, and we can...and if it if F does not win, we can say with confidence that more conservative players do not well...do not do well in the later stages of the game. And note that F has to keep decreasing their winning probability. We also tested whether the worst case algorithm matches for five players. In the general method, B has an 11% chance of winning, D has a 46% chance of winning, and F has a 48% of...48% chance of winning. This is very close to the worst case values, and so we get a strong correlation of...we get a strong correlation with an R squared of 0.998. And so, this worst case makes F win 50% of the time in the five player match. And we also tested this for four players, in which... in which we confirm that F will win 50% of the time in the four players case. And then the third case is with three betting rounds at six players. In this case the values are the same. and E's, which we added back is 56%. In the first round F wins, however, as players started folding like how B, C and D fold, F has to change their confidence level. F has to change their confidence level to match the winning probabilities for a round. F's level changes to 60% after the first round and 54% after the second round. This is involved, like these are models. This will be involuntary, indicating that there's nervousness in a conservative play style, which contributes to such players losing in later rounds. Players A and F represent the extreme playing styles, which may be indicative of problem gambling. And this is a quick summary of the betting round calculations. In two betting rounds or in a game with two bending rounds, we will see that F only wins two times out of 20. So the the... player's possible hand doesn't all...like F's possible hand is not good enough to match up against the opponent's possible hand. This happens in both two betting rounds and three betting rounds. And this is this is due to the nervousness, and this is due to the fact that F's probability was way too high, like they could not match their confidence level, so if you want to...like perhaps the optimal strategy for doing well in a poker game is to be not too, like, aggressive with your betting and also not too conservative with your betting. Be like player D, who has... player D had a 38% chance...or 38% probability so they would have to be at least 38% sure that they would make...that they would win against their opponent. So like...I think...or based on this, around that spot is a good place to be for poker. Now, why is this important? We also did...or we also did the three players test to confirm that player F has to fold and player D wins in this round. So we can say that player D has arguably the best strategy in this poker model, in the AKQJ poker model, with more than one betting round and more than...and less than six players. We also did the two betting rounds, to consider...or to show that F doesn't win either. And this is another case which we did to test if the outcome of F losing held throughout the betting rounds. We did this, three betting rounds and four players. And now, why is this important? We've showed previously that players can perform simple calculations, like the worst case, to control their urge to bet even on the losing hand. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. This is an emotional style of betting and it falls right into the trap of gamblers fallacy, which is thinking that either if you're on a lucky or if you're on an unlucky streak, you will get lucky the next time, for example. These new cases of fewer players and more betting rounds in a poker match introduced the ideas of nervousness in players, even the most rational ones. So F, in the one betting round in six player calculation, was a player who had...like you could see that player had experience or he...or that that player was able to wait, he was able to have a good strategy, he or she, excuse me. This is...and so we can apply this to a real poker game, because in humans, the parasympathetic nervous system releases stressful hormones, like adrenaline, into the body, eliciting a fight or flight response. When you're in a high stakes situation like a poker game, then you know that...you know at a superficial level what's at stake, you know that your money and your assets are at stake, so you will bet on that to try and increase it. That's just inherent human nature. More bandwidth during this poker game is given to the amygdala, which is an area of the brain that controls emotion, rather than the rational and involved in executive functioning prefrontal cortex. So the more hidden cards in a round, players may be more nervous about their bets and make worse bets, even if they... even if they're very experienced in poker. And this nervousness may be correlated with the blindly betting nature of compulsive gambling disorder, based on the concepts of risk calculation and gambling for thrill. And our conclusions were that F, or overly conservative players like player F, may not do as well in realistic poker situations. So the ones that do the best are are on the conservative side but are more willing to bet than your regular very, very stingy player. And get our main takeaway is that gambling disorder may be mitigated if players can understand basic statistical calculations and use them in their games. And the future research, which is actually not in the future, it's going to happen or we've done that research already is, we are going to get more reliable data using Python and that's what Arhan's presentation is going to show. Thank you for watching. Arhan Surapaneni Millions of people visit the capital of casinos, Las Vegas, every year to party, enjoy luxuries, but most importantly, gamble. Gambling, though forming a false reward of success in winnings still hides the dark pitfall of financial and social struggles. Our generation is modernizing with technological advancements rapidly becoming the norm and computer programming languages being the needed subjects to thrive in a modern world. Utilizing modern computer programming tools and ???, we are able to take a deeper look into their this psychological problem and analyze tools to help solve them. This is done with authors Siddhant Karmali, Mason Chen and Saloni Patel, with the esteemed help with advisors, Dr Charles ??? and ???. As previously mentioned millions of people attend casinos per year affecting a large population. With a problem that affects such a large sample size, one of the overarching questions concerns how do we express solutions to the problem of gambling and how do we do this efficiently. Through Siddhant's presentation, we have learned much about original methods that help explain gambling, while also presenting the economic misnomers about gambling, proving that being calm and cautious provides the best results, rather than relying on luck of the game for positive results. All of which was done through Java. This method, however, takes 30 hours just for 92 runs. This is not effective and unusable. Using Python however reduces this time to seconds, while also allowing for higher levels of complexity that provides to be beneficial to the overall methods. To recap, the method required a six player model, each of whom receives two cards in the 16 pack of AKQJ. One of the cards is hidden, with four cards placed on the table. Each player has confidence levels determining how often they fold or continue playing. Continuing playing loses three points and folding loses only one point. This is done in an effort to model players that range from blindly gambling to cautious strategists. Usually when conducting a large scale experiment is very tedious and time consuming to run tests at an adequate amount so that the data that is produced is usable. An efficient alternative is using computer programming, more specifically, using the computer language Python in a Monte Carlo simulation, rather defined as a broad class of computational algorithms that rely on repeated random sampling containing numerical results. The Java program applied here in the diagram is very ineffective. They only allow for random two samples and the process of individually sourcing out each specific sequence is difficult to do. When we are trying to derive larger sets of data, it is important to change this. Analyzing general differences between languages, it is important to identify the Java is staticly typed in a compiled language, while Python is dynamically typed in interpreted language. This single difference makes Java faster at runtime and easier to debug. But Python is easier to use and easier to read. Python is an also an interpreter language with an elegant syntax makes it...making it a very good option for scripting and rapid application development in many areas. This is applied in the following method. One thing we need to look at is the new randomizer, which is applicable, with the full 52 cards set, the AKQJ 16 card model allows for more accuracy in statistical ??? rendering, especially when using a Monte Carlo simulation. One beneficial aspects that we can add with the Python our characters. As Siddhant covered, we can now use wagers. Instead of manually applying different confidence levels on our two card ??? randomized, we can use Monte Carlo to make different choices based off of a certain percent threshold on the ability to add or remove wagers which will see you later on. First, it is important to talk about the deck. Here we see a 16 card array with our shuffling function. This allows the cards to be randomized, similar to the initial function seen in the Java program. This draws two cards from the 16 card total for different players. With this in mind, there are extensive applications to both the original comprehensive method and our worst case method, which will be covered and developed on later. ??? the same deck allows us to add changes or move things to affect how we compute the worst case method which is helpful in our end goal. We don't have the same flexibility in our old JAVA method. Python specific changes for the worst case method include specific elif/if statements, so that the player with the worst card is indicated as loss and the formula covered by Siddanth has changed. This is important because this allows more efficiently... more efficiency in data collection, with the probability on making the randomizer's outputs more accurate. You can add specific names to these separate cards, which is also another helpful application in itself. This simplified Monte Carlo simulations allow for more complexity, as it allows us to add our new wagers based off scenario which involves multiple betting rounds that Siddanth has described. We can change characters by changing the current wager, which affects how the player stays or leaves the betting round, this being a key difference from the Java method. To concept in this program has to do with setting a variable that will eventually return value to the funds and the wager to the initial wager argument. Sending the current wager to zero, we set up a while loop to run the condition continuously until the current wager is equal to count. Then for each set where we get a successful outcome, we increase the value by our wager. If we were to add a command telling us or the character to slow down at a certain wager, then we have a simple way of having threshold to betting. We can also edit the same form to accept specific sequences, like the full house, only allowing a wager when the sequence is present. These thresholds create our upper mentioned risk management level. We can plot the probabilities in ???, where we append to each updated current wager to an array of X values and each updated value to an array of Y values, proceeding to plot both PLT(?) to plot X values and y values. The key component of the better function is the condition if...in the if statement that corresponds to a successful outcome. This can be adapted to any outcome needed, including general scenario and worst case scenario. When we apply our comparisons with the two character and multiple character, we can add important statements that make sure the data is compared properly. Using Python you can make sure drawings like two pairs or instead of higher values ruling out players with a lower value, forcing them out of the game. This allows for more efficiency with the new Python program rather than using the original Java program. Something that would take 30 hours originally can now be done in a matter of seconds. After this program is applied, we are able to run a correlation test with the new results against the original Java results. If we look at the red lines for both the general and worst case methods, we see that they're extremely close to one, indicating strong correlations. This is also paired with the higher R squared value. We also run the one proportion hypothesis test, telling us for both methods we failed to reject the null hypothesis. This value, although high, isn't close to...isn't as close to one. For something that we would expect to be almost identical, because this is computer programming. There are two main reasons for this, the first is sample size. 92 seems like a lot, but it isn't strong...isn't as strong a trend as one would expect. To fix this we are able to increase the sample size to 1,000 or even 10,000. One more reason could be the application of Python. As previously mentioned Python is more comprehensive language, rather than a static language. This could change the effect, but mainly stay the same. Here we see the program finally applied with their results presented. With this program we see what each play...which cards each player draws with the cards... which card is shown and hidden, what cards are on the table, how many chips players gain and lose and, finally, who wins, how they win. In the diagram below, we see Player 1 wins with the full house. the ability to run multiple characters into one computer and one function, describing the different sequences while applying different numbers of players, also showing probability of different outcomes, even adding the multiple betting runs in a regular game of poker, which Siddanth has covered. This is vital to further developments, because it allows for people around the world to use this program and method to develop on ideas as a learning tool able to be utilized to act as therapy to gambling addicts. One more development exploits the neural network (AI) aspect, making this program more detailed by adding features like bluffing, added enough for it to make the program more like actual game seen in casinos. Finally, we were able to see by using Python, not only do we experience higher accuracy, but higher...much higher efficiency, ultimately, proving our original hypothesis that being cautious gives more money and more wins, turning a program that originally took hours in just a few seconds. So next time you go to Vegas or even partake in a game with friends, remember this project and remember that being more careful and taking a bit more time with decisions will help you in the long run. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Chi-hong Ho, Student, STEAMS Training Group Mason Chen, Student, Stanford OHS In late February 2020, the COVID-19 pandemic began gaining momentum, and the stock market subsequently crashed. Many factors may have contributed to the fall of stock prices, but the authors believed the pandemic may have been the main cause. The authors’ objectives for this project were to learn about stock investments, earn money in the stock market, and find a model to help determine the timing and amount of the trading. All of the data was Z-standardized to help eliminate bias and for ease of comparison. Specifically, the authors used three Z-standardization values: Z (within stock), Z (NASDAQ Ratio) and Z (Group NASDAQ Ratio). The authors compared the current stock price with the previous stock price, NASDAQ stock price and group NASDAQ price average respectively. After that, the authors combined three Z-values and applied it to the stock index, which much decreases the data bias and is better for reducing risks in investment. Determining the outliers is to acknowledge the timing of investment. Quantile Range Outlier is easily affected by the skew factor, because it is usually used in the normal distribution. Robust Fit Outliers is a better tool, because it can eliminate skew factors. The authors established a model to help people invest the right amount of money. Auto-generated transcript... Speaker Transcript Chi-hong Ho Hey, no. Come on. Okay, hello everyone I'm Chi-hong Ho, junior at Henry M Gunn High School. My partner is Mason Chen; he is a sophomore at Stanford Online High School. Our project is to finding the multivariate statistical modeling of stock investment during COVID-19 outbreaks. In late February due to the coronavirus pandemic spread out in the world, the stock markets start crashing. There are several factors causing this year's stock market crash. There are several factors causing this year's stock market crash, such as a COVID-19 pandemic. OPEC/Russia/USA oil price war. Also 2.2 trillion bailout package from US government. And some companies laid off their employees. There is more than 30% unemployment rate due to the pandemic. Let's compare with the 1929 Great Depression unemployment rate. At that time it was about 25%. Also in November, the US during the presidential election. And then the manufacturing supply chain shut down because of the coronavirus pandemic spread out in China before. Yeah, the COVID-19 pandemic influenced the stock market as the past stock market crash that happened in 1929 Great Depression, 1987 Black Monday. Because the COVID-19 pandemic had a huge impact on the world causing lots of deaths, the stock market should get more influenced by pandemic. Compared to US history, the Great Depression of 1929 and Black Monday 1987 both crashes continued for a long period. COVID-19 wasn't. This year, stock market decreased by 25% from the peak in March to April. Before March, the COVID-19 pandemic in the US was not spread out as fast as other countries. After the COVID spread was global, lots of countries were locked down. National and global-wide lockdown situation would affect the stock market. Look at the graph in the left corner. That is the situation that happened Korea. Asian countries got COVID pandemic before America. We use the Asian countries' situation to predict what will happen to the US. Based on this graph with this color that is COVID-19 inflection point is for Phase II, it may impact the stock curve significantly, because it seemed that the case growing speed will be little bit decreased in this short period. Okay. In the left side of graph, it's the correlation map of the case versus the date. This is its correlation map in the US. The cases were added in early February, and much through in late March and early April. The right graph is the stock market down by the date, which means that the case interest to maps are related to each other. When the case grew much, the stock market start to crash and the lowest point was more than 35% down. Compared with China, South Korea and the US, we knew that the Asian countries got COVID pandemic before the US. We could conclude that what will happen in the next few months in the US. Based on just that specific table and data below, I found that the duration of Phase III is really short. And it's now safe for us and we can go back to work. Compare with the direction of Phase IV, which is double the time of Phase III, so in this time we are we are feel more safer to go back to work. After we look at the Asian countries pandemic phase, we can predict the Phase III and Phase IV duration of the US. We would estimate the US end of Phase III should be on April 15 and Phase IV should be on May 25 for the best case. The worst case is end of Phase III is around April 30 and the end of Phase IV on June 10. Our recent projects to define stock investment strategy, our objectives are learning and experiencing the stock investment, earning money in the stock market, and building a model for judging the times to trade or exchange stock. Firstly we own eight high technology stocks which were purchasing 2008 and 2009, and average gain is about 400% in March. Some of the stocks are the top 20 of the Standard and Poor 500 stocks, with an average gain of more than 800%. We want to find a time to sell those stocks and get money back. Because of the COVID-19 situation and the stock market crashed, the stock price was not as high as in March. So we wanted to sell quickly. After selling high tech stocks, we look at 23 COVID impact stocks, which stocks lost ground because of the pandemic. We want to wait for those stocks surge after few months. Our choice of COVID impacted stocks should have a minimum of 3 billion market cap. When we down to trading stock, we could have some exchanges. We should...we sold one tumbling stock and bought one rising stock for balance. We need to make sure our stocks will surge in the few months. We separate the stock transaction decision chart into three levels. The first level is to decide what we will buy, what we will sell and to exchange. In level two, we are picking the stocks from selling group, buying group, and exchange pair from the two groups. The third level is the tool we will use. We may use the Z index, we will use the outlier detection tools. The function of the standardization is to give us the idea of how far from the mean to the real data points. Why do we need to use it? Because we need to convert the actual data to an index that's easier for us to compare. Also using the standardization to eliminate the bias of raw data. On the left side, there are blue boxes, purple boxes and red boxes. The real data is in the input we collect, which is in the blue box. The new index after the Z standardization is found in the red box. That is our output. The Z standardization is a tool we use, which is in the purple box. In the blue box the cubicle represent NASDAQ stocks, which is popular and lots of people will invest in. High tech stocks had grown a large amount during five years. The range of Z standardization is from -3 standard deviation up to +3 standard deviation. After the Z standardized, we can get Z within stock, Z NASDAQ ratio, and Z group NASDAQ ratio. The Z within stock is to compare the stock price with the past five years' previous stock price. The, the Z NASDAQ ratio compared the stock price with the NASDAQ stock price, the Z group NASDAQ ratio compares the stock price with the group NASDAQ mean. We use the Z standardization to help us look up the ris. In the end we will combine all three Z score into a new stock index. The stock index can help us to lower the risk of transactions. Here's the data table after we standardized with the raw data. We can use the stock...the stock price index change. USA stock has been in downward trending since a peak around in mid February. Some stocks are more robust and certain ones are impacted by COVID-19. We establish this modeling algorithm on March 7-8 and and the database on March 14-15. The red in that index shown in this figure speaker, that are good time...present a good time to sell out those stocks and we can earn more money than other not red index. Also, there are some index marked by color blue. That is the time to we can consider to buy the 23 COVID impacted stock because we can lower the cost and we can gain more in the future. The reason why we use the outlier algorithm is that the outlier is helping us determine the timing of trading stocks. The best way to determine the outliers is to use quantile range outliers. Firstly, we used to find the entire quartile range which is Quarter 3 minus Quarter 1. The algorithm is outlier equal to Q1 minus or Q3 plus x IQR, x equal to 1.5 for regular or 3 for extreme. Why do we need to choose the extreme outlier? Because regular outlier cannot shows the longer timing that we wanted. Extreme values are found using a multiplier of the entire quartile range, the distance between two specified quantities. So extreme outlier can have a long detecting level that we can use in the investment, which can help us reduce our risk. But technically the quantile range outliers algorithm is used in the normal distribution situation. The stock market is not as much as the normal distribution, so the outliers will be influenced by the skew factor. Thus we need to use more powerful tools that are not influenced by the skew factor. Because of the stock performance, we would not care more about the tails than the center of the distribution. The next tool we use is robust fit outliers. We use robust fit outliers to ignore the skew factor. Outliers and distribution skewness are very much related. If you have many so called outlies in one in one tail of distribution, then you'll have skewness in the tail. In quantile range outlier detection, the assumption is normal distribution. So skewness in the distribution will introduce an inaccuracy in the outlier detection methodology. If the distribution is significantly skewed, like it probably is in stock market data, the robust fit outlier are a better method to find the outliers accurately because they tend to ignore the skew factor. The robust fit outlier estimates a center and spread. Outliers are defined as those values that are K times robust spread from the robust center. The robust fit outliers provide several options for coupling the robust estimate and multiply K, as well as provide tools to manage the outlier found. We use K=2.7 for regular and 4.7 for extreme outliers. After we use the regular robust fit outliers, we can find out the outlier in the selling index data. Look at the right graph. There are so many shaded red cells in F5 and F8 columns, indicating that we can consider to sell those stocks to maximize our profit, because the stock price is selling above average. Each column is showing some stock index change by day going down to the column. The reason why we use the extreme outlier for buying index is that the buying index is dropping. That means it is really difficult to detect the outliers of the buy index. Not like the selling. Selling index is rising, which is easier to for us to determine the outlier. On this page, there are some color blocks in the data table, like B6, B13, B15 and B19, which indicates that we can consider to buy the stocks. Lots of people make money by investing in stocks and most people may choose the right stock to invest in for reasonable ROI. But investors are challenge to find the right amount of money to invest. Also other human psychological factors will favors our certain investment. We can determine the amount of stock we buy, sell or exchange based on this model, which can minimize a personal investment bias and reduce the overall financial risk. The model provides two ways to judge the amount of investment. The first one is the color block analysis. Now in that analysis, the blocks with dark green are good to sell and the blocks with orange or red is good to buy or it's good to exchange. And then in the bottom, there are the transaction levels we define. The L10 is the least investment amount and L1 is the greatest investment amount. If we want to sell the stocks, we will sell not too high, is in the L5 amount. And do the exchange, we also choose the L5 model. And if we do the buying, we can just to consider to buy more, so we just buy L2 amount. So based on this model, we can just manage our ...based on this model, in the investment you will reduce your financial risk. Then the function....okay...this is...in the Phase III is in the exchanging part. Also we are using Z standardization for...to convert the data point into this index, but this is the exchange index. We set up the exchange threshold Z exchange index should be greater than 15. This is an average index we calculate, which can tell investor the time. On the left side, there's the line chart, which shows the change of each exchange pair. Based on this line chart, we can see the trend of S5-B1 is about 15.8 and S5-B14 is about 15.16. So that means we can consider to doing that exchange between the S5-B1 or S5-B14. After Z standardized, we can get the stock...sell stocks index and the exchange index. The selling index is to compare the stock price with past five years stock price. The Z NASDAQ ratio is compared to stock index with the average stock price. So the exchange index is compared to stock selling index, which was the stock buying index. We use Z standardization to help us look after risk. We consider about 184 choices and we need to make sure our investments will be in the right timing and pick up the right pair to do the exchange. We're also using the quantile range outlier algorithm to help us determine the timing. The small value of Q provides a more radical set of outliers than the large value. Look at right side table. We use quantile range outlier methods and get a top thee outlier whose exchange index value is greater than 19. This is the second time we consider the exchange index. We found out that there are top index, which are the signal and help...and it's the best timing for us to do the exchange. Like S5-B14 pair at 19.27; S5-B13 pair at 19.12; S5-B12 at 19.07. On the left side, we have the timing prediction model. This model is presented with a color code, color box style. That blocks with dark green are good to sell for us, and dark orange or red is good for buying or exchange. The best time will be bold and shown in the graph. April 6, which is the best day to do the exchange, since we come to exchange data from February in 2015. We consider the exchange pair twice, which can double insurance that we can make more money in the stock market and enormously reduce the investment risk. On April 7, the exchange index had little change compared with April 6. On that day, S5-B1 pair had a 19.18 exchange index. In the right graph, that was the exchange stock information. On April 8, we sold the KLAC stock at $154.32. The market stock price was $148.85 so we saved about 3.5% into selling Also on the same day, we bought the Delta stock at $22.42. The stock market price was $22.92, so we gained about 2.2%, so it's changed in the sale. So sellign the same amount of stock and buying the same amount of stocks for balance. The model of selling and buying stock was equal to 65 quantities. After one day, the exchange pair has helped us gain 5.7%. Oh yeah, all stock buyers will focus on their stock trends. My partner Mason and I monitor the NASDAQ stock daily range outliers from 2020 late February to 2020 mid March. We separate the daily trade window into a certain time slots, 30 minutes each, and we want to find when's the best time of trading. There were 24 peak and valley points we detected and the upper threshold is 2.7%. In the right corner figure, we can see the stock price in the open, close, high, low times and also we will calculate the average...we count the price range and rank when we do the stock price peak and valley detection. ??? considered the discrete number of sample size. Among 24 peak/valley points we detected, the data shows that 17 out of 24 points is about 70%-71% were happening in the first or last hour. We set the one-proportion test that we made the null hypothesis is that we wake up early and have a stock lunch session to do trading. Null hypothesis is assuming the uniform distribution probability. Look at the left corner table that we can see the null proportion value is 0.34 which is greater than 0.05 so we cannot to to reject the null hypothesis. So To be four slots among 13 slots available, so we can not reject the null hypothesis. Look at the right corner table or figure which shows the distribution level of peak time and the valley time. In our research we provide a new model to pick the right stocks and the ??? the amount of buying and selling, also the exchanging index. Timing is a really important factor in the investment. This model of stock investment is accurate most of time during the COVID-19 pandemic. Our research group invested in the stock market and gained 2.5% after we finished the project. We may use it to predict in the future if the pandemic doesn't end. Based on our research early bird or last minute favor stock trading and can earn more money. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Marcello Fidaleo, Professor, Università della Tuscia The availability of functional data in batch unit operations is becoming more and more common as PAT tools and high-throughput analytical methods are developed. The Quality by Design approach to process development requires the development of a design space, that is the ‘multidimensional combination of raw material attributes and critical process parameters that assure product quality.’ For batch processes, such a design space should be dynamical in nature. In this work, functional data analysis applied to functional designed experiments was used to build the dynamical design space of the refining process of a cocoa and hazelnut paste used in ice cream manufacturing. Auto-generated transcript... Speaker Transcript Marcello FIDALEO Hi, my name is Marcello Fidaleo, and my presentation is about the functional data analysis and design of experiments applied to food milling processes. The aim of this research was to study the milling step of hazelnut and cocoa bean base used the in the manufacturing of ice creams. These process is called refining and is typically carried out in stirred ball mills in the batch mode, like this one report to the here on the right. The aim of this process is to reduce the particle size of the solid ingredients, such as cocoa and sugar, so that the final product is not gritty. To this aim we designed our central composite design like one report here on the right. We did three factors, N, D and S, where N the shaft the rotation speed, D is the ball diameter, and S is the overall mass of balls. As the responses, we consider that the size of largest solid particles, so that is the fineness, and the milling energy. Both responses were measured as a function of time, so this was actually a functional designed experiment, because the responses were not ???, but were functions, in this particular case, functions of time. We used JMP Pro to design the experiment and to analyze the results. For the analysis of the results, we followed two approaches. The first approach is the classical analysis of designed experiments. So we use the response surface models to regressive the responses as a function of the process parameters at two different time instance. In the second approach, which is a functional data analysis in design of experiments, we were able to include also the effect of time in the final model, that is, we considered the nature of the functional responses so...that's the functional nature of the responses. So in this presentation, I will talk about the second approach. Functional data analysis in design of experiments requires some intermediate steps to build the final model and also to obtain the design space of the system. Basically, we start with the functional data analysis to smooth the functional data. Then we apply functional principal component analysis to the smoothing functions. And so, at this point, by retaining a just a few components of this system we developed a model for the functional responses. So let's see the results... the results of this case studies. These are the results of the smoothing procedure. We applied the beta splines and we considered the ??? fineness and the energy. We can see that by using data transformation and also data filtering simple fitting functions, we're able to fit well the experimental data. In fact, we used one knot and a cubic spline and the one knot and a linear spline for fineness and for energy. So then we applied the function principal component analysis to the smoothing function. Here on the left, we can see that for fineness, the first two components explained 96% of the variance, while for energy, just the first component explained 98% of the varience. So in the final models, we used the two components for fineness and one component for energy. Here on the right, I reported the scores calculated for the 16 trials of the experiment. And we found from our experience that the scores are useful to understand the behavior of the system. For example, in this case, the first component acted as a grinding intensity axis with the high grinding intensity runs on the left and low grinding intensity runs on the right. And so, at this point, we regressed the scores as a function of the process parameters by using a responsiveness(?) model by including the linear quadratic and the interaction effects. Also these these models were very useful to understand the effect of the process parameters on the functional responses. So finally, here, we can see that finally, we were able to build the final models, as I said. And here I reported the fineness and the energy as a function of time for a few runs. And we can see that the agreement between the experimental data and the predictio...prediction ones are really good. As I said before, we use the two components for fineness and one component for energy. So, finally, we were able to build the design space and to study...to understand the effect of the process parameters, that is N, D and S, on the functional responses here. On the profiler, we can see that, under these conditions of N, D and S, we could predict the profile of fineness as a function of time and the profile of energy as a function of time. At the bottom...at the bottom here, I reported two contour plots of the system. The one on the left was obtained with the 55 rpm rotation speed and 6.5 millimeter of ball diameter. While on the right, the contour plot was used with the mass of spheres of 29 kilograms and an operating time of 80 minutes. So, for example here on the left, we can see that in the boundary that we have the mass of spheres when we increase the the operating time, the energy, of course, increases, while the fineness decreases. The white areas in both plots is the design space, and it is the the area and in which fineness is between 20 and 30 microns, so it's the area where the final product is not gritty and it's not over milled. So from these results, we can conclude that the functional data analysis applied to functional designed experiments appears to be a straightforward, robust, and easy to use approach to build the dynamical design space of a batch process.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Chi-Feng Ho, Student, STEAMs Training center Mason Chen, Student, Stanford OHS Age is a big factor for National Basketball Association (NBA) professional players, because an injury could mean a premature end to a player’s career. 36 effective candidates have been selected for study. “Effective” does not mean how strong they are; these players all have had long careers. Our goal is to find methods/information that can help coaches, doctors and even superstars understand the effects of aging. Another goal is to discover why these 36 players can play longer than other players. We use the trends of former players to predict the future performance of active NBA players. We selected three categories on the NBA scoreboard to understand how age affects players’ total game time and total points per season. The total Mins will be used to determine the player's health, because only healthy (no injury) players can have a longer career. The Tpt and PPM show how the player’s performance decreases with age. We then used JMP tools and statistical methods to build up models to derive the correlation between players and their performance related to age. Finally, we predicted the future performance of the current participant and the exact time for the player to retire. Auto-generated transcript... Speaker Transcript Chi-feng Ho NBA aging influence and benchmarking analysis. Mason Chen is the co author of the project. The main reason we decided to do this product is that our NBA idol Kobe Bryant was passed away by helicopter accident last year. Many people start to repay attention to their retirement age, lots of media and fans out Kobe could still play more rather than leave the NBA League. We want to know if those historical players could keep going about two to three years. The team might consider a player's maximum contract due to the player's value, current and future performance, given age and previous injury history. There's no difference between a superstar, or even a bench players. One day they absolutely are going to confront performance decline due to aging factors. If their team had superstar already faced his data decline should team tickets be still centered on him? Of course not. The NBA is a business alliance that managers only want to build up a strong team to get an NBA championship instead of losing, won't get any benefits for themselves. In our project that we want to help coaches, trainers and doctors to find out how important the aging factors is for the NBA players. From our perspective we will consider is the time good enough for Kobe to leave the leave. Secondly, we consider players who retired too late and some of who retired too early from our data. Lastly, we worked on how age is important for an NBA basketball players and give some clues. In our objectives, we want start to build statistical models to show how age is a big factor for NBA players and then we will predict active duty players to performance by using former players' career trajectory. In the data selection part, we only consider the top players who have complete at least 1,200 games and 15 seasons. And every season should be counted after 1979-1980 seasons. We exclude the short season records and then we remove any seasons where the players play less than 20 games. total minutes, the points scored and points per minutes. Three categories. By doing the standardization for each player's year-to-year statistics against the career average. Finally, there are 36 players qualified. The Career index making and using the standardization logarithm. Although those players are qualified and counted in our data set, we cannot just use their scores, minutes on stage to to contrast. After doing the standardization, the group ratio of three categories came out and then we derived the peak career index, which is equal to Z TM, plus Z PP plus Z PM. The, Z TM is equal to the total minutes of one players, minus the group average. And then we divided by the standard deviation of total minutes. A method for three categories is the same. You may ask me why we do the Z standardization? Because Z statistics have two benefits. First of all, it will remove any standard deviation bias and it will make sure also to equal weight for players' career average. This is the pattern for 36 players position distribution by bar graph showing the 36 historical data set of power forward position players are qualified most, which contains amount of 11, and a small forward position players had the least, which only qualified three of them. On the right, the top, on top graph shows how do we calculate the combo curve and the bottom graph shows the top three players (Kobe Bryant to LeBron James to Michael Jordan) curve versus the combo. Why do we choose to use three players? Because those players ??? maintain well and they are the best players in history. By using the average of points per minute, total points to the minutes in each age categories, we combine them as the combo curve to use to compare to each of the data. It's easy to point out a peak average on the top graph will be 27 years old. And each categories point out the highest value. Kobe Bryant will be a good example to explain that. On the bottom graph LeBron James shows is an outlier, whose peak age seems to be 21 years old and during the 27 years old, his combo curve dropped to a new low. The peak season position dependency is to find out the golden age of difference position players. We use ANOVA test to find out the five position dependency was age, same or not. The dependent variable is to eight factors. We want to know whether there are any difference. Then we do one-way ANOVA. Is on the figure on the left there is difference between each position of the basketball players; shooting guard and small forward have an early age than other players as well. The different kind of ultimate strategy most focused on small forward and shooting guard. If the rookie has been chosen to train in those two positions they will have lots of shot and time. Although the run-and-gun strategy is famous this year, so small forward and shooting guard was still enter their golden age earlier as well. On the bottom the ANOVA table points out that P value is less than 0.5, which means the probability can reject the null hypothesis to further conclude at least one of the positions has different peak age. On the right graph, the MVP age is about two years after players peak age. At this time, players experienced their golden age range. We usually use MVP to contrast a player's performance. As they become older, their body functions will decrease and the injury might influence their mark in MVP selections. We usee a paired T test along the curve on the age study. We would like to see the connection between players' peak age and their MVP age. We want to check whether the difference between two measure values of the statistic is not zero. Our results shows one players who enter his peak age and after two years on average to receive his MVP title. The peak age and the MVP average age found a significant difference between each of them. A prediction model cannot reject the hypothesis though, suggesting that a prediction is accurate. Yes, we set the judgments that told us if the player's performance is less than 60 percent of the career average we might think that whether that person need to decide to retire. If someone in the last year could also have performance scores at 80%, we might think about players could still play two to three years more. Kobe Bryant left the league was when he was 37. We saw that his performance in last year had maintained about 80% of his ability. On the right...on the left graph showing a set of players who should retire earlier, but they don't, especially when you go see Juwan Howard's performance after 34 years old become about 20 to 40% of his career ability. We could point out that Kobe can play more and Juwan should retire earlier. Our purpose in clustering because we want to use clustering results to help coach and team to see how their players' future performance is. We used JMP and partitioned 36 players into seven clusters by sixs different categories. The collection of data's object is going to be similar to one another. Let's look at the graph. This graph compared Gary Payton and Derek Fisher, in some cases. Also R squared value is pretty high to their own clusters, which means there is a strong correlation between themselves. Based on the R squared value, the graph could clearly show us that they are kind of similar. Gary Payton retired from league by 2007 and Derek Fisher in 2014. A seven year difference. We might as well use Gary Payton's career trajectory to predict the future tendency for Derek Fisher. Age is one of the most important factors for NBA players. As players' age increase, they will have a chance to face retirement. This combo curve model utilize the multivariate statistics clustering and correlation to demonstrate how players' performance changed, based on the difference in age and performance trajectory with age. And the modeling methods could also apply to other professional sports as well. Thank you.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Patrick Giuliano, Lead Quality Engineer, Abbott Mason Chen, Student, Stanford University OHS Charles Chen, Principal Data Scientist/Continuous Improvement Expert, Applied Materials As consumers recede from social eating to rely on satellite delivery, optimal food preparation is increasingly vital. Using dumplings – a simple, nutritious yet tasty dish, multidisciplinary DSD DOE, robust design, and HACCP control planning were utilized to prepare the most efficient recipe. Although a controlled experiment was initially designed using DOE, some runs had to be substituted with different meat/vegetable types after a shortage of ingredients. To study the impact of the outliers, the original DSD was modified to try to account for the substitution. Before assessing model fit, each DSD was checked for orthogonality (color map on correlations), design uniformity (scatter plot matrix with nonparametric density), and power (design diagnostics). This ensured that the response surface model (RSM) results would be attributed to scientific aspects of dumpling physics instead of problems in the data structure. Next, the optimal RSM model was selected using stepwise regression, and model robustness was probed using t-standardized residuals and global/local sensitivity. Finally, advanced modeling capabilities in JMP Pro were leveraged (multi-layered neural network, bootstrap forest, boosted tree), and, with the reapplication of HACCP framework, the optimal parameter fitting was then proposed for use in commercial manufacturing based on the data structure and physics. Auto-generated transcript... Speaker Transcript Mason Chen So hi I'm Mason, and today I'll be presenting the dumpling cooking project in which we studied the relationship between data structure, dumpling physics and RSM results. The motivation behind this project is that COVID-19 has increased demand of remote cooking and artificial intelligence and robotics in the food industry will play an increasingly important role as an application of technology continues to spread. But most foods are made without precise control of cooking parameters as we rely on the chef's expertise to create consistent dishes. But quality control and efficiency will be much more important if the robot is cooking a meal, so we decided to use steamed dumplings as a simple, nutritious but tasty dish to prepare the optimal recipe, according to the dumpling rising time. So I stated previously, we performed a dumpling experiment, but due to an ingredient shortage, we had to substitute a different meat type in the experimental process. So when we evaluated the DOE design, it was not orthogonal as there were some major confounding ???. And when we try to run the RSM result with this data, there was also a potential outlier, particularly Run #6, which we wanted to study further. So our first objective of this project is to address the shortage from a corrective(?) standpoint and see if we can save the DSD structure. We'll do this by first changing the meat type, which was originally a two level categorical variable consisting of pork or shrimp to a continuous variable so that we can use varying percentages of meat type for those three runs which we use with different mixtures of shrimp and pork. And then we'll assess the data structure for this new DSD design using a variety of evaluation tools. Well, since the DSD structure was not problematic based on those tools, we decided to run an RSM model to see if its results are in line with our scientific research regarding convection and conduction. Had the DSD structure been problematic, we may have received some false interactions that cannot be explained by science and interactions that we would expect from science may be hidden due to the absence of orthogonality. And our second objective after finding a good DSD model to account for the ingredient shortage is more preventative approach, where we want to study the impact of the potential outlier and a ??? DSD structured by revising the settings. The impact of outlier can help us better understand the importance of measurement control and a problem...while studying the DSD structure will let us know how confounded structure will affect our results before even running our experiment in the future. So we'll do the second objective by modifying and comparing the factor settings and response from Run #6, which is a potential outlier, and Run #9, which is our center point, and studying whether the model results are due to confirming data structure or if it can be explained by physics. So the original design table is on the right and the ones that we substituted are Run #3, 4, and 9, as highlighted in orange. Run #9, again, this is the center point so substitution may affect the model orthogonality, such as the average prediction variance we're looking to layer on. The original run uses categorical meat type of pork and shrimp, so we kept the setting shrimp, even for these three runs where we had to substitute some of the shrimp for pork. The colored map on left hand side show some confounding groups, so blue indicates no confounding, zero correlation between those two terms, and dark red indicates severe confounding and high correlation, which is why the diagonal line will always be 1, since each variable is...each term is 100% correlated with itself. So there are some resolution lines ??? that affect confounding. and the...for meat type categorical, especially about a 0.3 correlation, and we have some severe Resolution IV, which is interaction interaction compounding risk, as some of them, such as the red blocks are greater than 0.5 correlation. So we should not run an RSM model from here, since our data itself cannot be trusted to to the severe confounding, so before we proceed to run a model, we need to improve the RSM...the data structure. But how can we address this problem without recollecting the data? So, since the meat type categorical had confounding problems and we should account with substituting meat type, we decided to change the meat type from categorical to continuous. We changed the variable essentially to shrimp percent so the previous meat type that was all shrimp was changed to 100% and the ones that were all pork changed to 0%. For three substituted meats, we estimated the true percentage values based on what we could recall about the order history for total meat we bought at the supermarket that day, so 70, 40 and 65%. We then ran the color map on correlations again and looked at the power analysis to assess whether or not the data structure is more orthogonal. The power analysis is all greater than 0.9 for main effects, which means that there is a greater than 90% chance we can detect a significant effect of these variables. Additionally, the color map on indicates slightly reduced Resolution 1 confounding as it is shaded bluer, especially for meat type continuous. And Resolution 4 confounding doesn't change much, so we will choose to stick to this model, since the apparent orthogonality is not bad, and later on when you run the RSM, we will return to the color map to check if any interactions may be due to a confounding problem. So next we look at the prediction variance profile, which indicates that prediction variance at different levels of factors. Now the actual prediction profile depends on both the response error and a quantity that depends on the design and factor setting. But for this prediction variance profile, the Y axis is a relative variance of prediction, which is the actual variance divided by the response variance so that the response variance cancels out and the prediction variance profile only depends on the DSD structure and not the response. So the top graph is the average variance at the center point and the bottom is the maximized variance, so we need to look at both because average variance indicates more information about the entire design, and the maximum variance tells you information about the worst case point. So the average variance is Run #9 at Run #9's settings, which we had to substitute some of the meat. And at Run #9, the variance is 0.06, which is not too bad and the variance is symmetric about the center point, which means that the substitution does not severely impact them model orthogonality. The prediction variance will always be greatest at corners, since you don't want to make predictions of the corners, because less points are around it and the optimal run is usually around the center with the least prediction variance. For the maximum variance, you can see that the curve is a bit asymmetric for dumpling weight and meat type. This one is Run #18 but there isn't anything special about it, so shouldn't be too big of a deal. So since there hasn't been anything major standing out for this DSD structure, we will go ahead and proceed to run RSM model for the continuous meat type. So we ran RSM model using stepwise progression and mixed selection, based on the p value. And we chose a stepwise progression instead of the ordinary least squares regression, because the least squares need at least the same number of runs as terms and we don't have enough runs to estimate all the terms. So three important mean effects are water temperature, dumpling weight and batch size. And we have interaction term of water temperature times dumpling weight. The ANOVA is significant, which indicates that we can reject the null hypothesis and conclude we have a model. We don't have an overfitting problem as the R squared and R scquared adjusted differ by less than 10%. For lack of fit window, the sum of the lack of fit error and the pure error is the total error sum of squares from the ANOVA table. And pure error is independent of model, such as things like experimental error, so if the total residual error is large compared to the pure error. It means we might have to fit a nonlinear models, since the linear model has high lack of fit. When all the error is only due to pure error, then there is zero lack of fit and the R squared value is equal to the maximum R square. On the p value for the lack of fit is greater than 0.05 so we don't really have evidence that we need to switch to a robust nonlinear design as the pure error is much smaller compared to the total residual error. Now the R squared is not very good...not excellent, because of the possible outlier in Run #6 seen in the studentized residual, which is touching the bound for any limit. Let's see why that run might be an outlier. So, if you look at the prediction profiler for Run #6, we see that there's a large triangle, which indicates greater local sensitivity. And this indicates that a small change in the input causes a large change in rising time, which means a greater instantaneous slope at the factor settings for Run #6.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Anne-Catherine Portmann, USP Development Scientist, Lonza Sven Steinbusch, Senior Project & Team Leader, Microbial Development Service USP, Lonza Often, the analysis of big data is considered to be essential in the fourth big industrial revolution – the “Data-Based Industrial Revolution” or “Industry 4.0.” However, handling the challenge of unstructured data or a less than in-depth investigation of data prevent using the full potential of the existing knowledge. In this presentation we offer a structured data handling approach based on the “tidy data principle,” which allowed us to efficiently study the data from more than 80 production batches. The results of different statistical analyses (e.g. predictor screening, machine learning or PLS) were used in combination with existing process knowledge to improve the overall product yield. With the newly created knowledge, we were able to identify certain process steps that have a significant impact on the product yield. Additionally, several models demonstrated that the overall product yield can be improved up to 26 percent by the adaptation of different process parameters. Auto-generated transcript... Speaker Transcript Anne-Catherine Portmann Hello, today I will present you the power behind data. This presentation is based on the idea that a principal this presentation will allow us to efficiently study the data of more than 80 production batches. We were able to improve the product yield of more than 26% based on the process knowledge and the statistical analysis. The statistical analysis allows also to identify the key process step which have an impact on the product yield. So I will first introduce Lonza Pharma Biotech and then we will go to the historical data analysis. Lonza Pharma Biotech was found firstly in 1897 and shortly thereafter, it was transformed to chemical manufacturer. Today we are one of the world's leading supplier to pharmaceutical, healthcare and life sciences industry. Here at Visp, we are one of the biggest site from Lonza and we are most significant for R&D, development and manufacturing. We are, we have a new part of the company, the Ibex solutions, where we are able to complete biopharmaceutical cycles from preclinical to commercial stage, from drug substances to drug product, all of this in one location. You probably heard about this lately in the Moderna vaccine against the COVID-19, but it's not the only product that we are producing here in this. We are also producing small molecules, mammalian and microbial biopharmaceuticals, high potent APIs, peptides and bioconjugates including antibody-drug conjugates. Now that you know a little bit more about Lonza, I will go to the historical data analysis. So, first of all, I will present to you this process on which the 80 batches are run. So, first the upstream part. So the upstream part have first the fermentation part, where the product is generated by the micro organism. So the product make a microorganism. The... product to produce...the product, the microorganisms is produced (???) from the DNA during fermentation. Then we have the cell lysis where we disturb the cell membrane and allowed to release the product and all what is in the cell. and have access to this product. Then become the separation. In the separation part, we remove the cell fragments, such as the cell membrane or the DNA. Then we come to the downstream part, which is based on three different chromatography and allow the purification of the product. So the product is here in yellow in the below part of the slide. And we can see that during each of the chromatography part, we are able to profile a little bit more of the product. At the end, we perform a sterile filtration of the product. So the goal was to increase the overall project yield, and to do that, we first collect the data of the 80 batches and order them in a way that we can analyze them. Then we perform yield analysis. And then we discuss the result with the process analysis. With the SMEs, so the subject matter experts. Then we have seen and...we went to the data analysis for the upstream part and we perform this for analysis on the left of the slide. Then we go for the downstream part and focus on the Chromatography 1. At the end, we make a conclusion from all what we see in the...in the analysis and what the subject matter expert orders. And at the end, we recalculate the yield. Let's see what...how we organize our data. So we based the data on the tidy data principles. That is a big part of the...before the analysis, which takes time, but it's really important to have clean data and making an efficient analysis afterwards. So first we have about, that is, the observational unit one, for example, we can say to the fermentation. And then on each row of the...of the file we include one batch each time. For each column, we take a parameter. For example, for fermentation, the pH, the temperature, all the titer (that means the amount of the product at the end). And then, for each values, here corresponds the correct value from the column and the batch. And with this one, we can go to JMP and perform the analysis. So let's see how we calculate the yield. So, first we calculate the yield for each step, beginning at 100% for fermentation and see how it decreases along the process. So what we observe is that we have a big variation at the fermentation step. And then we have a decrease in the in the product amount at the separation step, as well as the chromatography 1 step. And so we go with this data to the subject matter experts and they told us that the complex media variability impacts the final titer of fermentation, so we have to explore this spot. Then, for the separation, the strategy that was choose could have a different impact on the mass ratios. And for the chromatography 1, the pooling strategy have most probably an impact. So, then we will see what the data said. So we look at the upstream part and perform different analysis. So the first analysis was the multivariate analysis of each of these USP process stages. So we focus on the fermentation, cell lysis, and separation. And see all the parameters, how they could correlate with the product at the end. So here, what we see the fermentation, the amount added to Reactor 1 had a medium correlation with a good significance probability. For separation, the final mass ratio. The mass ratio at the intermediate separation have both a major impact on a significant probability. You see that other parameters, such as the initial pH from Reactor 2 is very close to the medium correlation threshold and have a significant probability. And we will see if this parameter in the next analysis. We have also selected here only the parameter, which is scientifically meaningful for the other analysis also. Then we went to the partial least squares. For the partial least squares, we see that we have for fermentation a positive correlation for all these parameters. So again we see the amount of Reator 2, the initial pH of Reactor 2 and the initial amount of Reactor 2. As well as a new parameter, that is the hold time. And we see that the amount of Reactor 2 have a positive correlation with this analysis, but the negative with MVA. And this could be explained, because of the 80 batches whereby just...which were running production, but they were not designed to answer a question of positive or negative correlation on the product...on the final product. So that could be done in the future, in another analysis with a proper design. With ??? to still say that we have an impact on the final product. For the other parameters, at the other steps of the upstream part, we also see that the prediction matches the multivariate results. And we have also a possibility to improve the titer. Here we see with the prediction profiler that we can also optimize in the future and the project yield. Then we test the product...the predictor screening. And here we ran 10 times the predictor screening and the five parameters which will always found in the top 10 were selected. And here what we see. It's the initial pH of Reactor 2, the mass ratio at the end of separation, the mass ratio at intermediate separation, the initial amount of Reactor 2, the amount of Reactor 2, and the amount of Reactor 1. So again, we have the same parameter that appears to have an impact. So, then, we went to the machine learning. This machine learning analysis, XGBoost, is a decision tree machine learning algorithm. And to avoid having in this result parameter that's not part of the... of the top of our parameter, we include a fake parameter, which give us a kind of threshold in the parameter importance. And all the parameters which appear above this threshold were considered to have an impact. The other are considered to be random and below this random parameter and have no impact or not significant impact on the on the final product. And here we can see that the negative correlation will appear for Reactor 1. The pH of Reactor 2 and initial amount of Reactor 2, we have a positive correlation. And for the mass ratio, we have a negative correlation. Again, as I explained before, the difference between negative and positive correlation was not the goal and not designed for this experiment, so we know it's an impact, but we don't know if it's positive or negative yet. Then we will go to the downstream part, specifically on the Chromatography 1. And here we use the neural predictive modeling. So in the normal predictive modeling we use the a different fraction of the chromatography. So on the graph on the right, we see that Fraction 8 have...is the main fraction, so where we found most of the product, the highest purity. And then by decreasing from Fractions 7-1, we have product, but also more impurity. and Until now, we were taking into account of the Fraction 4 in our analysis and we would like to see if we can include also Fractions 3, 2 and 1. And what we saw is that the by increasing the number of fraction, we increase the yield but decrease very few the purity. So the graph on the left, we see that if we go to the Fraction 2, we decrease the purity to less than 1% but the yield was increased to 5%. And then, when we include the Fraction 1, there we have a bigger decrease of the purity but it's a little bit than 1%. More than one percent of purity decrease and the yield was in the other side increase of about 10%. So then, with this result we try to summarize everything together. So far, fermentation, we have a final volume of the tanks and reactor, which were identified by most of the methods. The initial pH of the fermenter was also identified by the different analysis methods and the complex compound variability by the process experts. And to be able to see the effect of the complex compound variability, we will need further investigation in the lab. Then for separation, we have the mass ratio, which was identified by some methods, analysis of the data but also with the process...by the process experts. And the strategy is very interesting. The process expert decide to look at it and try to make some tests to be able to improve the yield in the production. For the Chromatography 1, the pooling strategy was identified by the process expert on the neural network analysis. And here the method can be easily implemented in the lab and also in the production. And the yield is really increasing a lot with with this method. So, then, we recalculate with the prediction... prediction profiler how we can increase our yield of the different steps. And for fermentation we were able to increase the yield up to 16%. For the separation, we are able to increase it up to 5%, and for Chromatography 1, here we've raised up to 5%. In the other slide we wrote up to 10%, so we try to be the worst case scenario to say up to 5%. And, at the end, we will have a total increase at the end of 26%, so that is a good way to improve our process and to focus exactly on the part where we can have a big impact. And just based on the data without doing a lot of experiment in the in the lab, that is also cheaper to do these analysis with JMP as doing a lot of experiments in the lab. So we have a lot of of gains at the end. So, thank you very much to all of you for listening me today, and also a big thank you to my colleague, Ludovic, Helge, Lea, Nichola and Sven with the ??? of this presentation.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Paolo Nencioni, Technician, A.Menarini Diletta Biagi, PhD Student, Università di Firenze In the pharmaceutical world, tabletability, compactibility and compressibility profiles are commonly used to characterize the raw materials and powder formulations under compression. By evaluating these profiles, it is possible to explain the mechanical behavior of the tested materials during the tableting process (tableting performance). The tabletability profile explains the relationship between pressure and tablet strength. Compactibility and compressibility give additional information to describe the overall tableting behavior, keeping into account other parameters influencing the process, such as porosity. Using an instrumented single punch press, it is also possible to conduct compression studies with a little amount of material. This type of approach is the basis for developing a robust formulation, as required in a Quality by Design framework. In this early stage of the study we use JMP for modelling and visualizing tablet performances as a function of compaction pressure. This is a fundamental step for defining an experimental domain for further trials. Thanks to the powerful interactive visualization capabilities of JMP, it is possible to freely move inside the domain and predict tableting behavior and properties shaping an acceptable space design. Auto-generated transcript... Speaker Transcript Diletta Biagi Hello everyone, welcome to our presentation. I'm Diletta Biagi. I'd like to start introducing you what inspired us in this project. So I will give you very basic and simple element about what are tablets and how to manufacture them, then I'll talk to you about why making a compaction study. Then, Paolo will go deeper, showing you all the practical step of a work. So he will be showing you how we work with data and also some real case studies. When talking about an administration of drugs, the older route is the most used, and tablets, in particular, then are the absolute most popular solid dosage form. You can have all types, all color, all shapes of tablets. You can have coated control release, but also standard compressed tablets. So how can you obtain tablets? You can get tablets from powder, but also from granulate. The powder is put into a confined space of the die, and then a compression force is applied, causing a reduction of the volume. But what does actually happen when reduction of volume occurs? At first when the compaction...the pressure is still low, the particles rearrange and pack more closely, reducing their porosity. But, as the pressure increases, the particle dimension changes either, either because they form, or because they fracture into smaller particles. So the main...this behavior, it depends on the starting powder. So about the starting powder, it has its own characteristics affecting the tableting process. In the same way, the resulting tablets has got some characteristics, which depends on the applied force on the geometry of the die and the punches and also on the powders that we use. So we have a lot of characteristics affecting the process and also resulting from the process. And all these characteristics can be resumed into four formulas that we actually use to describe the compression process. So let's see this formula, the compaction pressure links the applied force to the cross sectional area of the punch. And it is useful to compare the loading applied to tablets when tablets have a different size, because you could not directly compare the force when two tablets have a different size. The tensile strength links the breaking force and the area of the longitudinal section of the tablet. And it is useful to compare the mechanical strength of tablets of different size, because, again, you could not directly compare the breaking force when two tablets have a different size. The true density represents the density only to the solid portion of powder, so no air between particles, no intra voids in inside the particles contribute to the calculation of the true density. The solid fraction is the ratio of the tablets apparent density and the true density and, as you can see, it is related with porosity. But compassion pressure and tensile strength and solid fraction can be plotted one against another in three different ways. Using these three plots, let's briefly see what do they mean. The compressibility is the ability of a material to reduce its volume when a pressure is applied. As you can see here, the reduction of volume is expressed as an increase in solid fraction. The compactibility is the ability of a powder to give tablets of a specified strength when a reduction of volume occurs. And again, you can see that the reduction of volume is expressed in terms of solid fraction, it has an increasing of solid fraction. Then the tabletability is the ability of a powder to give tablet have a sufficient strength when a pressure is applied. In fact, it is a plot of tensile strength versus compaction pressures. With these three plots, we are able to better understand what happened to a powder when it is compressed. And we are also able to understand and explain the characteristics of the resulting tablets, especially if we use these three plots together. Why it is important to understand what happened during the compression process? Because it is absolutely necessary if we want to develop a robust tablet formulation, and it is also very useful for the scale up of a laboratory formulation. OK, so now, I will tell you the first practical step of our work. We started by collecting some data, so we selected some pure excipients, in particular microcrystaline cellulose, lactose and calcium phosphate. And we also selected different particle size dimensions for these excipients. Here I reported just one type of cellulose, as an example. So we compressed the cellulose with a single punch press using a flat-face punch. All the tablets were manually made one at a time, and for each single tablet we recorded the compaction force, the weight, the thickness and the crushing strength. The compaction force were changed every time and increased, so we've recorded tablets using compaction force from two kilonewtons up to 40 kilonewtons, which correspond to a compaction pressure of 20 up to 400 megapascal. I would like to underline that these force were the only only type of data that we actually recorded because all the other data that we use in the in the this project (and you see can see the tensile strength, but we use also others) derived from these force types using the formula that I showed you before but also using some other models that Paolo is going to show you just in a while. Paolo Nencioni Okay, as Diletta said, to compute the solid fraction, we needed to know the true density of the material. The density is commonly measured by ???, but it can, it can also be derived from compaction data. A method developed by Sun use a non linear regression of compaction pressure by tablet density, based on a modified Heckel equation here on display. To model this equation, we used the nonlinear regression in the specialized modeling platform of JMP. The built-in model library contains a lot of models, but it's also possible to create a self customized equation as here I show. You have only to add your own ??? defined in NonlinLib.JSL. Okay. running the nonlinear regression JMPs of the equation, with the ??? computed, with parameter estimates that better fit the data, and the parameter that we call here D is the true density, that we will use in all our elaboration of data. Once we have the true density. We can start to plot, the data and the relative relationship, the compressability properties first it links the reduction in volume of the material with applied pressure. This relationship can be explained by the Kawakita equation, also here we have to. model this equation, with a nonlinear regression and also here again, we have to add a customized equation defined in NonlinLib.JSL. Okay. Paolo Nencioni Running the nonlinear regression gets a formula, and we can save this formula in our data table. So we get the compressibility plot. The saved equation of Kawakita explains the volume reduction with the applied pressure. And this is the first plot that Diletta showed us before. The compactability is another property, very important, most of every paper that speaks of it use a Ryshkewitch equation to describe the relationship between the solid fraction and tensile strength. Here is not necessary to use the nonlinear modeling, because from our Fit Y by X report we have only to select the Fit Special command. In this way it is possible to apply a logarithmical transformation to the Y data, the Tensile strength, so we have only to save. the formula in the data table. And we get the relation that links solid fraction to tensile strength. This is the compactibility plot with Ryshkewitch equation, explains the tensile strength and the powder densification, the solid fraction. The last plot that Diletta showed us is the Tabletability. It describes the effectiveness of the applied pressure in increasing the tensile strength. Normally a great compression pressure results in stronger tablets, but it's not always this, this relationships is true because after Increasing so much the compaction pressure, the tensile strength, to increase, stops to. Be high. On. Paolo Nencioni The relationship here is not a direct function Also, if the topic. has been investigated deeply in a full and versatile theoretical framework about a powder tabletability is missing. Here we try to apply a function composition of the two previous equations not whatever the resulting equation is able to fit the data. That is mainly from the material characteristic, however, we use these function composition in the next slide. The three graphs can be displayed together, using a dashboard, for example, and having a local data filter, that gives us also the possibility to highlight the desired range of compaction pressure. Data can. also be shown in three dimensional graphics and a scatter plot to understand the relation between compaction pressure, solid fraction and tensile strength. Each cube face is one of the three plots that we have seen before. The solid fraction of the compact, is direct to evaluate the results of. Applied compaction pressure simulating that the tensile strength of the compact is a direct results of the solid fraction. Now we go to see some case studies, two case studies. The first one is a very simple application with a profile of two excipients and their mixture, using a flat face punch of one square centimeter of area. Cellulose and lactose have different behavior under compression, the first - cellulose - is commonly known as the material that consolidates by the formation. The second one is commonly known, it gives compaction by fermentation. Here we can see that celulose gives ??? against the lactose reaches higher tensile strength value. We can see also that in the. tabletability plot the last part of data doesn't fit very well the equation. The lactose doesn't reach the same value of tensile strength of cellulose, but in the tabletability you can see that data. On that line fits very well the equation of the composition of function that we use. Okay we did also some mixture of the excipients with a different ratio. As expected, the two mixture cellulose and lactose profiles look alike profile excipients in major amount So we can change the behavior of the mixture not simply increasing or decreasing the ratio between the two ingredients, these can be very useful in doing a formulation activity, when we have to face. To face up with some active ingredient with the low compaction properties. Here we have a second case study. Is on a real formulation, a real tablet formulation, it is a compaction study a using both a flat face punch we have seen before, but also a real punch that is a very small tablet around the 5 million ??? First, we did the profile using a single punch press. Here we can compare the plot for the two different punches. It is possible to see that tablets done with the smaller convex punch are not able to reach the same value of tensile strength for solid fraction obtained with the bigger flat face punch. To get more reliable results, we continued the study using the punch really used in the industrial batches, the smaller one. Here we see the profile did with two different equipments, a single punch press and a rotary press. We can see that there isn't a remarkable difference between the two equipments. Tensile strength and solid fraction are quite the same for both equipments. Finally, we produced tablet at two different speeds, using the rotary equipment. Here we have to introduce a new term, the term 'dwell time' that defines the time that the powder is under the loading, under the maximum pressure of the cycle. A lower speed, that means a longer dwell time result in the tablets with the highest solid friction, and we can clearly see from the first plot, the compressibility that links compaction pressure and solid fraction. The compactability, that is at the right side of the plot, is perhaps the most valuable of the three properties, because it reflects. tensile strength and solid fraction and we can see here, that a part of minor difference is a result of the two curves are quite the same. This means that The compactability is not significantly affected by compaction speed and for this reason the compactibility profile become a useful tool during the scale up from laboratory to industrial equipment. Because the ??? is a very important information about the compaction of the pressure, of the powder that we have. So yeah I thank you for having followed our talk and I say goodbye.

0 attendees

0

Event has ended

0 attendees

0

Sunday, March 7, 2021

Markus Schafheutle, Consultant, Business Consulting Laura Castro-Schilo, JMP Senior Research Statistician Developer, SAS Christopher Gotwalt, JMP Director of Statistical R&D, SAS We describe a case study for modeling manufacturing data from a chemical process. The goal of the research was to identify optimal settings for the controllable factors in the manufacturing process, such that quality of the product was kept high while minimizing costs. We used structural equation modeling (SEM) to fit multivariate time series models that captured the complexity of the multivariate associations between the numerous process variables. Using the model-implied covariance matrix from SEM, we then created a Prediction Profiler that enabled estimation of optimal settings for controllable factors. Results were validated by domain experts and by comparing predictions against those of a thermodynamic model. After successful validation, the SEM and Profiler results were tested in the chemical plant with positive outcomes; the optimized predicted settings pointed in the correct direction for optimizing quality and cost. We conclude by outlining the challenges in modeling these data with methodology that is often used in social and behavioral sciences, rather than in engineering. Auto-generated transcript... Speaker Transcript Hello, I'm Chris Gotwalt and today I'm going to be presenting with Markus Schafheutle and Laura Castro-Schilo on an industrial application of structural equations models, or SEM. This talk showcases one of the things I enjoy most about my work with JMP. In JMP statistical development, we have a bird's eye view of what is happening in data analysis across many fields, which gives us the opportunity to cross fertilize best practices across disciplines. In JMP Pro 15, we added a new structural equations modeling platform. This is the dominant data analytic framework in a lot of social sciences because it flexibly models complex relationships in multivariant settings. One of the key features is that variables may be used as both regressors and responses at the same time as a part of the same model. Furthermore, it occurred to me that with these complicated models they are represented with diagrams that, at least on the surface, look like diagrams representing manufacturing processes. I wasn't the only one to make this connection. Markus, who was working with a chemical company, thought the same thing. He was working on a problem with a chemical company with a two column twin distillation manufacturing process where they wanted to minimize energy costs which were largely going to steam production, while still making product that stayed within specification. He reached out to his JMP sales engineer, Martin Demel, who then connected Markus to Laura and I. We had a series of meetings where he showed and described the data, the problem and the goals of the company. We were able to model the data remarkably well. Our model was validated by sharing the results as communicated with the JMP profiler to the company's internal experts and then with the first principle simulator and then with new physical data from the plant. This was a clear success as a data science project. However, I would like to add a caveat. The success required the joint effort of Laura, who is a top tier expert in structural equations modeling. Prior to joining JMP, she was faculty in quantitative psychology at the University of North Carolina, Chapel Hill, one of the top departments in the US. She is also the inventor of the SEM platform in JMP Pro. This exercise was challenging even for her. She had to write a JSL program that itself wrote a JSL program that specified this model, for example. The model we fit was perhaps the largest and most sophisticated SEM model of all time. I want to call out the truly outstanding and groundbreaking work that Laura's done both with the SEM platform generally and in this case study in particular. Now I'm going to hand it over to Markus, who is going to give background to the problem the customer wanted to solve, then Laura is going to talk about SEM and her model for this problem. I'll do a brief discussion of how we set up the profiler and then Markus will wrap up and talk about the takeaways from the project and the actions that the customer took based on our results. Thank you, Chris, for the introduction. Before I start with the problem, I want to make you familiar with the principles of distillation. Distillation is a process which separates a mixture of liquids most of the time and separates them into their individual components. So how does this work? Here, you see a schematic view of a lab distillation equipment. You see here is a flask where the crude mixture, which has to be separated, is inside. You heat this up, here in this case, with an oil bath and stir it and then it starts boiling. So the lowest boiling material starts first and then the vapor goes up here and pauses here the thermometer to reach the boiling temperature and then further goes into these cooler here where it condensates and then the condensates drop here into the other little flask. So as I said, it's built from...for separating a mixture of liquids with different boiling points, and those are separated by boiling point. For example, everybody knows that perhaps if you want to make schnapps from a mash, you just make the mash inside of this flask, heat it up, and then you distill the alcohol over here and get the schnapps. As it looks very simple here in the lab, in the industry it's a bit more complicated. Because let's say the equipment is not only bigger but more complex and mainly because of of the engineering part of of the story. So in this study, the distillation was not done batchwise as you've seen before, but in a continuous manner. This means the crude mixture is pumped somewhere in the middle of the column and then the low boiling material goes up as a vapor to the top of the column and there it leaves the column and the other higher boiling material flows downward and leaves the column on the bottom. And to make it a bit more complicated, in our case the bottom stream is then pumped into a second column and distilled again, so to separate another material from the main..from... from the residual stuff. So actually we separated the original crude mixture into three parts Distillate 1, Distillate 2 and what's left, then the bottom stream of the second still. And to make it even more complex, we used the heat of the distillate of this second column to heat the first column in order to save energy for this. So in a schematic view, this looks like this. Here you have Still 1 and there's Still 2. And here we have the raw material mix which is actually assembly of distillates from the manufacturing and what we want to do is to separate the value material from the rest. So we pumped this crude mix into the first stills. As I said somewhere in the middle it separates in the low boiling point, which is the first value material we want to have, and the rest leaves the first still on the bottom. Then it's stored immediately here in a tank and then from here it's pumped again into the second still, again somewhere in the middle, and it separates into the second value material and away stream, which was then redeposited. To heat the stuff, we start on this side because this is the higher boiling material, so we need higher temperatures and this means also more energy. So we pump in the steam here, heat this still here, and the material leaving here on the top has the temperature of the boiling of this material, so we used the residual heat to heat the first still. And if there's a little gap between what's coming up from here and what we need for for distillation here in the first still, we can add a little extra steam to keep everything running. So this is a very high level view of the things, and if you want to go a bit more into the details, here it's kind of the same picture. But here I show you with all the different tags which we have here for all the readouts and quality control and temperature control and so on and so forth. So what you see here? We start here, for feed one into the first still. And then we separate into the top stream where we control the density, which is a quality characteristic. And in the bottom stream, the bottom stream goes in this intermediate tank. And then from here it's fitted again into the second still and also and separated here in the bottom stream for top stream. And again we are testing here density for quality control. Here also we add the steam to heat all these things up and the top flow then goes via heat exchanger into the first still and heat that up again. And here we have the possibility to add some extra steam to have everything in balance here. So what we see is on a local basis you have a lot of correlation, so this is done here with the color code of the arrows. So for example, the feed here together with the feed density, which is a measure for the composition here of this feed. So together with these two are defining the top stream and quality here, and that bottom stream more or less. So you have some local predictions. Also over here the material going in here and here defines stuff over here. But if you want to have a total description of the entire equipment, then it gets tricky because you can do local least square correlations here. You can do it here. You can do it separately for the steam or also here. But as you see, we have a start of the mass stream coming here, going through first still, through the second still, to here and we have an energy stream which starts more or less here, going through here via the heat exchanger also down here. So it's a kind of a circuit, which we have here, and all these things are correlating more or less in a kind of circuit and this gives gives us the difficulty that we actually didn't know what the Xs and the Ys were. And that was the reason where we started to think about other possibilities to model this. So the target for this study was to find the optimal flow and steam settings for all these varying incoming factors here, and in a way that we are able to stay everything in spec. So the distillate quality should stay in spec but also internal operational specs and also the spec for the final waste stream. And the most interesting part, at least money wise, we want to minimize the consumption of the speed...sorry, of the steam. So what we actually needed was first of all, a good model which describes this and that was the point where Laura came into the game here and developed this structural equation model. And we also need the kind of profiler which enables us to figure out what are the best settings, the optimal settings for all these incoming variations, which we may have here in order to stay within all these specs. And that was the point where Chris came in, building on the model from Laura, a profiler, which we can use for doing all the predictions we need. So now I want to pass over to Laura to describe the model she built from this data here. Laura, please. Thank you, Markus. I'm Laura Castro-Schilo and I'm going to tell you about the steps we followed to model the distillation process using the structural equation models platform. So when Markus first came and talked to us about his project, there were three specific features that made me realize that SEM would be a good tool for him. The first is that there was a very specific theory of how the processes affect each other, and we saw that on the diagram that he showed. An important feature of that diagram is that all variables had dual roles. In other words, you can see that arrows point at the nodes of the diagram, but those nodes also point at other variables, so there wasn't a clear distinction between what was an input and what was an output. Rather, variables had both of those roles. Lastly, it was important to realize that we were dealing with processes that were measured repeatedly. In other words, we had time series data and so all of these features made me realize that SEM would be a good tool for Markus. Now, if you're not familiar with SEM, might wonder why. SEM is a very general framework that affords lots of flexibility for dealing with these types of problems. I've listed in this slide a number of different features that make SEM a good tool, but since we're not going to be able to go through all of these, I also included a link where you can go with learn more about SEM if you're interested. Now I'm going to focus on two of the points I have here. The first is that SEM allows us to test theories of multivariate relations among variables, which was exactly what Markus wanted to do. Also, there are very useful tools in SEM called path diagrams. These diagrams are very intuitive and they represent the statistical models that we're fitting. So let's talk about that point a little more. Here is an example of a path diagram that we could draw in the SEM platform to represent a simple linear regression, and the diagram is drawn with very specific features. For example, we're using rectangles to represent the variables that we have measured. Here, it's X and Y. We also have a one-headed arrow to represent regression effects. And notice the double-headed arrows that start and end on the same variables represent variances. Now, if these were to start and end on a different variable, those double-headed arrows would then represent a covariance. In this case, we just have the variance of X and the residual variance of Y, which is the part that's not explained by the prediction of X. So this is the path diagram representation of a simple linear regression. But of course we could also look at the equations that are represented by that diagram. And notice that for Y, this equation is that of simply a linear regression. And I've omitted here the means and intercepts just for simplicity. It's important to note that all of the parameters in the equations are represented in the path diagram, so these diagrams really do convey the precise statistical model that we're fitting. Now in SEM, the diagrams or models that we specify imply a very specific covariance structure. This is the covariance structure that we would expect given the simple linear regression model. So you can see we have epsilon X as the variance of X. We also have the variance of Y, which is a function of both the variance of X and the residual variance of Y, and we also have an expression for the covariance of X and Y. And generally speaking, the way that model fit is assessed in SEM is by comparing the model implied covariance structure to the actual observed sample covariance of the data, and if these two are fairly close to each other, we would then say that the model fits very well. So a number of different models can be fit in SEM. And today our focus is going to be specifically on time series models. When we talk about time series, we're speaking specifically about a collection of data where there is dependence on previous data points, and these data are usually collected across equally spaced time intervals. And the way that time series analysis deals with the dependencies in the data is by regressing on the past. So one type of these models are called autoregressive processes or ARP. And you can see here, where Y represents a process that is measured at time T, the auto regressive models consist on regressing that process on previous observations of that process up to time T minus P. So if we're talking specifically about an autoregressive one process, then you can see we have the process YT regressed on its immediately adjacent past YT minus one. And the way that we would implement this in SEM is simply by specifying, as we saw before, the regression of YT on YT minus one. So notice that here the regression equation is very similar to what we saw in the previous slide, and so it's no surprise that the path diagram looks the same. And we can extend this AR(1) model to one that includes two lags, in other words, an autoregressive of order two. And here we see we have the process YT that is being regressed on both T minus 1 and T minus 2. And if we look at the path diagram that represents that model, we see that we have an explicit representation for the process at the current time, but also at the lag one and lag two. A very specific aspect of this diagram is that the paths for adjacent time points are set to be equal to each other, and this is an important part of the specification that allows us to specify the model correctly. So notice here we're using beta 1 to represent this lag 1 effects and we also have to set equality constraints on the residual variances. Lastly, we also have the effect of YT minus 2 as it's predicting YT, and so here's the lag 2 effects. Now all of these models are univariate time series models, and you can fit them using the structural equation modeling platform in JMP or you could also use the time series platform that we have available. However, the problem we were dealing with with Markus' data require more complexity. It required us to look at multivariate time series models and a type of these models are called vector autoregressive models. And what I'd like to show you is one of these models of order two. So we have a process for X and another one for Y, and the same autoregressive effects that we saw before are included here. Notice we have our equality constraints which are really important for proper specification. But we also have the cross lagged effects which tell us how the processes influence each other. And notice here gamma 1 and gamma 1 and also gamma 3, gamma 3, suggesting here that we have to put equality constraints on those lag 1 effects across processes. We also have to incorporate the covariances across the processes so we have their covariance at time T minus 2. But we also have the residual covariances at time T minus 1 and T and notice these have to have equality constraints again to have proper specification. So I'm going to show you in this video how we would fit a bivariate time series model just like the one I showed you, using JMP Pro. We're going to start by manipulating our data so that they're ready for SEM. First, we standardize these two processes because they are in very different scales. Then we create lagged variables to represent explicitly the time points prior to time T. So we're going to have T minus 1 and T minus 2. We launched the SEM platform and we're going to input the Xs then the Ys so that it's easier to specify our models. And now I sped up the video so that you can quickly see how the model is specified. Here we're adding the cross lagged effects for lag 1. And then directly using the interactivity of the diagram, we add the lag 2 effects. And what remains is to specify all the equality constraints that are required for these models within process and across processes. We name our model. And lastly, we're going to run it. As you could see, even just a bivariate time series model that only incorporates two processes requires a number of equality constraints and nuances in the specification that make it relatively challenging. However, in the case of the distillation process data, we had a lot more than two processes. We were actually dealing with 26 of these processes and in total we had about 45,000 measurements, which were taken at 10 minute intervals. And so our first approach was to explore the univariate time series models using the time series platform in JMP. And when we did this, we realized that for most processes an AR(1) or AR(2) model fit best, and so this made me realize that really at the very least we needed to fit multivariate models in SEM that incorporated at least two lags. We also had to follow a number of preprocessing steps for getting the data ready into SEM. On the one hand, we had a lot of missing data, and even though SEM can handle missing data just fine, with models that are as complex as these ones, it became computationally very very intensive. And so we decided to select a subset of data where we had complete data for all of the processes and that left us with about 13,000 observations. Also, as we saw in the video, we had large scale differences across the processes, so we had to standardize all of them. And lastly we created lag variables to make sure that we could specify the models in SEM. Now for model specification, equality constraints in particular are very very big challenge because it would take a lot of time to specify them manually and it would be, of course, tedious and error-prone. So our approach for dealing with this was to generate a JSL script that would then generate another JSL script for launching the SEM platform. And what you see here is the final model that we fit in the platform and thankfully, after estimating this model, we are able to obtain a covariance structure that is implied by the model and that was the piece of information that I could pass over to Chris Gotwalt, who then used the information from that matrix in order to create a profiler that Markus could use for his purposes. So Chris, why don't you tell us how you created that profiler? Thank you, Laura. Now I'm going to show the highlights of how I was able to take the model results and turn them into a profiler that the company could easily work with. So Laura ran her model on the standardized data and sent me a table containing the same model intercepts and she also included the original means and standard deviations that were used to standardize the data. On the right we have the sim model implied covariance matrix, which includes the covariances between the present values and the lagged values from the immediate past. This information describes how all the variables relate to one another. In this form, though the model is not ready to be used for prediction. To see how certain variables change as a function of others, we have to use this information to derive the conditional distribution of the response variables, given the variables that we want to use as inputs. So essentially we need the conditional mean of the responses given the inputs. So to do that, we need to implement this formula right here. And to do that, we use the SWEEP Operator in JSL, the SWEEP Operator is a mathematical tool that was created by SAS CEO and co-founder Jim Goodnight. It's was published in the American Statistician in 1979. The SWEEP Operator is probably the single most important contribution to computational statistics in the last 100 years. Most JMP users don't know that the SWEEP Operator is used by every single JMP statistical platform in many ways. We use it for matrix inversion, the calculations that sums of squares and also can be exploited as simple and elegant way to compute conditional distributions if you know how to use it properly. I created a table with columns for all the variables. The two rows in the table are the minimums and maximums of the original data, which lets the profiler know how to set up the ranges. I added formula columns for the response variables using the swept version of the variance matrix from Laura's model and put those formulas into the back here in the data table or the far right. Here's what one of the formulas looked like. I pulled in the results from the analysis as matrices. Laura's model included the estimated covariance between the current Ys in the last two preceding values because it was a large multivariate autoregressive model of order 2. Predicting the present by controlling the two previous values of the input variables was going to be very cumbersome to operationalize. So I made a simplifying assumption that these two values were to be the same, which collapsed the model into a form that was easier to use. To do this, I simply use the same column label when I was addressing into the lag one, and lag two entries for term. Without machinery in place, I created a profiler for the response columns of interest. I set up desirability functions that reflected the company's goals. So they wanted to match a target on on A2TopTemp, maximize A2BotTemp, and so on, ultimately wanting to minimize the sum of the steam that came out of the two columns. So you can lock certain variables in the profiler by control clicking on a pain. The lock variables will have their value drawn via a solid red line, and then once we've done that we can enter values for them, and when we run the optimizer or maximize desirability, the locked variables will be held fixed. This way we find settings of the variables that we can control that keep the product being made to specification while minimizing energy costs. It's fair to say that it would be difficult for someone else to repeat Laura's modeling approach on a new problem, and it would be difficult for another person to set up a profiler like I did here. If enough people see this presentation and want us to add a platform that makes this kind of analysis easier in the future, you should let us know by reaching out to Technical Support via support@JMP.com. Now I'm going to hand it back over to Markus who will talk about what the customer did with the model and our conclusions. Thank you, Chris. So with the prediction Profiler, which Chris just presented, we used that to, let's say, make a predictive landscape, which makes us understanding how the best settings should be in order to achieve all the necessary quality specs. And so what the three factors which are, let's say, are varying with limited extent to our influence, and what's the feed for the Still 1 and the feed for the Still 2 and also the quality or the composition of the feed into one. And what we've turned out as in their model as well, is the cooling water temperature was also playing an important role in this scenario. All the other variables are of smaller importance so that we neglected them in this first approach. Here you see the landscape. It's kind of a variability chart, so to say, so we have here the feed density for the Still 1, the feed into Still 1 and the feed into Still 2 and all possible combinations, more or less. And here you see then the settings which are predicted to be best in order to stay within the specifications. And here are some of these I have specifications as well, so we have to stay inside them. So it's, for example, it's the steam flow for Still 2, the reflux there, the boiler up and the same things for the Still 1. And here on the right side, you see the predicted outcomes, so the quality specs, so to say. So the temperature of the top flow in Still 1 that the density of the distillate, the density of the distillate of Still 2, and so on and so forth. So what you see is here, if you have a look here on the desirability, which is the bottom row here, there's big areas where we cannot really achieve a good performance of our system. And if you have a look into the details you see, OK, here we are off spec, here we are off spec, here on some points, we are off spec, and so on and so forth. But what else it sees is that this in spec/off spec thing is also governed not only by these three components down here, but also by the river temperature, and for the moment it's highlighted the lowest river temperature; it's 1 degree. So this you see here with it, we are staying most of the time in specs, though there only are rare combinations of these three factors where we aren't. But if we are increasing the river temperature, for example for 24 degrees, then the areas where we are off spec are...become much more predominant. Also here it's very hard to stay within this specifications. So what we learned from the model is that we have problems to stay in our specifications when the river temperature is above 7 degrees C. So then then the the question was why is that? And the engineers often...suspected is that this was because of the cooling capacity of the coolant. But before we went into the real trial, we compared our SEM model versus a thermodynamic model based on Chem CAD. And what it pointed out was that both models are pointing in the same direction, so there were no no real discrepancies between the both. OK, this made us in an optimistic mood and so we did some real trials and with the best settings, and let's say, approved the the predicted things from the models. And so it turned out, as I said already, that what the engineer suspected that the cooling capacity of the cooler is not sufficient. And so when you have at higher river temperature, then the heat transfer is too small, and so the equipment... equipment doesn't really run anymore. So the next step now is to use these data from from this study here to justify another investment which builds a cooler here with a better heat exchange capacity. So thanks to Laura and Chris, we could set up here the investment. If you have questions so please feel free to ask now. Thank you.

0 attendees

0

Event has ended