Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
レベル:中級 現状を良くするためのアンケートは結果系と原因系の両方の質問が必要である.そして,両者の因果関係を回帰分析で把握すれば手を打つべき対象の選抜ができる.その場合,聞き逃した項目は調査後にカバーができないが,結果として不必要なものがあった場合は事後にそれを無視すればよい.このため回答者に負担をかけ過ぎないという配慮のもとに漏れのない質問項目を用意すると,その結果とし質問項目は多数となり項目間に高い相関が現れる.この問題に対しては,高橋・川崎により「多群主成分回帰分析」が提案されている.その本質は群内の相関は高く群間の相関は低くなるように構成した合理的な群のもとで群毎に主成分を求め,これを説明変数とした主成分回帰で重要な主成分を選択し,選択された主成分に対する因子負荷量の絶対値の大きなものの対策を打つべき主要項目として選抜するというものである. 時にはこの主要項目が密集することがあるが,それは因子分析で対応できる.因果分析は主成分を用いた場合が表側因果分析で,因子を用いた場合が裏側因果分析で,両者を合わせたものが両側因果分析である.本発表ではそのための理論とJMPを用いた具体的な方法を紹介する.   著者は半世紀に亘りQM(質経営),SQM(統計的質経営)および設計論の研究と実践(多数の企業との共同研究および経営指導他)を行ってきた.1990年代より新しい設計パラダイムである超設計(Hyper Design)を提案し,その数理であるHOPE理論(Hyper Optimization for Prospective Engineering)を研究し,その支援ソフトであるHOPE-Add-in for JMPをSAS社と共同開発を行っている.以上より考え方である超設計,統計数理であるHOPE理論,支援ツールであるHOPE-Add-in for JMPの三位一体で新しい設計法を実現している.  設計は敷居が高いため特殊な人々による特殊な活動と誤解されることが多い.これを打破し多くの人が設計を使いこなせるようにするために,著者は理論研究とともに新しい教育方法についても開発している.それは実物教材(紙ヘリコプター,紙グライダー,コイン射撃ほか)と仮想教材(飛球シミュレーターほか)を用いた体験型教育である.この教育プログラム(統計の基礎から超設計まで)は多くの場面でJMPによる可視化した解析と設計を行うために,分かり易くかつ面白い教育に仕上がっている.この教育は30年以上に亘り国内外の大学(慶應義塾大学,Yale University,東京理科大学,筑波大学他)および多数の企業で実施しその有効性を確認している.   ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である高橋武則様まで直接ご連絡ください。  
レベル:上級 超設計は汎用的でかつ包括的な設計法のため幅広い実践的な応用が可能である.中でも頑健設計に関しては従来の頑健設計の枠を大きく超えた柔軟な設計を可能にしている.これまで,殆どの頑健設計は撹乱因子を質的因子として扱っている.しかし,多くの事例の撹乱因子は本質的に量的因子であり,設計理論の都合や計測方法の制約(多くは工夫不足)から質的因子として扱っている.超設計は量的撹乱因子を量的変数として扱うことができるため,従来の戦術的な頑健設計を戦略的・政略的な頑健設計へと大きく進化した設計法である.  量的撹乱因子の頑健設計では合成関数(関数の関数)の高度な活用が決め手となる.多重(多階層)の合成関数をシステマティックに用いることで画期的な最適化による設計が可能となる.複雑な入れ子構造の合成関数を扱うには高度なソフトが不可欠であるが,JMPはそれを容易に可能にしている.超設計は新しいパラダイムを用意したもとで従来の考え(戦術的な発想)の枠を超えた設計(戦略的・政略的な設計)を実現している.  本発表では論文および書籍で取り上げられた実事例のデータを用いて量的撹乱因子の頑健設計の理論とその応用を具体的に議論する.   著者は半世紀に亘りQM(質経営),SQM(統計的質経営)および設計論の研究と実践(多数の企業との共同研究および経営指導他)を行ってきた.1990年代より新しい設計パラダイムである超設計(Hyper Design)を提案し,その数理であるHOPE理論(Hyper Optimization for Prospective Engineering)を研究し,その支援ソフトであるHOPE-Add-in for JMPをSAS社と共同開発を行っている.以上より考え方である超設計,統計数理であるHOPE理論,支援ツールであるHOPE-Add-in for JMPの三位一体で新しい設計法を実現している.  設計は敷居が高いため特殊な人々による特殊な活動と誤解されることが多い.これを打破し多くの人が設計を使いこなせるようにするために,著者は理論研究とともに新しい教育方法についても開発している.それは実物教材(紙ヘリコプター,紙グライダー,コイン射撃ほか)と仮想教材(飛球シミュレーターほか)を用いた体験型教育である.この教育プログラム(統計の基礎から超設計まで)は多くの場面でJMPによる可視化した解析と設計を行うために,分かり易くかつ面白い教育に仕上がっている.この教育は30年以上に亘り国内外の大学(慶應義塾大学,Yale University,東京理科大学,筑波大学他)および多数の企業で実施しその有効性を確認している.   ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である高橋武則様まで直接ご連絡ください。  
レベル:中級 働き方改革が叫ばれる昨今、実験と解析の効率化の重要性が高まっている。JMPによる実験効率化の威力を示すには、実験データを決定的スクリーニング計画(DSD)やカスタム計画で置き換えて見せて実験数を大幅に削減できることを示すと良い。その際の応答データは既存実験データから拾い出す。複数の表に分けられた実験データを見つけた時は、一つのテーブルにまとめて多変量解析を行い、プロファイルで可視化して見せて、OFAT(One Factor at a Time)的方法の落し穴に気づいてもらう。繰り返しのある実験データを平均で分析する考え方に対しては、積み重ね処理や平均・分散による多目的最適化やロバスト最適化の方法があることを示す。 開発現場で実験計画法を使う場合は交互作用の存在を予測できないことが多く、しかも交互作用は決して稀なことではない。DSDは主効果と交互作用(2FI)の交絡や2FI間の交絡がなく、実験数が因子数の2倍程度の少ない数で済む。これは大きな利点である。実際にDSDを使って分かったこと、主効果数+交互作用項数が因子数に近づく時に起きる破綻、拡張計画による解決方法、JMPコミュニティやASQから入手したDSD関連論文の中で実務上重要と思われる点、などについて報告する。   山武ハネウエル(現Azbil)でFA開発部長,理事 研究開発本部長,理事 品質保証推進本部長,アズビル金門参与,などを歴任したのち東林コンサルティングを設立.専門領域は生産データ解析による歩留まり改善や品質改善,市場不良予測・ロバスト設計・最適化設計・実験計画などの統計的問題解決全般,デザインレビュ―・根本原因分析手法(RCA)・ヒューマンエラーの未然防止・工程改善などの現場指導など.著書は『ネットビジネスの本質』 日科技連出版 2001(共著)【テレコム社会科学賞受賞】,『実践ベンチャー企業の成功戦略』 中央経済社 2011(共著),『よくわかる「問題解決」の本』 日刊工業新聞社 2014(単著).主な論文は「生産ラインのヒヤリハットや違和感に関する気づきの発信・受け止めを促進するワークショップの提案」品質管理学会 2016【2016年度 品質技術賞受賞】.主な講演「作業ミスを誘発する組織要因を可視化し改善を促進する仕組みの提案」(Discovery-Japan 2018) 「JMPによる品質問題の解決~製造業の不良解析と信頼性予測~」(Discovery-Japan 2019)  
レベル:中級 特定検診は、生活習慣病予防の観点から、40歳から74歳を対象にメタボリックシンドローム(通称:メタボ)の該当者を減少させることを目的としている。特定検診受診者全体に対する検診結果の要約は、個人と全体を比較するベンチマークとなり得るため、中高年個々人の健康管理に対して参考になると思われる。 厚生労働省が提供するNDBオープンデータでは、特定検診の情報として、年度ごとに検査項目(腹囲、血糖値、血圧など)の平均値や階級別分布を入手することができ、メタボの判断基準となるいくつかの検査項目に対し、性別、年代などの属性ごとに基準外の人数、検査人数を求めることができる。属性ごとに各検査項目に対する基準外の割合をグラフ化してみると、検査項目によっては年代による傾向が表れないなど興味深い結果が得られる。 本発表ではこれらグラフ化とともに、各検査項目に対する基準外の割合に対し、年度、都道府県、性別、年代を要因とした一般化線形モデルをあてはめた結果を示す。このモデル化により、対象者の属性(性別、年代、居住している都道府県)における基準外の割合を予測することができ、特定検診受診者全体を深く理解できる。   JMPジャパン事業部の技術エンジニア。現在は主に製薬会社、食品会社を対象としたJMP製品のプリセールス業務を行っている。JMPをお客様に紹介する立場ではあるが、自身も一人のJMPユーザであるという意識が強い。近年はメディア等で話題となる事柄について、JMPで分析した結果をブログや分析レポートの共有サイトである「JMP Public」に投稿している。 https://public.jmp.com/users/259  
Caleb King, JMP Research Statistician Tester, SAS   In this talk, we illustrate how you can use the wide array of graphical features in JMP, including new capabilities in JMP 16, to help tell the story of your data. Using the FBI database on reported hate crimes from 1991-2019, we’ll demonstrate how key tools in JMP’s graphical toolbox, such as graphlets and interactive feature modification, can lead viewers to new insights. But the fun doesn’t stop at graphs. We’ll also show how you can let your audience “choose their own adventure” by creating table scripts to subset your data into smaller data sets, each with their own graphs ready to provide new perspectives on the overall narrative. And don’t worry. Not all is as dark as it seems...     Auto-generated transcript...   Speaker Transcript Caleb King Hello, my name is Caleb King, I am a developer at the at the JMP software. I'm specifically in the design of experiments group, but today I'm going to be   a little off topic and talk about how you can use the graphics and some of the other tools in JMP to help with sort of the visual storytelling of your data.   Now the data set I've chosen is a...to illustrate that is the hate crime data collected by the FBI so maybe a bit of a controversial data set but also pretty relevant to what what's been happening.   And my goal is to make this more illustrative so I'll be walking through a couple graphics that I've created previously   for this data set. And I won't necessarily be showing you how I made the graphs but the purpose is to kind of illustrate for you how you can use JMP   for, like I said, visual storytelling so use the interactivity to help lead the people looking at the graphs and interacting with it   to maybe ask more questions, maybe you'll be answering some of their questions, as you go along. But kind of encourage that data exploration, which is what we're all about here at JMP.   So with that said let's get right in.   I'll first kind of give you a little bit of overview of the data set itself, so I'll kind of just scroll along here. So there's a lot of information about   where the incidents took place.   As we keep going, and the date, well, when that incident occurred. You have a little bit of information about the offenders   and what offense, type of offense was committed. Again some basic information, what type of bias was presented in the incident. Some information about the the victims.   And overall discrimination category, and then some additional information I provided about latitude and longitude, if it's available, as well as some population that I'll be using other graphics. Now just for the sake of, you know, to be clear,   the FBI, that's the United States Federal Bureau of Investigation, defines a hate crime is any   criminal offense that takes place that's motivated by biases against a particular group. So that bias could be racial, against religion, gender, gender identity and so forth.   So, as long as a crime is motivated by a particular bias, it's considered a hate crime, and this data consists of all the incidents that have been   collected by the FBI, going back all the way to the year 1991 and as recent as 2019. I don't have data from 2020, as interesting as that certainly would be,   but that's because the FBI likes to take its time and making sure the data is thoroughly cleaned and prepared before they actually create their reports. So   you can rest assured that this data is pretty accurate, given how much time and effort they put into making sure it is.   Alright, so with that let's kind of get started and do some basic visual summary of the data. So I'll start by running this little table script up here. And all this does is basically give us   a count over over the days, so each day, how many incidents occurred, according to a particular bias. From this I'm going to create a basic plot, in this case it's simple sort of line plot   here, showing the number of incidents that happen each day over the entire range. So you get the entire...you can see the whole range of the data 1991 to 2019   and how many incidents occurred. Now this in and of itself would probably a good static image, because you get kind of get a sense of   where the the number of incidents falls. In fact here I'm going to change the axis settings here a little bit. Let's see, we got increments in 50s, let's do it by 20s. There we go.   So there's a little bit of interactivity for you... interaction. We changed the scales to kind of refine it and get a better sense of   how many incidents, on average, there are. I ran a bit of a distribution and, on average, around 20 incidents per day   that we see here. Now of course, you're probably wondering why I have not yet addressed the two spikes that we see in the data.   So yes, there are clearly two really tall spikes. And so, if this were   any other type of software, you might say, okay, I'd look like to look into that. So you go back to the data you try and, you   know, isolate where those dates are and maybe try and present some plots or do some analysis to show what's going on there. Well, this is JMP and we have something that can   help with that, and it's something that we introduced in JMP 15 called graphlets and it works like this. I'm just going to hover and boom.   A little graphlet has appeared to help further summarize what's going on at that point. So in this case there's a lot of information.   We'll notice first the date, May 1, 1992. So if you're familiar with American history, you might know what's happening here, but if not,   you can get a little bit of an additional clue by clicking on the graph. So now you'll see that I'm showing you   the incidents by the particular bias of the incident. So we see here that most of the incidents were against white individuals and then next group is Black or African-American and it continues on down.   I kind of give away the answer here, in that the incidents that occurred around this time where the Rodney King riots in California. Rodney King,   an African-American individual who was unfortunately slain by a police officer and that led to a lot of backlash and rioting   around this time. So that's what we're seeing captured in this particular data point, and if you didn't know that, you would at least have this information to try to start and go looking there...looking online to figure out what happened.   We can do the same thing here with a very large spike. And again, I'll use the hover graphlet, so hover over it. I'll pause to let you look. So we look at the date, September 12, 2001. That's in it of itself a very big clue   as to what happened. But if we look here at the breakdown, we can see that most of the incidents were against individuals of Muslim faith, of   Arab ethnicity or some other type of similar ethnicity or ancestry. In this case, we can clearly see   that, after the unfortunate events of September 11, the terrorist attacks that occurred then, there was on the following day, a lot of backlash against   members who were of similar ethnicity, similar faith and so forth, so we had an unfortunate backlash happening at that time.   So already with just this one plot and some of that interactivity, we've been able to glean a lot of information, a lot of high level information in areas where you might want to look further.   But we can keep going. Now something new in JMP 16 is, because we have date here on the X axis, we can actually bin the dates into a larger category, so in this case let's bin it by month.   And we see that the plot disappears. So here's what I'm going to do. I'm going to rerun it and let's see.   There we go.   You never know what will happen. In this case, so this is what's supposed to happen; don't worry.   So we've binned it by month and we noticed an interesting pattern here. There seems to be some sort of seasonal trend occurring, and let's use the hover graphlets to kind of help us identify what might be happening. So I'm going to hover over the lower points. So if I do that,   we see okay, January, December, okay. Interesting. Let's hover over another one, December.   And yet another one, December. Ah, there might actually be some actual seasonal trend in this case going on. We seem to hit low points around the the winter months.   And in fact, if I go back to my data table, I've actually seen that before. It was something I kind of discovered while exploring that technique and I've already created a plot to kind of address that.   So this was something I created based off of that, kind of, look at, you know, what's the variation in the number of incidents over all the years within this month.   And here we can see them the mean trends, but we also see a lot of variation, especially here in September because of that huge spike there.   So maybe we need something a little more appropriate. So I'll open the control panel and hey, let's pick the the median. That's more robust and maybe look at the interquartile range, so that way we have a little bit more robust   metrics to play with.   And so, again, we see that seasonal trends, so it seems that there's definitely a large dip within the winter months   as opposed to peaking kind of in the spring and summer months. Now   this might be something someone might want to look further into and research why is that happening.   You might have your own explanations. My personal explanation is that I believe the Christmas spirit is so powerful that overcomes whatever whatever hate or bias individuals might have in December.   Again that's just my personal preference, you probably have your own. But again, with just a single plot, I was able to discover this trend and   make another plot to kind of explore that further. So again with just this one plot, I've encouraged more research. And we can keep going.   So let's see, let's bin it by year, and if we do that, we can clearly see this kind of overall trend.   So we see a kind of peak in the late '90s around early 2000s before dropping, you know, almost linear fashion,   until it hits its midpoint about in the mid 2010s before starting to rise again. So keep that in mind, you might see similar trends in other plots we show. But again, let's take a step back and just realize that in this one plot we've seen different   aspects of the data. We even   answered some questions, but we've also maybe brought up a few more. What's with that seasonal trend?   And if you didn't know what those events were that I told you, you know, what were those particular events? So that's the beauty of the interactivity of JMP graphics is it allows the user to engage and explore and encourages it all within just one particular medium.   All right.   Let's keep going. So I mentioned, this is sort of visual storytelling, so you can think of that sort of as a prologue, as sort of the the overall view. What's...what's   what's the overall situation? Now let's look at kind of the actors,   that is, who's committing these types of offenses? Who are they committing them against? What information can we find out about that? So here I've   created, again, a plot to kind of help answer that. Now this might be a good start. Here I've created a heat map   that then emphasizes the the counts by, in this case, the offender's race versus their particular bias. So we see that   a lot of what's happening, in this case I've sorted the columns so we can see there's a there's a lot going on. Most of its here in this upper   left corner and not too much going on down here, which I guess is good news. There's a lot of biases where there's not a lot happening, most of it's happening here in this corner.   Now, this might be a good plot, but again there's a lot of open space here. So maybe we can play around with things to try and emphasize what's going on. So one way I can do that is I'll come here to the X axis and I'm going to size by the count.   Now you'll see here, I had something hidden behind the scenes. I'd actually put a label, a percentage label on top of these.   There was just so much going on before that you couldn't even see it, but now we can actually see some of that information. So kind of a nice way to summarize it as opposed to counts.   But even with just the visualization, we can clearly see the largest amount of bias is against Black or African American citizens and then Jewish and on down until there's hardly any down here. So just by looking at the X axis, that gives you a lot of information about what's going on.   We can do the same with the Y, so again, size by the count.   And again, there's a lot of information contained just within the size and how I've adjusted the the axes.   And this case we include...we've really emphasized that corner, so we can clearly see who the top players are. In this case, most of it is   offenders are of white or unknown race against African-Americans, the next one being against Jewish, and then   anti white and then it just keeps dropping down. So we get a nice little summary here of what's going on. Now, you may have noticed that as I'm hovering around, we see our little circle. That's my graphlet indicator, so again I've got a tool here.   We've we've interacted a little bit and again, this could be a great static image, but let's use the power of JMP, especially those graphslets,   to interact and see what further information we can figure out. So in this case, I'll hover over here.   And right here, a graph, in this case, a packed bar chart, courtesy of our graph guru Xan Gregg. In this case, not only can you see, you know,   what people are committing the offenses and against whom, your next question might have been, you know, what types offenses are being committed? Well, with a graphlet, I've answered that for you.   We can see here the largest...the overwhelming type of offense is intimidation, followed by simple and aggravated assault, and then the rest of these, that's the beauty of the packed bar chart.   We can see all the other types offenses that are committed. If you stack them all on top of each other, they don't even compare. They don't even break the top three.   So that tells you a lot about the types of...these types of offenses, how dominant they are.   Now, another question you might have is, okay, we've seen the actors, we've seen the actions they're taking,   but there's a time aspect to this. Obviously this is happening over time, so has this been a consistent thing? Has there been a change in the trends? Well,   have no fear. Graphlets again to the rescue. In this case, I can actually show you those trends. So here we can see how has the   types of...the number of intimidation incidents changed over time? And again, we see that the pattern seems to follow what that overall trend was.   A peak in the like, late 90s, and then the steep trend...almost linear drop until about the mid 2010s, before kind of upticking again more recently.   And again we can maybe see that trend and others. I won't click to zoom in, but you can just see from the plot here, those trends in simple assault here and aggravated assault as well, a little bit there.   And you can keep exploring. So let's look at the unknown against African-Americans and see what difference there might be there. In this case, we can clearly see   that there are two types of offenses that really dominate, in this case, destruction or damage to property (which, if you think about it, might make sense; if you see your property's been damaged, there's a good chance you may not know who did it)   and intimidation are the dominant ones. And again, you can...the nice thing about this is the hover labels kind of persist, so you can again look and see what trends are happening there.   So in this case, we see with damage, there's actually two peaks, kind of peaked here in the late '90s early 2000s, before dropping again. And with intimidation, we see a similar trend as we did before.   Again within just one graphic, there's a lot of information contained and that you, as the user, can interact with to try and emphasize certain key areas, and then you, as the user, just visualize...just looking at this and interacting with it, can play around and glean a lot of information.   All right.   And let's keep going. Now you'll notice that amongst the reporting agencies, so, most of them are city/county level   police departments and so forth, but there's also some universities in here. So there might be someone out there who might be interested in seeing, you know, what's happening at the universities.   And so, with that, I've created this nice little table script to answer that. Now this time,   I've been just running the table scripts and I mentioned, I won't go too much behind the scenes, this is more illustrative.   Here I'm going to let you take a peek, because I want to not only show you the power of the graphics but also the power of the table script. Now if you're familiar with JMP,   you might know, okay, the table script's nice because I can save my analyses, I can save my reports, I can even use it to save graphics like I...like I did in the last one, so you may not have noticed that you can also save   scripts to help run additional tables and summary tables and so forth. So let me show you what all is happening behind here, in fact, when I ran the script, I actually created two data tables.   You only saw the one, so in this case I first created the data table that selected all the universities and then from that data table it created a summary and then I close the old one.   And then I also added to that some of the graphics. So I won't go into too much detail here about how I set this up, because I want to save that for after the next one. I'll give you a hint. It's based off of a new feature in JMP 16 that will really amaze you.   All right, let's go back to...excuse me...university incidents.   And here again I've saved the table script. This one that will show us a graphic.   So here we can see again is that packed bar chart, and here I'm kind of showing you which universities had the most incidents. Now again, this in and of itself might be a pretty good standard graphic.   You can see that, you know, which university seem to have the most incidents happening and again it's kind of nice to see that there's no real dominating one. You can still pack the other universities   on top of them, and nobody is dominating one or the other. So that in and of itself is kind of good news, but again there's a time aspect to this. So   have these been maybe... has the University of Michigan Ann Arbor, have they had trouble the entire time? Have they...would they have always remained on top? Did they just happen to have a bad year? Again, graphlets to the rescue.   In this case,   you'll see an interesting plot here. You might say, you know, what what is this thing? This looks like it belongs in an Art Deco museum. What...   what kind of plot is this? Well, it's actually one we've seen before. I'm just using something new that came out in JMP 16, so I'm going to give you a behind the scenes look.   And in this case, we can see, this is actually a heat map. All I've done   is I do a trick that I often like to do, which is to emphasize things two different ways, so not only emphasizing the counts by color, which is what you would typically do in a heat map, the whites are the missing entries, I can also now in JMP 16 emphasize by size.   And so I think this again gets back to where we size those axes before. It emphasizes...helps emphasize certain areas. So here we can see now maybe there's a little bit of an issue with incidents against African-Americans,   that has been pretty consistent, with an especially bad year in apparently 2017, as opposed to all of the other incidents that have been occurring.   Now there's no extra hover labels here.   All I'll do is summarize the data, but that's okay.   This in and of itself gives you a lot of information, so this is a new thing that came out in JMP 16 that can again help with that emphasis.   And again, we can keep going. We can look at other universities, so here, this might be an example of a university where they seem to have a pretty bad period of time,   the University of Maryland in College Park, but then there was an area where   things were really good, and so you might be interested in knowing, well, what happened to make this such a great period?   Is there something the university instituted, what they did that seemed to cause the count, the number of incidents to drop significantly? That might be something worth looking into.   And you can keep going and looking again to see whether it's a systemic issue, whether like, in this case, it seemed there's just a really bad year that dominated, overall they were just doing okay.   They were doing pretty good. Again, this might be another one. They had a really bad time early on, but recently they've been doing pretty good,   and so forth. So again, kind of highlighting that interactivity yet again,   and in this case, with some of the newer features in JMP 16. Now, before we transition to the last one, I have a confession. I'm a bit of a map nerd, so I really like maps and any type of data analysis that, you know, relates to maps.   I don't know why. I just really like it and so I'm really excited to show you this next one, because now we look at the geography of the incidents.   But I'm also excited because this really, I believe, highlights the power of both the table scripts and the JMP graphics, especially the hover graphs.   So hopefully that got you excited as well, so let's run it. Now this one's going to take a little while because there's actually a lot going on with this table script. It's creating a new table. It's also doing a lot of functions in that table   and computing a lot of things. So here we've got not just, you know, pulling in information but also there's a lot of these columns here near the end that have been calculated behind the scenes.   Now I have to take a brief moment to talk about a particular metric I'm going to be using. So a while back, I wrote a blog entry called the Crime Rate Conundrum on on the JMP Community (community.jmp.com),   so shameless plug there, but in that I talked about how, you know, typically when you're reporting incidents, especially crime incidents,   usually we kind of know that you don't just want to report the raw counts, because   there might be a certain area where it has a high number of counts, a high number of incidents, is that just because that's...there's a problem at problem there?   Or is it because there's just a lot of people there? And so we, of course, would expect a lot of incidents because there's just a lot of people. So of course people   report incidents rates. Now that's fine because everybody's now on a level playing field but one side effect of that is it tends to elevate   places that have small populations. Essentially, you have, if you have small denominator, you will tend to have a larger ratio just because of that.   And so that's sort of an unfortunate side effect, and so there, I talk about an interesting case where we have a place with a really small population that gets really inflated.   And how some people deal with that. One way I tried to address that was through this use of a weighted incident rate, essentially, the idea is   I take your incident rate, but then I weight you by, sort of, what proportion...   excuse me...basically a weight by how many people you have there. In this case, I have a particular weight, I basically rank the populations,   so that the the largest place would have rank of of the smallest. However, in this case there's 50 states, so the state with the largest population would have a   rank of 50 and the smallest state a rank of one. If you take that and divide that by you know the maximum rank, that's essentially your weight so it's it's a way to kind of put   a weight corresponding to your total population and the idea here is that, if your incident rate is such that it overcomes this weight penalty, if you will,   then that means that you might be someone worth looking into. So it tries to counteract   that inflation, just due to a small population. If you are still relatively small, but your incident rate is high enough that you overcome your weight essentially,   we might want to look into you. So hopefully that wasn't too much information, but that's the metric that I'll be primarily using so I'll run the script   and   here we go. So first I've got a straightforward line plot that kind of shows the weighted incident rates over time for all the states.   Now I'll use a new feature here. We can see here that New Jersey seems to dominate. Again interactivity, we can actually click to highlight it.   There's some new things that we do, especially in JMP 16. I'm going to right click here and I'm going to add some labels. So let's do the maximum value and let's do the last value   just for comparison.   So here we can see this...the peak here was about 11.4 incidents per 1,000 (that's a weighted incident rate) here in sort of the early '90s.   And then we see a decreasing trend, again it seems to drop about the same that all the the overall incident rate did before starting to peak again here in   2019. So again with just some brief numbers again this, in and of itself, would be an interesting plot to look at, but as you could see, my little graphlet indicator is going, so there's more.   Here's where the the map part comes in. So I'm going to hover over a particular point.   In this case, not only   can you see sort of the overall rate, I can actually break it down for you, in this case by county. So here I've colored the   counties by the total number of incidents within that year. And again, there's that time aspect, so this shows you a snapshot for one particular year, in this case 2008.   But maybe you're interested in the overall trend, so one one way you could do that is, hey, these are graphlets. I could go back, hover over another spot, pull up that graph, click on it to zoom,   repeat as needed. You could do that or you could use this new trick I found actually while preparing this presentation.   Let's unhide...notice over here to the side, we have a local data filter. That's really the key behind these graphlets.   I'm going to come here to the year and I'm going to change its modeling type to nominal, rather than continuous, because now, I can do something like this. I can actually go through   and select individual years or, now this is JMP, we can do better.   Let me go here and I'm going to do an animation. I'm going to make it a little fast here. I'm going to click play, and now I can just sit back relax and, you know, watch as JMP does things for me.   So here we can see it cycle through and getting a sense of what's happening. I'll let it cycle through a bit. We see...already starting to see some interesting things happening here.   Let's let it cycle through, get the full picture, you know. We want the complete picture, not that I'm showing off or anything.   Alright, so we've cycled through and we noticed something. So let's let's go down here to about say 2004, 2005. So somewhere around here, we noticed this one county here, in particular, seems to be highlighted.   And in fact, you saw my little graphlet indicator. So again, I can hover over it, and here   yet another map. Now you can see why I'm so excited.   Again, in this case, I can actually show you at the county level, so the individual county level...   Excuse me, let me...let's move that over a bit. There we are. Some minor adjustments and again, you can see my trick of emphasizing things two ways by both size and color.   We can kind of see dispersion within the ???, this is individual locations and because there's that time aspect again,   we know...now we know better, we don't have to go back and click and get multiple graphs, we can again use the local data filter tricks. So I can go back. I'll do   the year, and so in this case, we can again click through. Here I'm just going to use the arrow keys on my keyboard to kind of cycle through.   And just kind of get a sense of how things are varying over time. In this case, you see a particular area, you've probably already seen it, starting about 2006, 2007ish frame.   There's this one area...this here.   Keansburg, which seems to be highlighted and you'll notice yet another graphlet. How far do you want to go?   Graphception, if you will.   We can   keep going down further and further in. In this case, I get...I break it out by what the bias was, and again I could do that trick if I wanted to, to go through and cycle through by year.   So, again so much power in these graphs. With this one graphlet, I was able to explore geographical variation at county level and even further below, and so it might be   allowing you to kind of explore different aspects of the data, allowing you to generate more questions. What was happening in Keansburg around this time to make it pop like this?   That's something you might want to know.   So that's all I have for you today, hopefully I've whet your appetite and was able to clearly illustrate for you how powerful the the JMP visualization is in exploring the data.   If you want to know more, there's going to be a plenary talk on data viz. I definitely encourage you to explore that and it kind of helps address different ways of visualization and how JMP can help out with that.   But I did promise you, at one point, to give you a peek as to how I was able to create these pretty amazing table scripts and I'll do that right now.   It's called the enhanced log now in JMP 16. This is one of the coolest new features in JMP 16. Enhanced log actually follows along as you interact and it keeps track of it. And so whenever I closed, in this case, closed a data table, opened a data table, ran a data table,   if I added a new column, if I created a new graph, it gets recorded here in the log.   This is something that John Sall will be talking about in his plenary talk. It's, again, one of the most new amazing features here.   And this is the key to how I was able to create these tables scripts. I can honestly say that if this hadn't been present, I probably wouldn't have been able to create these pretty cool table scripts, because it'd be a lot of work to do.   So again, this is a really cool feature that's available in JMP 16. So I hope I was able to convince you that JMP is a great tool for exploring data, for creating awesome visualizations, interactive visualizations. And that's all I have. Thank you for coming.
Ross Metusalem, JMP Systems Engineer, SAS   Text mining techniques enable extraction of quantitative information from unstructured text data. JMP Pro 16 has expanded the information it can extract from text thanks to two additions to Text Explorer’s capabilities: Sentiment Analysis and Term Selection. Sentiment Analysis quantifies the degree of positive or negative emotion expressed in a text by mapping that text’s words to a customizable dictionary of emotion-carrying terms and associated scores (e.g., “wonderful”: +90; “terrible”: -90). Resulting sentiment scores can be used for assessing people’s subjective feelings at scale, exploring subjective feelings in relation to objective concepts and enriching further analyses. Term Selection automatically finds terms most strongly predictive of a response variable by applying regularized regression capabilities from the Generalized Regression platform in JMP Pro, which is called from inside Text Explorer. Term Selection is useful for easily identifying relationships between an important outcome measure and the occurrence of specific terms in associated unstructured texts. This presentation will provide an overview of both Sentiment Analysis and Term Selection techniques, demonstrate their application to real-world data and share some best practices for using each effectively.     Auto-generated transcript...   Speaker Transcript Ross Metusalem Hello everyone, and thanks for taking some time to learn how JMP is expanding its text mining toolkit in JMP Pro 16 with   sentiment analysis and term selection.   I'm Ross Metusalem, a JMP systems engineer in the southeastern US, and I'm going to give you a sneak preview of these two new analyses, explain a little bit about how they work, and provide some best practices for using them.   So both of these analysis techniques are coming in Text Explorer, which for those who aren't familiar, this is JMP's   tool for analyzing what we call free or unstructured texts, so that is natural language texts.   And it's a what we call a text mining tool, so that is a tool for deriving quantitative information from free text so that we can   use other types of statistical or analytical tools to derive insights from the free text or maybe even use that text as inputs to other analyses.   So let's take a look at these two new text mining techniques that are coming to Text Explorer in JMP Pro 16, and we'll start with sentiment analysis.   Sentiment analysis at its core answers the question how emotionally positive or negative is a text.   And we're going to perform a sentiment analysis on the Beige Book, which is a recurring report from the US Federal Reserve Bank.   Now apologies for using a United States example at JMP Discovery Europe, but the the Beige Book does provide a nice demonstration of what sentiment analysis is all about.   So this is a monthly publication, it contains national level report, as well as 12 district level reports, that summarize economic conditions in those districts, and all of these are based on qualitative information, things like interviews and surveys.   And US policymakers can use the qualitative information in the Beige Book, along with quantitative information, you know, in traditional economic indicators to drive economic policy.   So you might think, well, we're talking about sentiment or emotion right now. I don't know that I expect economic reports to contain much emotion, but the Beige Book reports and much language does actually contain words that can carry or convey emotion.   So let's take a look at an excerpt from one report. Here's a screenshot straight out of the new sentiment analysis platform.   You'll notice some words highlighted, and these are what we'll call sentiment terms, that is,   terms that we would argue have some emotional content to them. For example at the top, "computer sales, on the other hand, have been severely depressed,"   where "severely depressed" is highlighted in purple, indicating that we consider that to convey negative emotion, which it seems to if somebody describes computer sales as "severely depressed" it sounds like they mean for us to interpret that as as certainly a bad thing.   If we look down, we see in orange, a couple positive sentiment terms highlighted, like "improve" or "favorable." So we can look for words that we believe have positive or negative emotional   purple for negative, orange for positive, and some sentiment analysis keeps things at that level, so just a binary distinction, a positive text or a negative text.   There are additional ways of performing sentiment analysis and, in particular,   many ways try to quantify the degree of positivity or negativity, not just whether something is positive or negative. So consider this other example and I'll point our attention right to the bottom here, where we can see a report of "poor sales."   And I'm going to compare that with where we said "computer sales are severely depressed."   So both of these are negative statements, but I think we would all agree that "severely depressed" sounds a lot more negative than just "poor" is.   So we want to figure out not only is a sentiment expressed positive or negative, but how positive or negative is it, and that's what sentiment analysis in   Text Explorer does. So how does it do it? Well, it uses a technique called lexical sentiment analysis that's based on some sentiment terms and associated scores.   So what we're seeing right now is an excerpt from what we'd call a sentiment dictionary that contains the terms and their scores.   For example, the term "fantastic" has a score of positive 90 and the term "grim" at the bottom has a score -75.   So what we do is specify which terms we believe carry emotional content and the degree to which they're positive or negative on an arbitrary scale, here -100 to 100.   And then we can find these terms in all of our documents and use them to score how positive or negative those documents are overall.   If you think back to our example "severely depressed," that word "severely" and takes the word "depressed" and what we call intensifies it.   It is an intensifier or a multiplier of the expressed sentiment, so we also have a dictionary of intensifiers and what they do to the sentiment expressed by sentiment term.   For example, we say "incredibly" multiplies the sentiment by a factor of 1.4 where as "little" multiplies the sentiment by a factor of .3, so it actually, kind of, you know, attenuates the sentiment expressed a little.   Now, finally there's one other type of word we want to take into account and that is negators, things like "no" and "not," and we treat these basically as polarity reversals. So   "not fantastic" would be taking the score for "fantastic" and multiplying it by -1.   And so, this is a common way of doing sentiment analysis, again called lexical sentiment analysis. So what we do is we take sentiment terms that we find, we multiply them by any associated intensifier   or negators and then for each document, when we have all the sentiment scores for the individual terms that appear, we just average across all them to get a sentiment score for that document.   And JMP returns these scores to us in a number of useful ways. So this is a screenshot out of the sentiment analysis tool and we're going to be, you know, using this in just a moment.   But you can see, among other things, it gives us a distribution of sentiment scores across all of our documents. It gives us a list of all the sentiment terms and how frequently they occur.   And we even have the ability, as we'll see, to export the sentiment scores to use them in graphs or analyses.   And so I've actually made a couple graphs here to just try to see as an initial first pass, does the sentiment in the Beige Book reports actually align with economic events in ways that we think it should? You know, do we really have some validity to this   sentiment as some kind of economic indicator? And the answer looks like, yeah, probably.   Here I have a plot that I've made in Graph Builder; it's sentiment over time, so all the little gray dots are the individual reports   and the blue smoother kind of shows the trend of sentiment over time with that black line at zero showing neutral sentiment, at least according to our scoring scheme.   The red areas are   times of economic recession as officially listed by the Federal Reserve.   So you might notice sentiment seems to bottom out or there are troughs around recessions, but another thing you might notice   is that actually preceding each recession, we see a drop in sentiment either months or, in some cases, looks like even a couple years, in advance. And we don't see these big drops in sentiment   in situations where there wasn't a recession to follow. So maybe there's some validity to Beige Book sentiment as a leading indicator of a recession.   If we look at it geographically, we see things that make sense too. This is just one example from the analysis. We're looking at sentiment in the 12   Federal Reserve districts across time from 1999 to 2000 to 2001. This was the time of what's commonly called the Dotcom Bust, so this is when   there was a big bubble of tech companies and online businesses that were popping up and, eventually, many of them went under and there were some pretty severe economic consequences.   '99 to 2000 sentiment is growing, in fact sentiment is growing pretty strongly, it would appear,   in the San Francisco Federal Reserve district, which is where many of these companies are headquartered. And then in 2001 after the bust, the biggest drop we see all the way to negative sentiment in red here, again occurring in that San Francisco district.   So, just a quick graphical check on these Beige Book sentiment scores suggests that there's some real validity to them in terms of their ability to track with, maybe predict, some economic events, though of course, the latter, we need to look into more carefully.   But this is just one example of the potential use cases of sentiment analysis and there are a lot more.   One of the biggest application areas where it's being used right now is in consumer research, where people might, let's say, analyze   some consumer survey responses to identify what drives positive or negative opinions or reactions.   But sentiment analysis can also be used in, say, product improvement where analyzing product reviews or warranty claims might help us find product features or issues that elicit strong emotional responses in our customers.   Looking at, say, customer support, we could analyze call center or chats...chat transcripts to   find some support practices that result in happy or unhappy customers. Maybe even public policy, we analyze open commentary to gauge the public's reaction to proposed or existing policies.   These are just a few domains where sentiment analysis can be applied. It's really applicable anywhere you have text that convey some emotion and that emotion might be informative.   So that's all I want to say up front. Now I'd like to spend a little bit of time walking you through how it works in JMP, so let's move on over to JMP.   Here we have the Beige Book data, so down at the bottom, you can see we have a little over 5,000 reports here, and we have the date of each report from   May 1972 October 2020, which of the 12 districts it's from, and then the text of the report. And you can see that these reports,   they're not just quick statements of you know, the economy is doing well or poorly, they they can get into some level of detail.   Now, before we dive into these data, I do just want to say thank you to somebody for the idea to analyze the Beige Book and for actually pulling down the data and getting it into JMP, in a   format ready to analyze. And that thanks goes to Xan Gregg who, if you don't know, Xan is a director on the JMP development team and the creator of Graph Builder, so thanks, Xan,for your help.   Alright, so let's let's quantify the degree of positive and negative emotion in all these reports. We'll use Text Explorer under the analyze menu.   Here we have our launch window. I'll take our text data, put it in the text columns role. A couple things to highlight before we get going.   Text Explorer supports multiple languages, but right now, sentiment analysis is available only in English, and one other thing I want to draw attention to is stemming right here.   So for those who do text analysis you're probably familiar with what stemming is, but for those who aren't familiar, stemming is a process whereby we kind of collapse multiple...   well, to keep it nontechnical...multiple versions of the same word together. Take "strong," "stronger," and "strongest." So these are three   versions of the same word "strong" and in some text mining techniques, you'd want to actually combine all those into one term and just say, oh, they all mean "strong" because that's kind of conceptually the same thing.   I'm going to leave stemming off here actually, and it's because...take "strongest," that describes as strong as something can get   versus "stronger," which says that you know it's strong, but there are still, you know, room for it to be even stronger.   So "strongest" should probably get a higher sentiment score than "stronger" should, and if I were to stem, I would lose the distinction between those words. Because I don't want to lose that distinction, I want to give them different sentiment scores. I'm going to keep stemming off here.   So I'll click OK.   And JMP now is going to tokenize our text, that is break it into all of its individual words and then count up how often each one occurs.   And here we have a list of all the terms and how frequent they are. So "sales" occurs over 46,000 times and we also have our phrase list over here. So the phrases are   sequences of two or more words that occur a lot, and sometimes we actually want to count those as terms in our   analysis. And for sentiment analysis, you would want to go through your phrase list and, let's say, maybe add "real estate," which is two words, but it really refers to, you know, property.   And I could add that. Now normally in text analysis, we'd also add what are called stop words,   that's words that don't really carry meaning in the context of our analysis and we'd want to exclude. Take "district." This happen...or the Beige Book uses   the word "district" frequently, just saying, you know, "this report from the Richmond district," something like that, it's not really meaningful.   But I'm actually not going to bother with stop words right here and that's because, if you remember,   back from our slides, we said that all of our sentiment scores are based on a dictionary, where we choose sentiment words and what score they should get.   And if we just leave "district" out, it inherently gets a score of 0 and doesn't affect our sentiment score, so I don't really need to declare it as a stop word.   So once we're ready, we would invoke text or, excuse me, sentiment analysis under the red triangle here.   So what JMP is doing right now, because we haven't declared any sentiment terms or otherwise, it's using a built-in sentiment dictionary to score all the documents. Here we get our scores out.   Now before navigating these results, we probably should customize our sentiment dictionary, so the sentiment bearing words and their scores. And that's because in different domains,   maybe with different people generating the text, certain words are going to bear different levels of sentiment or bear sentiment in one case and not another. So we want to   really pretty carefully and thoroughly create a sentiment dictionary that we feel accurately captured sentiment as it's conveyed by the language we're analyzing right now.   So JMP, like I said, uses a built-in dictionary and it's pretty sparse. So you can see it right here, it has some emoticons,   like these little smileys and whatnot, but then we have some pretty clear sentiment bearing language, like "abysmal" at -90.   Now it's it's probably not the case that somebody's going to use the word "abysmal" and not mean it in a negative sense, so we feel pretty good about that. But, you know, it's not terribly long list and we may want to add some terms to it.   So let's see how we do that, and one thing I can recommend is just looking at your data. You know, read some of the documents that you have and try to find some words that you think might be indicative of sentiment.   We actually have down here a tool that lets us peruse our documents and directly add sentiment terms from them. So here, I have a document list. You can see Document 1 is highlighted and then Document 1 is displayed below. I could select different documents to view them.   Now, if we look at Document 1, right off the bat, you might notice a couple potential sentiment terms like "pessimism" and "optimism."   Now you can see these aren't highlighted. These actually aren't included in the standard sentiment dictionary.   And a lot of nouns you'll find actually aren't, and that's because nouns like "pessimism" or "optimism" can be described in ways that suggests their presence or their absence, basically. So I could say, you know, "pessimism is declining" or   "there's a distinct lack of pessimism," "pessimism is no longer reported."   And, in those cases, we wouldn't say "pessimism" is a negative thing. It's...so you want to be careful and think about words in context and how they're actually used before adding them to a sentiment dictionary.   For example, I could go back up to our term list here. I'm just going to show the filter,   look for "pessimism" and then show the text to have a look at how it's used. So we can see in this first example, "the mood of our directors varies from pessimism to optimism."   And the next one   "private conversations a deep mood of pessimism." If you actually read through, this is the typical use, so actually in the Beige Book, they don't seem to use the word pessimism in the ways that I might fear,   "optimism is increasing."   So I actually feel okay about adding "pessimism" here, so let's add it to our sentiment dictionary.   So if I just hover right over it,   you can see we bring up some options of what to do with it. So here I can, let's say, give it a score of -60.   And so now that will be added to our dictionary with that corresponding score, and it's triggering updated sentiment scoring in JMP. So that is, it's now looking for the word "pessimism" and adjusting all the document scores where it finds it.   So let's go back up now to take a look at our sentiment terms in more detail. If I scroll on down, you will find "pessimism"   right here with the score of -60 that I just gave it. Now I might want to actually...if you notice, "pessimistic" is, by default, has a score of -50, so maybe I just type -50 in here to make that consistent.   And I could but I'm not going to, just so that we don't trigger another update.   You'll also notice, right here, this list of possible sentiment terms. So these are terms that JMP has identified as maybe bearing sentiment, and you might want to take a look at them and consider them for inclusion in your dictionary.   For example, the word "strong" here, if you look at some of the document texts to the right, you might say, okay, this is clearly a positive thing. And if you've looked at a lot of these examples, it really stands out that   the word "strong" and correspondingly "weak" are words that these economists use a whole lot to talk about things that are good or bad about the current economy.   So I could add them, or add "strong" here by clicking on, let's say, positive 60 in the buttons up there. Again, I won't right now, just for the sake of expediting our look at sentiment analysis.   So we could go through, just like our texts down below, we could go through our sentiment term list here to choose some good candidates.   Under the red triangle, we also can just manage the sentiment terms more directly, so that is just in the full   term management lists that we might be used to for a Text Explorer user, so like the phrase management and the stop word management.   You can see we've added one new term local to this particular analysis, in addition to all of our built-in terms. Of course, we could declare exceptions too, if we want to just maybe not actually include some of those.   And importantly, you can import and export your sentiment dictionary as well. Another way to declare sentiment terms is to consult with subject matter experts. You know,   economists would probably have a whole lot to say about the types of words they would expect to see that would clearly convey   positive or negative emotion in these reports. And if we could talk to them, we would want to know what they have to say about that, and we might even be able to derive a dictionary in, say, a tab separated file with them and then just import it here.   And then, of course, once we make a dictionary we feel good about, we should export it and save it so that it's easy to apply again in the future.   So that's a little bit about how you would actually curate your sentiment dictionary. It would also be important to   curate your intensifier terms and your negation terms, and again, you don't see scores here, because these are just polarity reversals.   Just to show you a little bit more about what that actually looks like, if we...let's take a look at sentiment here, so we can see instances in which   JMP has found the word "not" before some of these sentiment terms and actually applied the appropriate score. So at the top there, "not acceptable" gets a score of -25.   So I show you that just to, kind of, draw your attention to the fact that these negators and intensifiers, they are kind of being applied automatically by JMP.   But anyways let's let's move on from talking about how to set the analysis up to actually looking at the results. So I'm going to bring up   a version of the analysis that's already done, that is, I've already curated the sentiment dictionary, and we can actually start to interpret the results that we get out.   So we have our high level summary up here, so we have more positive than negative documents. As we discussed before we can see, you know, how many of each. In fact, at the bottom of that table on the left, you see that we have one document that has no sentiment expressed in it whatsoever.   "strong" occurring 14,000 times, "weak" occurring 4,500 times approximately   and looking at these either by their counts or their scores, looking at the most negative and positive,   even looking at them in context can be pretty informative in and of itself. I mean, especially in, say, a domain like consumer research, if you want to know when people are feeling positively or expressing positivity   about our brand or some products that we have, what type of language are they using, maybe that would inform marketing efforts, let's say. This list of sentiment terms can be highly informative in that regard.   manufacturing,   tourism, investments. And sometimes we want to zero in on one of those subdomains in particular, what we might call a feature.   And if I go to this features tab in sentiment analysis, I'll click search. JMP finds some words that commonly occur with sentiment terms and asks if you want to maybe dive into the sentiment with respect to that feature.   So take, for example, "sales." We can see "sales were reported weak," "weakening in sales," "sales are reported to be holding up well" and so forth.   So if I just score this selected feature, now what JMP will do is provide sentiment scores only with respect to mention of "sales" inside these larger reports, and this is going to help us refine our analysis or focus it on a really well-defined subdomain.   And that's particularly important.   It could be the case that the domain in the language that we're analyzing isn't, you know, so well-restricted. Take, for example,   product reviews. You're interested in how positive or negative people feel about the product, but they might also talk about, say, the shipping and you don't necessarily care if they weren't too happy with the shipping, mainly because it's beyond your control.   You wouldn't want to just include a bunch of reviews that also comment on that shipping. And so it's important to consider the domain of analysis and restrict it appropriately and the feature finder here is one way of doing that.   So you can see now that I've scored on "sales," we have a very different distribution of positive and negative documents. We have more documents that don't have a sentiment score because they   simply don't talk about sales or don't use emotional language to discuss it, and we have a different list of sentiment terms now capturing sales in particular.   Let me remove this.   One thing I realized I forgot to mention, I mentioned it briefly, is how these overall document scores that we've been looking at are calculated, and I said that they're the average of all the sentiment terms of...   that occurred in a particular document. So let's look at Document 1. I'd just like to show you that   if you're ever wondering where does this score come from, let's say, -20, you can just run your mouse right over and it'll show you a list of all the sentiment terms that appeared.   And you can see, here we have 16 of them, including all at the bottom, "real bright spot," which was a +78 and then, if you divide...add all those scores up. divide by 78...   or divide by 16, excuse me, then you get an average sentiment of -20. And this is one of two ways to provide overall scores. Another one is a min max scoring, so differences between minimum and maximum   sentiments expressed in the text.   Now we can get a lot of information from looking at just this report, you know, obviously sentiment scores, the most common terms.   But oftentimes we want to take the sentiments and use them in some other way, like   look at them graphically, like we did back in the slides. So when it comes time for that part of your analysis, just head on up to the red triangle here   and save the document scores. And these will save those scores back to your data table so that you can enter them into further analyses or graph them, whatever it is you want to do.   So that's a sneak preview of sentiment analysis coming to Text Explorer in JMP Pro 16. The take-home idea is that sentiment analysis uses a sentiment dictionary that   you set up to provide scores corresponding to the positive and negative emotional content of each document, and then from there, you can use that information in any way that's informative to you.   So we'll leave sentiment analysis behind now and I'm going to move on back to our slides to talk about the other technique coming to Text Explorer soon.   And that is term selection, where term selection answers a different question, and that is, which terms are most strongly associated with some important variable or variable that I care about?   We're going to stick with the Beige Book.   We're going to ask which words are statistically associated with recessions. So in the graph here, we have over time, the percent change   GDP (gross domestic product) quarter by quarter, where blue shows economic growth, red shows economic contraction. And we might wonder, well, what   terms, as they appear in the Beige Book, might be statistically associated with these periods of economic downturn? For example, a few of them right here.   You know, why would we want to associate specific terms in the Beige Book with periods of economic downturn?   Well, it could potentially be informative in and of itself to obtain a list of terms. You know, I might find some potentially, you know, subtle   drivers of or effects of recessions that I might not be aware of or easily capture in quantitative data.   I might also find some words for further analysis. I might...I might find some   potential sentiment terms, some terms that are being used when the economy is doing particularly poorly that I missed my first time around when I was doing my sentiment analysis.   Or maybe I could find some words that are so strongly associated with recessions that I think I might be able to use them in a predictive model to try to figure out when recessions might happen in the future.   So there are a few different reasons why we might want to know which words are most strongly associated with recessions.   So, how does this work in JMP? Well, we we basically build a regression model where the outcome variable is that variable we care about, recessions, and the inputs are all the different words.   The data as entered into the model takes the form of a document term matrix, where each row corresponds to   one document or one Beige Book report, and then the columns capture the words that occur in that report. Here we have the column "weak" highlighted and it says "binary," which means that   it's going to contain 1s and 0s; a 1 indicates that that report contained to the word "weak" and 0 indicates that that report didn't contain the word "weak." And this is one of several ways we could kind of score the documents, but we'll we'll stick with this binary for now.   So we take this information and we enter it into a regression model. So here's the what the launch would look like.   We have our recession as our Y variable and that's just kind of a yes or no variable, and then we have all of these binary term predictors entered in as our model effects.   And then we're going to use JMP Pro's generalized regression tool   in order to build the model, and that's because generalized regression or GenReg, as we call it, includes automatic variable selection techniques. So if you're familiar with   regularized regression, we're talking like the Lasso or the elastic net. And if you don't know what that means, that's totally fine. The idea is that it will automatically   find which terms are most strongly associated with the outcome variable "recession," and then ones that it doesn't think are associated with it, it will zero those out.   And this allows us to look at relationships between "recession" and perhaps you know hundreds, thousands of possible words that that would be associated with them.   So what do we get when we run the analysis?   We get a model out. So what we have here is the equation for that model. Don't worry about it too much. the idea is that we say   the log odds of recession, so just it's a function of the probability that we're in a recession and when the Beige Book is issued is a function of   all the different words that might occur in the Beige Book report.   And you can see, we have, you know, the effect of the occurrence of the word "pandemic" with a coefficient of 1.94.   That just means that the log odds of "recession" go up by 1.94 if the Beige Book report mentions the word "pandemic." Then we see minus 1.02 times "gain." Well, that means if the Beige Book report mentions the word "gain," then the probability of recession...   or the log odds of recession drops by 1.02.   So we get out of that are a list of terms that are strongly associated with an increase in the probability of recession, things like "pandemic," "postponement," "cancellation," "foreclosures."   And we also get a list of terms that are associated with a decreasing probability of recession, so like "gain," "strengthen," "competition."   We also see "manufacturing" right there, but it's got a relatively small coefficient, about -.2.   And you'll actually notice here, and if we if we look at a graphical representation of all the terms that are selected in this analysis, you don't see too many specific domains like "manufacturing,"   "tourism," "investments" and all that. That's because those things are always talked about, whether we're in a recession or not, so what we're really looking for words that are used,   you know, when we're in a recession more often than you would expect by chance. So we have...for example, those are   "pandemic" being the most predictive. Makes a lot of sense. We're not talking about pandemics at all until pretty recently and then we've also experienced the recession recently, so we've picked up on that pretty clearly.   Then we have a few others in this, kind of, second tier, so that's "postponed" "cancel," "foreclosed," "deteriorate," "pessimistic."   And it's kind of interesting, this "postponement" and "cancellation" being associated with recessions. It makes sense, you really want to talk about postponing, say, economic activity   when a recession is happening, or at least that's perhaps a reliable trend. It's...so that that's insight, in and of itself. In fact, I   mean, I couldn't tell you how the Federal Reserve tracks postponing or canceling of economic activities, but the the fact that those terms, get flagged in this analysis suggests that's something probably worth tracking.   Alright, so that's term selection. We actually get this nice list of terms associated with recessions out and we can see the relative strength of association. Now let's actually see that briefly here, how it's done in JMP.   So I'm gonna head back on over to JMP and what we're going to do is pull up a slightly different data table. It's still Beige Book data, though, now we have just the national reports.   And we have this accompanying variable whether or not the US was in a recession at the time. And of course there's some auto correlation in these data. I mean, if we're in a recession last month, it's more likely we're going to be in a recession this month than if we weren't.   And you know that typically could be an issue for regression based analyses, but this is purely exploratory. We're not too too concerned with it.   So I'm going to just pull Text Explorer up from a table scripts just because we've kind of seen how it's launched before.   Note though that I've done some phrasing already, as we did before. I've also added some stop words, you can't see here, but words that I don't want them necessarily returned by this analysis.   And I've turned on stemming, which is what those little dots in the term list mean. For example, this for "increase" now is actually collapsing across "increases," "increasing," "increasingly."   And that's because now I, kind of, consider those all the same concept, and I just want to know if, you know, that concept is associated with recessions.   So to invoke term selection, we'll just go to the red triangle, and I'll select it here.   We get a control window popping up first, or I should say section, where we select which variable we care about, that's recessions. Select the target level, so I want this to be in terms of the probability of recession, as opposed to the probability of no recession.   I can specify some model settings. If you're familiar with GenReg, you'll see that you can choose whether you want early stopping, whether you want one of two different   penalizations to perform variable selection, what statistic you want to perform validation. And if that stuff is new to you, don't worry about it too much. The default settings are good way to go at first.   We have our weighting, if you remember, we had the 1s and 0s in that column for "weak," just saying whether the word occurred in a document or not, but you can select what you want. So   for example, frequency is not, did "weak" occur or not, it's how many times did it occur. And this kind of affects the way you would interpret your results. We're going to stick with binary for now.   But I'm going to say, I want to check out the 1,000 most frequent terms, instead of the 400 by default, which you can see, that's a lot more than 436, and normally you can't fit a model with 1,000 Xs but only 436   observations, but thanks to the   automatic variable selection in generalized regression, this isn't a problem. So once again it selects which of these thousand terms are most strongly related, hence the name term selection.   So I'm gonna run this. You can see what has happened is JMP has called out to the generalized regression platform and returned these results, where up here, we see some information about the model as it was fit. For example, we have 37 terms that were returned.   Let me just move that over. Because over here on the right is where we find some really valuable information. This is the list of terms most strongly associated with recession.   Now I'll sort these by the coefficient to find those most strongly associated with the probability of recession, so once again that's "pandemic" "postponement" "cancellations" and, as you might expect, at this point if I click on one of these, it'll update these previews   or these text snippets down below, so we can actually see this word in context.   So this list of terms in itself could be incredibly valuable. You, you might learn some things about specific terms or concepts that are important that you might not have known before. You can also use these terms in predictive models.   Now a few other things to note.   You can see down here, we have once again a table by each individual document, but instead of sentiment scores, we now have basically scores from the model. We have for each one   the...   what we call the positive contribution, so this is the contribution of positive coefficients predicting the probability of recession. Here we have the ones on the negative end.   And then we even have the probability of recession from the model, 71.8% here and then what we actually observed.   And we're not building a predictive model here necessarily, that is, I'm not going to use this model to try to predict recessions. I mean, I have all kinds of economic indicator variables I would use   too, but this is a good way to basically sanity check your model. Does it look like it's getting some of the its predictions right?   Because if it's not, then you might not trust the terms that it returns. You also have plenty of other information to try to make that judgment. I mean, we have some fits statistics, like the area under the curve up here.   Or we can even go into the generalized regression platform, with all of its normal options for assessing your model there further as well.   I'm not going to get into the details there, but all of that is available to you so that you can assess your model, tweak your model how you like, to make sure you have a list of terms that you feel good about.   Now you see right here, we have this, under the summary, this list of models and that's because you might actually want to run multiple models. So if I go back to the model...oh, excuse me...if we go back to our   settings up here, I could actually run a slightly different model. Maybe, for example, I know that the BIC is a little more conservative than the AICc and I want to return fewer terms, maybe did an analysis that returned 900 terms and you're a little overwhelmed.   So I'll click run and build another model using that instead.   And now we have that model added to our list here, and I can switch between these models to see the results for each one. In this case, we've returned only 14 terms, instead of 37 and I would go down to assess them below.   So two big outputs you would get from this, of course, is this term list. If you want to save that out, these are just important terms to you and you want to keep track of them, just right click and make this into a data table. Now I have a JMP table   with the terms, their coefficients and other information.   And   if what you want to do is actually kind of write this back to your original data table, maybe, so that you can use the information in some other kind of analysis or predictive model,   just head up to term selection and say that you want to save the term scores as a document term matrix, which if I bring our data table back here, you can see I've now written to their   columns for each of these terms that have been selected. In this case filled in with their coefficient values, and now I can use this   quantitative information however I like.   That's just a bit then about term selection. Again, the big idea here is I have found a list of terms that are related to a variable I care about and I even have, through their coefficients, information about how strong that relationship might be.   So let's just wrap up then. We've covered two techniques. We just saw term selection, finding those important words. Before that we reviewed sentiment analysis, all about   quantifying the degree of positive or negative emotion in a text. These are two new text mining techniques coming to JMP Pro 16's Text Explorer. We're really excited for you to get your hands on them and look forward to your feedback.
Bill Worley, JMP Senior Global Enablement Engineer, SAS   Pre-processing spectroscopic data is an important first step in preparing your data for analysis in any data analysis tool. We will review and demonstrate Standard Normal Variate (SNV) and Savitsky-Golay (SG) smoothing using JMP. These pre-processing tools are fairly standard throughout industry and are a first step in building predictive models with from raw spectroscopic data. Special thanks to Ian Cox for sharing his SG add-in.     Auto-generated transcript...   Speaker Transcript Bill Worley Hello everyone, this is Bill Worley, JMP systems engineer. I've been with JMP almost seven years. Today we'll be talking about   preprocessing spectroscopic data for analysis. I'm calling this a prequel because I've already given a talk on analyzing spectral data within JMP, but the pre processing is an important first step and will will show you why. Let's let's get into it.   So let's talk about the abstract, a little bit here. Pre processing spectroscopic data is really an important first step in preparing your data, any data, for analysis, really.   We can, you know, talk to you about...we're going to review the demo and demonstrate the standard normal variate and the Savitzky-Golay smoothing tools in JMP. These are the pre processing tools are fairly standard throughout the   chemical world, the chemical industry and really anybody who's doing spectroscopy and analyzing spectroscopic data. They would know that...   more about the standard normal variate the Savitzky-Golay and how they're used. So again it's an important first step inhelping you build predictive models from your raw spectroscopic data.   So, with a little background spectral data can be very messy. Pre processing is used to help smooth and filter and baseline correct prior to building these predictive models or any predictive models, you might want to build. The Savitzky-Golay filtering and standard normal variate   are tools that have...they're used in JMP. The Savitzky-Golay was an add-in that was developed by Ian Cox, so that is...thanks thanks to Ian for that. Standard normal variate requires a...   some sort of formula and we'll show you how to build that into JMP. And then the last line, this is...the Graph Builder is also an invaluable tool for visualizing your spectroscopic data and that can be used as an alternative platform for smoothing your data.   But again we'll get into that as we get further into the talk.   Why we do this. Why would you want to do any kind of pre processing? Well, if you look at that, first, that top spectra right there, or groups of spectra, that is the raw data from 100 spectra and you can't really tell any difference between   you know, a mixture of 100% starch or 100% gluten. You can see some differences, you just don't...you can't tell what's what. Well,   what we do from there, we do the Savitzky-Golay smoothing and filtering and then you take that first derivative and that gives you a much cleaner set of spectra,   nicely grouped, but still not completely where you'd like to be. And then, finally, after you do that standard normal variate smoothing, the spectra is much cleaner, much easier to see where the peaks are.   And you can see where the differences are in the different groupings of spectra. So again there's 20 spectra for each one of these red lines to gray lines to blue lines, and it's definitely much cleaner. You can see where the peaks are for the different mixtures.   Okay, so with Savitzky-Golay filtering, it's a digital filter supplied to your spectral data for the purpose of smoothing. The nice thing about it is it doesn't alter the overall spectra itself, the where the peaks are and everything like that. So it's a nice tool there.   It uses convolution for smoothing and that's based on a linear least squares formula.   That first and second derivatives are also...you're also able to tame those from the smooth signal.   And filtering of any unimportant data on either end of the spectra can be done as well. If you get into situations where, you know, you have something at the beginning, it's just unimportant and it's only going to mess you up as you go forward and same thing, on the other end.   Graph Builder allows you to do some even internal smoothing or internal filtering that you cannot do with the Savitzky-Golay.   add-in. So you can use the local data filter with the Graph Builder and further define areas of interest.   So,   let's go on. Savitzky-Golay   analysis is that we've got the...we've done...if we do that analysis, this is a grouping of spectra. So we've got the smooth,   the first derivative, now stretched out a little bit. And we put the zero line in there. And with the first derivative, you can   tell where the peaks are, based on the zero points are or crossing of the zero points, right, so that gives you an idea of where the peaks are for any given   group of spectra, set of spectra. And for the second derivative, now that's a little bit different. It helps flatten the baseline and then, if you have peaks that are overlapping, it helps you   further develop or better see where those are overlapping and you go from a positive to a negative, which is an indication of overlapping peaks and where they might be. Okay, and that's how you use that tool.   The standard normal variate itself is again the standardization of the data with respect to the individual spectra. It can be used alone or as an added smoothing tool and we're using it as an added tool today.   It's used...   it's used for baseline shift correction and correcting for global intensity variations. The nice thing is, it removes baseline variation without altering the shape of the spectra, right.   And the data, the thing about this is, and you'll see this when we do it, is that the data must be stacked, okay.   All right, this is just one of the formulas that   might be used in setting up the standard normal variate formula, and I'll show you how to do that. There's a couple of different ways, but it gets a little bit involved and I'll show you how to do that. The important part here is that   we have the standard formula here for the column standardized, but we have to make sure that we use it a by variable and that's the spectra itself. So   there's a little bit of a trick to adding that.   Right, so if it did the standard normal variate on the raw data only, this is what it would look like.   We can see the groupings, we can see, you know, kind of how things break out, but if we did that standard normal variate after we do that first derivative, it's much cleaner. You get a much better idea of where the spectra or where they overlap, where they, you know,   where you might have to do a little bit...where that second derivative would come into play, where you have overlapping peaks maybe here, right,   and maybe back over here. So   that's where the this really helps build the better vision of what you're seeing. And that zero line, again, is saying, where, you know, as you cross this that zero line is an indication of peaks, where it peaks are using the first derivative.   Alright, so let's get into a demo.   Alright, get out of there.   So let's talk about visualizing the data...the data first. Remember, whenever you get a new set of data, the best thing to do is pull it in and visualize it. You know, plot that data, get a better idea of what it looks like what you're up against. And let's go down here and pull up   our gluten raw data. Right, so we've imported the data. Now we need to do some visualization on it, and let's do...let's go to Graph Builder.   Right and the first things we're going to do,   move this up a little bit, go to raw data. Right, now we're gonna pull that in.   And this is the point data, so this is what we...what we saw before before we did any smoothing, before we did anything else. That was that line data and, if I can change that to   a parallel plot, right, that gives me the line data for the spectra. Again, we can't see a lot of what's going on there, but it gives us a better idea of where things are falling out. I can make that...   we can do some things down here, something called...if we need to, and you'll see this with other data, this parallel merged. But that didn't change anything. It didn't pull any together like you'll see what some of the other data, so I'm going to close that.   Right, and some of the other tools that I like to use when I'm visualizing the data are the model driven multivariate control chart, and let's do that first.   So we're go to Analyze, Quality and Process, and Model Driven Multivariate Control Chart.   Move this up.   Right, I don't have any historical data, so I say Okay. And what this allows us to do is, look at the data and say, okay,   we can see groupings of spectra, right. And for the most part, this is...just want to point this out,   we're able to explain most of the variation that we're seeing, at least up to 85% of the variation with two principal components, okay. And that's using this ??? T squared   information we've got here. And this also tells us where things might be out of control for a given spectra, right. Why is it different than the rest of the group?   So if I hover over one of these points, I can pull that out, right. If I hover over one of the out of control points, again, pull that out,   and now we can see, wow, those really are different, you know. There's something different about this spectra that we have to, you know, maybe better understand. In this case, we already have a better understanding, because we know that they're mixtures, but we want to see that.   And let's pull this one out too. Alright, so that's one more again. So between those three spectra, we've got quite a variety of, you know, the mixture. So we've got three of the...three of the groups and we could put the other two in there as well, just to see. But we'll leave it at that, for now.   Okay, so that's the model driven multivariate control chart. The other tool that I like to use is...it's under multivariate methods and it's called multidimensional scaling.   And the reason I like to use this is, especially for systems like this,   is that we've got   our raw data again, right, and we need to...   let me set this up first and I'll tell you.   We've got two dimensions here, right, and that's just based on what we saw before, but the reason I like to use is this allows us to do some grouping or visualize some grouping of the data itself. Alright,   let's do that. This takes a second to come up.   And we can see that we've got five distinct groups of spectra, right, so each one of those has to do with a group of our mixture of starch and gluten. If I highlight this grouping right here, that's the first group, which is all, you know, 100% starch.   The middle group is a 50/50 blend and this group down here is 100% gluten. Right, so what that does is it also highlights the data back in the data table, just like shown there, so that   interactivity is also nice. What this also shows us is that we're doing a fairly good job of already...already of being able to break out what the different groups are. If we look at the Shepard Diagram,   we have R square of 1 and   straight line, alright. So again, a good way to visualize it, and there's more you can do with this, but we'll leave that for another time.   Okay, so far,   we've brought the data in, we visualized it, right, and now we need to do some pre processing.   And what we're going to do, I've got these, more or less, here's placeholders so we're going to do some Savitzky-Golay analysis and we're going to do some standard normal varite, at least on the first derivative, okay.   Alright, so let me clean this up a little bit.   Right, close that up.   And again we're going to take that raw data and we're going to use that Savitzky-Golay add-in to do the analysis. So that's under add-ins for me.   And this is a free add-in, so you can get it and we'll put it out there for you. So this is again, we're going to take the raw data,   right, and if you want to learn more about what Savitzky-Golay is all about, we have a Wikipedia link here so you can use that. But we're going to use this to help smooth the data at every wavelength, right, and then take that first and second derivative. And I say, okay.   And there we go, and we've got our smooth data here.   And one thing, you know, you can adjust this as needed, but what we want to do is kind of take this,   widen out that window a little bit.   Alrigh.   Do that one.   And you can play around with the polynomials here, the order of polynomial fit.   In this case, it really doesn't make much difference to go to, what, an eighth order polynomial. A second one...the second order does just fine. Alright, and then we can take that data from our first derivative.   Say save smooth data, right, and we can save that back to the data table. And I've already done that, so I don't have to show you that process, but just know that the data is then taken back to the data table from there, okay.   Let's close that out.   Alright, and we're good to go now, but now we need to do that last step, right. So let's look at...before we go there, let's go back and look at   our Savitzky-Golay data. So this is the first derivative data. I would again want to look at this under the   Graph Builder, right. And see that's...just doesn't look very promising there, but let's do parallel plot,   makes it a little better. And clean it up and then let's do this combined scales parallel merge, and this brings the data into a much more   visual...a better visual than we had before, right. We can play the...play with that and again we could add our own zero line here to get a better understanding of where the peaks cross.   But that's that first derivative data. It's...you can see the groups, but they're not, you know, they're not completely separated like you'd really like to have them to get a better idea, okay.   So we're set there. That data is back in the data table. I'm going to close that out, and then remember I said, you have to stack the data for the...to do the standard normal variate, so we need to do that. Let's subset out some data here first.   Alright, so   subset all of these out.   Tables. Subset.   All rows. Selected columns, and say okay. Alright, so now we're good to go there.   Right.   Well, at least, we are...we've got the data separated out, right. Now we have to do...stack the data, right. And again, this is to be able to put the filter to all the data that we need for the formula.   So under Tables. Stack.   And these are our columns that we're stacking.   Right, and stack by row.   I'm going to select the columns that we want to keep.   On top of that. And after,   I'm sorry...I'm just clicking around just to get this setup, but you'll get the idea. I'll leave this up for a second for folks to see what it takes to get there. So I've saved this.   You know, see the untitled 12 now. That's our stacked data, right, and now we can perform that standard normal variate. A quick way to do that is to make a new column,   right click.   And I have to build this formula from here, from the formula editor. I'm going to use this statistical tool called   column standardize, right. And if I hover over this, over here, this gives you an idea of what what you're doing with that. What does standardized means, right. So we're going to...   a mean of 0 and a standard deviation of 1, so that's what we're trying to get with the data and with respect to all the given spectra. So we need to take the data   here and then, remember I said we had to add that bivariable, which is, in this case, going to be the spectra. I'm going to say okay.   Alright, and that gives us a data set. Now there's a simpler way to do this, or what I what I think is a simpler way, but it can also be...   it's a little bit more time-consuming sometimes. But if I right-click on the spectra right here, and again, remember i'm doing this because we need a bivariable. So I'm going to go to new column info and make that group by.   And I'm going to take my data column and I'm going to right-click on that and I'm going to do a new formula column.   We're going to slide over to distributional and remember, standardized. We're...so we want that   mean of 0; standard deviation of 1. Select that and then that builds out that new column. The nice thing about this is that the data matches, so I can breathe a sigh of relief that I got it right.   Right, so i've got that. Now, this would require from here, I'm going to click the that column,   that we take this data, split it and then put it back into the regular data table or the initial data table. One thing I want to show you with this is that   once you have the data stacked, you can do some other visualizations, right. So if I go to the...back to the Graph Builder right, I want to take this data   into the   Y, and then i'm gonna pull the label over, and this is a cool way to see where all the variation is. So these are all the,   you know, box plots for where the...you're seeing the variation in the data, so that's just a really nice way of being able to visualize what's going on.   Another powerful way of using the Graph Builder, and the data has to be stacked in this case to do that. Now, what you can also do is   if you pull in the point data, let's take off that, put the point data in there and then we could do some smoothing. Now that's an individual smooth so that's an average...for every...for all the data. What we would like to do is do that   Savitzky-Golay smoothing, so I need to pull in   the overlay, right. So that's all the data. Now it's going to be hard to see, but I'm going to do this, hopefully you can see this, if I select under smoother, we're going to switch to Savitzky-Golay.   Right, and I think you're...hopefully, you saw that that that now fits that line, that smoothing line right there.   That's another way of smoothing the data, right, using the Savitzky-Golay. And then you can change this to a quadratic, you can do some local trimming here, and then ultimately, you could save that back to the data table.   I'm going to leave it at that for now.   Right and remember, we had to split this data, save it back to the data table. I'm going to forego the splitting step and then just go back to the data...the original data table.   I don't need that.   Right and then when I   look at the data again under Graph Builder,   I may take this standard normal variate group,   drop that in. Again, doesn't look very promising but we'll do the parallel plot and then right-click.   Combine those scales.   And that's basically our finished smoothing, alright. So we've got it all set up. Again we could add a zero line here,   if needed. Let's go ahead and add that.   Zero and we'll make that some sort of green,   there's that, add it.   Say Okay, and then again, you would use this to help you figure out where your peaks are.   Based on the peak maximum as I crossed that zero point, okay.   So pretty much, that's it. Let's go back to our   PowerPoint.   So in conclusion, pre processing...we needed to pre process the data...pre process the data to get it cleaner, to get it, you know, much easier to work with.   Where also with JMP...   JMP is making this...   we're working to make this a much easier process. Pre processing in JMP 16 is going to be supported by add-ins and built-in capability.   We're looking forward to JMP 17 where pre processing, what's going to be offered in the Functional Data Explorer with much more sophisticated spectral and baseline smoothing options, along with some peak detection and selection.   For analyzing spectral data, please, there...there was a talk that I did back in 2020 for the Discovery   Munich. That was a virtual talk, but that will give you a good idea of how to use JMP to analyze your spectral data.   And with that, I'll say thank you and, hopefully, you all found...find this useful. Please let me know.
Christopher Gotwalt, JMP Director of Statistical R&D, SAS   There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience.       Auto-generated transcript...   Transcript Hello Chris Gotwalt here. Today, we're going to be constructing the history of graphic paradoxes and oh wait, wrong topic. Actually we're going to be talking about candidate set designs, tailoring DOE constraints to the problem. So industrial experimentation for product and process improvement has a long history with many threads that I admit I only know a tiny sliver of. The idea of using observation for product and process innovation is as old as humanity itself. It received renewed focus during the Renaissance and Scientific Revolution. During the subsequent Industrial Revolution, science and industry began to operate more and more in lockstep. In the early 20th century, Edison's lab was an industrial innovation on a factory scale, but it was done to my knowledge, outside of modern experimental traditions. Not long after RA Fisher introduced concepts like blocking and randomization, his associate and then son in law, George Box, developed what is now probably the dominant paradigm in design of experiments, with the most popular book being Statistics for Experimenters by Box, Hunter and Hunter. The method described in Box, Hunter and Hunter are what I call the taxonomical approach to design. So suppose you have a product or process you want to improve. You think through the things you can change. The knobs can turn like temperature, pressure, time, ingredients you can use or processing methods that you can use. These these things become your factors. Then you think about whether they are continuous or nominal, and if they are nominal, how many levels they take or the range you're willing to vary them. If a factor is continuous, then you figure out the name of the design that most easily matches up to the problem and resources that you...that fits your budget. That design will have... will have a name like a Box Behnken design, a fractional factorial, or a central composite signs, or possibly something like a Taguchi array. There will be restrictions on the numbers of runs, the level...the numbers of levels of categorical factors, and so on, so there will be some shoehorning the problem at hand into the design that you can find. For example, factors in the BHH approach, Box Hunter and Hunter approach, often need to be whittled down to two or three unique values or levels. Despite its limitations, the taxonomical approach has been fantastically successful. Over time, of course, some people have asked if we could still do better. And by better we mean to ask ourselves, how do we design our study to obtain the highest quality information pertinent to the goals of the improvement project? This line of questioning lead ultimately to optimal design. Optimal design is an academic research area. It was started in parallel with the Box school in the '50s and '60s, but for various reasons remained out of the mainstream of industrial experimentations, until the custom designer and JMP. The philosophy of the custom designer is that you describe the problem to the software. It then returns you the best design for your budgeted number of runs. You start out by declaring your responses along with their goals, like minimize, maximize, or match target, and then you describe the kinds of factors you have, continuous, categorical mixture, etc. Categorical factors can have any number of levels. You give it a model that you want to fit to the resulting data. The model assumes at least squares analysis and consists of main effects and interactions in polynomial terms. The custom designer make some default assumptions about the nature of your goal, such as whether you're interested in screening or prediction, which is reflected in the optimality criterion that is used. The defaults can be overridden with a red triangle menu option if you are wanting to do something different from what the software intends. The workflow in most applications is to set up the model. Then you choose your budget, click make design. Once that happens, JMP uses a mixed, continuous and categorical optimization algorithm, solving for the number of factors times the number of rows terms. Then you get your design data table with everything you need except the response data. This is a great workflow as the factors are able to be varied independent from one another. What if you can't? What if there are constraints? What if the value of some factors determine the possible ranges of other factors? Well then you can do....then you can define some factor constraints or use it disallowed combinations filter. Unfortunately, while these are powerful tools for constraining experimental regions, it can still be very difficult to characterize constraints using these. Brad Jones' DOE team, Ryan Lekivetz, Joseph Morgan and Caleb King have added an extraordinarily useful new feature that makes handling constraints vastly easier in JMP 16. These are called candidate or covariate runs. What you can do is, off on your own, create a table of all possible combinations of factor settings that you want the custom designer to consider. Then load them up here and those will be the only combinations of factor settings that the designer will... will look at. The original table, which I call a candidate table, is like a menu factor settings for the custom designer. This gives JMP users an incredible level of control over their designs. What I'm going to do today is go over several examples to show how you can use this to make the custom designer fulfill its potential as a tool that tailors the design to the problem at hand. Before I do that, I'm going to get off topic for a moment and point out that in the JMP Pro version of the custom designer, there's now a capability that allows you to declare limits of detection at design time. If you want a non missing values for the limits here the custom designer will add a column property that informs the generalized regression platform of the detection limits and it will then automatically get the analysis correct. This leads to dramatically higher power to detect effects and much lower bias in predictions, but that's a topic for another talk. Here are a bunch of applications that I can think of for the candidate set designer. The simplest is when ranges of a continuous factor depend on the level of one or more categorical factors. Another example is when we can't control the range of factors completely independently, but the constraints are hard to write down. There are two methods we can use for this. One is using historical process data as a candidate set, and then the other one is what I call filter designs where you create...design a giant initial data set using random numbers or a space filling design and then use row selections in scatter plots to pick off the points that don't satisfy the constraints. There's also the ability to really highly customize mixture problems, especially situations where you've got multilayer mixturing. This isn't something that I'm going to be able to talk about today, but in the future this is something that you should be looking to be able to do with this candidate set designer. You can also do nonlinear constraints with the filtering method, the same ways you can do other kinds of constraints. It's it's very simple and I'll have a quick example at the very end illustrating this. So let's consider our first example. Suppose you want to match a target response in an investigation of two factors. One is equipped...an equipment supplier, of which there are two levels and the other one is the temperature of the device. The two different suppliers have different ranges of operating temperatures. Supplier A's is more narrow of the two, going from 150 to 170 degrees Celsius. But it's controllable to a finer level of resolution of about 5 degrees. Supplier B has a wider operating range going from 140 to 180 degrees Celsius, but is only controllable to 10 degrees Celsius. Suppose we want to do a 12 run design to find the optimal combination of these two factors. We enumerate all possible combinations of the two factors in 10 runs in the table here, just creating this manually ourselves. So here's the five possible values of machine type A's temperature settings. And then down here are the five possible values of Type B's temperature settings. We want the best design in 12 runs, which exceeds the number of rows in the candidate table. This isn't a problem in theory, but I recommend creating a copy of the candidate set just in case so that the number of runs that your candidate table has exceeds the number that you're looking for in the design. Then we go to the custom designer. Push select covariate factors button. Select the columns that we want loaded as candidate design factors. Now the candidate design is loaded and shown. Let's add the interaction effect, as well as the quadratic effect of temperature. Now we're at the final step before creating the design. I want to explain the two options you see in the design generation outline node. The first one, which will force in all the rows that are selected in the original table or in the listing of the candidates in the custom designer. So if you have checkpoints that are unlikely to be favored by the optimality criterion and want to force them into into the design, you can use this option. It's a little like taking those same rows and creating an augmented design based on just them, except that you are controlling the possible combinations of the factors in the additional rows. The second option, which I'm checking here on purpose, allows the candidate rows to be chosen more than once. This will give you optimally chosen replications and is probably a good idea if you're about to run a physical experiment. If, on the other hand, you are using an optimal subset of rows to find to try in a fancy new machine learning algorithm like SVEM, a topic of one of my other talks at the March Discovery Conference. You would not want to check this option if that was the case. Basically, if you don't have all of your response values already, I would check this box and if you already have the response values, then don't. Reset the sample size to 12 and click make design. The candidate design in all its glory will appear just like any other design made by the custom designer. As we see in the middle JMP window, JMP also selects the rows in the original table chosen by the candidate design algorithm. Note that 10 not 12 rows were selected. On the right we see the new design table, the rightmost column in the table indicates the row of origin for that run. Notice that original rows 11 and 15 were chosen twice and are replicates. Here is a histogram view of the design. You can see that the different values of temperature were chosen by the candidate set algorithm for different machine types. Overall, this design is nicely balanced, but we don't have 3 levels of temperature in machine type A. Fortunately, we can select the rows we want forced into the design to ensure that we have 3 levels of temperature for both machine types. Just select the row you want forced into the design in the covariate table. Check include all selected covariant rows into the design option. And then if you go through all of that, you will see that now both levels of machine have at least three levels of temperature in the design. So the first design we created is on the left and the new design forcing there to be 3 levels of machine type A's temperature settings is over here to the right. My second example is based on a real data set from a metallurgical manufacturing process. The company wants to control the amount of shrinkage during the sintering step. They have a lot of historical data and have applied machine learning models to predict shrinkage and so have some idea what the key factors are. However, to actually optimize the process, you should really do a designed experiment. As Laura Castro-Schilo once pointed... As Laura Castro-Schilo once told me, causality is a property not of the data, but if the data generating mechanism, and as George Box says on the inside cover of Statistics for Experimenters, to find out what happens when you change something, it is necessary to change it. Now, although we can't use the historical data to prove causality, there is essential information about what combinations of factors are possible that we can use in the design. We first have to separate the columns in the table that represent controllable factors from the ones that are more passive sensor measurements or drive quantities that cannot be controlled directly. A glance at the scatter plot of the potential continuous factors indicates that there are implicit constraints that could be difficult to characterize as linear constraints or disallowed combinations. However, these represent a sample of the possible combinations that can be used with the candidate designer quite easily. To do this, we bring up the custom designer. Set up the response. I like to load up some covariate factors. Select the columns that we can control as factor...DOE factors and click OK. Now we've got them loaded. Let's set up a quadratic response surface model as our base model. Then select all of the model terms except the intercept. Then do a control plus right click and convert all those terms into if possible effects. This, in combination with response surface model chosen, means that we will be creating a Bayesian I-optimal candidate set design. Check the box that allows for optimally chosen replicates. Enter the sample size. It then creates the design for us. If we look at the distribution of the factors, we see that it is tried hard to pursue greater balance. On the left, we have a scatterplot matrix of the continuous factors from the original data and on the right is the hundred row design. We can see that in the sintering temperature, we have some potential outliers at 1220. One would want to make sure that those are real values. In general, you're going to need to make sure that the input candidate set it's clear of outliars and of missing values before using it as a candidate set design. In my talk with Ron Kennet this...in the March 2021 Discovery conference, I briefly demo how you can use the outlier and missing value screening platforms to remove the outliers and replace the missing values so that you could use them at a subsequent stage like this. Now suppose we have a problem similar to the first example, where there are two machine types, but now we have temperature and pressure as factors, and we know that temperature and pressure cannot vary independently and that the nature of that dependence changes between machine types. We can create an initial space filling design and use the data filter to remove the infeasible combinations of factors setting separately for each machine type. Then we can use the candidate set designer to find the most efficient design for this situation. So now I've been through this, so now I've created my space filling design. It's got 1,000 runs and I can bring up the global data filter on it and use it to shave off different combinations of temperature and pressure so that we can have separate constraints by machine type. So I use the Lasso tool to cut off a corner in machine B. And I go back and I cut off another corner in machine B so machine B is the machine that has kind of a wider operating region in temperature and pressure. Then we switch over to machine A. And we're just going to use the Lasso tool to shave off the points that are outside its operating region. And we see that its operating region is a lot narrower than Machine A's. And here's our combined design. From there we can load that back up into the custom designer. Put an RSM model there, then set our number of runs to 32, allowing coviariate rows to be repeated. And it'll crank through. Once it's done that, it selects all the points that were chosen by the candidate set designer. And here we can see the points that were chosen. They've been highlighted and the original set of candidate points that were not selected are are are gray. We can bring up the new design in Fit Y by X and we can see a scatterplot where we see that the the the machine A design points are in red. They're in the interior of the space, and then the Type B runs are in blue. It had the wider operating region and that's how we see these points out here, further out for it. So we have quickly achieved a design with linear constraints that change with a categorical factor without going the annoying process of deriving the linear combination coefficients. We've simply used basic JMP 101 visualization and filtering tools. This idea generalizes to other nonlinear constraints and other complex situations fairly easily. So now we're going to use filtering and multivariate to set up a very unique new type of design that I assure you you have never seen before. Go to the Lasso tool. We're going to cut out a very unusual constraint. And we're going to invert selection. We're going to delete those rows. Then we can speed this up a little bit. We can go through and do the same thing for other combinations of X1 and the other variables. Carving out a very unusual shaped candidate set. We can load this up into the custom designer. Same thing as before. Bring our columns in as covariates, set up a design with all... all high order interactions made if possible, with a hundred runs. And now we see our design for this very unusual constrained region that is optimal given these constraints. So I'll leave you with this image. I'm very excited to hear what you were able to do with the new candidate set designer. Hats off to the DOE team for adding this surprisingly useful and flexible new feature. Thank you.  
Vince Faller, Chief Software Engineer, Predictum  Wayne Levin, President, Predictum   This session will be of interest to users who work with JMP Scripting Language (JSL). Software engineers at Predictum use a continuous integration/continuous delivery (CI/CD) pipeline to manage their workflow in developing analytical applications that use JSL. The CI/CD pipeline extends the use of Hamcrest to perform hundreds of automated tests concurrently on multiple levels, which factor in different types of operating systems, software versions and other interoperability requirements. In this presentation, Vince will demonstrate the key components of Predictum’s DevOps environment and how they extend Hamcrest’s automated testing capabilities for continuous improvement in developing robust, reliable and sustainable applications that use JSL: Visual Studio Code with JSL extension – a single code editor to edit and run JSL commands and scripts in addition to other programming languages. GitLab – a management hub for code repositories, project management, and automation for testing and deployment. Continuous integration/continuous delivery (CI/CD) pipeline – a workflow for managing hundreds of automated tests using Hamcrest that are conducted on multiple operating systems, software versions and other interoperability requirements. Predictum System Framework (PSF) 2.0 – our library of functions used by all client projects, including custom platforms, integration with GitLab and CI/CD pipeline, helper functions, and JSL workarounds.     Auto-generated transcript...   Speaker Transcript Wayne Levin Welcome to our session here on extending Hamcrest automated testing of JSL applications for continuous improvement. What we're going to show you here, our promise to you, is we're going to show you how you too can build a productive cost-effective high quality assurance, highly reliable and supportable JMP-based mission-critical integrated analytical systems. Yeah that's a lot to say but that's that's what we're doing in this in this environment. We're quite pleased with it. We're really honored to be able to share it with you. So here's the agenda we'll just follow here. A little introduction, my self, I'll do that in a moment, and just a little bit about Predictum, because you may not know too much about us, our background, background of our JSL development, infrastructure, a little bit of history involved with that. And then the results of the changes that we've been putting in place that we're here to share with you. Then we're going to do a demonstration and talk about what's next, what we have planned for going forward, and then we'll open it up, finally, for any questions that that you may have. So I'm Wayne Levin, so that's me over here on the right. I'm the president of Predictum and I'm joined with Vince Faller. Vince is our chief software engineer who's been leading this very important initiative. So just a little bit about us, right there. We're a JMP partner. We launched in 1992, so 29 years old. We do training in statistical methods and so on, using JMP, consulting in those areas and we spend an awful lot of time building and deploying integrated analytical applications and systems, hence why this effort was very important to us. We first delivered JMP application with JMP 4.0 in the year 2000, yeah, indeed over 20 years ago, and we've been building larger systems. Of course, since back then, it was too small little tools, but we started, I think, around JMP 8 or 9 building larger systems. So we've got quite a bit of history on this, over 10 years easily. So just a little bit of background...until about the second half of 2019, our development environment was really disparate, it was piecemeal. Project management was there, but again, everything was kind of broken up. We had different applications for version control and for managing time, you know, our developer time, and so on, and just project management generally. Developers were easily spending, and we'll talk about this, about half their time just doing routine mechanical things, like encrypting and packaging JMP add-ins. You know, maintaining configuration packages and, you know, and separating the repositories or what we generally call repo's, you know, for encrypted and unencrypted script. It was...there was a lot we hade to think about that wasn't really development work. It was really work that developer talent is...was wasted on. We also had, like I said, we've been doing it a long time, even at 2019, we had easily 10 years, so over 10 years of legacy framework going all the way back even to JMP 5, you know, with, you know, it was getting bloated and slow. And we know JMP has come a long way over the years. I mean in JMP 9, we got namespaces and JMP 14 introduced classes and that's when Hamcrest began. And it was Hamcrest that really allowed us to go this this...with this major initiative. So we began this major initiative back in August of 2019. And that's when we are acquired our first Gitlab licenses and that's the development of our new...the development of our new development architecture, there you go, started to take shape and it's been improving ever since. Every month, basically, we've been adding and building on our capabilities to become more and more productive, as we go forward. And and that's continuing, so we actually consider this, if you will, a Lean type of effort. It really does follow Lean principles and it's accelerated our development. We have automated testing, thanks to this system, and Vince is going to show us that. And we have this little model here, test early and test often And that's what we do. It supports reusing code and we've redeveloped our Predictum system framework. It's now 2.0. We've learned a lot from our earlier effort. All that's gone, pretty much all of its gone, and it's been replaced and expanded. And Vince will tell us more about that. Easily, easily we have over 50% increase in productivity, and I'm just going to say the developers are much happier. They're less frustrated. They're more focused on their work, I mean the real work that developers should be doing, not the tedious sort of stuff. There's still room for improvement, I'm going to say, so we're not done and Vince will tell us more about that. We have development standards now, so we have style guides for functions and all of our development is functionally based, you might say. Each function requires at least one Hamcrest test, and there are code reviews that the developers, they're sharing with one another to ensure that we're following our standards. And it raises questions about how to enhance those standards, make them better. We also have these, sort of, fun sessions, where developers are encouraged to break code, right, so they're called like, these break code challenges, or what have you. So it's become part of our modus operandi and it all fits right in with this development environment. It leads to, for example, further tests, further Hamcrest tests to be added. We have one small, fairly small project that we did just over a year ago. We're going into a new phase of it. It's got well over... well over 100 Hamcrest tests are built into it and they get run over and over and over again through the development process. So some other benefits is it allows us to assign and track our resource allocation, like what developers are doing what. Everyone knows what everyone else is doing, continuous integration, continuous deployment, something like that), there's...code collisions are detected early so if we have... and we do, we have multiple people working on some projects, so, you know, somebody's changing a function over here and it's going to collide with something that someone else is doing. We're going to find out much sooner. It also allows us to improve supportability across multiple staff. We can't have code dependent on a particular developer; we have to have code that any developer or support staff can support ging forward. So that's was an important objective of ours as well. And it does advance the whole quality assurance area just generally, including supporting, you know, FDA requirements, concerning QA, you know, things like validation, the IQ OQ PQ. So it's...we're automating or semi automating those tasks as well through this infrastructure. We do use it internally and externally, so you may know, we have some products out there, (???)Kobe sash lab but new ones spam well Kobe send spam(???) are talked about also elsewhere in the JMP Discovery European Conference in 2021. You might want to go check them out, but they're fairly large code bases and they're all developed, in other words, we eat our own dog food, if you know that expression, but we also use it with all of our client development, so this is something that's important to our clients, so because we're building applications that they're going to be dependent on. And so we, we need to...we need to have the infrastructure that allows us to be dependable, and anyway, that's a big part of this. I mentioned the Predictum system framework. You can see some snippets of it here. It's right within the scripting index, and you know, we see the arguments and the examples and all that. We built all that in and 95%, over 95% of them have Hamcrest tests associated with them. Of course, our goal is to make sure that all of them do and we're we're getting there. We're getting there. Have...these framework...this framework is actually part of our infrastructure here. That's one of the important elements of it. Another is just that...Hamcrest... the ability to do the unit testing. And I'm going to have...there's a slide at the...at the end, which will give you a link into the Community where you can learn more about Hamcrest. This is a development that was brought to us by by JMP, back in JMP 14, as I mentioned a few minutes ago. Gitlab is a big part of this; that gives us the project management repository, the CI/CD pipeline, etc. And also there's a visual...visual studio code extension for JSL that we created and we'd...you see five stars there because it was given five stars on the on the visual studio. I'm not sure what we call that. Vince, maybe you can tell us, the store, what have you. It's been downloaded hundreds of times and we've been updating it regularly. So that's something you can go and look for as well. I think we have a link for that as well in the resource slide at the end. So what I'm going to do now is I'm going to pass this over to Vince Faller. Vince is, again, our chief software engineer. Vince led this initiative, starting in August 2019, as I said. It was a lot of hard work and the hard work continues. We're all, in the company, very grateful for Vince and his leadership here. So with that said, Vince, why don't you take it from here? I'm gonna... I'm... Vince Faller Sharing. So Wayne said Hamcrest a bunch of times. For people that don't know what Hamcrest is, it is an add-in created by JMP. Justin Chilton and Evan McCorkle were leading it. It's just a unit testing library that lets you run, test, and get results of it in an automated way. It really started the ball rolling of us being able to even do this, hence why it's called extending. I'm going to be showing some stuff with my screen. I work pretty much exclusively in the VSCode extension that we built. This is VSCode. We do this because it has a lot of built-in functionality or extendable functionality that we don't have to write, like get integration, get lab integration. Here you can see this is a JSL script and it reads it just fine. If you want to get it, if you're...if you're familiar with VSCode, it's just a lightweight text editor. You just type in JMP and you'll see it. It's the only one. But we'll go to what we're doing. So. For any code change we make, there is a pipeline run. We'll just kind of show what it does. So if I change the README file to this is a demo for Discovery. 2021. And I'm just going to commit that. If you don't know get, committing as just saying I want a snapshot of exactly where we are at the moment, and then you push it to the repo and it saved on the server. Happy day. Commit message. More readme info. And I can just do get push, because VSCode is awesome. Pipeline demo. So now I've pushed it. There is going to be a pipeline running. I can just go down here and click this and it will give me my merge request. So now pipeline started running. I can check the status of the pipeline. What it's doing right now is it's going to go through and check that it has the required Hamcrest files. We have some requirements that we enforce so that it can... we can make sure that we're doing our jobs well. And then it's done. I'm going to press encrypt. Now encrypt is going to take the whole package and encrypt it. If we go over here, this is just a vm somewhere. It should start running in a second. So it's just going through all the code. Writing all the encrypted passwords, going through, clicking all that stuff. If you've ever tried to encrypt multiple scripts at the same time, you'll probably know that that's a pain, so we automated it so that we don't have to do this because, as Wayne said, it was taking a lot of our time to do these. Like, if we have 100 scripts to go through and encrypt every single one of them every time we want to do any release, it was awful. Because we have to have our code encrypted because, yeah sorry, opinion, all right, I can stop sharing that. Ah. So that's gonna run. It should finish pretty soon. Then it will go through and stage it and then the staging basically takes all of the sources of information we want, as our as in our documentation, as in anything else we've written, and it renders them into the form that we want in the add-in, because much like the rest of github, gitlab, most of our documentation is written in markdown and then we render it into whatever. I don't need to show the rest of this but yeah. So it's passing. It's going to go. We'll go back to VSCode. So. If we were to change, so this is just a single function. If I go in here like, if I were to run this... JSL, run current selection. So. You can see that it came back...all that it's trying to do is open Big Class, run a fit line, and get the equation. It's returning the equation. And you can actually see it ran over here as well. But. So this could use some more documentation. And we're like, oh, we don't actually want this data table open. But let's let's just run this real quick. And say, no. This isn't a good return, it turns the equation in all caps apparently. So if I stage that. Better documentation. Push. Again back to here. So, again it's pushing. This is another pipeline. It's just running a bunch of power shell scripts in order, depending on however we set it up. But you'll notice this pipeline has more stages. So when we in an effort to help be able to scale this, we only test the JSL minimally at first, and then, as it passes, we allow to test further. And we only tested if there are JSL files that have changed. But we can go through this. It will run and it will tell us where it is in the the testing, just in case the testing freezes. You know, if you have a modal dialog box that just won't close, obviously JMP isn't going to keep doing anything after that. But you can see, it did a bunch of stuff, yeah, awesome. I'm done. Exciting. Refresh that. Get a little green checkmark. And we could go, okay, run everything now. It would go through, test everything, then encrypt it, then test the encrypted, basically the actual thing that we're going to make the add-in of, and then stage it again, package it for us, create the actual add-in that we would give to a customer. I'm not going to do that real quick because it takes a minute. But let's say we go in here and we're, like, oh, well, I really want to close this data table. I don't know why I commented out in the first place. I don't think it should be open, because I'm not using it anymore, we don't...we don't want that. We'll say okay. Close the dt. Again push. Now, this could all be done manually on my computer with Hamcrest. But you know, sometimes a developer will push stuff and not run all of their Hamcrest for everything on their computer, and this is...the entire purpose of it is to catch it. It forced us to do our jobs a little better. And yeah. Keep clicking this button. I could just open that, but it's fine. So now you'll see it's running the pipeline again. Go to the pipeline. And I'm just going to keep saying this for repetition. We're just going through, testing, and encrypting, then testing because sometimes encryption enters its own world of problems, if anybody's ever done encrypting. Run, run, run, run, run. And then, oh, we got a throw. Would you look at that? I'm not trying to be deadpan, but you know. So if we were to mark this as ready and say, yeah we're done, we'd see, oh, well, that test didn't pass. Now we could download why the test didn't pass in the artifacts. And this will open a J unit file that I'm just going to pull out here. It will also render it in getlab, which might be easier, but for now we'll just do this. Eventually. Minimize everything. Now come on. So, we could see that something happened with R squared and it failed. Inside of blue. So we can come here. Say, why is there something in boo that is causing this to fail? We see, oh, somebody called our equation and then they just assumed that the data table was there. So because something I changed broke somebody else's code, as if that would ever happen. So we're having that problem. Where did you go? Here we go. So that's the main purpose of everything we're doing here, is to be able to catch the fact that I changed something and I broke somebody else's stuff. So I could go through. Look at what boo does. Say, oh well, maybe I should just open Big Class myself. Yeah, cool. Well, if I save that, I should probably make it better. Open Big Class myself. I'll stage that. Open Big Class.Get push. And again, just show the old pipeline. Now this should take not...not too long. So we're going to go in here. We're...we only test on one...to... JMP version, but you can see automatically, we only test on one. Then it waits for the developer to say, yeah, I'm done and everything looks good. Continue. We do that for resource reasons, because these are running on vms that are automatically just chugging all the time, and we have multiple developers, who are all using these systems. We're also... You can see, this one is actually a docker system, we're containerizing these. Well, we're in the process of containerizing these. We have them working, but we don't have all the versions yet. But we run 14.3, at least for this project, we run 14.3, 15, 15.1, and that should work. Let's just revert things. Because that you know works. Probably should have done a classic...but it's fine. So yeah. We're going to test. I feel like I keep saying this over and over. We're going to test everything. We'll actually let this one run to show you kind of the end result of what we get. It should only take. a little bit. And so we'll test this, make sure it's going, and you can see the logs. We're getting decent information out of what is happening, on where it is, like it'll tell you the runner that is running. I'm only running on Windows right now. Again, this is a demo and all that but we should be able to run more. While that's running, I'll just talk about VSCode some more. In VSCode, we also...there's also snippets and things, so if you want to make a function, it will create you all over the the function information. We use natural docs again, that was stolen from the Hamcrest team, as our development documentation. So it'll just put everything in a natural docs form. So it just, again, the idea is helping us do our jobs and forcing us to do our jobs a little better, with a little more gusto. Wayne Levin For the documentation? Vince Faller So that's for the documentation, yeah. Wayne Levin As we're developing, we're documenting at the same time. Vince Faller Yep. Absolutely. You know, it also has for loops, while loops. For with an associate row, stuff like that. Are we...is this...is this done yet? It's probably done, yep. So we get our Green checkmark. Now it's going to run on all of the systems. If we can go back to here, you'll just see it. Open JMP. It'll run some tests, probably will open Big Class. Then close all...close itself all down. Wayne Levin So we're doing this, largely because many of our clients have different versions of JMP deployed and they want a particular add-in but they're running it, they have, you know, just different versions out there in the field. We also test against the early adopter versions of JMP, which is a service to JMP because we report bugs. But also for the clients, it's helpful because then they know that they can...they can upgrade to the new version of JMP. They know that the applications that we built for them have been tested. And that's just routine for us. Good. Vince Faller You're done. You're done. You're done. Change to... I can talk about... And this is just going to run, we can movie magic this if you want to, Meg. Just to make it run faster. Basically, I just want to get to staging but it takes a second. Is there anything else you have to say, Wayne, about it? Cool. I'll put that... Something I can say, when we're staging, we also have our documentation in mk docs. So it'll actually run the mk doc's version, render it, put the help into the help files, and basically be able to create a release for us, so that we don't have to deal with it. Because creating releases is just a lot of effort. Encrypting. It's almost done. Probably should just have had one pre loaded. Live demos, what are you gonna do. Some. Run. Oh, one thing I definitely want to do. So, the last thing that the pipeline actually does is checks that we actually spent our time, because, you know, if we don't actually record our time spent, we don't get paid, so forcing us to do it. Great, great time. Vinde Faller So Vince Faller the job would have failed without that. I can just show some jobs. Trying. That's the docker one. We don't want that. So You can see that gave us our successes. No failures. No unexpected throws. That's all stuff from Hamcrest. Come on. One more. Okay got to staging. One thing that it does it creates the repositories. It creates them fresh every time. So it's like, it tries to keep it in a sort of stateless way. Okay, we can download the artifacts now. And now we should have this pipeline demo. I really wish it would have just went there. What. Why is Internet Explorer up? So now you'll see pipeline demo is a JMP add-in. If we unzip it. If you didn't know, a JMP add-in is just a zip file. If we look at that now, you can see that has all of our scripts in it, it has our foo, it has our bar. If we open those, open, you can see it's an encrypted file. So this is basically what we would be able to give to the customer and not have so much mechanical work. Wayne mentioned that it's less frustrated developers, and personally, I think that's an understatement, because doing this over and over was very frustrating before we got this in place, and this has helped a bunch. That. Wayne Levin Now, about the encryption, when you're delivering an add-in for use by users within a company, you typically don't want, for security reasons and so on, you don't want them to anyone to be able to go in and deal with the code. You know, that sort of thing, so we may deliver a code unencrypted just for, you know, so the client has their code on encrypted, but for delivery to the end user, you typically want everything encrypted, just so it can't be tampered with. Just one of those sort of things. Vince Faller Yep, and that is the end of my demo. Wayne, if you want to take it back for the wrap up. Wayne Levin Yeah, terrific. Sure, thanks, very much for that, Vince. So there's a lot of moving parts in this whole system so it's, you know, basically, making sure that we've got, you know, code being developed by multiple users that are not colliding with one another. We're building in the documentation at the same time. And actually, the documentation gets deployed with the application and we don't have to weave that in. It's... We set the infrastructure up so that it's automatically taken care of. We can update that, along with the code comprehensively, simultaneously, if you will. The Hamcrest tests that are going on, each one of those functions that are written, there are expected results, if you will. So they get compared and so we saw, briefly, there was, I guess, some problem with that equation there. An R square or whatever came back with a different value, so it broke, in other words, to say hey, something's not right here; I was expecting this output from the function for a use case. So that's one of the things that we get from clients, so you know, we build up a pool of use cases that get turned into Hamcrest tests and away we go. There are some other slides here that are available to you, like, when you, when you, if you go and download the slides. So I'll leave that available for you and here's a little picture of of the pipeline that that we're employing and a little bit about code review activity for developers too. If you want to to go back and forth with it. Vince, do you want to add anything here about how code review and approval takes place? Vince Faller Yeah, so inside of the merge request it will have the JSL code on the diffs of the code. And again, a big thank you to the people who did Hamcrest, as well, because they also started a lexer(?) for GitHub and GitLab to be able to read JSL, so actually this is inside of getlab. And they can also read the JSL. It doesn't execute it, but it has nice formatting. It's not all just white text, it's it's beautiful there. We will just go in, like in this screenshot, you click a line, you put in a comment that you want, and it becomes a reviewable task. So we try to do as much inside of GitLab as we can for transparency reasons, and once everything is closed out, you can say yep, my merge request is ready to go. Let's put it into the master branch, main branch. Wayne Levin Awesom. So you're really it's...it's helping, you know, we're really defining coding standards, if you will, and I don't like the word enforcement but that's what it what it amounts to. And it reduces variation. It makes it easier for multiple developers, if you will, to understand what what others have done. And as we bring new developers on board, they come to understand the standard and they know what to look for, they know what to do. So it...it makes onboarding a lot easier, and again it deals with...everything's attached to everything here, so you know supportability and so on. This is the slide I mentioned earlier, just for some resources so we're using GitLab. I suppose the same principles applied to any git generally so like GitHub or what have you. Here's the community link for Hamcrest. There was a talk at in tucson, that was in 2019, in the old days when we used to travel and get together. That was a lot of fun. And here's the the marketplace link for the visuals...visual code studio. Visual studio code, what have you. So as Vince said, yeah we make a lot of use of that editor, as opposed to using the built-in JMP editor just because it's all integrated. It's... it's just all part of one big application development environment. And with that, Vince and I, on behalf of Vince and myself, I want to thank you for your your interest in this, and want to, again we really want to thank the JMP team. Justin Chilton and company, I'll call out to you. If not for Hamcrest, we would not be on this. That was the missing piece or was the enabling piece that really allowed us to to take JSL development to, basically, the kinds of standards you expect in code development, generally, in industry. So we're really grateful for it, and I know our... you know, that that is propagated out with each application we've deployed. And at this point, Vince and I are happy to take any questions that... info@predictum.com and it'll get forwarded to us and we'll get back to you. But at this point, we'll open it up to Q&A.  
Jeremy Ash, JMP Analytics Software Tester, SAS   The Model Driven Multivariate Control Chart (MDMCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMCC monitoring of a PLS model using the simulation of a real-world industrial chemical process: the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate how MDMCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available which can delay fault detection substantially. When MDMCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aid in the early detection of faults. We also demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts and diagnostic plots. MDMCC provides a user-friendly interface to move between these plots.     Auto-generated transcript...   Speaker Transcript   Hello, I'm Jeremy Ash. I'm a   statistician in JMP R&D. My job   primarily consists of testing   the multivariate statistics   platforms in JMP, but I also   help research and evaluate   methodology. Today I'm going to   be analyzing the Tennessee   Eastman process using some   statistical process control   methods in JMP. I'm going to   be paying particular attention   to the model driven multivariate   control chart platform, which is   a new addition to JMP 15.   These data provide an   opportunity to showcase the   number of the platform's   features. And just as a quick   disclaimer, this is similar to   my Discovery Americas talk. We   realized that Europe hadn't seen a   model driven multivariate   control chart talk due to all the   craziness around COVID, so I   decided to focus on the basics.   But there is some new material   at the end of the talk. I'll   briefly cover a few additional   example analyses, then I put on   the Community page for the talk.   First, I'll assume some knowledge   of statistical process control   in this talk. The main thing it   would be helpful to know about   is control charts. If you're   not familiar with these, these   are charts used to monitor   complex industrial systems to   determine when they deviate   from normal operating   conditions.   I'm not gonna have much time to   go into the methodology of model   driven multivariate control   chart, so I'll refer to these other   great talks that are freely   available on the JMP Community   if you want more details. I   should also say that Jianfeng   Ding was the primary   developer of the model driven   multivariate control control   chart in collaboration with   Chris Gotwalt and that Tonya   Mauldin and I were testers. The   focus of this talk will be using   multivariate control charts to   monitor a real world   typical process; another novel   aspect will be using control   charts for online process   monitoring. This means we'll be   monitoring data continuously as   it's added to a database and   detecting faults in real time.   So I'm going to start off with   the obligatory slide on the   advantages of multivariate   control charts. So why not use   univariate control charts? There   are a number of excellent   options in JMP. Univariate   control charts are excellent   tools for analyzing a few   variables at a time. However,   quality control data are often   high dimensional and the number   of control charts you need to   look at can quickly become   overwhelming. Multivariate   control charts can summarize a   high dimensional process in   just a couple of control charts,   so that's a key advantage.   But that's not to say that   univeriate control charts aren't   useful in this setting. You'll   see throughout the talk that   fault diagnosis often involves   switching between multivariate   and univariate charts.   Multivariate control charts give   you a sense of the overall   health of the process, while   univariate charts allow you to   monitor specific aspects of the   process. So the information is   complementary. One of the goals   of monitoring multivariate   control chart is to provide some   useful tools for switching   between these two types of   charts. One disadvantage of   univariate charts is that   observations can appear to be in   control when they're actually   out of control in the multivariate   sense and these plots show what I   mean by this. The univariate   control chart for oil and   density show the two   observations in red as in   control. However, oil and density   are highly correlated and both   observations are out of control.   in the multivariate sense,   specially observation 51, which   fairly violates the correlation   structure of the two variables,   so multivariate control charts   can pick up on these types of   outliers, while univariate   control charts can't.   Model driven multivariate   control chart uses projection   methods to construct the charts.   I'm going to start by explaining PCA   because it's easy to build up   from there. PCA reduces the   dimensionality of the process by   projecting data onto a low   dimensional surface. Um,   this is shown in the picture   on the right. We have P   process variables and N   observations, and   the loading vectors in the P   matrix give the coefficients for   linear combinations of our X   variables that result in   square variables with   dimension A, where the dimension   A is much less than P. And then   this is shown in equations on   the left here. The X can be   predicted as a function of the   score and loadings, where E is   the prediction error.   These scores are selected to   minimize the prediction error,   and another way to think about   this is that you're maximizing   the amount of variance explained   in the X matrix.   Then PLS is a more suitable   projection method. When you have   a set of process variables and a   set of quality variables, you   really want to ensure that the   quality variables are kept in   control but these variables   are often expensive or time   consuming to collect. The plant   could be making product without   a control quality for a long   time before a fault is detected.   So PLS models allow you to   monitor your quality variables   as a function of your process   variables and you can see that   the PLS models find the score   variables that maximize the   amount of variation explained of   the quality variables.   These process variables are   often cheaper or more readily   available, so PLS can enable you   to detect faults in quality   early and make your process   monitoring cheaper. And from here   on out I'm going to focus on PLS   models because it's more   appropriate for the example.   So PLS model partitions your   data into two components. The   first component is the model   component. This gives the   predicted values of your process   variables. Another way to think   about it is that your data has   been projected into the model   plane defined by your score   variables and T squared monitors   the variation of your data   within this model plane.   And the second component is the   error component. This is the   distance between your original   data and the predicted data and   squared prediction error (SPE)   charts monitor this variation.   Another alternative metric we   provide is the distance to model   X plane or DModX. This is just   a normalized alternative to SPE   that some people prefer.   The last concept that's   important to understand for the   demo is the distinction between   historical and current data.   Historical data are typically   collected when the process was   known to be in control. These   data are used to build the PLS   model and define the normal   process variation so that a   control limit can be obtained.   And current data are assigned   scores based on the model but   are independent of the model.   Another way to think about this   is that we have training and   test sets. The T squared control   limit is lower for the training   data because we expect less   variability for the various...   observations used to train the   model whereas there's greater   variability in P squared when   the model generalizes to E test   set. Fortunately, the theory   for the variance of T squared is   been worked out so we can get   these control limits based on   some distributional assumptions.   In the demo will be monitoring   the Tennessee Eastman process.   I'm going to present a short   introduction to these data. This   is a simulation of a chemical   process developed by Downs and   Vogel, two chemists at Eastman   Chemical. It was originally   written in Fortran, but there   are wrappers for Matlab and   Python now. I just wanted to note   that while this data set was   generated in the '90s, it's still   one of the primary data sets   used to benchmark multivariate   control methods in the   literature. It covers the   main tasks of multivariate   control well and there is   an impressive amount of   realism in the simulation.   And the simulation is based on   an industrial process that's   still relevant today.   So the data were manipulated   to protect proprietary   information. The simulated   process is the production of   two liquid products from   gaseous reactants within a   chemical plant. And F here is   a byproduct   that will need to be siphoned   off from the desired product.   Um and...   That's about all I'll say about that.   So the process diagram looks   complicated, but it really isn't   that bad, so I'll walk you   through it. Gaseous   reactants A, D, and E flow into   the reactor here.   The reaction occurs and the   product leaves as a gas. It's   then cooled and condensed into   liquid in the condenser.   Then a vapor liquid separator   recycles any remaining vapor and   sends it back to the reactor   through a compressor, and the   byproduct and inert chemical B   are purged in the purge stream,   and that's to prevent any   accumulation. The liquid product   is pumped through a stripper,   where the remaining reactants   are stripped off.   And then sent back to the reactor.   And then finally, the   purified liquid product   exits the process.   The first set of variables being   monitored are the manipulated   variables. These look like bow   ties in the diagram. I think   they're actually meant to be   valves and the manipulated   process...or the manipulated   variables mostly control the   flow rate through different   streams of the process.   And these variables can be set   to any values within limits and   have some Gaussian noise.   The manipulated variables are able   to be sampled in the rate,   but we use the default 3   minutes sample now.   Some examples of the manipulated   variables are the valves that   control the flow of reactants   into the reactor.   Another example is a valve   that controls the flow of   steam into the stripper.   And another is a valve that   controls the flow of coolant   into the reactor.   The next set of variables are   measurement variables. These are   shown as circles in the diagram.   They were also sampled at three   minute intervals. The   difference between manipulated   variables and measurement   variables is that the   measurement variables can't be   manipulated in the simulation.   Our quality variables will be   the percent composition of   two liquid products and you   can see the analyzer   measuring the products here.   These variables are sampled with   a considerable time delay, so   we're looking at the purge   stream instead of the exit   stream, because these data are   available earlier. And will use   a PLS model to monitor process   variables as a proxy for these   variables because the process   variables have less delay and   affect faster sampling rate.   So that should be enough   background on the data. In   total there are 33 process   variables and two quality   variables. The process of   collecting the variables is   simulated with a set of   differential equations. And this   is just a simulation, but as you   can see a considerable amount of   care went into modeling this   after a real world process. Here   is an overview of the demo I'm   about to show you. We will collect   data on our process and store   these data in a database.   I wanted to have an example that   was easy to share, so I'll be   using a SQLite database, but   the workflow is relevant to most   types of databases since most   support ODBC connections.   Once JMP forms an ODBC   connection with the database,   JMP can periodically check for   new observations and add them to   a data table.   If we have a model driven   multivariate control chart   report open with automatic   recalc turned on, we have a   mechanism for updating the   control charts as new data come   in and the whole process of   adding data to a database would   likely be going on a separate   computer from the computer   that's doing the monitoring. So   I have two sessions of JMP open   to emulate this. Both sessions   have their own journal   in the materials on the   Community, and the session   adding new simulated data to   the database will be called   the Streaming Session and   session updating the reports   as new data come in will be   called the Monitoring Session.   One thing I really liked about   the Downs and Vogel paper was   that they didn't provide a   single metric to evaluate the   control of the process. I have   a quote from the paper here   "We felt that the tradeoffs   among the possible control   strategies and techniques   involved much more than a   mathematical expression."   So here are some of the goals   they listed in their paper,   which are relevant to our   problem. They wanted to maintain   the process variables at   desired values. They wanted to   minimize variability of product   quality during disturbances, and   they wanted to recover quickly   and smoothly from disturbances.   So we'll see how well our   process achieves these goals   with our monitoring methods.   So to start off in the   Monitoring Session journal, I'll   show you our first data set.   The data table contained all of   the variables I introduced   earlier. The first variables are   the measurement variables; the   second are the composition.   And the third are the   manipulated variables.   The script up here will fit   a PLS model. It excludes the   last 100 rows as a test set.   Just as a reminder,   the model is predicting 2   product composition   variables as a function of   the process variables. If   you have JMP Pro, there   have been some speed   improvements to PLS   in JMP 16.   PLS now has a   fast SVD option.   You can switch to the   classic in the red   triangle menu. There's   also been a number of   performance improvements   under the hood.   Mostly relevant for datasets   with a large number of   observations, but that's   common in the multivariate   process monitoring setting.   But PLS is not the focus of the   talk, so I've already fit the   model and output score columns   and you can see them here.   One reason that the monitor   multivariate control chart was   designed the way it is, is that   imagine you're a statistician   and you want to share your model   with an engineer so they can   construct control charts. All   you need to do is provide the   data table with these formula   columns. You don't need to share   all the gory details of how you   fit your model.   Next, I'll provide the score   columns to monitor the   multivariate control chart.   Drag it to the right here.   So on the left here you can see two types of control charts the   T squared and SPE.   Um, there are 860 observations   that were used to estimate the   model and these are labeled as   historical. And then the hundred   that were left out as a test set   are your current data.   And you can see in the limit   summaries, the number of points   that are out of control and the   significance level. Um, if you   want to change the significance   level, you can do it up here in   the red triangle menu.   Because the reactor's in normal   operating conditions, we expect   no observations to be out of   control, but we have a few false   positives here because we   haven't made any adjustments for   multiple comparisons. It's   uncommon to do this, as far as I   can tell, in multivariate   control charts. I suppose you   have higher power to detect out   of control signals without a   correction. In control chart   lingo, this is means you're out   of control. Average run length   is kept low.   So on the right here we   also have contribution   plots and on the Y axis are   the observations; on the X   axis, the variables. A   contribution is expressed   as a portion.   And then at the bottom here,   we have score plots. Right   now I'm plotting the first   score dimension versus the   second score dimension, but   you can look at any   combination of score   dimensions using this   dropdown menus or the arrow   button.   OK, so I think we're oriented   to the report. I'm going to   now switch over to the   scripts I've used to stream   data into the database that   the report is monitoring.   In order to do anything for this   example, you'll need to have a   SQLite ODBC driver installed   for your computer. This is much easier   to do on a Windows computer,   which is what you're often using   when actually connecting to a   database. The process on the Mac   is more involved, but I put some   instructions on the Community   page. And then I don't have time   to talk about this, but I   created the SQLite database   I'll be using in JMP and I   plan to put some instructions   in how to do this on the   Community Web page. And hopefully   that example is helpful to you   if you're trying to do this with   data on your own.   Next I'm going to show   you the files that I put   in the SQLite database.   Here I have the historical data.   This was used to construct   the PLS model. There are 960   observations that are in   control. Then I have the   monitoring data, which at first   just contains the historical   data, but I'll gradually add new   data to this. This is the data   that the multivariate control   chart will be monitoring.   And then I've simulated new   data already and added it to the   data table here. These are   another 960 odd measurements   where a fault is introduced at   some time point. I wanted to   have something that was easy to   share, so I'm not going to run   my simulation script and add to   the database that way. We're   just going to take observations   from this new data table and   move them over to the monitoring   data table using some JSL and   SQL statements. This is just an   example emulating the process   of new data coming into a   database. Somehow you might not   actually do this with JMP, but   this was an opportunity to show   how you can do it with JSL.   Clean up here.   And next I'll show you this   streaming script. This is a   simple script, so I'm going to   walk you through it real quick.   This first set of   commands will open the   new data table and   it's in the SQLite database,   so it opens the table in the   background so I don't have to   deal with the window.   Then I'm going to take pieces   from this data table and add   them to the monitoring data   table. I call the pieces   bites and the bite size is 20.   And then this next command will   connect to the database. This   will allow me to send the   database SQL statements.   And then this next bit   of code is   iteratively sending SQL   statements that insert new   data into the monitoring data.   And I'm going to   initialize K and show you the   first iteration of this.   This is a simple SQL statement,   insert into statement that   inserts the first 20   observations into the data   table. This print statement is   commented out so that the code   runs faster and then I also   have a wait statement to slow   things down slightly so that   we can see their progression   in the control chart.   And this would just go too fast   if I didn't slow it down.   Um, so next I'm going to move   over to the monitoring sessions   to show you the scripts   that will update the report   as new data come in.   This first script is a simple   script. That will check the   database every .2 seconds for   new observations and add them   to the JMP table. Since the   report has automatic recalc   turned on, the report will update   whenever new data are added. And   I should add that   realistically,   you probably wouldn't use a   script that just iterates like   this. You probably use task   scheduler in Windows or   Automator on Mac to better   schedule runs of the script.   And then there's also another   script that will   push the report to JMP Public   whenever the report is updated,   and I was really excited that   this is possible with JMP 15.   It enables any computer with a   web browser to view updates to   the control chart. Then you   can even view the report on   your smartphone, so this makes   it really easy to share   results across organizations.   And you can also use JMP Live   if you wanted the reports to   be on restricted server.   I'm not going to have time   to go into this in this   demo, but you can check out   my Discovery Americas talk.   Then finally down here, there is   a script that recreates the   historical data in the data   table if you want to run the   example multiple times.   Alright, so next...make sure   that we have the historical data...   I'm going to run the   streaming script and see   how the report updates.   So the data is in control at   first and then a fault is   introduced, but there's a   plantwide control system   that's implemented in the   simulation, and you can see   how the control system   eventually brings the process   to a new equilibrium.   Wait for it to finish here.   So if we zoom in,   seems like the process first   went out of control around this   time point, so I'm going to   color it and   label it, but it will   show up in other plots.   And then in the SPE plot,   it looks like this   observation is also out of   control but only slightly.   And then if we zoom in on   the time point in the   contribution plots, you can   see that there are many   variables contributing to   the out of control signal at   first. But then once the   process reaches a new   equilibrium, there's only   two large contributors.   So I'm going to remove the heat   maps now to clean up a bit.   You can hover over   the point at which the process   first went out of control and   get a peek at the top ten   contributing variables. This   is great for giving you a   quick overview which variables   are contributing most to the   out of control signal.   And then if I click on the plot,   this will be appended to the   fault diagnosis section.   And as you can see, there's   several variables with large   contributions and just sorted   on the contribution.   And for variables with   red bars the observation is   out of control in the univariate   control charts. You can see   this by hovering over one of   the bars and these graphlets   are IR charts for an   individual variable with a   three Sigma control limit.   You can see in the stripper   pressure variable that the   observation is out of   control, but eventually the   process is brought back under   control. And this is the case   for the other top   contributors. I'll also show   you one of the variables   where we're in control, the   univariate control chart.   So the process was...   there are many variables out   of control in the process at   the beginning, but   process eventually reaches   a new equilibrium.   Um...   To see the variables that   contribute most to the shift in   the process, we can use mean   contribution proportion plot.   These plots show the average   contribution that the variables   have to T squared for the group   I've selected. Um, here if I   sort on these.   The only two variables with   large contributions measure the   rate of flow of reactant A in   stream one, which is the flow of   this reactant into the reactor.   Both of these variables are   measuring essentially the   same thing, except one   measurement...one is a   measurement variable and the   other is a manipulated   variable.   You can see that there is a   large step change in the flow   rate, which is what I programmed   in the simulation. So these   contribution plots allow you to   quickly identify the root cause.   And then in my previous talk I   showed many other ways to   visualize and diagnose faults   using tools in the score plot.   This includes plotting the   loadings on the score plots and   doing some group comparisons.   You can check out my Discovery   Americas talk on the JMP   Community for that. Instead, I'm   going to spend the rest of this   time introducing a few new   examples, which I put on the   Community page for this talk.   So.   There are 20 programmable faults   in the Tennessee Eastman process   and they can be introduced in any   combination. I provided two other   representative faults here. Fault   1 that I showed previously was   easy to detect because the out   of control signal is so large   and so many variables are   involved. The focus on the   previous demo was to show how to   use the tools and identify.   faults out of a large number of   variables and not to benchmark   the methods necessarily.   Fault 4, on the other hand,   is a more subtle fault,   and I'll show you it here.   The fault i...that's programmed   is a sudden increase in the   temperature in the reactor.   And this is compensated for by   the control system by increasing   the flow rate of coolant.   And you can see that   variable picked up here and   you can see the shift in   contribution plots.   And then you can also see   that most other variables   aren't affected   by the fault. You can see a   spike in the temperature here   is quickly brought back under   control. Because most other   variables aren't affected, this   is hard to detect for some   multivariate control methods.   And it can be more   difficult to diagnose.   The last fault I'll show you   is Fault 11.   Like Fault 4, it also involves   the flow of coolant into the   reactor, except now the fault   introduces large oscillations in   the flow rate, which we can   see in the univariate control   chart. And this results in a   fluctuation of reactor   temperature. The other   variables aren't really   affected again, so this can be   harder to detect for some   methods. Some multivariate   control methods can pick up on   Fault 4, but not Fault 11 or   vice versa. But our method was   able to pick up on both.   And then finally, all the   examples I created using the   Tennessee Eastman process had   faults that were apparent in   both T squared and SPE plots. To   show some newer features in   model driven multivariate   control chart, I wanted to show   an example of a fault that   appears in the SPE chart but not   T squared. And to find a good   example of this, I revisited a   data set which Jianfeng Ding   presented in her former talk, and   I provided a link to her talk   in this journal.   On her Community page,   she provides several   useful examples that are   also worth checking out.   This is a data set from Cordia   McGregor's (?) classic paper on   multivariate control charts. The   data are processed variables   measured in a reactor, producing   polyethylene, and you can find   more background in Jianfeng's   talk. In this example, we   have a process that went out of   control. Let me show you this.   And it's out of control in...   earlier in the SPE chart than in   the T squared.   And if we look at the mean   contribution   plots for SPE,   you can   see that there is one variable   with large contribution and it   also shows a large shift in the   univariate control chart, but   there are also other variables   with large contributions, but   that are still in control in the   univariate control charts.   And it's difficult to determine from   the bar charts alone why these   variables had a large   contributions. Large SPE values   happen when new data don't   follow the correlation structure   of the historical data, which is   often the case when new data are   collected, and this means that   your PLS model you trained is   no longer applicable.   From the bar charts, it's hard   to know which pair of variables   have their correlation structure   broken. So new in 15.2, you   can launch scatterplot matrices.   And it's clear in the   scatterplot matrix that the   violation of correlations   with Z2 is what's driving   these large contributions.   OK, I'm gonna switch back   to the PowerPoint.   And real quick, I'll summarize   the key features of model driven   multivariate control chart that   were shown in the demo. The   platform is capable of   performing both online fault   detection and offline fault   diagnosis. There are many   methods provided in the platform   for drilling down to the root   cause of faults. I'm showing you   here some plots from a popular   book, Fault Detection and   Diagnosis in Industrial Systems.   Throughout the book, the authors   demonstrate how one needs to   use multivariate and univariate   control charts side by side   to get a sense of what's going   on in a process.   An one particularly useful   feature in model driven multivariate   control chart is how   interactive and user friendly   it is to switch between these   two types of charts.   And that's my talk. Here is   my email if you have any   further questions. And   thanks to everyone that   tuned in to watch this.
Mia Stephens, JMP Principal Product Manager, SAS   Predictive modeling is all about finding the model, or combination of models, that most accurately predicts the outcome of interest. But, not all problems (and data) are created equal. For any given scenario, there are several possible predictive models you can fit, and no one type of model works best for all problems. In some cases a regression model might be the top performer; in others it might be a tree-based model or a neural network. In the search for the best-performing model, you might fit all of the available models, one at a time, using cross-validation. Then, you might save the individual models to the data table, or to the Formula Depot, and then use Model Comparison to compare the performance of the models on the validation set to select the best one. Now, with the new Model Screening platform in JMP Pro 16, this workflow has been streamlined. In this talk, you’ll learn how to use Model Screening to simultaneously fit, validate, compare, explore, select and then deploy the best-performing predictive model.     Auto-generated transcript...   Speaker Transcript Mia Stephens model screening.   If you do any work with predictive modeling, you'll find that model screening helps you to streamline your predictive modeling workflow.   So in this talk I'm going to provide an overview of predictive modeling, talk about the different types of predictive models we can use in JMP.   We'll talk about the predictive modeling workflow within the broader analytics workflow, and we'll see how model screening can help us to streamline this workflow.   I'll talk about some metrics for comparing competing models using validation data and we'll see a couple of examples in JMP Pro.   First let's talk a little bit about predictive modeling and what is predictive modeling.   You've probably been exposed to regression analysis, and regression is an example of explanatory modeling. And in regression we're typically interested in building a model   for a response, or Y, as a function of one or more Xs, and we might have different modeling goals. We might be interested in identifying important variables. So what are the key Xs, or key input variables,   for example, in a problem solving setting that we might focus on to address the issue?   We might be interested in understanding how the response changes, on average, as a function of the input variables.   a one unit change in X is associated with a five unit change in Y. So this is classical explanatory modeling and if you've taken statistics in school, this is probably how you learn about regression.   Now to contrast, predictive modeling has a slightly different goal. In predictive modeling our goal is to accurately predict or classify future outcomes.   So if our response is continuous, we want to be able to predict the next observation, the next outcome, as precisely or accurately as possible.   And if our response is categorical, then we're interested in classification. And again we're interested in using current data to predict what's going to happen at the individual level in the future.   And we might fit and compare many different models, and in predictive modeling we might also use some more advanced models. We might use some machine learning techniques like neural networks.   And some of these models might not be as easy to interpret and many of them have a lot of different tuning parameters that we can set. And as a result, with predictive modeling we can have a problem with overfitting.   What overfitting means is that we fit a model that's more complex than it needs to be.   So with predictive modeling we generally use validation and there are several different forms of validation we can use.   We use validation for model comparison and selection, and fundamentally, it protects against overfitting but also underfitting.   And underfitting is when we fit a model that's not as complex as it needs to be. It's not really capturing the structure in our data.   Now, in the appendix at the end of the slides, I've pulled some some slides that illustrate why validation is important, but for the focus of this talk I'm simply going to use validation when I fit predictive models.   There are many different types of models we can fit in JMP Pro, and this is not by any means an exhaustive list.   We can fit several different types of regression models. If we have a continuous response, we can fit a linear regression model.   If our response is categorical, a logistic regression model, but we can also fit generalized linear models and penalize regression methods, and these are all from the fit model platform.   There are many options under the predictive modeling platform, so neural nets, neural nets with boosting, and different numbers of layers and nodes,   classification and regression trees, and more advanced tree based methods, and several other techniques.   And there are also a couple of predictive modeling options from the multivariate methods platform, so discriminate analysis   and partial least squares are two differen...two additional types of models we can use for predictive modeling. And by the way, partial least squares is also available from fit model.   And why do we have so many models?   In predictive modeling, you'll often find that that no one model   or modeling type always works the best. In certain situations, a neural network might be best and neural networks are generally are pretty...pretty good performers.   But you might find in other cases that a simpler model actually performs best, so the type of model that fits your data best and predicts most accurately is based largely on your response, but also on the structure of your data.   So you might fit several different types of models and compare these models before you find the model that fits most accurately or predicts most accurately.   So, within the broader analytic workflow, where you start off with some sort of a problem that you're trying to solve, and you compile data, you prepare the data, and explore the data,   predictive modeling is down in analyze and build models. And the typical predictive modeling workflow might look something like this, where you fit a model with validation.   Then you save that formula to a data table or publish it to the formula depot.   And then you fit another model and you repeat this, so you may fit several different models   and then compare these models. And in JMP Pro, you can use the model comparison platform to do this, and you compare the performance of the models on the validation data, and then you choose the best model or the best combination models and then you deploy the model.   And what's different with model screening is that all of the model fitting and comparison, this selection is done within one platform, the model screening platform.   So we're going to use an example that you might be familiar with, and there is a blog on model screening using these data that's posted in the Community, and these are the diabetes data.   So the scenario is that researchers want to predict the rate of disease progression, one year after baseline.   So there are several baseline measurements and then there are two different types of responses.   The response Y is the quantitative measure, a continuous measure, so this is the rate of disease progression and then there's a second response, Y Binary, which is high or low.   So Y Binary can represent a high high rate of progression or a low rate of progression.   And the goal of predictive modeling here is to predict patients who are most likely to have a high rate of disease progression, so that the the corrective actions can be taken   to prevent this. So we're going to...we're going to take a look at fitting models in JMP Pro for both of these responses, and we'll see how how to fit the same models using model screening.   And before I go to JMP, I just want to talk a little bit about how we compare predictive models.   We compare predictive models on the validation set or test set. So so basically what we're doing is we fit a model to a subset of our data called our training data.   And then we fit that model to data that were held out (typically we call these validation data) to see how well the model performs.   And if we have continuous responses, we can use measures of error, so root means square error (RMSE) or RASE, which is route average squared error, so this is the measure of prediction error.   AAE, MAD, MAE, these are these are measures of average error and there's different R square measures we might use.   To get categorical responses, we're most often interested in measures of error or misclassification.   We might also be interested in looking at an ROC curve, look at AUC (area under the curve) or sensitivity, or specificity, false positives, false negative rate.   the F1-Score and MCC, which is Matthews correlation coefficient.   So let me switch over to to JMP.   Tuck that this away and I'll open up the diabetes data.   And let me make this big so you can see it.   So these are the data. There are...there's information on 442 patients, and again, we've got a response Y, which is continuous and this is the the amount of disease progression after one year. And we'll start off looking at Y, but then we also have the second variable, Y Binary.   We've got baseline information.   And there's a column validation. So again validation, when we fit models, we're going to fit the models only using the training data and we're going to use the validation data to tell us when to stop growing the model and to give us measures of how good the model actually fits.   Now to build this column, there is a utility under predictive modeling called model validation column.   And this gives us a lot of options for building a validation column, so we can partition your data into training validation and a test set.   And, in most cases, if we're using this sort of technique of partitioning our data into subsets,   having a test set is recommended. Having a test set allows you to have an assessment of model performance on data that wasn't used for building the models or for stopping model growth, so so I'd recommend that, even though we don't have a test set in this case.   So let's say that I want to...I want to find a model that most accurately predicts the response. So as you saw there are a lot of different models choose from.   I'll start with fit model.   And this is usually a good starting point.   So I'm going to build a regression model with Y as our response, the Xs as model effects, and I'll only put in main effects at this point. And I'm going to add a validation column.   Now from fit model, the default here is going to be standard least squares, but there are a lot of different types of models we can fit.   I'm simply going to run least squares model.   A couple of things to point out here. Notice the marker here, V.   Remember that we fit our model to the training data, but we also want to see how well the model performs on the validation data, so all of these markers with a V are observations in the validation set.   Because we have validation, there is a crossvalidation section here, so we can look at R square on the training set and also on the validation set and then RASE.   And oftentimes what you'll see is that the validation statistics will be somewhat worse than the training statistics, and the farther off, they are, the better indication that you model is overfit or underfit.   I want to point out one other thing here that's really beyond the scope of this talk, and that's this prediction profiler.   And the prediction profiler   is actually one of the reasons I first started using JMP. It's a really powerful way of understanding your model.   And so I can change the values of any X and see what happens to the predicted response, and this is the average, so with   these models, we're predicting the average. But notice how these bands fan out for total cholesterol and LDL, HDL, right. And this is because we don't have any data out in those regions.   So that the new feature, I want to point out really quickly and again this is beyond the scope of this talk, is this thing called extrapolation control. And if I turn on a warning and drag   total cholesterol, notice that it's telling me there's a possible extrapolation. This is telling me I'm trying to predict out in a region where I really don't have any data, and if I turn   this extrapolation control on, notice that it truncates this bar, this line, so it's basically saying you can't make predictions out in that region. So it's something you might want to check out if you're fitting models. It's a really powerful tool.   So so let's say that I've done all the diagnostic work. I've reduced this model, and I want to be able to save my results.   Well, there are a couple of ways to do this. I can go to the red column, save columns, save the prediction formula.   So this saves the linear model I've just built out to the data table. So you can see it here in the background. And then, if I add new values to this data table, it'll automatically predict the response.   But I might want to save the model out in another form, so to do this, I might publish   this formula out to the formula depot. And the formula depot is actually independent of my data table, but what this allows me to do is I can copy the script and paste it into another data table with new data to score new data.   Or I might want to generate code in a different language to allow me to deploy this within some sort of a production system.   I'm going to go ahead and close this. This is just one model. Now is it possible, if I fit a more complicated or sophisticated model, that it might get better performance?   So I might fit another model. So, for example, I might...I'm just gonna hit recall. I might change the personality from standard least squares to generalize regression.   And this allows me to specified different response distributions. And I'll just stick with normal and click run.   So this will allow me to fit different penalized methods and also use different variable selection techniques. And if you haven't checked out generalized regression,   it's super powerful and super flexible modern modeling platform. I'm just going to click go. And let's say that I fit this model   and I want to be able to compare this model to the model I've already saved. So I might save the prediction formula out to the data table. So now I have another column in the background in the data table.   Or I might again want to publish this   to   the formula detail, so now I've got two different models here. And I can keep going. So this is just just one model from generalized regression. I can also fit several different types of   predictive models from the predictive modeling menu, so for example, neural networks or partition or bootstrap forest or boosted trees.   Now, typically what I would have to do is fit all these models, save the results out either to the data table or to the formula depot, and if I save them to   the data table, I can use this model comparison platform to compare the different competing models. And I might have many models, here I only have two.   And I don't actually even have to specify what the models are, I only need to specify validation. And I actually kind of like to put validation down here in the by field.   So this gives me my fit statistics for both the training set and the validation set, and I'm only going to look at the statistics for the validation step. So I would use this to help me pick the best performing model.   And what I'm interested in is a higher value of R square, a lower value of RASE (the root average squared error), and a lower average absolute error. And between these two models, it looks like this fit least squares regression model is the best.   Now, if I were to fit all the possible models, this can...this can be quite time-consuming. So instead of doing this, what's new in JMP Pro 16 is a platform called model screening.   And when I launch model screening, it has a dialogue at the top, just like we've seen, so I'll go ahead and populate this.   And I'll add a validation but over on the side, what you see is that I can select several different models and fit these different models all at one time.   So decision tree, bootstrap forest, boosted tree, K nearest neighbors, right, I can fit all of these models.   And it won't run models that don't make sense, so I have a continuous response logistic regression as one of my options, it won't run it a logistic regression model   with a continuous response. Notice that I've also got this option down here, XGBoost.   And the reason that appears is there is an add-in and it's actually uses open source libraries, and if you install this add-in (it's available on our JMP user community,   it's called XGBoost and it only works in JMP Pro), but if you install this add-in, it'll automatically appear in the model screening dialogue.   So I'm just going to click OK, and when I click OK, what it's going to do is it's going to go out and launch each of these platforms.   And then it's going to pull all the results into one window. So I clicked okay. I don't have a lot of data here, so it's very fast.   And under the details, these are all of the individual models that were fit. And if I open up any one of these dialogues, I can see the results, and I have access to additional options that will be available from that menu.   So I'm going to tuck away the details. And by default, what it's done is it's reported statistics for the training data, but it shows me the results for the validation data, so I can see R square and I can see RASE.   And by default it's sorting in order of RASE where lowest is best.   But I've got a few little shortcut options down here, so if I wanted to find the best models and it could be that R square is best for some models, but RASE is better for others, I'm going to click select dominant.   And this case, it selected neural boosted so across all of these models, the best model is neural boosted. And if I want to take a closer look at this model, I can either look at the details up here under neural   or, I can simply say, run selected.   Now I didn't talk about this, but in the dialogue window there's an option to set a random seed.   And if I set that random seed then the results that launch here will be identical to what I see here. So this is a...this is a   neural model with three nodes using the TanH function, but it's also using boosting. So in designing this platform, the developers did a lot of simulations to determine the best starting point and the best default settings for the different models.   So so neural boosted is the best.   And   if I want to be able to deploy this model, now what I can do is I can save the script   or I could run it, or I can save it out to the formula deeper.   So this is with a continuous response and there's some other options under the red triangle. What if I have a categorical response? For a categorical response, I can use the same platform. So again, I'll go to model screening.   I'll click recall, but instead of using Y, I'll use Y Binary.   And I'm not going to change any of the default methods. I will put in a random seed, so for example, 12345, I'm just grabbing random number.   And what this will do is give me repeatability. So if I save any model out to the data table or to the formula depot,   the statistics will be the same in the model, fit will be the same. A few other options here. We might want to turn off some things like the reports.   We might want to use a different cross validation method, so this platform includes K fold validation, but it also uses nested K fold cross validation.   And we can repeat this. So so really nice. Sometimes partitioning our data into training validation and test isn't the best and K fold   can actually be a little bit better. And there are some additional options at the bottom. So we might want to add two way interactions. We might want to add quadratic effects.   Under additional methods, this will fit additional generalized regression methods. So I'm just going to go ahead and click OK.   OK. It runs all the models and again, this is a small data set. It's very quick.   Right, the look and feel are the same, but now the statistics are different. So I've got the misclassification rate. I've got an area under the curve. I've also got some R square measures and then root average squared error.   I'm going to click select dominant, and again, the dominant method is neural boosted.   Now, what if I want to be able to explore some of these different models? So that misclassification right here is a fair amount lower than it is for stepwise.   The AUC is kind of similar, it's lowest overall but but maybe not that much better. And let me grab a few of these. So if I click on a few of these, maybe I'll...maybe I'll select these four, these five.   I can look at ROC curves, if I'm interested in looking at ROC curves to compare the models.   And there's some nice controls here to allow me to to turn off models and focus on on certain models.   And a new feature that I'm really excited about is this thing called a decision threshold.   And what the decision threshold allows us to look at,   and it starts by looking at the training data, is it's giving us a snapshot of our data.   And misclassification rate is based on a cut off of .5 for classification. So for each of the models, it's showing me the points that were actually high and low. And if we're focusing in on the high, the green dots were correctly classified as high.   And the ones in the red area were misclassified, so it's showing us correct classifications and incorrect classifications, and then it's reporting all the statistics over here on the side   And then down below we see several different metrics plus a lot of graphs for allowing us to look at false classifications and also true classifications.   I'm going to close this and look at the   validation data.   So why is this useful? Well, you might have a particular scenario where you're interested in maximizing sensitivity   while maintaining a certain specificity. And sensitivity...and there are some definitions over here. Sensitivity is the true positive rate;   specificity is the true negative rate. This is a...this is a scenario where we want to look at disease progression, so we want to make sure we are...we are maintaining a high sensitivity rate   while also making sure that our specificity is high, alright. So what we can do with this is there's a slider here, and we can grab this slider,   and we can see how the classifications change as we change the cutoff for classification.   So I think this is a really powerful tool when you're looking at competing models, because you might have some models that have the overall with a cut off of .5, they might have the best misclassification rate.   But you might also have scenarios where, if you change the cut off of classification, different models might perform differently. So, for example, if I'm in a certain region here, I might find that the stepwise model is actually better.   Now to further illustrate this, I want to open up a different example.   And this example is called credit card marketing.   And if I go back to my slides just to introduce this scenario.   This is a scenario where   we've got a lot of data based on market research on the acceptance of credit card offers.   The response is, was the offer accepted. And this is a scenario where only 5.5% of the offers in the study were actually accepted.   Now there are factors that we're interested in, so there are different types of rewards that are offered,   and there are different mailer types. So this is actually a designed experiment. We're going to, kind of, really, kind of, ignore that that part...that aspect of this study.   And there's also financial information, so we're going to stick on...stick to one goal in this example, and that's the goal of identifying customers who are most likely to accept the offer.   And if we can identify customers that are most likely, in this scenario we might send offers only to the subset that is more likely to accept the offer and ignore the rest. So that's the scenario here.   And I'm going to open these data.   I've got 10,000 observations and my response is is offer accepted.   And I've already saved the script to the data table, so I've got air miles, mailer type, income level, credit rating, and a few other pieces of financial information.   It's going through...I ran a save script. It's going through running all the models. And neural, in this case, will take the most time because it's running a boosted neural.   It will take a few more seconds. It's running support vector machines. Support vector machines will timeout and actually won't run if I have more than 10,000 observations.   I'm going to give it another second. I'm using validation so I'm using standard validation for this, where I've got a validation column. And in this case, I've got a column of zeros and ones and JMP will recognize the zeros for the training data and the ones for the validation data.   Okay. Tthere we go. Okay, so it ran, and if you're dealing with a large data table, there is a...there is a report you can run   to look at elapsed times. And for this scenario, support vector machine actually took the longest time, and this is why, at times, why it won't run if we have more than 10,000 observations.   So let's look at these. So our best model, if I select dominant, is a neural boosted and a decision tree, but I want to point something out here. Notice the misclassification rate.   The misclassification rate is identical for all of the models, except support vector machines. And why is this the case? Well, if I run a distribution of offer accepted   (let me make this a little bit bigger so we can see it) and just focus in on the validation data, notice that point .063% of our of our observations were Yes. This is exactly what our model predicted.   And why is it doing this?   I'm going to again ask for decision threshold.   And focusing on this graph here, and this graph has a lot of uses, and in this case, what it shows us is our cutoff for classification is .5   but none of our fitted probabilities were even close to that, right. So as a result, the model either classified the no's correctly as no's   or classified the yeses as no's. It never classified anything as a yes, because none of the probabilities were greater than .5. So if I stretch this guy out,   right, I can see the difference in these two models. So that the top probability was around .25 for the neural boosted and for decision tree it was about .15.   And notice that for decision tree, decision tree is basically doing a series of binary splits and basically I've got several predictive values, whereas a neural boosted it's showing me a nice random pattern in the points. So let me change this to something like .12.   Right, and it cut off at .12, in fact, if I slide this around, notice that the lower I get, I actually start getting (I'm going to turn on the metrics here) I start getting some true positives.   And I start getting some false positives. So as I drag this, you can see it in the bar, but the bar is kind of small, right. Neural boosted, I'm starting to see some true positives and some false positives.   And now you start seeing them. As soon as I get past this first set of points, I start seeing it for decision tree.   So, using a cut off of .5 doesn't make sense for these data, and again I might...I might try to find a cut off that gives me the most sensitivity, while maintaining a decent level for specificity.   In this case, I'm going to point out these two other statistics.   F1 is the F1 score, and this is really a measure of how well we're able to capture the true positives.   MCC is the Matthews correlation coefficient, and this is a good measure of how well it classifies within each of the four possibilities. So I've got...   I can have a false positive, a false negative, a true positive, a true negative.   And I didn't actually say that corresponding to the boxes here, but I've got four different options. MCC is a correlation coefficient that falls between minus one and plus one, that measures how will I'm predicting in each four...in each one of those four boxes.   So I might want to explore cut off that gives you the maximum F1 value or the maximum MCC value. And let's say that I drag this way down. Notice it that   the sensitivity is growing quickly and specificity is starting to drop, so maybe it around .5, right, I reach a point where I'm starting to drop off too far in specificity.   I might find a cut off and at the bottom there's this option to set a profit matrix. If I set this as my profit matrix, basically, what's going to happen is it will allow me to score new data using this cut off.   So if I set this here and hit okay, right, any future predictions that I make, if I save this out to the data table or to the formula depot, will use that cut off.   And this is a scenario where I might actually have some financial information that I could build into the profit matrix.   So, for example, instead of using the slider to pick the cutoff, maybe I have some knowledge of the dollar value associated with my classifications. And maybe if the actual response is a no,   but I send them an offer, so there's a yeah, I think they're going to be yes and I send them an offer. Maybe this costs me $5, right, so I have a negative value there.   And maybe I have some idea of the potential profit, so maybe the profits...potential profit over this time period is $100 and maybe I've got some lost opportunity. Maybe I say, you know, it's -100 if   the person would actually have responded, but I didn't send them the offer. So maybe this is lost opportunity and sometimes we leave this blank. Now if I use this instead,   I have some additional information that shows up, so it recognizes that I have a profit matrix. And now if I look at the metrics, I can make decisions on my best model based on this profit.   So I'm bringing this additional information into the decision making, and sometimes we have a profit matrix and we can use that directly and sometimes we don't. And this is one of those cases where I can see that the neural boosted model is going to give me the best overall profits.   So this is a sneak peek at model screening and let me go back to my slides.   And what have we seen here? Well, we talked about predictive modeling   and how predictive modeling has a different goal than explanatory modeling. So our goal here is to accurately predict or classify future outcomes, so we want to score future observations.   And we're typically going to fit and compare many models using validation and pick the model or the combination of models that does the best job, that predicts most accurately.   Model screening really streamlines this workflow so you can fit many different models, at the same time, with one platform.   And I really only went with the default, so I can fit much more sophisticated models than I actually fit there. It makes it really easy to select the dominant models, explore the model details. We can fit new models   from the details. This decision threshold, if you're dealing with categorical data, allows you to explore cut offs for classification and also integrates the ability to include a profit matrix.   And for any selected model, we can deploy the model out to the formula depot or save it to the data table. So a really powerful new new tool.   For more information. The classification metrics, I know before I saw the F1 score and the Matthews correlation coefficient, those those statistics were relatively new to me.   To make sense of sensitivity and specificity this Wikipedia post has some really, really nice examples and a really nice discussion.   There are also some really nice resources for predictive modeling and also model screening. In the JMP user Community, there's a new path, Learn JMP, that has access to videos,   Mastering JMP series videos. There is a really nice talk last year at the JMP Discovery in Tucson by by Ruth Hummel and Mary Loveless on which model when.   And this does a nice job of talking about different modeling goals and when you might want to use each of the models.   If you're brand new to predictive modeling, our free online course, STIPS, which is Statistical Thinking for Industrial Problem Solving, Module 7 is an introduction to predictive modeling.   So I'd recommend this. There is a model screening blog that uses the diabetes data that I'll point out, and I also want to point out that there's a second edition of this book, Building Better Models with JMP Pro,   that's coming out within the next couple months. They don't have a new cover yet, but they're going to...they include model screening in that book.   So so that's all I have. Please feel free to post comments for this post or ask questions, and I hope you enjoy the rest of the conference. Thank you.
Stuart Little, Lead Research Scientist, Croda   This presentation will show how some of the tools available in JMP have been successfully used to visualize and model historic data within an energy technology application. The outputs from the resulting model were then used to inform the generation of a DOE-led synthesis plan. The result of this plan was a series of new materials that have all performed in line with the expectations of the model. Through this approach, a functional model of product performance has been successfully developed. This model, alongside the visualization capabilities of JMP, has allowed for the business to begin to embrace a more structured approach to experimentation.     Auto-generated transcript...   Speaker Transcript Stuart Little Hi everyone, and welcome to this talk around how JMP is being used at Croda to help drive new product development. So what we're going to cover today, and firstly we're going to cover some context to Croda, how we are using JMP and where we are on that journey, and a summary of the problem we're trying to solve. Once you've been covered the problem, we then are going to move to JMP, look at how the tools and platforms in JMP have allowed easy data exploration and easy development of a structure performance model. And then finally we'll wrap this up by discussing the outcomes of this work and how by doing this kind of research, we've been able to increase buy in to the use of data and DOE techniques in and...in the research side of the business. So firstly, who we are as Croda. It's a question that does come up quite a lot, because we're a business-to-business entity. But as a business, Croda are the name behind a lot of high-performance ingredients and technologies, and behind a lot of the biggest and most successful brands across the world, across a range of markets. We create, make, and sell speciality chemicals. From the beginning of Croda, these have been predominantly sustainable materials. So we started by making lanolin, which is from sheep's wool, and we continually build on that sustainability. Last year we made a public pledge to be climate, land, and people positive by 2030 and have signed up to the UN Sustainable Development Goals as part of our push to achieve this and become the most sustainable supplier of innovative ingredients across the world. So, in terms of the markets we serve, we have a kind of very big personal care business where we deal with skincare and sun care and sort of hair care, color cosmetics, and those kind of traditional personal care products. Life sciences business, our products...our products and expertise help customers optimize their formulations and their active ingredient use. I mean, most recently, we in an agreement with Pfizer to provide materials that are going into their COVID-19 vaccine. Our industrial chemicals business, that's...that part of the business is responsible for supplying technically differentiated, predominantly sustainable materials to a huge range of markets. A lot of markets aren't quite...don't quite fit into anything else on this slide. And then finally we've got our performance technologies business. This covers a lot, again, a lot of similar areas, providing high performance answers across across all of these. And then today, in particular, we're talking about our energy technologies business, and specifically, kind of, battery technology in high performance electronics. So where we are in Croda and JMP is we've been using JMP for about two years and we've had a lot of interest internally. It's been harder to build confidence that these techniques have real value to research. And so to prove this, we've gone away, we've created a number of case studies that have been pretty successful on the whole. We've demonstrated the potential and some of the pitfalls within that. And then all of that has then led to a slightly bigger set of projects, one of which is the one we're going to talk to about...talk to you today. how do we improve the efficiency of electrical cooling systems? The primary driver for this project is sort of transport electrification, so that's battery vehicles. How do you maintain the battery properly? How do you make, make sure the motors are working at their optimum level? And how'd you do that without electrocuting anyone? So currently there's a set of cooling methods for these things, that our customers are certainly looking at how that can be improved, because the better the control of your battery cooling, for instance, the better battery capacity you have, and the more consistent the range will be. And because, you know, this is critical and there's lots of different applications that are broadly similar, the really useful thing for us would be to build an understanding of these fluids, by having some sort of data led model and that's where JMP came in. So how can we do that? Well, the first thing we we looked at, was what are the current cooling methods? So for batteries, predominantly they're air cooled or coldplaee cooled in the previous generation. And the electronics in the car, you have the opposite problem of the battery but thing at tend to get too hot so, then we have heatsinks, to try and take that energy away. And in electric motors, we're trying to minimize the resistance in there, they tend to be jacketed with fluid. In all three of these cases, the incoming alternative method of cooling relies on fluid, so that's direct immersion for batteries and electronics, and then in terms of the electric motors, that tends to be more of a flow. So what does that fluid look like? Obviously, we're dealing with high voltages so we have to have something that's not electrically conductive. It also needs to have a really high conductivity of heat, so that it can pull heat out of the electronics. And because these fluids need to be moved around the system, the viscosity has to be low. So we have kind of practical physical constraints that have been introduced by the application itself. If you look at it in a bit more depth, the ability of the fluids to transfer heat is based predominantly on this equation. And what this tells us is there is a part that we can control by the fluid, which is the heat transfer coefficient, and then there is a part that is controlled by the engineering solution in the application to that. What's the area for cooling, and what are the temperatures of the surfaces that you're trying to cool? But for in all cases, to get an efficient heat transfer, we have to have a high heat transfer coefficient, and as that's the thing we can effect, that's where we looked. That heat transfer coefficient is defined by this equation in a simplistic way, there are other terms in there, but predominantly, it's a function of density, thermal conductivity, the heat capacity, and then having an effect of the viscosity of the system. So, if we look specifically for the applications were interested in, if we want to optimize our dielectric fluid, we need to increase the density, increase the thermal conductivity, and increase the heat capacity, but alongside that, we need to reduce the viscosity of the...of the fluid. And these match up pretty well with the engineering challenges that we have, which is helpful. So from that, what we really wanted to do was, we knew what the target was. And we really wanted to understand what the relationship was between structure and product performance as a dielectric fluid. So initially we proceeded in kind of a fairly traditional way and we started conducting a large-scale study measuring the physical properties of a lot of esters and a lot of other materials. And then, when when we saw that and looked at that, we thought, well actually, this data exists so why don't we use these data sets to try and build some models, and say, can we really understand that physical property to structure to performance relationship. So that's where...we're just going to pop into JMP so just bear with me one second. Okay. So the first thing that we we did was we collated that mix of historic data, data that was being obtained through targeted testing by the applications teams. And once we've got that into one place, we kind of examined that in JMP to say is there...to understand, is there a relationship, but at a really simple level between the physical properties we're measuring. So, if we look at that that data set, the first port of call for me, as ever, is the distribution platform in JMP. And it's a really easy way just to see if something that you want has any kind of vague pattern anywhere else. So if I, in this case, if you say, oh, we want everything that's got a high thermal conductivity, what we see is the properties that are pretty stretched out across the other...the other properties we've measured. So it doesn't really say, oh, there's brilliant relationship, what you need is this, which is kind of what we expected. But it's nice to have a check. Similarly, if we then plot everything as scatterplots, what we see is a lot of noise. I mean, these lines of fit are just there for reference to show there isn't really any fit. In no way am I claiming your correlation on these. And all of that was disappointing, that there isn't an obvious answer was expected. Where it got interesting to us is, we said, well, we were expecting that there isn't a clear... a clear relationship between any of these factors, because if there was, it would have been obvious to the experienced scientists doing the work, and we would have known that. So, then, we said, well, what we do know is these properties all have an understanding...a relation to their structure. What happens if we calculate some physical parameters for these things and combine that with a number of, sort of, structural identifiers and ways of looking at these molecules? What happens if we take that and add that data to the test data? Do...you know, can we then build some kind of model? This starts being able to estimate structure and performance, so that's exactly what we did. You know, in this case, what we see is that, again, if we use the multivariate platform, just as a quick look to see if there's any correlation on some of these these factors, this clear differentiation in some cases, between up and down, and maybe a little hints of correlation, but nothing clear that says, this is the one thing that you need. Again, this is what we expected. So then, what we did was we used the regression platforms in JMP to try and understand whether we could build a model, and what that relationship looks like. So to do this, we randomly selected a number of rows for the row selection tool in JMP. Generally, pulling out five samples at random, which weren't going to participate in the model, then it's a relatively built up these models and refined them that way, so we always had a validation set from the initial data, just to...just to check that what we were doing had any chance of success. So then, if we just look at the 80 degree models, the first model that we we came to was this one. Clearly, as we can see, there are a number of factors that were included in this model that make no sense from a statistical point of view, because they're just overfitting and they are just non significant. However, these are fairly important in terms of describing the molecules that are in there, so as a chemist, we created this model. So this is a model that allows, you know ,molecules, if you like, to be designed for this application, even though we know it's over fitted. And we know that it's not...it's not really a valid model because these terms are just driving the R squared up and up and up. We also built the model without those terms. This is a far better model in terms of estimating the performance of these things, the R squared is a touch lower, but all the terms that are in there have a significant impact on the performance. The downside of this model is it doesn't really help us design any new chemistry. But, in both cases, when we look at the predicted values against the actual measured values, we see a reasonable correlation between them. Certainly when we expect things to be high, they are. So that gave us some confidence that this model might actually perform for us. In terms and...then in terms of when we looked at how good this might be, we just simply looked at what's the percentage difference between the measured value and the actual value. And what we see is they are almost universally within 10%, predominantly within 5% for either model. Again across a range of different types of material, this gave us confidence that what we were...what we were seeing might be a real effect. All of which is very nice, is this just an effect of the data we've measured? So what we did was that was we used the profiler platform in JMP, produced a shareable model that we could send around the project team, and essentially set up a competition and said look, whoever can find the highest... the highest thermal conductivity in this model from a molecule that could actually be made, wins. You know, from that we had a list of about 14 materials back that looked promising. We had to cut a few out because they were impossible to source of raw materials, so we ended up with about nine new materials that were synthesized and tested. Now these materials were almost exclusively made up of materials the model hadn't really seen before, you know. In some cases, part of the molecule would be the same, but they were quite distinct from the original materials. So once we'd made them and, once they had been tested, we put them back into the model and to see, just to see what the predictive power of this model was like. So if we have a look at that data, you know, I think, given the differences of these materials, I was fully expecting this... this to break the model. However, when we... if we look at the predictions again, what we see is the highlighted blue ones are the new materials that were made. You know, we deliberately picked a couple that were lower just out of curiosity, just to check. And all the ones that we picked that we thought would be high were high. So in the overfitted model that had value from a designing a structure point of view, what we see is one outlier. In the...in the model that was statistically reasonable, we actually see a much better fit overall. And that was edifying, that we can start to be able to not, sort of, design a single molecule and say, oh, here you go, off you...off you pop; here's the one thing you need to make. But certainly be able to direct synthetic chemists to the right, sort of, types of materials to really drive projects forward. So then, if we just look again at these residuals, what we see is for the you know, for the statistically good model with no overfitting, what we saw was everything was within 10% of all the new materials, which, for what we were trying to achieve, was good enough. There's a few in the overfitting model that were a little bit under 10% but, again, this is kind of what I would expect to see. And it's, you know, it was it was nice that they were all in the right range, because it shows that this approach was was having value, but it was also quite to find that they weren't all at exactly right, because I tlhink, had we produced nine materials and they'd all been within 1%, I'm not sure that people would have believed that either. So the fact is, we were getting a similar... a similar level of difference to the predictions from the materials we started with and the new materials that we made. So we started having some real confidence in this in this model. And then, if we just go back to the slides a second. So what we can say then is that the structure performance relationship of these materials has been created in JMP using the regression platforms. We've used the visualization tools in JMP to be able to see that there's real benefits to do this, and that the model itself is being used to direct this emphasis of new materials in this project. It's being used to screen likely materials to test from things we already make. And it's a, you know, there's an acceptable correlation in the results between the model and the new molecules we're making, all of which has given real confidence to this approach, and it's really allowed us to, kind of, push this project further and sort of split it out into specific target materials. So, in terms of new molecules, we've directed this emphasis of molecules with higher thermal conductivity. So as you can see in this plot, you know, all the new molecules are sort of medium to high on that range of thermal conductivity, which is kind of what we wanted to achieve from them. We demonstrated that we could target an improvement, using data and then verify that in the lab and make it. Where this project then becomes harder still is, we're now trying to build similar models for all the other factors that influence the performance of these dielectric fluids, and then we will be trying to balance those models against each other to find the best outcomes. So all of that further development is ongoing, but that momentum has come purely by the ease of use of JMP and the platforms in it to take a data set and with a bit of kind of domain knowledge, really push that forward and say, yep here's a model that will help direct this emphasis for this project and subsequent projects in this area for Croda. So then, just in conclusion, data that we've obtained from testing has been used to successfully model the performance of these these materials. It's not absolutely perfect, but it's good enough for what we want. The model... the model demonstrates that there is a structure performance a relationship of esters (sorry, not sure why my taskbar is jumping around). The model has been used to predict materials of high thermal conductivity. Those predictions are then verified initially by just exclusion and then laterally by making new materials, and really showing that this this model holds for that type of chemistry. It's also demonstrated the possibility of tailoring properties of, in this case, dielectrics but other materials, if you build similar models, so that you can start being able to create specific materials for specific applications. And I think most importantly for me, the real success of this work has built internal momentum to sort of demonstrate that JMP is not a nice to have, it's a...it's a real platform to develop research, to very quickly look at data sets and say, is there something there? And with that, I just like to say thank you for for watching. Obviously I can't answer any questions on a recording, but if you want to get in touch, feel free to comment in the Community. Yeah, thank you very much.  
Philip Ramsey, Professor, North Haven Group and University of New Hampshire Wayne Levin, President and CEO, Predictum Trent Lemkus, Ph.D. Candidate, University of New Hampshire Christopher Gotwalt, JMP Director of Statistical R&D, SAS   DOE methods have evolved over the years, as have the needs and expectations of experimenters. Historically, the focus emphasized separating effects to reduce bias in effect estimates and maximizing hypotheses testing power, which are largely a reflection of the methodological and computational tools of their time. Often DOE in industry is done to predict product or process behavior under possible changes. We introduce Self-Validating Ensemble Models (SVEM), an inherently predictive algorithmic approach to the analysis of DOEs, generalizing the fractional bootstrap to make machine learning and bagging possible for small datasets common in DOE. Compared to classical DOE methods, SVEM offers predictive accuracy that is up to orders of magnitude higher. We have remarkable and profound results demonstrating SVEM capably fits models with more active parameters than rows, potentially forever changing the role of classical statistical concepts like degrees of freedom. We use case studies to explain and demonstrate SVEM and describe our research simulations comparing its performance to classical DOE methods. We also demonstrate the new Candidate Set Designer in JMP, which makes it easy to evaluate the performance of SVEM on your own data. With SVEM one obtains accurate models with smaller datasets than ever before, accelerating time-to-market and reducing the cost of innovation.     Auto-generated transcript...   Speaker Transcript Wayne Levin self-validating ensemble model, as we view it as a paradigm shift in design of experiments and so.   Just gonna...there we go. Our agenda here is...we're going to...just some quick introduction about who's talking.   So, what, why and how of SVEM. So we'll have a quick overview of DOE and machine learning, describe blending them into SVEM, analyze some real world...real world SVEM experiments, and we're going to   review some current research and demonstrate JMPs new candidate set designer, new in JMP 16, and then we'll end with the usual, you know, next steps and Q & A and so forth, so.   SVEM is a remarkable new method to extract more insights with fewer experimental cycles and build more accurate predictive models from small sets of data, including DOEs.   Using SVEM then we anticipate, we promise you're gonna have less cost, you'll be faster to market, a lot faster problem solving. We're going to explore all that, as we go forward.   in the session.   I'm Wayne Levin and I'm the president of Predictum and joining me is Chris Gotwalt, who's the chief data scientist at JMP, and also joining us is Phil Ramsey who's a senior data scientists here at Predictum and associate professor at the University of new Hampshire.   So let's just   take stock here. JMP's contributions to data science is huge, so we all know, and in the DOE space going back over 20 years ago now, it would...JMP 4, with the coordinate exchange algorithm.   And then about eight years ago, nine years ago, we saw definitive screening designs coming out. And we'd like to think that SVEM, we call it for short, self-validating ensemble modeling (try and say that fast three times), we like to think that this is another contribution.   machine learning and design of experiments.   It overcomes limitations of limited amounts of data, right. With small amounts of data, you can't do machine learning, you normally have to have large amounts of data to make that happen.   Now we've been trying SVEM in the field and we've had a number of companies who approached us on this and we've made a trial version of SVEM available. We're, at   Predictum, the first to create a product in this area, bring a product to market, and I just want to share with you some of the things we've learned so far.   And one is that SVEM works exceptionally well. I mean, when you have more parameters and number of runs, it works very well with   higher order models. It does give more accurate predictions, it also helps recover from broken DOEs. So, for example, if you have definitive screening designs that for   whatever reason can't be run in fit definitive, we've had some historical DOEs sent to us that   lack power. They they just, you know, didn't find anything and but with SVEM, actually it did. It was able to. And you know we've had cases where there were some missing or questionable runs.   Something else we've learned is that it's not going to help you with rational factorial, so things like that. We're interested in predictive models here   and fractional factorials really don't give predictive models; they're not designed for that. So it's not going to give you something that the data is not structured to deliver,   at least not to that extent. The other thing we've learned is that we've not yet tested the full potential of SVEM. To do that, we really need   to design experiments with SVEM in mind to begin with, and that means look at more factors. We don't want to just look at three or four factors. How about 10, 15 or even more?   We understand from the research that Chris and Phil will be talking about that Bayesian I-optimal designs are the best...the best approach for using with SVEM and also mixture designs   were particularly useful for SVEM and I'll have a little something more to say about that later. And as far as SVEM goes   if you'd like to try it out, we'll have some information at the end after Chris and Phil to to talk about that. And so with that said, I'm going to throw it over to you, Chris, who will   go into the what and the why and the how of SVEN. So off to you, Chris. Chris Gotwalt Alright, great well, thank you, Wayne. Oh wait, am I sharing my screen? Wayne Levin Not yet.   There we go. Chris Gotwalt Okay, all right, well, thank you, Wayne.   So now I'm going to introduce the idea of self-validating ensemble models or SVEM.   I'm going to keep it fairly basic. I'll assume that you have some experience of design and analysis of experiments.   Most importantly, you should understand that the statistical modeling process boils down to fitting functions to data   that predictor response variable as a function of input variables. At a high level, SVEM is a bridge between machine learning algorithms using big data applications   and design of experiments, which is generally seen as a small and good analysis problem. Machine learning and DOE each have a long history, but until now, they were relatively separate traditions.   Our research indicates that you can get more accurate models using fewer observations using SVEM.   This means that product and process improvement can happen in less time using fewer resources at lower cost. So if you're at a competitive industry,   where being first to market is critical, then SVEM should be of great interest to you.   In particular, we believe that SVEM has particular...has tremendous promise in the pharma and biotech industries,   high technology like semiconductor manufacturing, and in the development of consumer products.   In my work here at JMP, I've had something of a bird's eye view of how data is handled in a lot of industries and areas of application.   I've developed algorithms for constructing optimal design experiments as well as related analysis procedures like mixed models and generalizing your models.   On the other hand, I'm also the developer of JMP Pro's neural network algorithms, as well as some of the other machine learning techniques that are in the product.   Over the last 20 years I've seen machine learning become more prominent when modeling large observational data sets and seeing many new   algorithms that have been developed. At the same time, the analysis of designed experiments has also changed, but more incrementally over the last 20 years.   The overall statistical methodology of industrial experimentation would be recognizable to anyone that read Box, Hunter and Hunter in the late 1970s.   Machine learning and design of experiments are generally applied to industry to solve problems that are predictive in nature.   We use these tools to answer questions, such as is the sensor indicating a faulty batch or how do we adjust the process to maximize yield while minimizing costs.   Although they are both predictive in nature, machine learning and design of experiments have generally applied to rather different types of data.   What I'm going to do now is give a quick overview of both and then describe how we blended them into SVEM.   I'll analyze some data from a real experiment, to show how it works and why we think it's doing very well overall.   I'll go overall...I'll go...I will go over results of simulations done by Trent Limpkiss(?), a PhD student at the University of New Hampshire, which has given us confidence that the approach is very promising.   Then I'll go over an example that shows how you can use the new candidate set designs in JMP 16 to find optimal subsets of existing data sets so you can try out SVEM for yourself.   I'll highlight some of our observations, I mentioned some future questions that we hope to answer, and then I'll hand it over to Phil, who will go through a more in-depth case study and a demo of the SVEM add-in.   Consider the simple data on the screen. It's a data from the metallurgical process and we want to predict the amount of shrinkage at the end of the processing steps.   We have three variables on the left that we want to use to predict shrinkage. Modeling amounts to finding a function that connects pressure, time, and temperature that predicts that the response (shrinkage) here.   Hopefully this model would generate...will generalize to new data in the future.   In a machine learning context, the standard way to do this is to partition that data into two disjoint sets of rows. One is called the training set and it's used for fitting models to make sure that we don't overfit the data,   which leads to models that are inaccurate on new observations. We have another subset of the data, called the validation set that is used to test out various models we have been...   that have been fit to the training set. There's a trade off here, where functions that are too simple will fit poorly on the training and validation sets because they don't explain enough variation.   Whereas models that are too complex can have an almost perfect fit on the training set but will be very inaccurate on the validation set.   We use measures like validation R squared to find the goldilocks model that is neither too simple nor too complicated.   This goldilocks model will be the one whose validation R squared is the highest on the validation set.   This hold out style model selection is a very good way to proceed when you have a quote unquote reasonably large number of rows,   often in the hundreds of millions of rows or more. The statement reasonably large is intentionally ambiguous and depends on the task at hand. That said, the 12 rows you see here is really too small and is used here just for illustration.   In DOEs, we're usually in situations where there are serious constraints on time and/or resources.   A core idea of designed experiments, particularly the optimal designed experiments that are a key JMP capability, is around obtaining the highest quality information packed into the very smallest number of rows possible.   Many brilliant people over many years have created statistical tools like F tests, half-normal plots and model selection criteria, like the AIC and BIC,   that help us to make decisions about models for DOE data. These are all guides that help us identify which model we should use to base our scientific, engineering and manufacturing decisions on.   One thing that isn't done often is applying machine learning model selection to designed experiments. Now why would that be?   One important reason is that information is packed so tightly into designs that removing even a few observations can cause the main effects and interactions to collapse   so that we're no longer able to separate them into uniquely estimable effects. If you try to do it anyway, you'll likely see software report warnings, like lost degrees of freedom or singularity details added to the model report.   Now we're going to conduct a little thought experiment. What if we tried to cheat a little bit?   What if we copied all the rows from the original table, the Xs and Ys, and labeled one copy "training" and labeled the other copy "validation"? Now we have enough data, right? Wrong. We're not fooling anybody with this, this is this crazy scheme.   It's crazy, because if you tried to model using this approach and looked at any plot of index of goodness of fit, you'll see that making them more complicated leads to a better fit on the training set but because training and validation   sets are the same in this case, the approach will always lead to overfit.   So let's back up and recast machine learning model selection as a tale of two weighted sums of squares calculated on the same data.   Instead of thinking about model selection in terms of two data partitions, we can think of it as two columns of weights.   There's a training weight that is equal to one for rows in the training set and zero otherwise, and validation column   that is equal to zero for the training rows and equal to one for the validation column, for the validation rows.   So from machine learning model selection, we can think of each row is having its own pair of weight values, where both weights are binary zeros and ones.   Having them set up in this way is what gives us independent assessment of model fit.   Models are fit by finding finding parameter values that minimize the sums of squared errors weighted using the training weight, and the generalization performance is measured using the sum of squared error weighted using the validation column weights.   If we plot these weight pairs in a scatter plot, we see, of course, that the training validation weights...the training and validation weights are perfectly anti correlated.   Now, what if, instead of forcing ourselves to use perfectly anti correlated pairs of binary values, we relax the requirement that the values can take on only the values of zero and one   and allow the weights to take on any strictly positive value?   To do that, we have to give something up, in particular, to my knowledge, for this to work mathematically, we can no longer have perfectly anti correlated weight pairs, but we can do this in a way that we still have a substantial anti correlation.   There are many ways that this could be done, but one way to do this would be to create weights using the same scheme as the fractionally weighted bootstrap, where we use an exponential distribution with a mean of one   to create the weights. And there's a little trick using the properties of the uniform distribution that we can use to create exponentially distributed, but highly anti correlated weight pairs.   When using this as a two column weighted validation scheme, we call this autovalidation. Phil and I have given several Discovery papers on this...on this very idea.   Using this approach is like having a training column of weights that has a mean of variance of one and the validation column with the same properties.   Under this autovalidation scheme, if a row contributes more than average weight to the training sums of squares, that row will contribute less to the validation sums of squares and vice versa.   There is an add-in that Mike...I want to point out that there's an add-in that Mike Anderson created that sets data sets up into this kind of format that JMP Pro's modeling platforms can consume.   Now recently Phil and I have taken on Trent as our PhD student.   Over the spring and fall, Trent did a lot of simulations trying out many different approaches to fitting models to DOE data, including an entire zoo of autovalidation based approaches. I'll show some of his results   here in a little bit. Suffice to say, the approach that worked consistently the best, in terms of minimum average error on an independent test set, was to apply autovalidation   in combination with a variable selection procedures such as forward selection, but instead of doing this once we repeat the process dozens or hundreds of times,   saving the prediction formula each time and ultimately averaging the models across all these autovalidated iterations. And we call this general procedure self-validated ensemble models or SVEM.   So, to make things a little more concrete I'm going to give a quick example that illustrates this. We begin by initializing the autovalidation weights.   We apply a model selection procedures, such as the Lasso or forward selection, then save the prediction formula back to the table.   Here's the first prediction formula. In this iteration, just two linear main effects have been selected.   Then we reinitialize the weights, fit into the model again using the new set of weights, and save that prediction formula.   Here's the second autovalidated prediction formula. Note that this time, a different set of main effects and interactions were chosen and their regression coefficients are different this time.   We repeat this process a bunch of times.   And at the end, the SVEM model is simply the average across all the prediction formulas.   Here's a succinct diagram that illustrates the SVEM algorithm as a diagram. So the idea of combining the bootstrap procedures with model averaging   happens over and over again, as we're...after we reinitialize the weights. We save the the prediction formula from that iteration.   We save...here I'm saving the param...here the illustration is showing the parameters, but it's the same thing with the formulas. Just save them all out and then at the end of the day, the final model is just simply the average across all the iterations.   Once we're done, we can use that SVEM model in the graph profiler with expand intermediate formulas turned on so that we can visualize the resulting model and do optimization of our process and so forth.   Now I'm going to go over the results from Trent's dissertation research.   a Box Behnkens and DSDs in four and eight factors. Each simulation consisted of 1,000 simulation replications and each simulation had its own set of true parameter values for our quadratic response surface model.   The active parameters were doubly exponentially distributed and we explored different percentages of active effects from 50% to 100%.   For each of the 1,000 simulation reps, the SVEM prediction was evaluated relative to the true model on an independent test set that consisted of a 10,000 run space filling design over the factor space.   We looked at a large number of different classical single shot modeling approaches, as well as a number of variations on all the validation.   But it's easier just to look at the most interesting of the results since there was too much to compare. The story was very consistent across all the situations that we investigated.   This is one of those simulations where the base design was a definitive screening design, with eight factors and 21 runs. In this case 50% of the effects were non zero.   This means that the model had as many non zero parameters as there were runs, just about.   On the left are the box plots of the log root mean squared error for the Lasso, Dantzig selector, forward selection and pruned forward selection, all tuned using the AICc, the best performing at the methods.   Also, the best performing of the methods from the fit definitive platform is next, followed by SVEM applied to a different modeling...several different modeling procedures over on the right.   There is a dramatic reduction in independent test set root mean squared error, when we compare the test set RMSEs relative to the SVEM predictions.   And note that this is on a log scale, so this is a pretty dramatic difference in predictive performance.   Here's an even more interesting result. This was the plot that really blew me away. This is the same simulation setup, except here all of the effects are non zero.   So the true models are supersaturated, not just in the design matrix but in the parameters themselves.   In the past we have had to assume that the number of active effects was a small subset of all the effects and certainly smaller than the number of rows.   Here we are getting reasonably accurate models despite the number of active parameters, being about two times the number of rows, which is really truly remarkable.   Now I'm going to go through a quick illustration   of SVEM to show how you can apply it to your own historical DOEs. I'm going to use a real example of a five factor DOE from a fermentation process, where the response of interest was product yield as measured by pDNA titer.   The great advantage with SVEM is that we can get more accurate models with less runs. This means we can take existing data sets,   use the new candidate set designer in JMP 16 to identify the subset of rows that form the quote unquote predictive heart of the original design.   I'll take the original data which had 36 rows, load the five columns in as covariates and tell the custom designer to give me the 16 rows subset of the original design that gives the best predicted performance at that size.   Then I'll hide and exclude all the rows that are not in that subset design, so that they can be later used as an independent test set.   I'll run the SVEM model and compare the performance of the SVEM model to what you get if you apply a standard procedure like forsward selection, plus AICc fit to the same 16 runs.   I do want to say that the new candidate set designer in JMP 16 is a...is also a remarkable contribution and I just want to call out how incredible Brad's team has been in the creation of this new tool which is going to be profoundly useful in many ways, including with SVEM.   So to do this, we can take our data set, go to the custom designer.   We select covariate rows, so this is a new outline node in JMP 16.   Then we select the five input factors.   Click continue.   And we're going to press the RSM button, which will automatically convert the optimality criteria to I-optimal or predictive type design.   We select all the non intercept factors and right click in the...in the model...   select list here and create...set all the non intercept effects as if possible effects. This is what allows us to have models   with more parameters in them than there are runs as our base design, as the design criteria will now be a Bayesian I-optimal design. Now we can set the number of runs equal to 16.   Click make design.   And once we've done that, the optimizer works its magic, and all the rows that are in the Bayesian I-optimal subset design are now selected in the original table.   We can go to the table and then use an indicator or transform column to record the current row selection.   And I went ahead and renamed the the column a meaningful name for later.   We can select the points not in the subset design by selecting all the rows where the new indicator column is equal to zero.   And then we can hide and exclude those rows so that they will not be included in our model fitting process.   Phil will be demoing the SVEM add-in, so I'll skip the modeling steps and go right to a model comparison of forward selection plus AICc and the SVEM model on the rows not included in the Bayesian I-optimal subset design. So you can see that comparison here in the red...   in the red outline part of the the report here. We see that SVEM has an R square on the   independent test set of .5, and the classical procedure, it has an R square of .22, so we're getting basically twice the amount of variation explained with SVEM then the classical procedure.   We can also compare the profile traces of the two models when we apply SVEM to all 46 observations.   It's clear that we're getting the same model basically with when we apply SVEM to all 46 runs as we're getting with just 16 runs under under SVEM.   But the forward selection based model is missing curvature terms, not to miss...not to mention there's a lot of interactions that are missing also. This is a fairly procedure...fairly simple procedure that you can use to test SVEM out on your own historical data.   So overall SVEM   raises a lot of questions. Many of them are centered around the role of degrees of freedom in design experiments as we are now able to fit models where there are more parameters than rows, meaning that there would effectively be negative degrees of freedom, if there were such a thing.   I think this will cause us to reconsider the role of P-value based thinking in industrial experimentation.   We're going to have to work to establish new methods for uncertainty in analysis, like prediction...in confidence intervals on predictions.   Phil and I are doing some work on on trying to understand what is the best family of base models that we can work from. So this could be quadratic RSMs and we're also looking at these partial cubic models that had been proposed a long time ago, but now we believe are worthy of reconsidering.   What kinds of designs should we use? What sample sizes are appropriate as a function of the number of factors in the base model that we're using? What is the role of screening experiments? And one big unknown is what is the role of blocking and split plot experimentation   in this framework?   So now I'm going to hand it over to Phil. He's going to do a more in-depth case study and also demo Predictum's SVEM add-in. Take it away, Phil. Philip Ramsey Okay, thank you, Chris. And what I'm going to do is discuss a case study, so let me put this in   slideshow and I'll do some   illustration, as Chris said, of the Predictum   add-in that actually automates a lot of the SVEM analysis. And what we're trying to look at is an analytical method used in the biotech industry.   And this one is for characterizing the glycoprofiling of therapeutic proteins, basically proteins to become antibodies.   And many of you who work in that industry know glycoproteins are really a very rich source of therapeutics and you also know that if you work to see GMP, you have to demonstrate the reliability of the measurement systems. And actually for glycoprofiling,   fast, easy to use analytical methods are not really fully developed. So the idea of this experiment was to come up with a fairly quick and easy method that people could use that would give them an accurate assessment of the different   (I am not chemist here) sugars that have been   attached to the base protein post transcription.   And to give you an idea, the idea of using chromatography and I'm going to assume you have some familiarity with it, basically it's a method where you   have some solution, you run it through a column, and then, as the different chemical species go through the column,   they tend to separate and come out of the column at different times. So basically a form, what is called a chromatagram, where the peak   is a function of concentration and the time at which the peak occurs is actually important to identifying what the species actually was. So in this case, we're going to look at a number of   sugars. I'm simply going to call them glycoforms, and what is going on here is the scientists who did this work developed a calibration solution.   And we know exactly what's in the calibration solution. We know exactly what peaks should elute and roughly where.   And then we can compare an actual human antibody sample to the calibration sample. And some of these glycoforms are charged. They're difficult to get them through the column.   They tend to stick to it, so we're using what is sometimes called a gradient elution procedure.   And I won't get into the details. And what we're doing here, we're using something called high performance anion exchange chromatography. I'm not an expert on it, but the scientist I've worked with has done   a very good job of developing this calibration, and the reason we need a calibration solution that historically   the sugars that elute from a human antibody sample are not entirely predictable as to where they're going to come out of a column, so we have something that we can use for calibration.   So.   The person who did the work, and I'm going to mention her in a moment, designed two separate experiments. One is a 16-run three-level design.   And then later she came back and did a bigger 28-run, three-level design. And so one could be used as a training set and the other is a validation set, but to demonstrate the covariate selector and Bayesian I-optimal strategy that Chris talked about,   both designs were combined into a single 44-run design. We then   took that design into custom design with the covariate selector and then created various Bayesian I-optimal designs.   And we could have done more combinations, but we only have so much time to talk about this. So there are three designs. One has 10 runs, remember the five factors and we're doing 10 runs.   A 13-run design and a 16-run design and again, we could do far more. So, what we want to do is see how these   designs perform after we use fan to fit models and, by the way, as Chris mentioned, in each one of these designs, the runs that are not selected from the 44 are then used as a validation or test set to see how well the model performed.   Okay, so these are the initial factors, the initial amount of sodium acetate, the initial amount of sodium hydroxide. And then there are three separate gradient elution steps that take place   over the process time, which runs to roughly 40 minutes. So these are the settings that are being used and manipulated in the experiment.   There are actually, in this experiment, 44 different responses one could look at. So, given the time we have available, I'm going to look at two.   One is the retention time for glycoform 3, and this is the key, this is what anchors the calibration chromatogram.   And we'd like it to come out at about eight and a half minutes because it aligns nicely with human antibody samples.   It's actually fairly easy to model. The second response is the retention time of glycoform 10. It's a charged glycoform. It elutes very late and tends to be bunched up with other charged glycoforms and is harder to distinguish. OK, so those are the two responses I'll look at.   And for those who aren't familiar with chromatography, here are a couple of what are called chromatograms. These are essentially the responses,   pictures of what the responses look like and, in this case, if you take a look at the picture, we're going to look at the retention time, that is, when did   Peak 3 come out? At what time and then, at what time did Peak 10 come out and show up in the chromatogram? And, by the way, notice between the two   chromatograms I'm showing, how different the profiles are. So in this experiment when we manipulated those experimental factors   (and by the way, we have 44 of these chromatograms; I'm showing you two of them), you see a lot of differences in the shapes and the retention times and so forth of these peaks and in resolution. So whatever we're doing, we are clearly manipulating   the chromatograms, and for those of you who are curious, yes, we are thinking about this in the future as functional data and how functional data analysis might be used.   But that is often the future but that's definitely something that we're very interested in. Chromatograms really are functional data in the end. But I'm just going to extract two features and try to model them. Okay. And I also want to give credit to   Dr Eliza Yeung of Cytovance Biologics, a good friend and excellent scientist. And she did all the experimental work and she was the one who came up with the idea of constructing the...   what she calls the glucose ladder, the calibration solution. And Eliza, besides being a nice person, she also loves JMP. So we like Eliza in many ways, okay. Excellent scientist. Okay, so   before I get into   SVEM, I just want to mention...Chris mentioned the full quadratic model that's been used since, what, 1950 as the basis of optimization.   Basically, all the main effects, two factor interactions and squared terms. In point of fact, for a lot of complex physical systems, the kinetics are actually too complex for that model.   And Cornell and Montgomery, in a really nice paper in 1998, pointed this out and suggested that in many cases,   what they call the interaction model, we call it the partial cubic model, may work better. And in my experience, using SVEM and applying this model to a lot of case studies, they are right. It does give actually better predictive models, but there's a problem.   These partial cubic models can be huge and they are a big challenge to traditional modeling strategies where supersaturated models, as Chris mentioned, are a big problem.   How many potential terms are there? Well, take K square, two times that and add one for the intercept. So for five factors, the full   partial cubic model would have 51 terms. By the way, I'm I'm going to use these models, but I'm only going to use 40 of the terms that are usually most important.   And then we're going to use self-validating ensemble modeling to fit the models and, as Chris said, SVEM has no problem with supersaturated models.   And by the way, in the machine learning literature, supersaturated models are fit all the time.   And using the right machine learning techniques, you can actually...they can actually show you, you get better prediction performance.   This is largely unknown in traditional statistics, okay. So we're going to use the SVEM add-in and I mentioned if you'd like to learn about it, you can contact Wayne Levin, who's already spoken, at Predictum and I'm sure Wayne would love to talk to you.   So let's go over to JMP briefly.   And I'll show you how the add-in actually works, so let me bring over JMP.   So I've installed the add-in so it has its own tab. So I click on Predictum. By the way, this is one of the Bayesian I-optimal designs we created, and this one has 16 runs. Select self-validating ensemble modeling.   So you see what looks like your typical fit model launch dialogue. I'm going to select my five factors.   And I'm just going to do a response surface model at this point for illustration, and I want to model   retention time for glycoform 3. Notice in the background, this setup, the autovalidation table, so there's a waiting function.   And there's an autovalidation portion. This is all in the background, you don't need to see it. But the saw...but the add-in created it and then it basically hides it, but you can look at it, if you want, but it's hidden because it's not terribly important to see it.   So now we open up GenReg, so this is the GenReg   control panel. And for illustration, by the way, SVEM is really agnostic in terms of the method you use it with for model building. So, in fact, you could use methods that aren't even in GenReg, but that's our primary focus today. So we're going to do forward selection.   And because we only have so much time I'm only going to do 10 reps.   Click go and we'll create the model.   So here is the output for only 10 reps, and you get an actual by predicted plot. But what's really nice about this, is it actually creates the ensemble model average for you.   So I click on self-validating ensemble model, save prediction formula. So I'm going to go ahead and close this display.   Come over.   And there's my model.   I can now take this model, and I can go ahead and apply it to the validation data. So we've got 16 runs that were selected, so there's another 28 available that can be used as a validation set to see how we actually did. So at this point I'm now going to go back to the   presentation and, by the way, without the SVEM add-in, it can be a little bit difficult for, especially if you aren't particularly proficient in JMP scripting   to actually create these models. So that add-in, it may not look like it did an awful lot of work in the background, okay. So how did we do? So let me put this back on slideshow.   So what I did, there are three designs, a 16-run design, 13-run, and then I picked a 10 run. We really push the envelope.   And there are two responses, as I said, glycoform 10 elutes late and tends to be noisy and and can be hard to resolve. So for the 16-run design, I actually fit a 40 term partial cubic model. Keep that in mind, 16 runs. And I fit a 40 term model.   And then the key to this is the root average square error on the validation set, okay. So   I fit my sixth...my 40 term model, and then the   root average square error on the validation set was low and the validation R squared was actually close to 98, so it fit very, very well.   Notice, even for the much noisier   glycoform 10, we still got a pretty good model. I'll show you some actual by predicted plots. So we did this for 13, once again for 13, we got very good results.   And then finally for 10. By the way for 10, I just fit the standard full quadratic model, I felt like I was really pushing my luck.   But I could have gone ahead and tried the partial cubic. And notice, once again, even with 10 runs, modeling glycoform 3 retention time, I got an R square .94.   To me, that is really impressive and again I modeled...my model had twice as many predictors as there were observations and even with the difficult to model glycoform 10, we did quite well.   So, here are some actual by predicted plots and again I pick glycoform 10 because it's really hard to model.   So I didn't want to game the system too much. I know from experience it's hard to model. So there is an actual by predicted plot. There's another actual by predicted plot for the 10-run design, it still does pretty good   when you consider what we're trying to do here.   And then finally, how did we do overall?   So this is a little table I put together and I took the 16-run design as the baseline, and then what I'm doing is comparing what happened when we fit the 10-   term model and the 13 term for retention time for G10.   You'll notice that we got a rather large error for 13 and a much smaller one for 10. I don't know why. I chose to just show what the results were. It's a bit of a mystery, and I have not gone back to explore it.   But for retention time three, notice I go from 16 down to 10. Yes, I got some increase in the root average square error, but you know if you look at the actual by predicted plot,   it still does a good job of predicting on the validation set. So the point of this being, and this is, I know this is, I work a lot with biotech companies, the   actual efficiencies you could potentially achieve in your experimental designs and I know for many of you these experiments are costly.   And even more important, they take a lot of time. This really can shorten up your experimental time and reduce your experimental budget and get even more information.   So just kind of a quick conclusion to this. This SVEM procedure and this, and as Chris showed, you we've investigated this in simulations and case studies.   SVEM is great for supersaturated models. I won't get into it, but from machine learning practice supersaturated models are known to perform very well. They use all the time and deep learning as an example.   Basically, as Chris says, SVEM is combining machine learning with design of experiments.   And even more important, once you start going down the pathway of SVEM, that, in turn, informs how you think about experimenting.   So basically using SVEM, we can start thinking about highly efficient experiments that can really speed up the pace of innovation and actually reduce the time and costs of experimentation. And with time to develop products and processes   becoming...the lead times getting shorter and shorter, this is not a trivial point, and I know many of you in biotech are also faced with serious budget constraints.   And then (by the way, most people are) and then finally one promising area is Bayseian I-optimal designs.   And again, I'll mention that Brad Jones and his group have done a great job introducing these in JMP. It's the only software package I know of that actually does Bayesian I-optimal designs.   And we think SVEM is going to open up the window to a lot of possibilities. So basically, that is my story and I'm going to turn it over to Wayne Levin. Wayne Levin Thanks, very much for that,   Phil and Chris. I hope everybody can see that there's an awful lot of work, a lot of research has gone into this, including some   work in the in the trenches, you might say, and I've been in this game for over 30 years now, I don't mind telling you I'm very excited about what I'm seeing here.   I think this can be really transformative, really changed what...how industrial experiments are are conducted here. I was previously very excited by supersaturated designs alone and that was   facilitated with a custom designer and now with...that's on the design side.   of things.   Now, when you bring SVEM on the analysis side, I mean those are two great tastes   that go great together, you might say. So anyway, just to conclude, I'm just going to slide in here that   Phil, myself and Marie Goddard, we're putting together an on-demand course (it should be available next quarter, like in April)   on mixture designed experiments, and we're going to focus a good amount of time on SVEM in that course. So if that's of interest to you, please let me know or will...so we can let you know or follow us on LinkedIn or one of those things.   And for the SVEM add-in, again, if I may, I'm delighted we were first to market with this.   We've been working hard on it for a number of months. We do a new release about every month, so it is evolving as we're getting feedback from the field, and I want to thank Phil very much   for leading this. And he's been partnering with one of our developers (???) and it's really a really been a terrific effort, a lot of hard work   to get to this point, and we want to make that available to you. And so as Phil mentioned earlier, there's really two ways for you to try this. One is just try the add-in itself,   it works with 14 and 15. And with 16 coming out, I think we're going to have it working for 16 as well. It does require JMP Pro, so if you don't have JMP Pro, there's a link here, where you can apply for...to get a JMP Pro trial license.   And you can get this link, of course, by downloading the slides for this presentation. So that's how how you can get it, so this is one way you can work with it to try it yourself.   Another way is just contact us and we can work with you on a proof of concept basis, all right.   And we could do some comparative analyses and so on, and of course, review that analysis with you. So you know we'll put the necessary paperwork in place, and   we don't mind trying it out. Wwe've done that for a good number of companies and that may be the easiest way for you, whatever whatever works for you. Okay, and for now, then what I'd like to do is   open it up for any questions or comments at this time. I'll say on behalf of Chris and Phil, thanks for joining us. So okay, questions and comments, please.
Phil Kay, JMP Senior Systems Engineer, SAS Peter Hersh, JMP Technical Enablement Engineer, SAS   Scientists and engineers design experiments to understand their processes and systems: to find the critical factors and to optimize and control with useful models. Statistical design of experiments ensures that you can do more with your experimental effort and solve problems faster. But practical implementation of these methods generally requires acceptance of certain assumptions (including effect sparsity, effect heredity, and active or inactive effects) that don’t always sit comfortably with knowledge and experience of the domain. Can machine learning approaches overcome some of these challenges and provide richer understanding from experiments without extra cost? Could this be a revolution in the design and analysis of experiments? This talk will explore the potential benefits of using the autovalidation technique developed by Chris Gotwalt and Phil Ramsey to analyze industrial experiments. Case study examples will be presented to compare traditional tools with autovalidation techniques in screening for important effects and in building models for prediction and optimization.     We were not able to answer all questions in the Q&A so here goes...   I see weights are not between 0-1, how they are generated? They are "fractionally weighted". This is the key trick that enables the method. In Gen Reg in JMP Pro we use this in the Freq role. You can use this addin for JMP Pro 15 to generate the duplicate table with weighting. We can recommend a talk by Chris Gotwalt and Phil Ramsey to explain the deep statistical details of the weighting.   When you duplicate rows, they are duplicated with the same output result? What is the purpose of this duplication? Yes they are duplicated with the same response. It would be a bad idea just to do that and progress with the analysis. The key innovation here is that they are fractionally weighted such that they have low weighting in one portion of the validation set and a higher weighting in the other validation set. For example, a run with low weighting in "training" will have a duplicate pair that has a high weighting in "validation". Some duplicate pairs will have more equal weighting in training and validation.   We fit a model using these weightings to determine how much of each run goes into training and how much goes into validation. Then we redraw the weights and fit another model. We repeat this process hundreds of times. We then use the results across all these models, either to screen the important effects from their proportion non-zero, or to build a useful model by averaging all the individual models. Is it possible to run the add-in in the normal, non-pro, JMP 15 version? Quite possibly. You can run the add-in but you will need JMP Pro to utilize the validation column in the analysis - this is critical. How does it work with categorical variables? We presented examples with only continuous variables. A nice feature is that it would work for any model and with any variable types. Is this a recognized method now? In a word, no. Not yet. Our motivation is to make people aware of this method to stimulate people that have an interest to explore the method and provide critical feedback. As we made clear in the presentation, we do not recommend you to use this as your only analysis of an experiment. If you like the idea, try it and see how it compares with other methods.   Having said that, the ideas of model averaging and holdback validation are recognised in larger data situations. It seems that this should be beneficial in the world of smaller data that is designed experiments. Do you do the duplicate step manually? No, this is done as part of the autovalidation setup by the addin or by the platform in JMP Pro 16. You could easily create the duplicate rows yourself. The harder part is setting up the fractional weighting, without which the analysis will not work.   Bayesian uses also a lot of simulation (MCMC), would it also be applicable? Integrating Prior and posterior distributions? Quite possibly. The general idea of fractionally-weighted validation with bootstrap model averaging might well be applicable in lots of other areas.      Auto-generated transcript...   Speaker Transcript Phil Kay Okay, so we are going to talk about rethinking the design and analysis of experiments. I'm Phil Kay. I'm the learning manager in the global technical enablement team and I'm joined by Pete Hersh. Peter Hersh Yeah I'm a member of the global technical enablement team as well, and excited to talk about DOE and some of the new techniques we're...we're exploring today. Phil Kay So the first thing to say, is we're big fans of DOE. It's awesome. It's had huge value in science and engineering for a very long time.   And having said that, there are some assumptions that we have to be okay with in order to use DOE a lot of time.   And they don't always feel hugely comfortable, as a scientist or an engineer. So things like effect sparsity, so the idea that not everything you're looking at turned out to be important   and actually only a few of the things that you're actually experimenting on turn out to be important. Or effect heredity is another assumption.   So that means that we only expect to see complex behaviors or higher order effects for factors that are active in their simpler forms.   And, and just this idea of active and inactive effects, so commonly, the sequential process of design of experiments is to screen out the active effects from the inactive effects.   And that just feels sometimes like too much of a binary decision. It seems a bit crazy, I think, a lot of time, this idea that some of the things we are experimenting on a completely inactive,   when really, I think we, we know that really everything is going to have some effect. It might be less important but   it's still going to have some effect. And these are largely assumptions that we use in order to   get around some of the challenges of designing experiments when you can't afford to do huge numbers of runs and yeah. Pete, I don't know if you want to comment on that from your perspective as well. Peter Hersh Yeah I completely agree. I think that thinking of something like temperature being inactive is is...I think hard to imagine that temperature has no effect on an experiment. Phil Kay Yeah it's kind of absurd, isn't it? So,   yeah, if that's in your experiment, then it's it's always going to be active in some way, but maybe not as important as other things.   So I'm just going to skip right to the results of   of what we've looked at. So we've been looking at this auto-validation technique that Chris Gotwalt and Phil Ramsey essentially invented   and using that in the analysis of designed experiments, and it's really provided results that are just crazy. We just didn't think they were possible.   So, first of all I looked at a system with 13 active effects and analyzing a 13 run definitive screening design and from that I was able to identify that 13 effects were active, which is... that's what we call a saturated situation. We wouldn't...   commonly, we've talked about definitive screening designs as being effective in identifying the active effects when we have effect sparsity, when there's only a small number of effects that are actually important or active. But in this case I managed to identify all of the 13 active effects.   And not only that , was actually able to build a model with all those 13 active effects from this 13 run definitive screening design.   So, again that's kind of incredible; we don't expect to be able to have a model with as many active effects as we've have rows of data that we're building it from.   And Pete, you looked at some other things and got some other really interesting results. Peter Hersh Yeah, absolutely, and and Phil's results are very, very impressive. I think   what the next step that we tried was making a super saturated design, which is more active effects than runs and we tried this with   very small DOEs. So a six run DOE with seven active effects, which if we did in standard DOE techniques, there'd be no way to analyze   that properly. And we looked at comparing that to eight and 10 run DOEs and how much that bought us. So we got fairly useful models, even from a six run DOE, which was   better than I expected. Phil Kay Yeah, it's better than you've got any right to expect really, isn't it?   And so we've got these really impressive results and the ability to identify a huge number of active effects from a small   definitive screen design and actually build that model with all those active effects. And in Pete's case, I have been able to build a model with seven active effects from really small   Designed experiments.   So, how did we we do this? How does the auto-validation methodology work? Well, it's taking ideas from machine learning, and one of the really useful tools from machine learning is validation. So holdout validation is a really nice way of ensuring that you, you build   the most useful model. So it's a model that's robust. So we hold out a part of the data, we use that to test different models that we build, and   basically, the model that makes the best prediction of this data that we've held out is the model that we go with, and that's just really tried and tested.   It's actually pretty much the gold standard for building the most useful models, but with DOE that's a bigger challenge, itn't it, Pete? It doesn't really obviously lend itself to that. Peter Hersh Yeah, yeah, the the whole idea behind DOE is exploring the design space as efficiently as possible. And if we start holding out runs or holding out   analysis of runs, then we're going to miss part of that design space and   we really can't do that with a lot of these DOE techniques like definitive screening designs. Phil Kay Right, right, so it'd be nice if there was some trick that we could get the benefits of this holdout validation and not suffer from holding out critical data. So that brings us to this auto-validation   idea, and, Pete, do you want to describe a bit about how this works? Peter Hersh Absolutely, so this was a real clever idea developed by Chris and Phil Ramsey, and they they essentially take our original data from a DOE and they resample it, so you   repeat the results. So if you notice that the table here at the bottom of the slide, the the runs in gray are the same results of runs in white. They're just repeated.   And the way they get away with this is by making this weighting column that is paired together. So basically, if one has a high weight, the   the repeated run of that has a low rate...weight and so on and so forth. And this is...this is...enables us to use the data with this validation and the weighting and we'll go into a little bit more detail about how that's done. Phil Kay Yeah, you'll kind of see how it happens when we go through the demos. So we've basically got two case studies of simulated examples that we use to illustrate this methodology. So this first case study I'm going to talk through,   I emphasize it's a simulated example. And in some ways, it's kind of an unrealistic example, but I think it does a really nice job of demonstrating the power of this methodology.   We've got six factors and to make it seem a bit more real, we've chosen some some real factors from a case study here where they were trying to grow some biological organism and optimize the nutrients that they feed into the system to optimize the growth rate.   So we've got these six different nutrients and those are our factors. We can add in different amounts of those, so I designed a 13 run definitive screening design to explore those factors with this growth rate response.   And the response data was completely simulated by me, and it was simulated such that there were 13 strongly active effects. So   I simulated it so the all of the main effects, all all six of the the main effects, are active.   And then, for each of those factors, the quadratic effects are active as well, so we've got six quadratic effects.   Plus we've got an intercept that we always need to estimate, so there are 13 effects in total that are active, that are important in understanding the system.   1 signal to noise, but that's still a real challenge with standard methodology in order to model that, and we'll come to that in the demo.   So, really, the question is, can we identify all those important effects and, and if we can, then can we build a model with all those important effects as well? Because as I said, that would be really quite remarkable   versus what we can do with standard methodology.   And then case study #2, Pete? Peter Hersh Yeah, absolutely. Very, very similar to Phil's case study. Ssame idea with we're feeding different nutrients at different levels to an organism and and checking its growth rate. In this case I simplified what Phil had done and broke it down to just three nutrient factors. And this is   building a different type of design, so an I-optimal supersaturated design where we're looking at a full response surface   in a supersaturated manner and we looked at six, eight and 10 run   designs. And so same same idea.   The the   effects were very, very   high signal to noise ratio, so really wanted to be able to pick out those effects if they were active. And just like Phil's, I kept the main effects in the quadratics active, as well as the intercept and we're trying to pick those out.   And same idea, so how many runs would we need to see these active effects and how accurate of a model can we make from these very small designs? Phil Kay Yeah because you know, like I said, you've really got no right to expect to be able to build a good model from such a small design. Peter Hersh Yeah, exactly. Okay. Phil Kay So I'll go into a demo now of case study #1.   And I'm presenting this through a JMP project, so that's a really nice way to present your results. I'd recommend trying this out.   And that's our   design, so this is our 13 run definitive screening design, where we vary these nutrient factors, and we have the simulated growth rate response. As I said, that's been simulated such that   the main effects, the quadratics of all of these factors are strongly active, plus we've got to estimate this intercept.   Now, with a definitive screening design I've generally recommend you use fit definitive screening as a way of looking at the results as one of the analyses that you can do.   It works really well when we have this effect sparsity principle being true. So as long as only a few of the effects are strongly active...are active and the rest of them are unimportant,   then it will find those...the few important effects and separate them from the unimportant ones.   But in this case I wasn't expecting it to work well and it doesn't work well. It does not identify that all six factors are active. In fact it only identifies one of the factors as being active here.   So that's not a big surprise, this is too difficult, too challenging a situation for this type of analysis.   If somehow we knew that all of these active effects are active and we try and fit a model with all six main effects, all six quadratic and the intercept,   then that's a saturated model. We've got as many parameters to estimate as we have rows of data, so we can just about fit that model, but we don't get any statistics.   And in any case, you know, aside from the fact of I've simulated this data, in a real life situation, we wouldn't know which ones are active, so we wouldn't even know which model to fit.   Now, using the auto-validation method, I was able to actually very convincingly identify the active effects, and I'll talk through how we did this.   And this is just a visualization of my results here. You don't necessarily need to visualize it in this way. This is for presentation purposes.   I was able to identify that first of all, the intercept was active. I've got all my six main effects,   and my quadratic effects, and then my two factor interactions, which I simulated to have zero effect. You can see they are well down   versus the other ones. And there's actually a null factor here that we use that...so so a dummy factor. So anything less than the null factor we can declare as being unimportant or or inactive, if you like.   And what we're actually...the metric we're looking at here is something called proportion nonzero and I'll explain what that means, as we go through this. That's kind of the metric we're using here to identify the strength of an effect, of the importance of an effect.   So a bit about how I went through this. So I took my original 13 run definitive screening design and then I set it up for auto-validation so we've now got 26 rows we've duplicated.   And there's an add-in for doing this, one of our colleagues, Mike Anderson. created an add-in that you can use to do this in JMP 15.   In JMP 16 they're actually adding the capability in the predictive modeling tools in the validation column platform.   And what that does, we get this duplicate set of our data, and then we get this weighting and as Pete said, we have...each row is in the training set and in the validation set.   In the training set, if it has a low weighting, it'll have a high weighting in the validation. So if it has a high weighting in the training set, it'll have a low weighting in the validation set.   And what we do actually is, we read...these have basically been randomly assigned.   We reassign those and we were able to kind of iterate over this hundreds of times, fitting the models each time and then looking at the aggregate...aggregated results over many simulation runs. So what you would do   is to fit the model and I'm using GenReg here in JMP Pro.   And you'll need JMP Pro anyway, because you need to be able to specify this validation role, so we put...the train validation column goes into validation.   And the weighting goes into frequency and then we set up everything else as we normally would with our response. And then I've got a model, which is the response surface model here with all these effects in, and then I would click run.   And it will fit a model, and we can use forward selection or the Lasso. Here, I've used the Lasso.   It's not hugely important in this case.   And what's actually happened is we've identified only the intercept as being important in this case, so we've only actually got the intercept in the model.   But if we change the weighting, if we go back to our data table resimulate these weightings, we will likely get a different result from the model.   We weight different rows of data, different runs in the experiment, that changes the model that's fit. So we're going to do that hundreds of times over, and what I'm going to do is actually to use the simulate function in JMP Pro.   And what we do is we switch out the weighting column and switch in a recalculated version of the weighting column. And you can do that a few hundred times. I actually did it 250 times in this case. I'm not going to actually let that run, because that will take a minute or two.   Once you've done that, what you'll get is a table that looks like this.   So now I've got the parameter estimate for every one of those 250 models for each of these effects. So   in my first model in this run that I did, this was the parameter estimate for this citric acid main effect. In the next model when we resampled the weighting,   citric acid main effect did not enter the model, so it was zero in that case.   And you can actually run distributions on all of these parameter estimates. And one of the things you can do is to   customize this statistics, the summary statistics, to look at the proportion nonzero. So you can see the intercept here,   the estimates that we've had of the intercept. You can see with citric acid, a lot of the time it's been estimated as being zero so those the models were,   citric acid main effect was not in the model, and then a lot of the time it's been estimated as around about 3, which is what I'd simulated it to be.   So what we look at is the proportion of times that it is non zero and we can make a combined data table out of those. And I've already done that, and just done a little bit of...   a little bit of additional augmentation here. I've just added a column for whether it's a main effect or whatnot, and then that was how I created   this visualization here. So what you're looking at is the proportion of times each of those effects is non zero, so the proportion of times that each of the effects is in our model over all those   250 simulation runs we've done, where we've resimulated the fractional weighting. And that's what we use to identify   the active effects, and that's...and it's done a remarkable job. It's been able to do what our standard methods would not be able to do. It's identified 13 active effects from a 13 run   definitive screening design.   Now, what would you want to do next? We maybe want to actually fit that model with all those effects and I've been able to do that. And I'm comparing the model that I've fit here   versus the true simulated response, and you can see how closely they match up. So I've been able to build a model with all these main effects, all these quadratics and the intercept.   So I've got a 13 parameter model here that I've been able to fit to this 13 run definitive screen design, which again is just remarkable.   And I'm not going to talk through exactly how I got to that part. I'll hand over to Pete now. He's going to talk a bit more about this idea of self validated ensemble models. Peter Hersh Absolutely. Thank you, Phil. Let's see. I'm going to share my screen here and we'll just take a look at this project. So   you can see here   in the same   flow as Phil, we're looking at a project here and I have   started with that six runs supersaturated DOE, and here you can see, I have three factors,   what my my actual underlying model growth rate is and then what the growth rate...the simulated growth rate was and then like like Phil mentioned,   I create this auto-validation column, which can be done with an add-in in JMP 15 that that Mike Anderson developed. Or in JMP 16, it's it's built right into the software and you can access that under the analyze predictive modeling platform make validation column.   So just like Phil showed, he showed a excellent example of how we can find which factors are active, so a factor screening. And that is oftentimes our main goal with DOE, but if we want to take it a step further and build a model out of that,   we'd go through this the same process, right. So we build our DOE, we get an auto-validation added to that DOE, we build our model, just like Phil showed, using generalized regression and one of the   variable selection techniques. So Phil Ramsey and Chris Gotwalt have looked at many of these different techniques and they all seem to work fairly well. So whether you're using a Lasso or even a two stage forward selection, they all seem to have similar results and   work fairly well. So once you set this up and launch it, you get a model, like like Phil had shown, and you know some of these models will have
David Wong-Pascua, Senior Scientist, CatSci Phil Kay, JMP Global Learning Manager, SAS Ryan Lekivetz, JMP Principal Research Statistician Developer, SAS Ben Francis, JMP Systems Engineer, SAS   Imagine a DOE problem with a huge number of possible treatments: 5 categorical factors with 3, 4, 4, 8 and 12 levels, respectively. The full factorial experiment is 4,608 runs. You have a high-throughput system, so you can carry out 48 runs at the same time. Perhaps the first experiment is obvious: use the JMP Custom Design platform to generate a 48-run design to estimate main effects. Fitting a model to the response (yield), you find that all factors are significant, as expected. So what should you do next? What is an efficient and effective approach to finding the optimum and having confidence in that result?   This is not a typical screening-then-optimisation sequential DOE situation, because there are no unimportant factors that can be dropped after the initial screening design. Also, 2nd-order (and higher) interactions are likely to be important, but estimating those models requires hundreds of runs.   In this paper you will find out how we approached this problem using sequential approaches to seek optimum regions in an overwhelmingly big possibility space, while also balancing that with maximizing learning. We will compare this with alternatives including a Custom Design approach to characterising the whole system. You will see that the solutions we found were only possible because of the unique interactive and visual nature of JMP and the new Covariate/Candidate Run capability in version 16.       You can find the example data that we used and the reference in this JMP Public post.   Auto-generated transcript...   Speaker Transcript David.Wong-Pascua Okay, so.   Sequential and Steady Wins the Race. So, hello everyone I'm David and I'm a senior scientist at CatSci. So what is CatSci?   CatSci is an innovation partner for end-to-end chemical process, research and development. We collaborate with organizations like JMP UK to ensure that all of our decisions are driven by data.   For us it's about getting it right, the first time and getting it right, every time; formulating bespoke perfect-for-purpose solutions for every project; and our absolute focus is delivering to the expectations of our clients.   We work in a number of areas, including early- and late-stage development. We possess process expertise, a long history of catalysis experience, which has always been a core feature of our identity, and we have excellent capabilities in analytical chemistry and material science.   So at CatSci, one of the challenges that we face is the selection of categorical variables upstream of process optimization.   This difficulty is most prominent in our catalyst screening work.   In the images on the slide, we can see some of the apparatus and instruments that we rely upon to perform catalyst screening.   On the left, we have premade catalysis kits that contain 24 predosed combinations of metal precursor and ligand.   In the center is a hydrogenator that houses, a 48 well plate, which can be seen on the right.   This allows 48 individual experiments to be conducted simultaneously.   These instruments are designed for high throughput and small scale, and it isn't uncommon to run 10 or more of these plates for our clients, which totals up to almost 500 experiments.   So why do we need to run a relatively large number of experiments and what are the benefits for improving efficiency?   So, first of all, the chemical space that needs to be explored is extremely large.   For any reaction, any chemical reaction, there are many factors that can have an effect, such as temperature, concentration and stoichiometry.   However, what makes exploration particularly challenging for a catalytic reaction is that there are many important categorical factors; these include metal precursor, ligand, reagent, solvent, and additive.   Another characteristic of catalytic reactions is that there are frequently high order interactions, so what this means is that, for a catalyst system (a particular combination of metal precursor and ligand),   there needs to be extensive solvent and additive screens to gauge performance.   It isn't correct to rule out to say a certain combination of methal precursor and ligand just before...just because it performs poorly in a certain solvent. It might be the case that you don't have the right solvent or the right additive for it.   Lastly, one of the easiest ways to explain the benefit of a more efficient screening method is to look at the price of materials.   Many of our catalysis projects involve precious metals, such as platinum, palladium, ruthenium, iridium, and rhodium. Some of these metals are among the rarest elements in the earth's crust and that's reflected in their price.   Just a quick Google search of their prices, you can see here that the price of rhodium is more than an order of magnitude more expensive than gold.   Another large expense of catalysis screening is the ligands. While there are cheap ligands that exists, many of the top performing ligands for asymmetric synthesis are proprietary ligands, such as Josiphos from Solvias,   and one of the most commonly available and cheapest, J001, is still twice the price of gold.   So, reducing the number of experiments can therefore decrease the cost of consumables but the biggest impact is created when we find the optimal solution for a client with a multiton manufacturing process.   In these cases, a tenfold reduction in catalyst loading or the discovery of an alternative cheaper catalyst say, for example, exchanging a rhodium catalyst for a ruthenium one, can save millions of dollars per year.   And now going to hand over to Ben, who will discuss how the JMP UK team have tackled this problem. Ben Francis So thanks very much, David.   So it's clear that   CatSci and many other companies similar to CatSci face big challenges with large formulation spaces,   which can have very high order interactions. So the science is very, very complex and what we do to experiment to unravel that complexity is important.   Unfortunately, the option to just test every single combination isn't really a sensible or viable one, due to the costs that David outlined there.   And ultimately, we need to do some form of approach, which involves data science to understand the relationships numerically.   So, in order to proceed and applying the chosen formulation to a manufacturing...a multiple manufacturing scenario, we need to know it will work each time, as David said.   So we searched literature, because at JMP we're very interested in application of DOE, and we felt this was a particular situation of DOE.   And as we aptly named it, it's a big DOE. So we searched for a case study to apply a big DOE approach to and we found a paper by Perera et al,   published in 2018, and they looked at a platform for this reaction screening. And what was interesting in this scenario is they had five categorical factors.   And this may not sound like too much. They had to look at Reactant 1 and 2, ligand, reagents and solvent but what the complexity of what that was presented in this paper was the levels of each factor.   Four, three, twelve, eight and four levels, respectively, all combined together come to 4,608 distinct combinations.   Now, in the Perera et al paper, they tested every single one, which obviously is not a viable solution in terms of every company in the world, just testing every combination, as David outlined before.   You can see here on the heat map that they were able to get an idea that there are some high yielding regions   and some less high yielding, and some very low as well. So there was a lot of testing to ultimately end up at not really much process understanding   and a very high experimental cost.   So this ultimately ended up in a situation where they found a combination, they they got to a combination, which they were then able to process in the paper and say look, this is, these are the general   it was right, the first time, but was it right every time, from that point onwards, if they put that into a multiton manufacturing scenario? Would it continue to provide the specification that was required?   does the big DOE solution reduce resource requirements?   Does it get it right first time, and does it get it right every time? And the way we are able to measure these questions and how we performed   was firstly in terms of reducing resource requirements, we only allowed to design an experiment with 480 runs in total. And this was mimicking 10 48 workplates, and this is around 10% of the runs performed in the paper.   Does it get it right first time was measured by looking at how many of the high yield combinations identified in the paper, the 80, were either tested or predicted   by the model generated. And, ultimately with 10% of runs, we were expecting to get at least eight high yield combinations out of our...out of our DOEs   (and high yield is indicated here by a value of more than 95%).   Finally, and crucially, does it get it right every time was measured by looking at the general diagnostics of the model, the R squareds. And, ultimately, can we look at this and see if we have process insight? Do relationships make sense? Do combinations and interactions makes sense?   So what we did to give ourselves a par for the course here as well, was looking at using the full data set of 4,608, we found that the R squared of this was .89, and this was a with a boosted tree method.   So now I'm going to hand over to Phil, who will outline the first big DOE approach that we took. Phil Kay Thanks, Ben; thanks, David, for setting the scene so nicely. So we looked at three different approaches   to tackling this to seeing if we can take a reduced number of those runs that were actually tested in that paper and still gain the same useful information as Ben as outlined.   So I'm going to talk through two of those three approaches, and these two are kind of the extreme approaches. There's a conservative approach and a greedy approach and then Ryan will talk through the more hybrid balanced approach that we thought was probably going to work best ultimately.   So by the conservative approach, this is a standard DOE approach. We design a 480-run custom design with 10 blocks, so that's representing 10 plates in the catalyst screening equipment that that David showed us.   And the headline results were, from that 480-run experiment, we were able to correctly predict that 13 of the highest yielding combinations would be high yielding ones.   And we were able to build a model from that 480 run data set that, when we tested it against data that hadn't been used in that experiment,   the R squared was about .73, so pretty good actually. We get a fairly good model, fairly good understanding of the system as a whole from that smaller experiment on...out of the big experiment that was published.   Some really interesting insights we got from this were that we needed to use machine learning to get useful models from this data, and that was a bit of a surprise to us. I'll talk through that in a bit more detail.   And we got some good understanding from this data that we collected using this conservative approach. It gave us a good understanding of the entire factor space, why certain combinations work well.   It enabled us to identify alternative regions, so if we find a high yielding combination, but it's not practical because it's too expensive,   the catalyst are too expensive, or maybe they're dangerous or toxic and not amenable to an industrial process, then we we've got lots of alternatives because we've got an understanding of the full space.   So, how did we do that?   Custom design in JMP. We   designed a 480 run, I optimal design, in this case, in 10 blocks. So again, the 10 blocks representing 10 48 wellplates that David showed us the equipment for.   Of note here, we specify a model in custom design that is for the main effects.   for the second order interactions and the third order interactions. And with the third order interactions and all those levels of these categorical factors,   there's just a huge number of parameters that need to be estimated, so we couldn't actually design an experiment to estimate all of those explicitly. We need to put the third order interactions in as, if possible, so this is a bayesian I optimal design in 480 runs.   You can see from this visual, we're just looking through the different rounds of the experiments, of the different...10 different plates.   And you can see how, with an addition of the next plate in the experiment, we're adding to our exploration of the factor space as a whole.   We're not focusing on any specific regions; we're just filling in the spaces in our knowledge about that whole factor space.   So that's 480 runs that we used in our experiment and the remaining 4,128 rows that we have from that published data set. We can use this test data to see how well our models compare on a heldback data set, this test data.   Now we expected from all our sort of DOE background and our experience that   when we design an experiment for a certain model then, standard regression techniques are going to provide us with the most useful models.   So we were using JMP Pro here and we use generalized regression to fit a full third order model.   Now we actually use a beta distribution here, which bounce predictions between zero and one, so we convert percent yield to the scale of zero to one. It's just a little trick that when you're modeling yield data like this, you might might want to use that when you've got JMP Pro.   And we use forward selection, with AICc as the validation method, so that's going to select the important effects.   And we repeat this after each round, so with the first round of 48 runs, then the second round of 96, then the third with 144, all the way up to the 10th round where we've got the full 480 runs.   Here you can see the the filter set on gen reg there just to look at the first three rounds there.   So for that first three rounds, this is the model that we get using gen reg, forward selection, AICc validation.   And it's not a great model, I think, just intuitively from our understanding of what we'd expect from the chemistry. We're only seeing two of the factors involved in the model so it's only selected two main effects as being significant.   So we can save that model to the data table, and then we can see how well that model predicts the heldback data, our test set,   the remaining 4,128 rows that were not in our experiment. And we can do that for each round. And we did that for each round, so we've got 10 models that we've built using gen reg.   Now, as an alternative, we thought it'd be worth looking at some machine learning methods. And here, we're just looking at the first three rounds again.   And as a really quick way of looking at lots of different modeling techniques, we using model screening in JMP 16 Pro.   We just fill in the yield and the factors in the Y and the X's there. We've got holdback validation column set up to hold back 25% of the data randomly for validation.   Click OK, and it runs through all these different modeling techniques, so we get a report now that tells us   how well these...all these different modeling techniques, including some gen reg models, neural nets and different tree methods.   And we've selected the two best there. And in fact across all the rounds of the experiment, we found that neural...boosted neural net models and bootstrap forest were the best performing models.   So again, we can save the prediction formulas back and use those on the test data set that wasn't involved in the experiment to compare them and see how well our different models are doing against that that test data set.   And we use model comparison and JMP Pro here, so we compare all our gen reg models, our neural net models, and our bootstrap forest models.   And now you're looking at a visual of the R squared against that test data set. How well those models fit against the heldback test data set.   And you can see the gen reg, the standard regression models using forward selection in red there, starting off   in the first round, that's actually the best model, but not really improving, and then suddenly going very bad around five, before becoming really good again in the next round. So very unstable and towards the end getting worse, probably overfitting.   The machine learning methods, on the other hand, are,   after round three, they're they're performing the best, performing better than our standard regression techniques and they're improving all the way. And in fact,   you can see that neurals start off better in green there, and towards the end, the bootstrap forest is outperforming the the boosted neural net models.   So the take home method here is even with this, what you might consider a relatively simple situation, you really need the machine learning methods in JMP Pro to get the best models, the most reliable models for these kind of systems or processes.   Now we're just looking at plots of the actual yield results for all those 4,608 runs that were published versus the predictions of our best model at the end of round 10, after the full 480 runs there. So that was a bootstrap forest on the right.   And it's not bad, you know you can see, the important behaviors it's picking up. It's identifying the low yielding regions and it's identifying the highest yielding regions. So there's,   particularly up in the top right, they're using an iodine(?) reactant and the boronic acid and boronic ester   reactants there.   So, then, we compare this with the other extreme approach. So the first round of 48 runs was exactly the same; we start with the same 48 runs. Then we build a model and we use that model to find out which, out of all the remaining runs,   out of the full remaining 4,560 runs there are, which of those does the model predict would have the highest yield, and then we keep doing that in all the subsequent rounds up to 10.   And in this case we correctly predict more of the high yield and combinations, but when we get the final model after 10 rounds and compare that on the data   That we haven't used in the experiment, the R squared is terrible. It's .33. So really the model we get is not making good predictions about the system as a whole.   So we're getting good understanding of one high yielding region and it's finding combinations, finding settings of the factors that work well, but it's not really enabling us to understand the system as a whole. We're not...   we're not reducing our variability in our understanding of the system as a whole.   So just to show you how that system, that approach works, this is our full 4,608 runs, all the combinations that are possible, And, as I mentioned in Round 1, that's the same as as Round 1 using the conservative approach so that's an I optimal   set of points there, and you can see they're evenly distributed across the factor space.   We fit a model to the data from those 48 runs.   And we use that model to predict the yield of all of the other possible combinations and then we select out of those the 48 highest yielding predictions,   which is those ones. So you can see, they are focused on the higher yielding region of the...we're not...we're no longer exploring the full factor space.   And if we go from Round 1, 2, 3 and up to Round 10, you can see, each time we're adding...we're adding runs to the experiment that are just in that high yielding zone up in the top left. So we're only really focusing on one region of the factor space.   And if we compare the actual versus predicted here, you can see that it's doing a pretty good job in that region in the top left there.   But elsewhere it's making fairly awful predictions, as we might expect; that's probably not that surprising. We haven't experimented across the whole system. We really have just focused on that that high yielding region.   And if you compare it against our conservative approach on the left hand side (so you've already seen the plot on the left hand side there) if you look at how the model   improved through Rounds one to 10, and then each round we were using the model screening platform again to find the best model.   And it turned out that at each stage actually, the boosted neural net was the best model. Well, you can see again it's kind of unstable like the gen reg from the conservative approach.   It's, at times, it's okay, but it's never...it's never competitive with our best models from the conservative experiment.   So those are the two extremes, and I'm going to pass over to Ryan, who's going to talk about a more balanced or hybrid approach to this problem. Ryan Lekivetz So thanks for that Phil. And   just actually in the middle of a presentation, gotta say thank you to Phil, Ben, and David for allowing me to take part in this. As a developer it's actually kind of fun to play around a bit. So I'm a developer on the DOE platforms in JMP.   So as Phil kind of mentioned here, the approach that I'm looking at here is a hybrid approach.   Alright, so the idea here is we're going to be looking at this idea of optimization versus exploration.   So, if you think of the greedy approach that Phil talked about, that's really that optimization. I'm going strictly for the best, that's all I'm going to focus on. Whereas that conservative approach, all that that cared about was exploration.   And so the idea here is that, can we get somewhere in the middle to   perhaps optimize, while still exploring the space a little bit?   So I'll say what what I had done here...so I actually started with kind of three of the well plates set ahead of time, right. So I did those three blocks of size 48 and I just did main effects with, If Possible 2FIs.   The reason for that is that sometimes, I didn't really trust the models with just 48 runs. I said, well, let's give it a little bit of burn in. Let's try the 144, or three well plates, so then maybe I can trust the model a little bit more.   Because in some sense, even if you think of that greedy approach, if your model was bad, you're going to be going to the terrible regions. It's not going to help you an awful lot.   And again so...so as before, I'm still going to be doing 10 total well plates or 10 total blocks in the end.   And I'll talk about this a little bit in the future slides, but so I was actually using XGBoost throughout. So the way I was predicting was using XGBoost and there's a reason i'll get to   for doing that.   But, so the idea here was to essentially take what the predictive value was. So instead of going strictly after the best 48, I said, I'm going to set a threshold.   Right, so after the first three, I started at 50. So I said, give me the predictive values that are greater than 50 and as I keep going on further well plates, I'm going to increase five each round.   I mean that was pretty arbitrary; part of that was was knowing that we only had 10 rounds and total, and that's also because I could kind of see based on the prediction how big was that set.   Right, so depending on the data, if it looked like that set was still huge, that the predicted was about 50, you could modify it that.   And so the idea here is really just to to slowly start exploring what should hopefully be the higher yield regions.   And so, were we right? Well, so in this, we actually got 27 out of the 80 highest yielding combinations, those above 95, which wasn't that far off of of the greedy approach.   And if you think of the R squared, again, with all that holdout data, to be that the final test R squared there was .69, so which really wasn't even that far off from the from the conservative approach. So just the...   I mean, you may be...you may want to be cautious, right now, just looking at those two numbers, but based on that, this hybrid really actually does seem to be kind of the best of both worlds. I mean, you're trading off on it, but, but you are getting something good.   So just some insights on this approach. Really, I would say, in the end, that we did get a better understanding of those factor spaces associated with high yield.   I think the greedy, if you can think back to where those points were, they were really concentrated in that upper left so it didn't really do much more exploration of that space.   And you can imagine if...   if something in there suddenly quadruples in price overnight, now you don't know anything else about the rest of the region.   Right, and so with this hybrid approach, we get some more of that how and why, some of the combinations, and it gives us better alternatives   again that, maybe if you have other things to factor in, you're willing to sacrifice a tiny bit of the yield because there's better alternatives, based on other measurements.   And so, really, the idea here...yep, so so let me just get into how this was actually done in the custom design, right. So the approach I took was   using covariates in the custom designer, which I think it's kind of underappreciated at times.   But, so the idea is I just took that full data set, so the whole 4,608 runs,   and for each round...so you can imagine, at first, I took those...I took the 144 run design, and then I said, well, I'm going to take a subset of that full data set. If i've already chosen it in a previous design, well, I need to make sure that that's included.   And then also, give me the predicted yields above that certain value, so again, for the fourth design, that was started at 50 and 55, 60, etc.   So in the data table   now, what I do is I select the design points that were used in previous experiments. So I make sure that those are highlighted. And then custom design, when you go under add factor, there's an option there called covariate. And again, I think that it's...   A lot of people have never really seen covariates before and, I mentioned here, because I think I could probably spend an entire talk just talking about covariates,   but there's going to actually be a couple of blog entries (if they're not up by the time you see this video, probably within a few days by the time you're watching this)   where I'm going to talk a little bit more about how to really use covariates. But the idea is, you may have also heard something called a candidate set before, so the idea is...   here's this data set that was the subset of the previously chosen and everything like that, and it's telling the custom designer you can only pick these design points.   You can't do the standard way of, we're going to keep trying to switch the coordinates and everything like that. It's telling the JMP custom designer, you have to use these runs in your design and, in particular,   one of the options using covariates is that you can force in previously chosen design points, so in this way it's actually like doing design augmentation   using a candidates after using covariates.   And, of course, because in this...in this case, I can't really throw out those design points that I've previously experimented on. Really, all we're wanting to do is to pick the next 48.   And I should just mention as well here, though, if you think of the the custom design...   So in each of these, what I was doing was...so the two factor interactions and then three factor interactions if possible.   But there was...that was really more of a mechanism to explore the design space, right. The model that I was fitting   was using XGBoost after the fact, so I mean that design wasn't optimized for an XGBoost model or anything like that.   This was more as a surrogate to say, explore that design space where the predicted yield look good and, especially, because this was dealing with the categorical factors, it's going to make sure that it tries to get a nice balance. One other thing to mention,   if you wanted to...again, because it's not using that underlying model, the XGBoost or anything like that, you could use the response. So in this case, they predicted yield as another covariate, and what that would do is to try and get a better balance of both high and low yield   responses when it's trying to make the design.   So we didn't explore that here, but I think that would be a valid thing to do in the future.   So, to the predicted yield. So, as I mentioned in...so I think when Phil was talking about some of the other machine learning techniques he had tried   For my ???, I actually focused mainly on XGBoost.   And if you look in the on the left side, one of the reasons I had done that was to try   using K folds cross validation, but where I use the design number, which would essentially be the well plate as each additional fold. So instead of randomly selecting the different K folds,   it would say, I'm going to look at the well plate and I'm going to then try to hold that for out for each one. So if you imagine, for the first one, when I only had three well plates,   it was going to take each of those well plates, hold it out, and then see how well it could predict, based on the remaining two. And so by the time I would get to the tenth...the tenth well plate, it would actually be kind of like a tenfold.   So part of my reasoning for that was to think about trying to make that model robust to that next well plate that you haven't seen.   And, in reality, this would actually also perhaps protect you if something happened with a particular well plate,   then, maybe get a little bit of protection from that as well.   And another thing I had done with XGBoost is when you launch the XGBoost platform, like through the add-in, you'll see there's a little checkbox up at the top for what we call a tuning design.   And so, with that tuning design,   it's actually fitting a space filling design behind the covers and running all of that for the different hyper parameters.   So what I had done for for each time I was doing this,   I would pick a tuning design for 50 runs, but just kind of explore that hyper parameter space. And so the model I would choose that each round   was the one that had the best validation R squared. And so again, the validation R squared in this case is actually using only the design points that I have, right. So this was doing that K fold with the...   with those different well plates. So in this case, the validation R squared is not...I'm not using that data that hasn't been seen yet, because again, in reality, I don't have that data yet.   But this was just saying using that that validation technique, where I was using the well plates. So again, that's kind of the reason that I was using XGBoost for most of this, I was just kind of trying something different that   using that K folds technique, and as well, I really just like using the space filling as well, when it comes to that XGBoost.   And so let's take a look here. So so we see here, these are the points that get picked throughout the consecutive rounds. Now,   if you think of what was happening in the greedy approach, I mean, it was really focusing up in that top left.   Right, but now, if you see with this hybrid approach,   as it starts to move forward into the rounds, you see it fill in those points a little bit more, but we also get a better... a better sampling of points, especially on that left side.   Right, so if you think of from the perspective, now you have a lot better alternatives for those high yield regions.   I think so...if we take a look now at the the actual versus predicted,   I actually think this was kind of telling as well, I think, because we did better exploration.   In those high yield regions that maybe   didn't get quite as high, I think, if you look to the lower left, I think we've done a much better job in the hybrid model of picking up some of those which may be viable alternatives.   One of the nice things with this method as well that was that I can focus on more than just the yield.   Right, so if you think in...   I could have a secondary response, whether it be some kind of toxicity, it could be cost, anything like that that,   I can focus on more than just the yield when I'm looking at those predicted values. And so, if you think of something like the greedy approach,   if you start adding in secondar...secondary criteria that you want to consider, it's a lot harder to start doing that balancing. Whereas with this subset approach, it's really not difficult at all to add in a secondary response in doing that.   I think I'll hand it over now to Ben for conclusions. Ben Francis So,   thank you very much to David, Phil and Ryan for giving a background of why we're approaching this, and then ultimately, giving us two, three very good solutions and ways to   view how we tackle this problem. And in all of these approaches, it's clear that JMP is providing value to scientists. So we've got straightforward, upfront, we've got tenfold reduction in resources for experimentation. We only used 10% of the runs. And   ultimately, what we were enabling in each situation was the ability to meet specifications and provide a high yielding solution from the experiments.   But in addition to that, we also showed how we could have different levels of deep process understanding, depending on the goal and strategy of the company employing this approach.   Now, you may notice that, we three from JMP, we're very interested in how CatSci approach this problem. We weren't necessarily taking   the problem away from CatSci and solving it ourselves; we're providing the tools in order to tackle this in terms of DOE. And this was a fantastic collaboration between David as a customer,   myself and phil on the side of technical sales, and Ryan within product development, and it's key here that we're all looking at this from a different perspective   to enable that kind of solution, which, as I said before, we can look at kind of deep process understanding in different ways, depending on the objectives of what needs to be achieved.   We found out some things along the way, which is really exciting to us at JMP and will lead to product improvements, which we then hand back to yourselves   as customers. We learned that machine learning approaches are applicable to this big DOE situation and, as you know, with JMP they are straightforward to apply. And we also learned caution in terms of the validation approaches, and we can look into that further.   So, ultimately, we presented here a good volume of work in terms of big DOE challenges, and we're sure there are many companies out there, similar to CatSci, taking on this sort of problem.   So we all invite you to explore this work further that we've done, and we have two links here in terms of the Community where we'll be posting resources and the video of this,   as well as a JMP Public post, which enables you to get hands on with the data that we utilized within this. So from everyone in this presentation, we want to say thank you for watching.
Els Pattyn, Non-Clinical Efficacy and Safety Statistician, Ablynx, a Sanofi company   Immunogenicity assays to detect anti-drug antibodies (ADA) in subject samples provide data on drug safety and efficacy and are required for approval by health authorities such as EMA and FDA. Such assays are usually qualitative or semi-quantitative and thus require cut-points as threshold values for distinguishing positive and negative samples. Establishing appropriate cut-points is crucial to ensuring acceptable sensitivity of the assay to detect ADA and as such requires particularly complex statistical calculations. Since one study can have multiple immunogenicity read-outs, these cut-point calculations become very laborious. We developed a JMP script for cut-point determination and assessment, which follows a pre-determined decision tree developed based on industry guidance, white papers and scientific best-practice. The script is designed to be appropriate for validation and use in GxP studies. End-users can select decision trees applicable for their needs, such as type of assay and study. The application accepts Excel files to upload data and makes outcome-dependent decisions – for example, adapting the effects included in the mixed-effects model based on the study-design, selecting the most appropriate normalization/transformation, calculating analyst-specific cut-points in case of significant analyst-specific differences, and adapting down-stream analysis in cases where no second-tier confirmatory data is available. In summary, this JMP script allows immunogenicity cut-points to be calculated quickly and efficiently in a standardized way, including automated reporting that is suitable for regulatory submissions.     Auto-generated transcript...   Speaker Transcript Els Pattyn Hello, my name is Els Pattyn and I'm working as a non clinical efficacy and safety statistician of Sanofi. So I'm happy to present you in JMP an end user tool we have developed for immunogenicity assessments. First short introduction of the company I work for. So originally I was employed by Ablynx. Ablynx is a biopharmaceutical company which is based in Ghent, Belgium. And we are engaged in the discovery and development of nanobodies. Nanobodies are camelid heavy chain entities, as you can see here, and you can discriminates. (OK, I will first my pointer...set my pointer. I don't think you see anything.) Okay, here we are. So nanobodies are entities from camelid heavy chain antibodies and they are smaller than the conventional antibodies. Conventional antibodies have to have two heavy chains, two large chains and camelid antibody you only have the heavy chain. So they are small, they can also be used modular you can have multiple nanobodies and have multiple specificities and they're also easier to manufacture. We were founded in 2001 and then 17 years later, we had our first compounds in the clinic. It was Cablivi. It was and it is targeted for a rare blood disease aTTP. And 2018 was another important year as then we got acquired by Sanofi, which is a large biopharmaceutical company with headquarters in France. So then at that time, we were more than 450 employees, but with the acquisition of Sanofi and we are part of a fairly large company. Sanofi itself has more than ??? employees with around 15,500 employees committed to R & D. And you see we are spread across the three...across three continents. And also therapeutic platform of Sanofi is very wide. It's only...it's not only the nanobodies, but it's really a lot and it's still growing. Then about the tool we have developed. It's an immunogenicity tool. About the context of immunogenicity. Immunogenicity is the ability of a substance to provoke an immune response. So often it's wanted. It's when your immune system has an appropriate response to a pathogen, but it can also...can also be wanted response towards a therapeutic agents and that's done in the case of a vaccine. That it can also be an unwanted response, a response to a therapeutic antigen and then it's called anti drug antibodies or ADAs. And these ADAs can be a you can have allergic reactions or even anaphylactic shock. Loss of efficacy. You can have antibodies that have neutralizing capacity so that you lose your activity, so it will be no surprise that it has special attention to the regulatory entities. Several guidelines issued by the EMEA and the FDA regarding the ADAs and how to analyze it. So completes ADAs determinations fills several steps. It's a multi tiered approach. You have a screening assays where you determine your sensitivity, then you have confirmatory assay, where you look for where it's specific or not. And you also have characterization assays, and that's done to determine whether your antibodies have neutralizing capacity. And sometimes you want to determine which isotypes it targets. So what we have to do is to determine good points, so that we can determine what's positive to discriminate the positives versus the negatives, and that has to be determined on so-called blank or naive population. So the blank population, of course, should be representative; it should be free of outliers also not exist...have pre existing antibodies and incorporate all sorts of variability. You have biological variabilities, so the different subjects that you test, subject ids, but you also have variability...technical vulnerability so that's of your analysts, so of your run, plate to plate credibility. And when you want to assess it in a proper way, you prefer to do it by mixed effect model with a REML model so that you can have these elements as random effects in your model. So there was within Sanofi, a a... a need to have an end user tool for good point setting, so the aims were that we could have a harmonized and standardized approach across all sites; to have a user friendly end user tool with the state of the art analysis, so it should be a statistical package because we want to have...to use...make use of the mixed effect model. It should also operate in a regulator...regulated environment, because we have also lots of GxP studies and it should also include uniform and automatic reporting. So that's where we started, where we explored whether we could do it with JMP and that's by its its language, it's JSL language. You all know it can easily be retrieved when you have a graph. You can just click save scripts, and then you have the script. So that's an easy start. And another...and what is also an asset is that it is programmable towards an outcome dependent decision. So we started with retrieving the scripts from every graph and every analysis we wanted to do, and then it became really a huge analysis. We have one parental code and see here, and we have different child codes that are called by the parental codes. And it is, as I said, it should be, this should be programmable towards outcome depending decision, so the code gets really more difficult by audience, because we have to do a lot of statistical analysis, assess whether you have statistical significance or not. We have different normalizations we want to assess. So that's around some extra aspects of the code, how it look like. So before giving a demonstration, here there are two, in fact two steps we want to perform the analysis. First, there is an Excel template where we populate...where the raw data should be populated. And then the JSL has to be launched. So it's done, only the parental code that has to be launched. I will go over these two steps. So we have one Excel template. I'll open it so that you can have a look. So here, you see an overview, where we can populate it and here is data for upload in the Excel file. As I have told, it can be an ADA assay or an NAb essay and depending on whether it's NAb or ADA assay, the header names are different. So here, you can say it's an ADA assay, it's either without or with confirmatory data. The code is adapted that it can either accept...that it can either accept summarized data or otherwise raw data, replicated data. So say it's here, replicated data that you can answer the number of replicates. Say, for instance it's three replicates. And in order to to determine if it's a ??? point we also related to negative control...to the negative control of a plate, and that should be multiple negative controls on one plate. For instance, 2, and I say, I call it Start and End, so when you populate this then we make a worksheet. And then here, you have your Excel templates that you have to populate and it's ready for upload and have the correct naming for uploads. So, once you have done that, then you can go to JMP. I already have JMP open. And what you have to do, then, is to click on the parental JSL codes and automatically an interaction menu appears. And then you have to fill in the fields to be filled out, so yeah. Just do here and test for JMP. Then there are five types of assays, so let's say we have now an ADA assay and here you can upload this file for uploads. I have here input data, some dummy data to upload. Then we select it and on here, there are some numbers you can change. The default numbers are filled out, but if you want another percent's thresholds as acceptance criteria, you can change it. And if your data has an order specificity done three decimal places, you can change it, and I think here...there's no decimal places in the data, so I change it to zero. So then you have a second introduction window here. It's an ADA assay so that's typical for the ADA assay. You have an NAb assay, you have another window. So here, you have to select whether it's clinical or nonclinical because good points, I think, is different there. I have here a confirmatory data and I want to analyze in a clinical setting, so I click it here. Now you have different outlier.... outline removal approaches that you can click. And also, there is another level of flexibility that you can enter. When you...just to make it user friendly, we have one BTD approach. BTD is an internal guideline and when you have...you want this approach, then there will be no other window that is opening. And when you want to have extra options then you can click it here, and you can have extra output or calculate, for instance, run specific differences on a specific difference. So when you click here, you have extra options, but I want to stick to the general upload approach. And then you can also have an upfront exclusion. We want to do it here and, once you have entered that, then the analysis starts. So it's a whole chain of analysis. It's an outlier removal approach and it's an iterative way. It checks so that when you have removed outliers, first analytical, than biological or if there are other outliers. So here, you see a series of analysis that's been performed. So it takes a while, because this data has screening data and also also confirmatory data. So now the calculations are going on. And then, at the end, there is a report. That's what you see here, that's still assembling the report now. And it's now outputting the data sets in PDF formats. Generating the outputs so now, and now, it's finished. So in the result parts, all results are automatically output here. That's what you see here. You have the data sets, you have a report and a journal, and also a PDF file. And then data sets, the different data sets that were outputs. And here you see that you have full report with the analysis settings, the methodology. Close it. And you have your descriptives. And of course what is nice for JMP is that you have your interactivity when you want to follow this subject, for instance. I have to double click on 411. You can follow it throughout all all the analysis so it's all linked. So, then you have here, then for my information of the screening cut point factors. It assesses different transformations, different normalizations, then it evaluates the most appropriate blank population. You have an output of all outliers, whether they were analytical or biological. And identities of the outliers. Yeah you have...so you see, you have all analysis that has to be performed, all analysis relevant for the settings. Same also for confirmatory cut points. And here is ADA scoring, whether the score is negative, positive, reactive. So by this, the whole analysis is done and you have the final conclusions. Under the data tables outputs. And the same, what we see here, the same, we also have, of course, in a PDF format. That can be then submitted for... submitted for...it's validated. We are able to validate it so everything we saw there, you see again here in a PDF format. So so, to conclude, what was the effect, what was the aim to do. We were able to generate codes to do an analysis very quickly and efficiently. Normally it took days for the analysis and for the reporting, now we can we have our... our analysis and report in 10 minutes, so it's an automated in a standardized way. It's an automatic reporting and it's also it's performed in a validated environment that's suitable for submission. So yeah I want to talk...to thank my colleagues, my colleagues of the non clinical efficacy and safety teams. I highlighted people that were involved in the development of the end user tool. And also yeah it was, of course, a collaboration between different teams, so I want to thank all people that were involved in that.  
Sunday, March 7, 2021
Bradley Jones, JMP Distinguished Research Fellow, SAS   JMP has been at the forefront of innovation in screening experiment design and analysis. Developments in the last decade include Definitive Screening Designs, A-optimal designs for minimizing the average variance of the coefficient estimates, and Group-orthogonal Supersaturated Designs. This tutorial will give examples of each of these approaches to screening many factors and provide rules of thumb for choosing which to apply for any specific problem type.     Auto-generated transcript...   Speaker Transcript Brad Jones Hello, thanks for joining me. My name is Bradley Jones and I work in the development department of JMP and today we're going to talk about 21st Century screening designs.   And JMP has been an innovator in screening experiments over the last,.   well, I would say the last decade or so. And I'm going to talk about three   different kinds of screening designs and tell you what they are, and how to use them and, then, when you should use one in preference the other.   So the three screening designs I'm going to talk about are A-optimal screening designs, definitive screening designs (or DSDs), and group orthogonal supersaturated designs (or GO SSDs).   So let me first introduce A-optimal screening designs. And A-optimal designs minimizes the average variance of the parameter estimates for the model.   And the remember that...the way to remember that, A-optimal stands for average, so A average.   By contrast, the D-optimal design minimizes the volume of a confidence ellipsoid around the parameter estimates by maximizing the determinant of the information matrix. So that's a lot of words   and it's hard to see why all of that's true from the determinant. But you can remember that D-optimal the D in D-optimal stands for determinant.   So why am I saying that we should do A-optimal screening designs? D-optimal designs have been the default in the custom designer for 20 years. Why should we change?   Well, my first reason is that the A-optimal...the A-optimality criterion is easier to understand than the D criterion.   Everybody know what an average is.   The average of the parameter estimates are   directly related to   the average variance of the parameter estimates, are directly related to your capability of detecting a particular active effect.   The D criterion, talking about this global confidence ellipsoid, doesn't tell you about any   particular capability to to estimate any one parameter with precision.   So, so I think the A optimality criterion is easier to understand.   But more than that, when when the A-optimal design is different from the D-optimal design, the A-optimal design often has better statistical properties. And it is true that a lot of times, the A- and D-optimal designs...the A- and D-optimal screening designs are going to be the same.   But there are times when they're different and when they are different, boy, I really like the A-optimal design better.   So I want to motivate you with an example. The first example I have is is I have...suppose I have four continuous factors.   And I want to fit a model with all the main effects of these factors and all their two factor interactions. So there are four main effects and six two factor interactions, but I can only afford 12 runs.   So I'm going to do a JMP demonstration. In that demonstration, I'm going to make a D-optimal design for this case, an A-optimal design for this case, and then compare those two designs.   So I'm going to open up my JMP journal here and go to the first case, and that's the four factor with two factor interactions and 12 runs.   And this is a script that makes it where I don't have to actually create the D-optimal design. Just believe that it is the D-optimal. Here here's the D-optimal design. You notice that all of the numbers in the design are either plus or minus one.   And here's the A-optimal design and there's a surprise here. In this first column, you see that X1 has four zeros in it,   whereas X2, 3, and 4 have all plus and minus ones. So that's a bit of a surprise. It's a good thing that, that this is...all these factors are continuous, but let's now compare these two designs.   So to start the comparison, let's look at the relative efficiencies of the estimates of the parameters.   This is the efficiency of the A-optimal design to the D-optimal design. So when the...when the numbers are bigger than one, that means the A-optimal design is more efficient than the D-optimal design.   And there's one parameter that's less efficient for the A-optimal design, but all the rest of the   parameters are at least as well estimated as for the D-optimal design, and many of them are up to 15% better estimated by the A-optimal design.   I'd like to show you one other thing, and that is this.   This is the correlation cell plot of these two designs and you can see that the the D-optimal design has all kinds of correlations on...   off the diagonal here, up to 33.33; that's a third.   And, but a lot of other ones, whereas the A-optimal design only has three   pairwise correlations that are not zero. Everything else in this...in this set of pairwise correlations is actually zero, and the main effects are all orthogonal to each other, and the first main effect is orthogonal to all of the two factor interactions.   Many of the two factor interactions are also orthogonal to each other. All of this orthogonality means that it's going to be easier to make   model selection decisions for the A-optimal design than for the D optimal design. And then one one other thing that I think you might be interested in is the G efficiency of the A-optimal design is more than 87% better than the D-optimal design. What G efficiency is, is the   the maximum...it's a measure of the maximum variance of prediction within the space of the parameters. So the   maximum variance prediction for the D-optimal design is   nearly twice as big as the maximum variance and prediction for the   A-optimal design.   And   also, the the I efficiency of the A-optimal design is is more than 14% better than the D-optimal design. So for all these reasons, I think it's pretty clear cut that the A-optimal design is better than the D-optimal design in this case.   So let me clean up a bit.   Going back to the slides, this is just the picture that you just saw.   But I wanted to show it to you in JMP instead. So let's move on to another another reason why I like A-optimal designs.   A-optimal designs allow for putting differential weight on groups of parameters. So what does that mean? In a D-optimal design, you could,   in theory, weight the different parameters more...some parameters more than others, but   in so doing, you don't change the design that gets created.   The...whatever weighting you use is still going to give you the same...the same design, so that's so that means that weighting in D-optimal design is not useful. In fact,   for most of the variance optimality criteria, including I-optimal designs, weighting doesn't help you put more emphasis   in the design on on the parameters. But in A-optimal designs when you weight the parameters, you can achieve a different design and that might be useful in some   cases. So here's my weighted A-optimal design example. Suppose I have five continuous factors now, and I want to fit a model with all the main effects and all the two factor interaction. So I have   five main effects and ten two factor interactions and I can only afford 20 runs. So I care more about being able to estimate the main effects precisely, because I figured that   the main effects are going to be bigger than the two factor interactions. I want...I want to get really good estimates of them.   So what I'm going to do is make...again I'm going to make a D-optimal design for this case. I'm going to make a weighted A-optimal design for this case. And then I'm going to compare the designs. So here here's a picture   of the demo. So here's the D-optimal design and again, we see that it's all plus and minus ones.   Here's the A-optimal design, which...in which I have weighted the main effects so that they're 10 times more important than the two factor interactions. And now I want to compare these two designs.   Now, in this case,   I'm now looking at the D efficiency of the D-optimal design relatively...relative to the A-optimal design, and so the A-optimal design is better when   the numbers are less than one. And we see that, for all of the main effects, the A-optimal design is doing a better job of estimating them   than the D-optimal design. Of course, there's a price that you have to pay for that, because in weighting...downweighting that two factor interactions, you get slightly worse precision for estimating those.   So this is a trade off, but you said...you said you were more...or I said that I was more interested in main effects in two factor interactions and so that that's what I got.   But here's here this compared...comparison of correlations is another reason why you might like the A-optimal design preferred over the D-optimal design.   In the in the D-optimal design, there are a lot more pairwise correlations. Notice that in the the A-optimal design, all of the main effects are completely orthogonal to all the two factor interactions.   You can see this whole region of the correlation cell plot has...is white, which means zero correlation.   The main effects are not orthogonal to each other, except that X1 and X2 are both are all orthogonal to X3 and 4 or 5, that that is to say their pairwise correlations are zero.   But there's some small correlations   between   X1 and X2 and between X3, 4 and 5,   correlations of .2.   But there's there's a lot of orthogonality in this plot, way more than there is in   in the   the D-optimal design.   So if you really wanted to, if you were sure that you wanted to be able to estimate the main effects better than you estimate the two factor interactions, again the A-optimal design would be preferred to the D-optimal design.   I don't want to close everything, because that would close my journal too.   Okay, going back to the slides.   So when would I want to use an A-optimal screening design?   Well,   I'm going to tell you about DSDs and GO SSDs   after this, but whenever they're not appropriate, then you would use an A-optimal screening design and that may often be the case in real world situations.   When you have many categorical factors and and some of the categorical factors may have more than two levels, then you can't use either a DSD or a GO SSD.   If certain factor level combinations are infeasible, for instance, if some corner of the design space it doesn't make sense to run,   then you would have to use the A-optimal design, because that that that is supported by the custom designer in JMP and and that can handle infeasible combinations,   either through disallowed combinations or inequality constraints. Or if there's a non standard model of that you want to fit, for instance, you might   want to fit   a three way interaction   of some factors, and then the DSD or the GO SSD are not appropriate there. And then, finally, when you want to put more weight on some effects than others   that your only choice is to use an A-optimal design.   So, so that concludes the section on A-optimality and I'm going to proceed now to definitive screening designs. These these designs first were published in the literature in 2011 so they're they're now 10 years old.   And this is what they look like. Here's the six factor definitive screen designs, but definative screening designs exist for any number of factors so it's...   they're very flexible.   What the...if we look first at the first and second run, we notice that   every place in Run Number 1 that is plus one, Run Number 2 is minus one, and every place that Row Number 1 is minus one, Row Number 2 is plus one.   So that Run 1 and 2 are like mirror images of each other, and if we look at all these pairs of runs, they're all like that. Every... every odd   run is a mirror image of every even numbered run. And finally we end with with a row of zeroes.   Another thing we might notice is that there are a couple of zeros for each   factor in the design.   So this, this is a useful thing, as it turns out, because it allows us to estimate quadratic effects of all the factors.   This overall center run allows you to estimate all the quadratic effects of all the factors, as well as the intercept term.   So, what are the positive consequences of running a definitive screening design? Well,   if there is an active two factor interaction, it will not bias the estimate of any main effect, so the two factor interactions and the quadratic effects, for that matter, are all uncorrelated with the main effects.   Also, any single active two factor interaction can be identified by this design, as well as any single...single quadratic effect. And it and...the final, very useful   property of this design is that if only three factors are active, you can fit a full quadratic model in those three factors, no matter which three factors there are. So and it and if...   that would also apply, of course, to two factors or one factor being active.   So you might be able to avoid having to do a response surface experiment, as a follow up to the screening experiment.   Now, in interest of full disclosure, there is a trade off that definitive screening designs have to make when comparing them with a D-optimal screening design, and that is that the parameter estimates for the main effects are about 10% longer than parameter estimates for the   the D-optimal screening design. So   so that's a small price to pay, though, for the ability. It also estimates all the...all the quadratic effects and have and have protection against two factor interactions.   So let's look at a small demo of definitive screening design in action.   So   I recently wrote a paper with Chris Nachtsheim on how to analyze definitive screening designs, and   when we submitted the original paper, the referees asked us to provide a real example. It's kind of difficult to provide a real example for a new   kind of design, because nobody will have ever done that before, because they haven't heard of it. So but we have this way of instrumenting the custom designers so that we know how long a design is going to take, in order to run it. So we, we had a set of   various factors that can make a design longer...take longer to to produce. One of them, for instance, is   the number of random starts that you use. So here are the times for all all...that the custom designer took for all of these examples. And then   if we look at DOE definitive screening, there's this automatic analysis tool called Fit Definitive Screening, and if you if you create a definitive screen design using JMP this   script is always available, so we can see what   what the analysis is. And the analysis goes in two stages. First, the main effects are fit and then the second order effects are fit and then they're combined. And it turns out that this   this Factor E is a type one error, but all the rest of these effects are actually active and we know that because   we also ran a full factorial design   on on all of these factors, and   so we know what the true model is in this case. And Factor E didn't show up in the in the analysis of the full factorial design, but all these other effects, including the squared effect of Factor A and these three two factor interactions were were real effects that were identified   by the definitive screen design. So that that got inserted in the paper, which it was eventually published by Technometrics.   So when would I use a DSD?   DSDs work best when most of the factors are continuous.   We do have a paper showing how to add up to three two factor or two level categorical factors, but as you continue adding more two level factors,   the DSD doesn't perform well. So you...so most of the factors need to be continuous. Also if you expect there to be curvature in your factor effects, then   like, for instance, if you think that there might be an active quadratic effect, then you would want to   use a DSD in preference to a two level screening design. And then you can also handle a few two factor interactions when using a DSD. So those are the, these are the times when you would use a DSD.   And that's all I have to say about them right now, but I want to move along to group orthogonal supersaturated designs or GO SSDs.   Excuse me um.   So this is a picture of a correlation cell plot for group orthogonal supersaturated design, and the cool thing about them is that the factors are   in a group. Here, one group is W1 through W4. Those factors are correlated but they're uncorrelated with any other group of factors, so the factors are are   in groups that are that are mutually uncorrelated between groups, but a little correlated within groups. And, of course, since the design is supersaturated, there has to be some correlation. We're just putting the correlation in very convenient places.   So here's a paper on   how we construct these group orthogonal supersaturated designs and and a cast of many coauthors, including Ryan Lekivetz, who is another   developer in the   JMP DOE group.   Here are pictures of my co authors.   Here's Ryan.   Chris is a   long-term, probably 30 years, colleague of mine. Dibyen Majumdar is a is an associate dean at the University of Illinois in Chicago, and Jon Stallrich is a professor at NC State.   So I'm going to talk about a little bit about the motivation for for doing supersaturated designs, tell you how to construct GO SSDs, analyze GO SSDs and then compare them to the our automatic analysis approach to other analytical approaches.   So a supersaturated design is a design that has more factors than runs. So, for example, you might have 20 factors that you're interested in, but runs are expensive, and you can only afford a budget for 12 runs. So the question is, is this a good idea?   One of my colleagues at SAS, when I went and asked him about what he thought about supersaturated design, said, "I think supersaturated designs are evil."   So   I felt my ears pin back a bit, but but I went at...went ahead and implemented supersaturated designs anyway.   I understand, though, why Randy felt the way he did. The problem with supersaturated design is is that the design matrix is singular, so you can't even do multiple regression, which is a standard tool in fit model. Also there is factor aliasing because   the factors are generally correlated. In fact, all the correlate in...in most supersaturated designs, all the factors are correlated with all the other factors and so there's this feeling that you, you shouldn't expect to get something for nothing.   In early literature   they were introduced by Satterthwaite in 1959 and then Booth and Cox in 1962 introduced this criterion   for   kind of optimally choosing a supersaturated design.   This this criterion here is basically minimizes the squared correlations, the sum of the squared correlations, in the design. So John Tukey it was a first to use this term, supersaturated, we think, in his discussion of the Satterthwaite paper.   It was...the paper was published with discussion.   And a lot of the discussants were very nasty. But from Tukey's discussion, he says, "Of course, Satterthwaite's method, which is called constant balance can only take us up to saturation   (one of George Box's well-chosen terms)" saturated design is a design, where all of the effects are taken up by parameter estimates.   I mean all the runs are taken up by parameter estimates.   But Tukey says, "I think it's perfectly natural and wise to do some supersaturated experiments," but basically the other discussants   largely panned this idea and nothing happened in this area for 30 years.   Then in 1993 two papers got   published, almost simultaneously. One in Biometrika by Jeff Wu and the other one in Technometrics by Dennis Lin.   Jeff, who is a professor at the Coca Cola Chair at Georgia Tech university and Dennis Lin is the Chair of the Department at Purdue now.   So when would you use a supersaturated design? Well,   you certainly want to use it when runs are expensive, because then you, know you, can have fewer runs than factors and therefore have a much   less expensive design than you would have to have if you're using a standard method, which always requires there at least...be at least as many runs as there are factors.   If you've done brainstorming of experiments, you know that when everybody gets their stickies out and writes down a factor they think is important,   you end up with maybe dozens of stickies on the wall and representing an impossible factor that at least one person thought was important.   And then, what what happens is that   subsequently, only a few...in the past, only a few of those factors are chosen to be experimented with and the other ones are kind of ignored.   And that seems unprincipled to me, eliminate...eliminating a bunch of factors in absence of the data seems like a bad idea.   Another another thing that you might want to do is is use them in computer experiments, because if you've got maybe a very complicated computer experiments with the one...with   dozens or even hundreds of tuning parameters that the...that the computer experiment has, you can do a supersaturated design.   And that's...that would be is especially good if the computer experiment takes a long time to run, so you wouldn't want to sit and wait for weeks in order to run a   very large   unsaturated experiment.   So how do we construct these GO SSDs?   This is a little math but   we start with a Hadamard matrix (H), and we have and we have another Hadamard matrix (T) that we've dropped a couple of   rows from, and then we take the Kronecker product (this this interesting symbol here is the Kronecker product) of those two matrices and that and that creates a GO SSD.   So when you form the X transpose X matrix of that, what you get is this thing which creates something that looks like this, where this this matrix is a block square matrix and   and all the other elements in the X transpose X or information matrix are zero.   So that's that's what creates the group orthogonality. So here's here's   a small example. Here, this H is an orthogonal Hadamard matrix, and this T is just H with the last row removed.   And so the Kronecker product of T and...of H and T just replaces every 1...every   element 1 in H by the matrix T, and every minus one in H by the matrix,   which is the negative of T.   So   since since H is four by four and T is three by four, we're putting a three by four matrix in place of each single   element of H and so what we come up with is a design with 16 columns and 12 rows.   So it expands   that example. Now here, this is what happens when you use...when you do what I just talked about. And so we have this 12 runs and 16 columns. The first column is the...is the column of ones which you would use to estimate the intercept, so we don't consider that a factor.   And here's the correlations...here are the correlations for this design. And the first three factors are...have higher correlations among among themselves and they're also unbalanced.   These columns have...don't have the same number of pluses and minuses.   The other...all the other factors have the same correlation structure,   and they all   have balanced columns.   One other interesting thing is that the their columns are   fold overs, so the main effects of of each set of factors are uncorrelated with two factor interactions of those factors.   Now what what my co authors and I have recommended is that we treat these first three factors...not not use them as factors, but instead because they're orthogonal to all the other factors, you can use these columns to estimate the variance   in a way that is unbiased by any main effect of any other of...any of the factors. So that way, we get an unbiased est...a relatively unbiased estimate of the variance which we can...which is very helpful in doing modeling.   So the design properties of this particular definitive screen....or GO SSD, sorry, GO SSD   are that we have three independent groups of four factors and so each factor group has a rank of three, so that means I can estimate three...the effects of three factors.   The fake factor columns in this case, have a rank of two because you are...you're using one of that...one of those columns to estimate the intercept, so you can estimate Sigma squared with two degrees of freedom.   And that estimate is unbiased, assuming that second order effects are negligible.   So now, I can estimate the effects of of...   out of each of the three groups of four factors, I can estimate three of them.   And I can, and I can test each group of factors, with a degree...with Sigma squared that has two degrees of freedom.   I've already pointed out that these these factor groups are fold overs and that that   provides substantial protection from two factor interactions.   So this table appears in the ...in the Technometrics paper. I'm happy to send you   a copy of that paper, if you just write me an email.   I'm Bradley.Jones@jmp.com.   Just my name @JMP.com.   So there, there are lots of choices that you can use for for creating these designs. It's   reasonably flexible.   So   in the paper, we talked about how to create the designs, but we also talked about how to analyze the designs. So those...the first group of factors, we want to use to estimate the variance instead of...instead of using them as actual factors.   We use the orthogonal group structure to active...identify active groups and then within the active groups, we do either stepwise or all possible regressions to   identify the active factors within a group. And as we go, we can pool the sum of squares for inactive groups into the estimate of Sigma squared, which gives us more power for for   finding active factors in the active groups of factors.   And if you can guess the direction of the signs of your effects, and often you know...you may not know how big the effects are going to be, but you know which direction they're going, that you   you can guess whether the effect is going to be positive or negative. If you if you have an an an effect that you think is negative, you can relate...relabel the pluses to minuses and minuses to pluses, so that the effect...the new effect will be positive.   And if you do that for all of the negative effects, so that all of the effects in each   in each group are positive, that maximizes your power for being able to detect an active group. So we recommend that.   So we did a simulation study to compare GO SSDs, our analysis procedure for GO SSD to standard   generic regression analyses...modern regression analyses, like the Lasso and the Dantzig selector, so we have three different analytical approaches.   The Dantzig Selector, the Lasso, and our two to two stage approach. We had three different designs, three different GO SSDs. We have the number of active groups that we...   either one active group to...and as you...as you get more active groups, of course, you get...make it harder for for the design to find everything. And then   the power is high for all of the methods, except when all the groups are active. So   in this in this design, there are only three groups and all of...and when you make all of them active, then the Dantzig Selector and the Lasso have very poor power.   But the two stage analysis that we're proposing has a basically a power of one for estimating everything that's estimable.   In in these two designs, there are seven...seven active groups and   so when all of them are active the Dantzig Selector and the Lasso   don't perform nearly as well as the two stage method that we...that we are recommending. So   what what our two stage method is is doing is using the structure of the design to inform our analytical approach. A Dantzig Selector, a Lasso don't...doesn't care what the structure of the data is. In fact they're used in arbitrary observational   analyses, as well as analyzing designed experiments. So you would expect that a generic procedure might not do as well as as a procedure that's constructive for the   pure purpose of analyzing this kind of design.   434   258   259   260   261   262   263   264   265   266   267   268   269   270   271   272   273   274   275   276   277   278   279   280   281   282   283   284   285   286   287   288   289
Jose Goncalves, Upstream Scientist, Process R&D Department, Oxford Biomedica Rebecca Clarke, Upstream Senior Scientist, Process R&D Department, Oxford Biomedica George Pamenter, Downstream Process R&D Scientist, Oxford Biomedica Thomas Chippendale, Upstream Senior Scientist, Process R&D Department, Oxford Biomedica   A significant growth in the generation of cell culture bioprocess data has been observed at Oxford Biomedica in recent years. This increase in data collection is not only driven by intensive process development and characterization studies, but also due to incorporation of high throughput production systems. Throughout a cell culture process, many online and offline process parameters are recorded. However, due to the complexity of biological systems, one of the main challenges is the identification of genuine factors that can influence process performance or product yield. Holdback validation is often used to avoid overfitting and prevent the inclusion of non-genuine terms in a model, but the use of a single validation column may not be effective in the prediction of the best model. This presentation will demonstrate how refitting models using multiple validation columns can allow the identification of the simplest and most frequently occurring model. Data from 67 bioreactor production runs containing 20 candidate predictors were used for analysis. The simulate function in JMP® Pro was used to randomly generate 100 validation columns.     Auto-generated transcript...   Speaker Transcript Rebecca Clarke Okay hi everyone, and my name is Rebecca and I'm joined today by Jose and George and today we're going to talk to you about how we use the JMP Pro validation methods to help us to stop wasting time investigating false effects. So firstly I'm just going to introduce us as a company and what we do so. We work for Oxford Biomedica. We're pioneers in the gene and cell therapy business, and we have a leading position in lentiviral vector research, development and bioprocessing. So we have over 20 years experience in the field. We were the first company to treat humans with a lentiviral vector(?) based therapy. Some of the local partners that we work with are listed here in the bottom right, so we work with Novartis, Sonofi, Boehringer Ingelheim, and Orchard Therapeutics and some of the conditions that we treat include cancer, central nervous system disorders, and cystic fibrosis just to name a few. So last year, Oxford Biomedia also joined the fight against COVID-19 when we signed up to the UK vaccine consortium. And the consortium was run by the Jenner Institute of Oxford University and the goal was to rapidly develop a COVID-19 vaccine candidate. And by joining this consortium, Oxford Biomedica allowed for this scaled up manufacturing of the vaccine by allowing the consortium access to our state of the art ??? suites. So following on from that, we then signed a clinical and commercial supply agreement with AstraZeneca. And, with this agreement, we agreed to supply the AstraZeneca COVID vaccine should it prove to be successful in the clinical trials. And as I'm sure you're all aware, the vaccine was indeed successful and Oxford Biologica continue to manufacture the COVID-19 vaccine to this day. So I also just want to briefly introduce the processes that we use here and then at OXB. So a bioprocess is a process that uses a living cell or their components to obtain a desired product, so a well known example of such a process would be the fermentation of alcoholic beverages. And during this process, the yeast is the living component and the yeast plus all the other ingredients needed to brew beer are put into a bioreactor and the yeast initiates a chemical reaction called fermentation. During fermentation glucose is converted into ethanol. And then, once this fermentation reaction is complete, the contents of the bioreactor is then harvested and purified via filtration and then we're left with the final product that's beer. So the process that we utilize here in Oxford Biomedica is similar to fermentation. So the living component of our process are cell cultures, so with the bioreactors we put into cell cultures and also the other components needed to make the lentiviral factors. At the end of this production process, the contents of the bioreactor are passed on for downstream processing, where the product is purified and then it gets bottled and shipped out for use in patients. So this slide is just showing the life cycle of drug development. So on the left hand side at the very bottom is the research stage. So it's at this stage where we identify target genes or proteins that are important in disease mechanisms. And, and we also develop drugs against these targets and we show whether the drug is successful in its ability of targeting the disease. And then, at the very end of the life cycle, we have commercial manufacturing, where the drug is made and bottled for use in patients. So our group, the Process, Research and Development group, sit in between these two stages. It's our job to design, optimize, and scale up the production process and move drug product from the research bench through to manufacturing. Some of the processes that we use include small scale shake flasks and we work right the way up to 50 liter bioreactors. So, as you can imagine, we generate a lot of data during the course of our work and the bioreactors have many input parameters that we need to monitor and optimize. With the sheer volume of the data that we generate, it makes it quite difficult to discern which of the parameters we should focus on during our optimization. And just because of the sheer number of parameters that we have to work with, it's not commercially viable for us to optimize each one individually, so at this point, we look to statistical modeling. And we use the models to try and narrow down the list of potential parameters that we should optimize. And so, while these models are useful, we do generate a lot of random noise during our work, and this is down to the living nature of bio processing, so no two cultures are the same. So this random noise obviously gets incorporated into our statistical models and this can lead to us wasting time looking at non genuine effects during our optimization experiments. So it was with this in mind that we looked to use JMP Pro, specifically the validation features, which will hopefully help us to further narrow down the list of parameters that we'll look at, remove the non genuine effects from the models and also to avoid us overfitting our models as well. So now I'm going to hand over to Jose who's going to talk you through the whole data validation method. Jose Goncalves Right, my name is Jose and I'm going to speak about the whole optimization method, so this is a method, this is a modeling tool only available on JMP Pro. We were interested to explore how to use...how useful it could be at OXB for... I saw how useful it could be to OXB for modeling of bioprocess data. So as Rebecca was saying, we use these steritank(?) reactors in our process and basically we grow cells into these bioreactors and we produce these viral vectors, which are our main product. It's a...it's what we want to maximize and we want to model, so this is our response variable really. And bioreactors require many control inputs and produce a lot of data during one production process, so having a tool to help determine which inputs outputs are worthy of investigation, it would be of great value for us. It will allow us to save time, free up resources, and allow for a more targeted approach to process investigations. So before jumping into model building we we had to first collect as much data as we could from historical batches. And so, this data gives us indications on how our production process is running and any abnormalities in the data, might indicate a detrimental effect on our vector titers. So here on the left side, you can see that we collected the data from 20 variables, unfortunately, unfortunately we couldn't show the actual names of the variables and we had to anonymize the data. So we list them here from X1 to X20 and since our our process is quite a dynamic process, we cannot expect the degree of colinearity between predictors. So in summary, we have a data set that contains 20 variables or 20 candidate predictors to our model, and we, and we have collected the data from 67 historical batches. So, since this was part of JMP Pro trial, we thought it would be a great thing to do in the beginning to use standard JMP to build the model and then use it as a comparison to to to to to another model built using the hold out formulation method. Here, to build the model without the validation method, we use the standard modeling platform on JMP and we selected here response surface model because we wanted to allow all main effects, interaction terms and square terms to enter into the model. And here, in the personality, we chose to do a stepwise progression. So here on the right hand side, we have the stepwise regression control panel and we use the stopping rule to be our P value threshold, so the remaining parameters, we left them by default. And then, after running the stepwise regression, this was the final model that was generated. So here, you can see that we have eight terms into the model and seven of them, they are actually significant. So I would like to bring your attention just for these first three three factors here, because they have...they are highly significant in our model, and they will be very important in in a couple of slides when I'm going to show the results...the results when we approach using the hold out validation method, so the old model here is significant and can explain about 44% of the variation in our data set. And so, this was kind of our initial approach. We were happy with with with this first model that was generated, but we were more interested to know more about this about validation methods so. Because to investigate all of these predictors it would take a considerable amount of time and laboratory resources so that was the approach we took next. So here is the an overview on how this holdout validation method works. And, essentially, we have our data set and our data set will be randomly split into two groups. So on the left hand side, 70% of the data will be, it will be assigning into a training set, which is the proportion of data that will be used to exclusively build the model. And here on the right hand side we have 30% of the data that we will be assigning to a validation set, and this is the proportion of data that will be used to stop our model building process when this R squared of the validation set, which is its maximum. So here at the bottom, we can see, this is a screenshot of the stepwise regression platform, which is essentially the same as we used before, but there was as a stopping rule, which shows this to be, the maximum validation R squared instead of our p value threshold. So and here, this graph at the bottom, it shows we have the training set and we have the validation set here, which is labeled excluded. And so, as you can see, the training set, as we start entering terms into the model, because we have this rule here, this forward direction rule, we start with zero terms into the model, and then we go in a stepwise fashion to add terms into the model. As we start adding those terms, the R squared increases quite fast, up to when we have five terms into the model and afterwards the there is not much of an increase in the improvement in the R squared when we start...when we keep adding more and more terms. And so one of the important things when we use this method is also what happens into the validation set, because this is related to our stopping rule. And so what happens is essentially the same as we start adding terms into the model, the R squared increases quite fast, up to when we reach the maximum R squared and then after that point, as we keep adding more and more terms the R squared decreases. And so, this is exactly what what this validation set does is to...what...we want to use this validation set to stop the model building process at this stage, so we can include all these these terms here that explain most of the variation in our data set and avoid the inclusion of all these terms in the model that might create model overfitting or or they might not actually be genuine at all. So after running the stepwise regression, so this was the final model that was generated and, as you can see, we have we had a reduction from eight or seven significant terms into only three. And, as you recall, these terms here they were highly significant when we didn't use the the the the hold out validation method and so here they don't appear to be significant at all. And actually this model is only able to explain about 4% of the of the variation in our data set. And so, this was quite...we we thought this was quite a dramatic decrease in the number of terms in the model and we start thinking, okay, maybe we were a bit unlucky with the validation set that was generated for for for building this particular model and we thought we... One of the great things in JMP Pro is that we can use this generalized regression platform, which is essentially...the output is essentially it looks the same as in a stepwise regression, but in this generalized regression platform, we are able to use this simulation function. And what this basically does is to rerun the model over and over again, as many times as we want. And then use it for each model that is generated, it uses a different validation set. so this was the approach we took next. And because we couldn't list all of the all of the models that we generated, essentially we just...we tried to generate 100 models, or we did 100 simulations of models. So we couldn't list them all in here. And the way we we wanted to to summarize the results is to...we tried to count the number of times that a particular factor was picked up to be included in the model after those 100 simulations. And so, this is exactly what this graph is showing here. So on the right...on the left hand side we have just...these are just the parameter estimates. This is good just to observe the scatter of the data. And here, as you can see, we have the X13, X8 and it's this interaction terms that were actually significant when we didn't use this method. And here if we take as an example, X13, it was only...it was only picked up 46 times to be included in the model or 46% of the time to include in the model, and so, which is quite a low number, as we were expecting since these factors here, they were highly significant when we didn't use this method. So the conclusion that we take on using the the hold out validation method is that it's definitely a good tool to have for scientists and especially when we when we build models and we have these variables that come out as significant and they don't quite fit in our scientific understanding. So another one...another conclusion that we take is that specifically for this study is that we can safely assume that the process attributes that we studied do not significantly impact our vector titers. So I'm going to hand over to George now. He's going to speak about the autovalidation. George Pamenter Okay hi everyone, my name is George. I work in the purification group here Oxford Biomedica. I'm going to be talking a little bit about this autovalidation method that was put to us from JMP and its specific use in design of experiments, like data structure. And I'm going to talk a little bit about how we've used that methodology to actually remove a variable from an optimization model of viral vector purification. Okay, so just to start with a little bit of background on a viral vector purification for those that don't know. So essentially we receive this cell culture material from the bioreactors. And that material is quite crude material, so includes our therapeutic of interest are very viral vectors, that also includes thousands of other species, contaminant species. And they can be anything from unwanted proteins to bacterial DNA or sometimes even ineffective viruses. And so the aim of purification really is to separate our pure viral vector product from all these different contaminants. But these can be incredibly complicated processes, and you know, involving chromatography or enzymatic treatments and a number of different factors that affect their performance. And so we need a way of efficiently selecting the best conditions to reach the best purity. And the way we do that a lot of the time is by use of design of experiments. And design of experiments that are common throughout the bioprocess development timeline, we essentially use them to screen, to optimize, and to fully characterize our bioprocesses. Essentially, as we move through the drug development timeline, as we move towards commercial manufacturing, the size of our experiments goes up. The number of experiments we can do actually goes down, and it becomes harder and harder for us to change things the closer we get to a commercial product. And so we need to be able to make smart decisions early on in the development timeline and the way we do this is by developing effective models. So if we have an effective model early on, it sort of affords us the confidence that we've chosen the best conditions as we go through our scale of procedure. So I'm just going to detail how we've used this autovalidation technology on a specific example. So we conducted the response surface design of experiments on a viral vector purification step and again we're not able to share the actual data with you but we've had to normalize it here. So this was a three factor design space and our...and our output or response variable of interest is this impurity level. So this impurity level could be the amount of contaminating DNA protein, something of that description. So the goal of this really was to optimize these models for the lowest impurity level possible. And this is how we sort of looked initially in our JMP setups, there are three input variables and this response impurity. So we initially built some models, using the stepwise regression platform and also the Lasso regression, that was a JMP Pro feature. And actually at the beginning, we were we were quite pleased with what we were seeing so, we were explaining quite a lot of the data. We seemed to be explaining it quite well and we started seeing the variables we would expect to see crop up. But there was one variable that kind of made us less sure, and that was this X1 X3 interaction, you can see I've highlighted here. If you draw your attention to the top right of the screen, you can just see that this X1 profile completely changes depending on the value of X3, so at low values of X 3, we've got this kind of no real effect to be honest, and then a high values of X3, we've got this kind of positive correlation. And that actually really confused us and it was a little bit concerning because actually this X1 X3 interaction really didn't fit with what we would understand scientifically. So that kind of made us a little bit confused and it was also a little bit like, were our models really explaining the data we'd seen properly. It also had a significant real world implications for process development in the sense that this X3 variable was actually known to change quite a bit. So the analogy I would use here would be that if we were developing a drug dosing of two drugs, so X1 and X2, and we discovered that actually the dosing depends on how old the patient is, say X3, that's something that you really have to dig down and characterize. So exploring this interaction would have actually required significant extra experimentation costs and in time, as well. So we wanted to run some validation methodology on that, so we consulted our colleagues at JMP, and they essentially said to us, unfortunately, they wasn't really possible to use this holdout validation on the design of experiment data structure. And the reason for that being is, if you can imagine, I assigned 70% of a validation set, that sort of training sets to my data here to the design space, that we're actually building the model, but actually completely missing part of design space. So it's not really applicable to use this hold out validation on DOE data sets. So our colleagues at JMP came to us with this autovalidation. And the way it was explained to me was essentially that you would resample your entire data set. So you use your whole data set for training and your whole data set for validation. And that might appear at first glance kind of like cheating, it certainly did to me, but I'm told the way in which this is achieved fairly is is by the application of these three extra columns, and really by this pared fractionally weighted bootstrap weight. And, essentially, I think the the idea here is that if you use a piece of data in the training of your model, you would then wight that so that you didn't use that same piece of data as much in the validation, and that's how you sort of get around this idea of double counting your data. So we we used this methodology. We ran this on our models. Here you can see the setup in JMP and we use this again generalized regression platform, which is a JMP Pro tool. And here it's pretty much the same as what Jose was explaining, but instead of simulating for different training and validation assignments, we actually simulate this partial weighting. So we simulate that a number of times to generate different models and we actually went through this 500 times. Again, so the output of our data now like, you can see, with Jose's before, on the right here is a histogram of how many times a particular variable was listed in our model, and on the left here, you've got this these parameter estimates from various models. I've actually just drawn this red line here and I've called this this the null factor line. So this null factor variable kind of operates sort of like a fake variable. So we know this to be a sort of a nonsense variable because, if we look at the null factor on the left here, you can see its parameter estimates are quite widely distributed around zero, so it's not really knowing where to go. And so we draw this line and we say that any...the first time that null factor appears, anything below that we would consider quite likely to be a non genuine effect. So this is kind of a range of the non genuine terms. And I've highlighted in blue, as I'm sure you can see, this this X1 X3 interaction, which was the one of concern. And you can see here that it's it's come up less times and a lot of the null factor effects, so this was sort of the first indication we had that this might not be a genuine effect. So we didn't want to stop there. We ran multiple iterations of this with different autovalidation setups, and I've just sort of generated three examples here. But you can see, in in every example, highlighted in blue, this X1 X 3 interaction was consistently coming lower than this this red line, this null factor line. So, based on that, we were pretty confident that what we were seeing probably wasn't a genuine effect, and so we were able to eliminate that from the models that we desired. Now just a comparison of kind of final outputs of what we got. So on the left here was the initial model we built that included this X1 X3 interaction. And then on the right, this autovalidated model where we removed that. And I suppose, the first thing we noticed was actually this reduction in R squared. So at first it appears like you're probably explaining less of the data, but actually now we're a lot more confident that what we're explaining was actually genuine. And you can see here the difference in the minimization optimization conditions that it's output. And this actually had a number of kind of benefits for us, actually, in terms of processing. So remember these are all real life variables, they have real things behind them. So this X1 variable actually, that's a very expensive thing for our processes. So we were very...it was good to see that we could reduce that that variable. Again, also as this X2 variable goes down, that's actually also, for scientific reasons, quite a benefit to our process, so we're able to reduce that as well, which was of benefit to us. And these are kind of happy chance coincidences, I guess, but the one thing that we were really pleased with really was that actually, you can see the removing this X1 X3 interaction, we were seeing the actually the profiles of X1 and X2 were constant and that actually really fit with the scientific understanding that we expected, so we were quite happy to observe this. And obviously we've gone away now, and we've tested both of those models with follow up experiments just to see which one was performing the best. And on all occasions, this actually...this autovalidated model has agreed with the observations we've seen in our final final experiments. I'll just hand it back now, Rebecca. Rebecca Clarke I'm just going to do a quick summary of what we were talking about. So we know that non true effects have real world implications for both the development timelines and experimental costs for bioprocessing. And here we looked at two validation methods to help us with our bioprocesses, to highlight a potential non true effects that we can eliminate from our optimization experiments. So the hold out validation tool we used successfully to remove a number of parameters from our optimization list, and then we also use the autovalidation during our DOE type experiments. And, and in these cases the models that were generated are also then later confirmed by experimental work. So overall, we were very pleased with how we could use JMP Pro, particularly, the validation methods within our bioprocesses to help save us a lot of time and resources and not focus on non true effects. And just to finalize, we'd like to acknowledge Robert Anderson and Anna Roper at JMP who helped us throughout this presentation, and that is it for us, and thank you for listening.  
Yassir EL HOUSNI, R&D Engineer/Data Scientist, Saint Gobain Research Provence Mickael Boinet, Smart Manufacturing Group Leader, Saint-Gobain Research Provence   Working on data projects across different departments such as R&D, production, quality and maintenance requires taking a step-by-step approach to the pursuit of progress. For this reason, a protocol based on Knowledge Discovery in Databases (KDD) methodology and Six Sigma philosophy was implemented. A real case study of this protocol using JMP as a supporting tool is presented in this paper. The following steps were used: data preprocessing, data transformation and data mining. The goal of this case study was to evolve the technical yield of a specific product through statistical analysis. Due to the complexity of the process (multi-physics phenomena: chemical, electrical, thermal and time), this approach has been combined with physical interpretations. In this paper, the data aggregation (coming from more than 100 sensors) will be explained. In order to explain the yield, decision tree learning was used as the predictive modelling approach. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. In our case, a model based on three input variables was used to predict the yield.     Auto-generated transcript...   Speaker Transcript YASSIR EL HOUSNI Hello. I am Yassir El Housni, R&D engineer and data scientist, in the smart manufacturing team of Saint-Gobain Research Provence. in Cavaillon, France. We are working for ceramic materials business units. In this post, we have two points. In the first we will present the data projects life cycle that we propose for manufacturers data projects. And in the second, we will continue to present two user cases of Saint-Gobain materials...ceramic materials industries. Working on data projects across different departments, such as R&D, production, quality and maintenance, require taking a step by step approach to pursue the progress. knowledge discovery in database methodology, and the six Sigma philosophy, also known as DMAIC to define, measure, analyze, improve, and control. We define in this infinity loop seven steps to pursue in order to manage correctly data analysis project. In all of them, we ensure to have the good understanding of the process, because we believe it's a relevance key to sucessful data projects in an industrial world. For example, to detect the variation in the process we use the SIPOC or flow chart map and to detect the causes of variation we use our toolbox of problem resolving which contain a ??? or Ishikawa diagram. Then infinity loop presents also another route to achieve the continuous improvement. In the next we will detail the approach, step by step. Let's start with define the projects. It's necessary to define clearly three elements before starting data projects. We propose here some questions which we found very useful to define clearly the elements of the trade(?). First of all business need definition. Frequently, the targets of data project in manufacturing is to optimize a process, maximize a yield, improve the quality for a specific product or reduce the consumption of energy. Under the definition of the business opportunity, we should know what will be used, is the target need just a visualization or analytics. And after that, the impact should bring our quantified gain to the business. Secondly, data availability and the usability. In it we launch a diagonal analysis of quality of data. It is so important in the step to determine the feasibility of the data project. And therefore the team setup, a person from the data team, a person from the business unit team and a person from the plants, a process engineer with Six Sigma green or black belts. Let's move to the second step. Data preparation. With the transformation, integration, cleansing, it's an important step which consumes a lot of time in data projects. For example, we have here different sources of data and we need to centralize them in one table. Mainly we use X for inputs and Y for outputs. In this setup we use a different tools in JMP such as missing value processing, the ??? of constant variables, and of course the JMP data tables tools which ensures the right SQL request to transform correctly tables. The third step is about exploring data with dynamic visualization and with JMP we have a large choice for visualization. For example, plot the distribution of variable and estimate the load(?) that it's follow. Detect the outliers with the box plots diagram, nonlinear regression between two variables, contour or the density mapping to determine the principal placement for concentration of each population, and we have a large choice to plot it. ???, we use them usually in our work and we found them very useful. The fourth step is to develop the development of the model and it depends on the kind of analysis that we need. It's to... the target is to explain or is to predict. The first is about links between variables and it serves to explain patterns in data. The second is about formula and it serves to predict patterns in data. Generally we cut our data sets in three blocks, 70% for training, 20% for testing, and 10% for validation. And sometime if we have a small size of data we use 70% for training and 30% for validation. If the model is good, we request a new set of data in order to drive decision making. We have to approach supervised and unsupervised learning. Today at Saint-Gobain we'll use the normal version of JMP. And we have access to the supervised learning tools, such as linear regression, decision tree and the neural network. We work a lot today with decision tree because it gives us a relevant results, which help us to resolve a challenge in ceramic materials industry. The fifth step is about finding the optimal solution. Sometimes it's just a solution, but in other cases it's a combination between a lot of models. And to ensure the good sense, we added some constraints, for example, the mean max and the step of valuation of each variable. JMP profile give us large possibility to optimize quickly solutions. From now, the next step is pass to the plant, by the support of our process engineer with the Six Sigma green or black belt. The sixth step is about implementing the best solution in the plants and governed by only one representative model. For example, we implement the control charts of output Y1 and analyze the different variation. In the seventh step, we monitor the model effectiveness and we visualize the global gain of working on our data project. For example in the pie charts, we see the radial impact to the global yield. And the last but not least, the preparation to the step one for a new data project to ensure the continuity, the continuous improvement and the continuity of the infinity loop. That was all about data project lifecycle and now we will pass to present two real case studies of the protocol using JMP. In the two examples we studied the same process technology. It's about electric arc furnace, but for the two we have different products and different target. In this technology, we have a complex multi-physical phenomena such as electrical, thermal, chemical, time and others physics effects. Here, we have, for example, more than 100 process variables that comes from different kinds of sensors. The business need was to explain the global yield of a specific product, JO7. In it we detect a lot of kind of defects, the Defect #1, #2 to #N. And for that to prepare correctly the data sets, we used Pareto, outlier processing with Mahalanobis distance and recoding the correct attributes to correct the type of errors and missing value processing. In step three, in explore, we present here just an example of correlation that we study between inputs to reduce number of variation after working with models. As our results, we found our decision tree with just three variables and here, the goal was to explain why the yield is not in the max. And we have a decision tree with just three variables under 100. So for the model we use here 70% of the data for training because we don't have a big size of data and 30% for validation and we get good results with important root square. As you see, it's more than 70%. So the message that we passed to the plant is that just with the specific setting of X1, X2, and X3 we can explain the global yield, and if we need to maximize Y, the precent of this yield, we need to get a specific setting just for X1 and X3. And the global yield should evolve rapidly. It's the point...at the cluster of points here. And for each project, we give also the physical understanding of each parameter to the plant. For the second example here with the same technology but for another product, here we have just 80 process variables and the target was to evolve the number of pieces with no defect, D1. The need is about explaining the quality of our specific products, so we use the same methodology for our steps. For example here, we studied the same correlation between inputs to reduce the number of variables that we will put in the model. And, as a result, we use also a decision tree, but here we found 12 variables that explain this global yield with good results. As you see the R square was 84%, the RMSE was 3% and number of size was 287. For that, we used the Cross validation method because we have a very small size of table. So the first parameter was very important, as you see, it contributes 50%. And it was difficult to explain that with the 12 variables to the plants. For that, when we plot just the first variable, we see visually that we can define a threshold just with the variable X1, and with it the global yield should evolve rapidly. Thank you.  
Arhan Surapaneni, Student, Stanford OHS Siddhant Karmali, Student, Stanford OHS Saloni Patel, Student, Stanford OHS Mason Chen, Student, Stanford OHS   Our projects include topics ranging from high level analysis of gambling utilizing hypothesis testing tools, probabilistic calculations and monte-carlo simulation (with Java vs. Python programming) to strategic leadership development through quantification of troop strength in the Empire: Four Kingdoms video game. These projects carefully consider decision-making scenarios and the behaviors that drive them, which are fundamental to domains of cognitive psychology and consciousness. The tools and strategies used in these projects can facilitate the creation of user-interfaces that incorporate statistics and psychology for more informative user decision-making – for example, in minimizing players’ risk of compulsive gambling disorder. The projects are about the game of poker and use eigenvector plots, probability and neural network-esque Monte Carlo Simulations to model gambling disorders through a game consisting of AKQJ cards. Offering a subtle analytical approach to gambling, the economic drawbacks are explained through multi-step realistic statistical modeling methods.     Auto-generated transcript...   Speaker Transcript Siddhant Karmali Hi everyone, this is Siddhant Karmali, Mason Chen and Arhan Surapaneni and we're working on optimizing the AKQJ game for real poker situations. So COVID-19 as effective mental health and can worsen existing mental health problems. The stressors involved in the pandemic, namely fear of disease or losing loved ones, may impact people's decision making ability and can lead them to addictive behaviors. And addiction to gambling is one such behavior that has increased due to an increase in online gambling sites...the site traffic of online gambling sites. This project analyzes how different situations in the game of poker affect how people make irrational decisions. These include situations that may lead to problem gambling. We developed a simplified model of poker that only uses the ace and the face cards, so A, K, Q, and J. That increase the probabilities of certain winning hands called and we called it AKQJ for ace, king, queen, jack. The variables in this model are the card value, the number of players, the number of betting rounds, and whether cards are open or hidden. Open cards or yeah...the objective of this model is to simplify the complicated probability calculations for the winning outcomes in a full game of poker. And we will extend this objective to the idea that since poker in real life has more than one betting round, we can prove or we can show that this model is effective, even in different variations of poker with different betting rounds. So this is how the project...or this is the outline of the project. First we researched emotional betting and compulsive gambling and what are the risk factors for compulsive gambling? How do people like...what do compulsive gamblers think like? Why do they gamble? So we found that people will gamble for thrills, just like addict...people who have addictions to drugs, they use the drug for a high or thrill. So again we... we we infer that gambling or gambling as an addiction must hit the same chords in the brain that are involved in the review...or the that are involved in the reward system. And then we went to...our technology was using hidden and open cards in real cases. So hidden and open cards are...so open cards are the cards that a player keeps face up and the hidden cards are face down, and only the player knows its identity. The and then we made two separate algorithms. There was a comprehensive algorithm and a worst case algorithm. But comprehensive algorithm is more complicated since all the cards are hidden and it's hard to do calculations, and the worst case algorithm had some open cards so players...or our modeled players could infer whether you take the bet or not. And so this was our engineering part. We used JMP to model players play styles. And we also used Java and Python programming, as Arhan will show, to generate...to randomly generate card situations, and we calculated the probabilities and conducted correlation and regression tests in JMP. So hidden...hidden and open cards. Open cards are, as I mentioned, open cards are the play...cards that a player keeps face up so other players can see it. And hidden cards are facedown and only the player knows its identity. The comprehensive algorithm, which ...which is what...usually what happens in a real game of poker where players have to try and calculate the probability of them winning against another person or them winning against their opponent, based on their current hand. And in a comprehensive algorithm it's hard to do, since all the cards are hidden and you don't know which which card which player has. And the open cards make AKQJ game easier calculation wise. And the number of hidden cards increases with the number of betting rounds. so the first case we did was with one round and six players, which had six hidden cards. Then we have one betting round and five players, which had seven hidden cards and so on. So earlier, we...or in the model, there were six players given labels A through F. We assign them probability characteristics, which are the percentages of confidence they have to make a bet. A's is 0%, B's is 15%, C's is 30%, D's is 45%, E's is 60% and F is 75%. And F's 75% probability means that unless they are...unless they are 75% sure...at least 75% sure that they will win against that person...their opponent, then they will not take the bet, so it means they're very, very conservative with their betting. a general poker case, which is the comprehensive algorithm, and the worst case algorithm. The general method is calculated or, for example, if we're trying to calculate the probability of A winning the poker match, in terms of the general method, we would have to use the probability of A is the probability of A versus A winning versus B times the probability of A winning versus C, all the way to probability of A winning versus F. This takes a very long to calculate and it's cumbersome in a real poker match, since the betting round time can be 45 seconds to a minute and not many people can do this kind of calculation in a minute. So the worst case...so that's where we developed the worst case method. The worst case...we calculated the worst case outcome by seeing which player can make the best hand with the cards they can see, out of four shared cards which are which are open to all players and one hidden card and one open card per player. We use these two algorithms in three different cases. The first case is with one betting round and six players. We have to determine in which cases each player will fold or stay and how many chips they will win or lose. For example, A stays, even if they lose chips, because that was ...that was one of our models, which we...which we knew had a problem gambling, so...that...so according to our condition, they had to stay. B wins against E, but not against anyone else. C does not win in this case. D wins against B and C and ties with A and E. And E doesn't win, and F wins all...against all the other players. Note that because C and E did not win, they...or because B didn't win enough because C did not win at all, and because E did not win at all, they all fold in the next betting round and they lose their...they lose their chips. And so the ones that stay, they're the ones that like...considering this is a one match or one betting round poker match, the ones that are...the one that is most likely to win is F, since they have the highest worst case winning probability. And we can see that, if you go to the previous slide, we can see that player A or player ...or for the six players' case, player F's overall was very close to 80% so we can... so the...so there is a correlation between...or there was a strong correlation between the AKQJ worst case method and the general method. And the next case is with one betting round and five players. In this case, the confidence values change. A's is 0%, B's is 12% , C's is 25%, D's is 38%, and F's is 65%. Player E was removed, since they lost the most chips and had to fold in the previous round. In this round with the fewer players, we see that there are more hidden cards. The number of hidden cards increases with the number of betting rounds and players. With the number of hidden cards increasing, the calculation time may take longer, and this may make players more nervous and unwilling to do those calculations, since they could lose money. In this case, player F didn't win, as shown by how their worst case winning probability is less than their confidence percentage, so they are forced to fold. Meaning that...or this could be because conservative players may not do well in the later stage of the game, because they are, you know, too stingy with their money. They do not make the right bets even when the stakes are higher. The next case is with four players and we did this test to confirm whether player F wins or not, and we can...and if it if F does not win, we can say with confidence that more conservative players do not well...do not do well in the later stages of the game. And note that F has to keep decreasing their winning probability. We also tested whether the worst case algorithm matches for five players. In the general method, B has an 11% chance of winning, D has a 46% chance of winning, and F has a 48% of...48% chance of winning. This is very close to the worst case values, and so we get a strong correlation of...we get a strong correlation with an R squared of 0.998. And so, this worst case makes F win 50% of the time in the five player match. And we also tested this for four players, in which... in which we confirm that F will win 50% of the time in the four players case. And then the third case is with three betting rounds at six players. In this case the values are the same. and E's, which we added back is 56%. In the first round F wins, however, as players started folding like how B, C and D fold, F has to change their confidence level. F has to change their confidence level to match the winning probabilities for a round. F's level changes to 60% after the first round and 54% after the second round. This is involved, like these are models. This will be involuntary, indicating that there's nervousness in a conservative play style, which contributes to such players losing in later rounds. Players A and F represent the extreme playing styles, which may be indicative of problem gambling. And this is a quick summary of the betting round calculations. In two betting rounds or in a game with two bending rounds, we will see that F only wins two times out of 20. So the the... player's possible hand doesn't all...like F's possible hand is not good enough to match up against the opponent's possible hand. This happens in both two betting rounds and three betting rounds. And this is this is due to the nervousness, and this is due to the fact that F's probability was way too high, like they could not match their confidence level, so if you want to...like perhaps the optimal strategy for doing well in a poker game is to be not too, like, aggressive with your betting and also not too conservative with your betting. Be like player D, who has... player D had a 38% chance...or 38% probability so they would have to be at least 38% sure that they would make...that they would win against their opponent. So like...I think...or based on this, around that spot is a good place to be for poker. Now, why is this important? We also did...or we also did the three players test to confirm that player F has to fold and player D wins in this round. So we can say that player D has arguably the best strategy in this poker model, in the AKQJ poker model, with more than one betting round and more than...and less than six players. We also did the two betting rounds, to consider...or to show that F doesn't win either. And this is another case which we did to test if the outcome of F losing held throughout the betting rounds. We did this, three betting rounds and four players. And now, why is this important? We've showed previously that players can perform simple calculations, like the worst case, to control their urge to bet even on the losing hand. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. This is an emotional style of betting and it falls right into the trap of gamblers fallacy, which is thinking that either if you're on a lucky or if you're on an unlucky streak, you will get lucky the next time, for example. These new cases of fewer players and more betting rounds in a poker match introduced the ideas of nervousness in players, even the most rational ones. So F, in the one betting round in six player calculation, was a player who had...like you could see that player had experience or he...or that that player was able to wait, he was able to have a good strategy, he or she, excuse me. This is...and so we can apply this to a real poker game, because in humans, the parasympathetic nervous system releases stressful hormones, like adrenaline, into the body, eliciting a fight or flight response. When you're in a high stakes situation like a poker game, then you know that...you know at a superficial level what's at stake, you know that your money and your assets are at stake, so you will bet on that to try and increase it. That's just inherent human nature. More bandwidth during this poker game is given to the amygdala, which is an area of the brain that controls emotion, rather than the rational and involved in executive functioning prefrontal cortex. So the more hidden cards in a round, players may be more nervous about their bets and make worse bets, even if they... even if they're very experienced in poker. And this nervousness may be correlated with the blindly betting nature of compulsive gambling disorder, based on the concepts of risk calculation and gambling for thrill. And our conclusions were that F, or overly conservative players like player F, may not do as well in realistic poker situations. So the ones that do the best are are on the conservative side but are more willing to bet than your regular very, very stingy player. And get our main takeaway is that gambling disorder may be mitigated if players can understand basic statistical calculations and use them in their games. And the future research, which is actually not in the future, it's going to happen or we've done that research already is, we are going to get more reliable data using Python and that's what Arhan's presentation is going to show. Thank you for watching. Arhan Surapaneni Millions of people visit the capital of casinos, Las Vegas, every year to party, enjoy luxuries, but most importantly, gamble. Gambling, though forming a false reward of success in winnings still hides the dark pitfall of financial and social struggles. Our generation is modernizing with technological advancements rapidly becoming the norm and computer programming languages being the needed subjects to thrive in a modern world. Utilizing modern computer programming tools and ???, we are able to take a deeper look into their this psychological problem and analyze tools to help solve them. This is done with authors Siddhant Karmali, Mason Chen and Saloni Patel, with the esteemed help with advisors, Dr Charles ??? and ???. As previously mentioned millions of people attend casinos per year affecting a large population. With a problem that affects such a large sample size, one of the overarching questions concerns how do we express solutions to the problem of gambling and how do we do this efficiently. Through Siddhant's presentation, we have learned much about original methods that help explain gambling, while also presenting the economic misnomers about gambling, proving that being calm and cautious provides the best results, rather than relying on luck of the game for positive results. All of which was done through Java. This method, however, takes 30 hours just for 92 runs. This is not effective and unusable. Using Python however reduces this time to seconds, while also allowing for higher levels of complexity that provides to be beneficial to the overall methods. To recap, the method required a six player model, each of whom receives two cards in the 16 pack of AKQJ. One of the cards is hidden, with four cards placed on the table. Each player has confidence levels determining how often they fold or continue playing. Continuing playing loses three points and folding loses only one point. This is done in an effort to model players that range from blindly gambling to cautious strategists. Usually when conducting a large scale experiment is very tedious and time consuming to run tests at an adequate amount so that the data that is produced is usable. An efficient alternative is using computer programming, more specifically, using the computer language Python in a Monte Carlo simulation, rather defined as a broad class of computational algorithms that rely on repeated random sampling containing numerical results. The Java program applied here in the diagram is very ineffective. They only allow for random two samples and the process of individually sourcing out each specific sequence is difficult to do. When we are trying to derive larger sets of data, it is important to change this. Analyzing general differences between languages, it is important to identify the Java is staticly typed in a compiled language, while Python is dynamically typed in interpreted language. This single difference makes Java faster at runtime and easier to debug. But Python is easier to use and easier to read. Python is an also an interpreter language with an elegant syntax makes it...making it a very good option for scripting and rapid application development in many areas. This is applied in the following method. One thing we need to look at is the new randomizer, which is applicable, with the full 52 cards set, the AKQJ 16 card model allows for more accuracy in statistical ??? rendering, especially when using a Monte Carlo simulation. One beneficial aspects that we can add with the Python our characters. As Siddhant covered, we can now use wagers. Instead of manually applying different confidence levels on our two card ??? randomized, we can use Monte Carlo to make different choices based off of a certain percent threshold on the ability to add or remove wagers which will see you later on. First, it is important to talk about the deck. Here we see a 16 card array with our shuffling function. This allows the cards to be randomized, similar to the initial function seen in the Java program. This draws two cards from the 16 card total for different players. With this in mind, there are extensive applications to both the original comprehensive method and our worst case method, which will be covered and developed on later. ??? the same deck allows us to add changes or move things to affect how we compute the worst case method which is helpful in our end goal. We don't have the same flexibility in our old JAVA method. Python specific changes for the worst case method include specific elif/if statements, so that the player with the worst card is indicated as loss and the formula covered by Siddanth has changed. This is important because this allows more efficiently... more efficiency in data collection, with the probability on making the randomizer's outputs more accurate. You can add specific names to these separate cards, which is also another helpful application in itself. This simplified Monte Carlo simulations allow for more complexity, as it allows us to add our new wagers based off scenario which involves multiple betting rounds that Siddanth has described. We can change characters by changing the current wager, which affects how the player stays or leaves the betting round, this being a key difference from the Java method. To concept in this program has to do with setting a variable that will eventually return value to the funds and the wager to the initial wager argument. Sending the current wager to zero, we set up a while loop to run the condition continuously until the current wager is equal to count. Then for each set where we get a successful outcome, we increase the value by our wager. If we were to add a command telling us or the character to slow down at a certain wager, then we have a simple way of having threshold to betting. We can also edit the same form to accept specific sequences, like the full house, only allowing a wager when the sequence is present. These thresholds create our upper mentioned risk management level. We can plot the probabilities in ???, where we append to each updated current wager to an array of X values and each updated value to an array of Y values, proceeding to plot both PLT(?) to plot X values and y values. The key component of the better function is the condition if...in the if statement that corresponds to a successful outcome. This can be adapted to any outcome needed, including general scenario and worst case scenario. When we apply our comparisons with the two character and multiple character, we can add important statements that make sure the data is compared properly. Using Python you can make sure drawings like two pairs or instead of higher values ruling out players with a lower value, forcing them out of the game. This allows for more efficiency with the new Python program rather than using the original Java program. Something that would take 30 hours originally can now be done in a matter of seconds. After this program is applied, we are able to run a correlation test with the new results against the original Java results. If we look at the red lines for both the general and worst case methods, we see that they're extremely close to one, indicating strong correlations. This is also paired with the higher R squared value. We also run the one proportion hypothesis test, telling us for both methods we failed to reject the null hypothesis. This value, although high, isn't close to...isn't as close to one. For something that we would expect to be almost identical, because this is computer programming. There are two main reasons for this, the first is sample size. 92 seems like a lot, but it isn't strong...isn't as strong a trend as one would expect. To fix this we are able to increase the sample size to 1,000 or even 10,000. One more reason could be the application of Python. As previously mentioned Python is more comprehensive language, rather than a static language. This could change the effect, but mainly stay the same. Here we see the program finally applied with their results presented. With this program we see what each play...which cards each player draws with the cards... which card is shown and hidden, what cards are on the table, how many chips players gain and lose and, finally, who wins, how they win. In the diagram below, we see Player 1 wins with the full house. the ability to run multiple characters into one computer and one function, describing the different sequences while applying different numbers of players, also showing probability of different outcomes, even adding the multiple betting runs in a regular game of poker, which Siddanth has covered. This is vital to further developments, because it allows for people around the world to use this program and method to develop on ideas as a learning tool able to be utilized to act as therapy to gambling addicts. One more development exploits the neural network (AI) aspect, making this program more detailed by adding features like bluffing, added enough for it to make the program more like actual game seen in casinos. Finally, we were able to see by using Python, not only do we experience higher accuracy, but higher...much higher efficiency, ultimately, proving our original hypothesis that being cautious gives more money and more wins, turning a program that originally took hours in just a few seconds. So next time you go to Vegas or even partake in a game with friends, remember this project and remember that being more careful and taking a bit more time with decisions will help you in the long run. Thank you.  
Chi-hong Ho, Student, STEAMS Training Group Mason Chen, Student, Stanford OHS   In late February 2020, the COVID-19 pandemic began gaining momentum, and the stock market subsequently crashed. Many factors may have contributed to the fall of stock prices, but the authors believed the pandemic may have been the main cause. The authors’ objectives for this project were to learn about stock investments, earn money in the stock market, and find  a model to help determine the timing and amount of the trading. All of the data was Z-standardized to help eliminate bias and for ease of comparison. Specifically, the authors used three Z-standardization values: Z (within stock), Z (NASDAQ Ratio) and Z (Group NASDAQ Ratio). The authors compared the current stock price with the previous stock price, NASDAQ stock price and group NASDAQ price average respectively. After that, the authors combined three Z-values and applied it to the stock index, which much decreases the data bias and is better for reducing risks in investment. Determining the outliers is to acknowledge the timing of investment. Quantile Range Outlier is easily affected by the skew factor, because it is usually used in the normal distribution. Robust Fit Outliers is a better tool, because it can eliminate skew factors. The authors established a model to help people invest the right amount of money.     Auto-generated transcript...   Speaker Transcript Chi-hong Ho Hey, no. Come on. Okay, hello everyone I'm Chi-hong Ho, junior at Henry M Gunn High School. My partner is Mason Chen; he is a sophomore at Stanford Online High School. Our project is to finding the multivariate statistical modeling of stock investment during COVID-19 outbreaks. In late February due to the coronavirus pandemic spread out in the world, the stock markets start crashing. There are several factors causing this year's stock market crash. There are several factors causing this year's stock market crash, such as a COVID-19 pandemic. OPEC/Russia/USA oil price war. Also 2.2 trillion bailout package from US government. And some companies laid off their employees. There is more than 30% unemployment rate due to the pandemic. Let's compare with the 1929 Great Depression unemployment rate. At that time it was about 25%. Also in November, the US during the presidential election. And then the manufacturing supply chain shut down because of the coronavirus pandemic spread out in China before. Yeah, the COVID-19 pandemic influenced the stock market as the past stock market crash that happened in 1929 Great Depression, 1987 Black Monday. Because the COVID-19 pandemic had a huge impact on the world causing lots of deaths, the stock market should get more influenced by pandemic. Compared to US history, the Great Depression of 1929 and Black Monday 1987 both crashes continued for a long period. COVID-19 wasn't. This year, stock market decreased by 25% from the peak in March to April. Before March, the COVID-19 pandemic in the US was not spread out as fast as other countries. After the COVID spread was global, lots of countries were locked down. National and global-wide lockdown situation would affect the stock market. Look at the graph in the left corner. That is the situation that happened Korea. Asian countries got COVID pandemic before America. We use the Asian countries' situation to predict what will happen to the US. Based on this graph with this color that is COVID-19 inflection point is for Phase II, it may impact the stock curve significantly, because it seemed that the case growing speed will be little bit decreased in this short period. Okay. In the left side of graph, it's the correlation map of the case versus the date. This is its correlation map in the US. The cases were added in early February, and much through in late March and early April. The right graph is the stock market down by the date, which means that the case interest to maps are related to each other. When the case grew much, the stock market start to crash and the lowest point was more than 35% down. Compared with China, South Korea and the US, we knew that the Asian countries got COVID pandemic before the US. We could conclude that what will happen in the next few months in the US. Based on just that specific table and data below, I found that the duration of Phase III is really short. And it's now safe for us and we can go back to work. Compare with the direction of Phase IV, which is double the time of Phase III, so in this time we are we are feel more safer to go back to work. After we look at the Asian countries pandemic phase, we can predict the Phase III and Phase IV duration of the US. We would estimate the US end of Phase III should be on April 15 and Phase IV should be on May 25 for the best case. The worst case is end of Phase III is around April 30 and the end of Phase IV on June 10. Our recent projects to define stock investment strategy, our objectives are learning and experiencing the stock investment, earning money in the stock market, and building a model for judging the times to trade or exchange stock. Firstly we own eight high technology stocks which were purchasing 2008 and 2009, and average gain is about 400% in March. Some of the stocks are the top 20 of the Standard and Poor 500 stocks, with an average gain of more than 800%. We want to find a time to sell those stocks and get money back. Because of the COVID-19 situation and the stock market crashed, the stock price was not as high as in March. So we wanted to sell quickly. After selling high tech stocks, we look at 23 COVID impact stocks, which stocks lost ground because of the pandemic. We want to wait for those stocks surge after few months. Our choice of COVID impacted stocks should have a minimum of 3 billion market cap. When we down to trading stock, we could have some exchanges. We should...we sold one tumbling stock and bought one rising stock for balance. We need to make sure our stocks will surge in the few months. We separate the stock transaction decision chart into three levels. The first level is to decide what we will buy, what we will sell and to exchange. In level two, we are picking the stocks from selling group, buying group, and exchange pair from the two groups. The third level is the tool we will use. We may use the Z index, we will use the outlier detection tools. The function of the standardization is to give us the idea of how far from the mean to the real data points. Why do we need to use it? Because we need to convert the actual data to an index that's easier for us to compare. Also using the standardization to eliminate the bias of raw data. On the left side, there are blue boxes, purple boxes and red boxes. The real data is in the input we collect, which is in the blue box. The new index after the Z standardization is found in the red box. That is our output. The Z standardization is a tool we use, which is in the purple box. In the blue box the cubicle represent NASDAQ stocks, which is popular and lots of people will invest in. High tech stocks had grown a large amount during five years. The range of Z standardization is from -3 standard deviation up to +3 standard deviation. After the Z standardized, we can get Z within stock, Z NASDAQ ratio, and Z group NASDAQ ratio. The Z within stock is to compare the stock price with the past five years' previous stock price. The, the Z NASDAQ ratio compared the stock price with the NASDAQ stock price, the Z group NASDAQ ratio compares the stock price with the group NASDAQ mean. We use the Z standardization to help us look up the ris. In the end we will combine all three Z score into a new stock index. The stock index can help us to lower the risk of transactions. Here's the data table after we standardized with the raw data. We can use the stock...the stock price index change. USA stock has been in downward trending since a peak around in mid February. Some stocks are more robust and certain ones are impacted by COVID-19. We establish this modeling algorithm on March 7-8 and and the database on March 14-15. The red in that index shown in this figure speaker, that are good time...present a good time to sell out those stocks and we can earn more money than other not red index. Also, there are some index marked by color blue. That is the time to we can consider to buy the 23 COVID impacted stock because we can lower the cost and we can gain more in the future. The reason why we use the outlier algorithm is that the outlier is helping us determine the timing of trading stocks. The best way to determine the outliers is to use quantile range outliers. Firstly, we used to find the entire quartile range which is Quarter 3 minus Quarter 1. The algorithm is outlier equal to Q1 minus or Q3 plus x IQR, x equal to 1.5 for regular or 3 for extreme. Why do we need to choose the extreme outlier? Because regular outlier cannot shows the longer timing that we wanted. Extreme values are found using a multiplier of the entire quartile range, the distance between two specified quantities. So extreme outlier can have a long detecting level that we can use in the investment, which can help us reduce our risk. But technically the quantile range outliers algorithm is used in the normal distribution situation. The stock market is not as much as the normal distribution, so the outliers will be influenced by the skew factor. Thus we need to use more powerful tools that are not influenced by the skew factor. Because of the stock performance, we would not care more about the tails than the center of the distribution. The next tool we use is robust fit outliers. We use robust fit outliers to ignore the skew factor. Outliers and distribution skewness are very much related. If you have many so called outlies in one in one tail of distribution, then you'll have skewness in the tail. In quantile range outlier detection, the assumption is normal distribution. So skewness in the distribution will introduce an inaccuracy in the outlier detection methodology. If the distribution is significantly skewed, like it probably is in stock market data, the robust fit outlier are a better method to find the outliers accurately because they tend to ignore the skew factor. The robust fit outlier estimates a center and spread. Outliers are defined as those values that are K times robust spread from the robust center. The robust fit outliers provide several options for coupling the robust estimate and multiply K, as well as provide tools to manage the outlier found. We use K=2.7 for regular and 4.7 for extreme outliers. After we use the regular robust fit outliers, we can find out the outlier in the selling index data. Look at the right graph. There are so many shaded red cells in F5 and F8 columns, indicating that we can consider to sell those stocks to maximize our profit, because the stock price is selling above average. Each column is showing some stock index change by day going down to the column. The reason why we use the extreme outlier for buying index is that the buying index is dropping. That means it is really difficult to detect the outliers of the buy index. Not like the selling. Selling index is rising, which is easier to for us to determine the outlier. On this page, there are some color blocks in the data table, like B6, B13, B15 and B19, which indicates that we can consider to buy the stocks. Lots of people make money by investing in stocks and most people may choose the right stock to invest in for reasonable ROI. But investors are challenge to find the right amount of money to invest. Also other human psychological factors will favors our certain investment. We can determine the amount of stock we buy, sell or exchange based on this model, which can minimize a personal investment bias and reduce the overall financial risk. The model provides two ways to judge the amount of investment. The first one is the color block analysis. Now in that analysis, the blocks with dark green are good to sell and the blocks with orange or red is good to buy or it's good to exchange. And then in the bottom, there are the transaction levels we define. The L10 is the least investment amount and L1 is the greatest investment amount. If we want to sell the stocks, we will sell not too high, is in the L5 amount. And do the exchange, we also choose the L5 model. And if we do the buying, we can just to consider to buy more, so we just buy L2 amount. So based on this model, we can just manage our ...based on this model, in the investment you will reduce your financial risk. Then the function....okay...this is...in the Phase III is in the exchanging part. Also we are using Z standardization for...to convert the data point into this index, but this is the exchange index. We set up the exchange threshold Z exchange index should be greater than 15. This is an average index we calculate, which can tell investor the time. On the left side, there's the line chart, which shows the change of each exchange pair. Based on this line chart, we can see the trend of S5-B1 is about 15.8 and S5-B14 is about 15.16. So that means we can consider to doing that exchange between the S5-B1 or S5-B14. After Z standardized, we can get the stock...sell stocks index and the exchange index. The selling index is to compare the stock price with past five years stock price. The Z NASDAQ ratio is compared to stock index with the average stock price. So the exchange index is compared to stock selling index, which was the stock buying index. We use Z standardization to help us look after risk. We consider about 184 choices and we need to make sure our investments will be in the right timing and pick up the right pair to do the exchange. We're also using the quantile range outlier algorithm to help us determine the timing. The small value of Q provides a more radical set of outliers than the large value. Look at right side table. We use quantile range outlier methods and get a top thee outlier whose exchange index value is greater than 19. This is the second time we consider the exchange index. We found out that there are top index, which are the signal and help...and it's the best timing for us to do the exchange. Like S5-B14 pair at 19.27; S5-B13 pair at 19.12; S5-B12 at 19.07. On the left side, we have the timing prediction model. This model is presented with a color code, color box style. That blocks with dark green are good to sell for us, and dark orange or red is good for buying or exchange. The best time will be bold and shown in the graph. April 6, which is the best day to do the exchange, since we come to exchange data from February in 2015. We consider the exchange pair twice, which can double insurance that we can make more money in the stock market and enormously reduce the investment risk. On April 7, the exchange index had little change compared with April 6. On that day, S5-B1 pair had a 19.18 exchange index. In the right graph, that was the exchange stock information. On April 8, we sold the KLAC stock at $154.32. The market stock price was $148.85 so we saved about 3.5% into selling Also on the same day, we bought the Delta stock at $22.42. The stock market price was $22.92, so we gained about 2.2%, so it's changed in the sale. So sellign the same amount of stock and buying the same amount of stocks for balance. The model of selling and buying stock was equal to 65 quantities. After one day, the exchange pair has helped us gain 5.7%. Oh yeah, all stock buyers will focus on their stock trends. My partner Mason and I monitor the NASDAQ stock daily range outliers from 2020 late February to 2020 mid March. We separate the daily trade window into a certain time slots, 30 minutes each, and we want to find when's the best time of trading. There were 24 peak and valley points we detected and the upper threshold is 2.7%. In the right corner figure, we can see the stock price in the open, close, high, low times and also we will calculate the average...we count the price range and rank when we do the stock price peak and valley detection. ??? considered the discrete number of sample size. Among 24 peak/valley points we detected, the data shows that 17 out of 24 points is about 70%-71% were happening in the first or last hour. We set the one-proportion test that we made the null hypothesis is that we wake up early and have a stock lunch session to do trading. Null hypothesis is assuming the uniform distribution probability. Look at the left corner table that we can see the null proportion value is 0.34 which is greater than 0.05 so we cannot to to reject the null hypothesis. So To be four slots among 13 slots available, so we can not reject the null hypothesis. Look at the right corner table or figure which shows the distribution level of peak time and the valley time. In our research we provide a new model to pick the right stocks and the ??? the amount of buying and selling, also the exchanging index. Timing is a really important factor in the investment. This model of stock investment is accurate most of time during the COVID-19 pandemic. Our research group invested in the stock market and gained 2.5% after we finished the project. We may use it to predict in the future if the pandemic doesn't end. Based on our research early bird or last minute favor stock trading and can earn more money. Thank you.  
Simon Stelzig, Head of Data-Driven Development, Lohmann   In the course of adhesive formulation and tape development the developer is supported by the data scientist in handling the analysis of the gathered data resulting from design of experiments, the analysis of historical data or preliminary trial and error runs. Hence, a data specialist talks with a non-specialist about the results, facing the need to present complex data analysis in a non-complex way. Especially in case of the analysis of spectral data, the data is often reduced to a few characteristic numbers for analysis. However, the developer still thinks in spectra, rather than numbers, making the visualization and interpretation of the results difficult. The Functional Data Explorer is a great tool to perform such analysis and visualize the results in a spectral-like form. As an example, the dependence of the rheological behavior of a structural adhesive tape, namely the reaction start, on the formulation ingredients is analyzed using the Functional Data Explorer. Another example is the analysis of the dependency of the molecular weight distribution on the process parameters during the synthesis of a pressure-sensitive adhesive. In both cases, the resulting predicted spectra greatly help the developer to gain insight and develop a mechanistic understanding.     Auto-generated transcript...   Speaker Transcript Simon Stelzig Okay, so hello everybody. My name is Simon Stelzig. I'm from Lohmann, the bonding engineers. I'm from the research and development department, and I would like to tell you something about today about the Functional Data Explorer and how you can use it to visualize results to non data scientists.   So, before I start, some very short introduction of the company Lohmann. You might not have heard about it, but you use or you might have come across its products like on a daily basis.   So just some facts and figures about Lohmann. So Lohmann has a turnover in 2019 about 600 million euros and the part I'm working on is the Lohmann Tape Group and we are a producer of adhesive tapes. It's mainly double sided adhesive tapes and we are a pure B2C company, so you only   you can purchase only our product when...if you're industrial...for industrial applications. We're quite old so, this year is about 170 years since the company Lohmann was founded.   We are basically divided into two parts, so one of the it's tape products where I'm working on. That's the focus on the   adhesive tapes for industrial applications and that's our hygiene brand.   And we are globally active. We have 25 sites around the whole world, with about roughly 1,800 employees. And maybe very interesting,   90% of our products are customized so we have a quite complex product portfolio, where we have a lot of products focus on a very small amount of applications for a very small group of customers.   And so, we are the top producer of adhesive tapes or many double sided adhesive tape and you know we got a lot of applications, and these are our market segment where we are in.   So one of ours is the industrial application. That's mainly for windows and doors or indoor and outdoor application in the building and construction area.   Or you might come across Lohmann products in home appliance and electronic devices, for example, like adhesive tapes for fixing the back cover on smartphones. Or in the   medical segments for mainly for diagnostic applications, wound care. In the field of transportation in the...for automotive maybe for interior or exterior   adhesion of emblems on the cars. We are also, for example, in the graphical industry for flexible printing applications where we deliver adhesive tapes   to fix the printing plate on the sleeves. One of the...one subdivision of the Lohmann Tape Group is the hygiene brand where we deliver closure system for diapers.   So very wide range of applications and another point which I just mentioned before is that about 90% of our products are customized. And another fact about Lohmann is   that our value chain is a very deep value chain, so if you think about an adhesive tape, you need an adhesive layer which   glues onto or sticks onto your subject which you want to adhere to.   You need a carrier, optionally a carrier which carries your adhesive layer, and on the other side, which is since we are a producer of the double sided adhesive tapes you have on the other side again,   an adhesive layer. And Lohmann's value chain basically starts at the very beginning, so we can do our own polymer, so the base polymer for an adhesive.   Then we have to formulate that the polymer into an adhesive formulation, which we then put in the next step on a carrier, being it in the form of an adhesive tape in the form of   quite large ??? rolls. And at the very end, so we can deliver that in the form   the customer wants it to be. So, if you have some some die cuts, maybe some emblems for cars, you need to have the those adhesive tape doesn't need a specific form, we deliver that specific form as a customer requires it.   And the R&D focuses mainly on the first three parts of it, so for the polymerization, the formulation of the coating and in order to get, as I mentioned a lot of our products are customized, in order to get as quick as possible to the goal to fulfill our customer requirements we apply   design of experiments throughout the course   of getting to the final product and to fulfill our customer requirements.   And having a lot of   doing design of experiments or doing overall a lot of experiments also gives a lot of data.   And doing these experiments or the design of experiments in that course and being a chemical company, a lot of the data, in the experiments results, you get a lot of data, a lot of spectra.   So, most of our experiments give...doesn't give you a value, doesn't give you a number, it gives you a spectra.   And, in most cases, if you want to do the experiments, like in my case, I am the, let's say, the data scientists delivering the service of planning those experiments and analyzing   experiments and doing the whole data analytics for my colleagues, who might, in that case, are the non data scientists.   They are developers of the final product, and they take me as a service to do all the data analytics time. In the most of the cases, you extract from those data from those spectra data, key parameters which describe your spectra data.   That might be, in the case of polymerzation, that might be the molecular weight distribution, it might mean the size of your   polymer which we get out of the reaction, that might be a start of a reaction to detect the the sort of reaction by any means of measurements or something else.   So if you do the analysis at the data analysis, you extract those key parameters and you do the analysis, most of the time in classical DOE analysis on those key parameters.   So   as a data scientist, so I, as a service to my my colleagues, the developers, do this analysis on those key parameters.   They deliver also to me, the customer requirements and you want to meet for these customers requirements, they the optimum where you want to be your formulation or product to be or to reach.   Having done this, having done the analysis, having done the optimization, at some point you go to your, or I go to my colleagues and I present the results.   And since I did all the analysis on the key parameters, I also present the results on those key parameters.   The problem is, although my development colleagues or my project teams, they are experts in the area...the experts in their expertise, but not experts of data science.   So the problem here is that if I do the analysis on the key parameters and also talk in the language of key parameters, but they as the expert in their area of expertise,   they are still thinking spectra because that's when the experiments were done, that's what the analysis they get, they see spectra where I see numbers.   It always leaves the problem that if I present the results and the analysis in key parameters, they still think in spectra.   So you always have that the problem that you do the talking in different languages. I talk in the language of numbers, they talk in the language of spectra,   which then leads to the problem that they have to translate those numbers into their area of expertise that means the spectra.   Or, I have to translate these key parameters into their language that means the spectra. Luckily with the Functional Data Explorer, this is a kind of a universal translator for that to resolve that kind of problem.   And what I do, or want to show you today is on three examples, how the Functional Data Explorer greatly helps you in overcoming this barrier or this language barrier between the data scientists talking numbers and the developer thinking and talking in spectras.   In one example, so the first example I want to show you, it's a very simple, very trivial example, maybe on a measurement which describes the printing quality in the fractal printing   experiment. The second one describes a measurement which defines a sort of a chemical reaction. And there's a third one; it's a measurement which gives you the molecular weight of a polymer   during a polymerization reaction. I'll stop now using PowerPoint and I go over to JMP, yeah, there it is.   So first example as I showed...as I told you, is the analysis of the printing quality. So you do the measurement, and the ideal and perfect spectra would be a   very straight line going through the origin with the slope of 1. So that would be the ideal if you reach that perfect, so it's one measurement, this time the printing quality.   Normally, what you get...normal, it's a real life environment. We get something which looks like not really a straight line but it's a little bit shifted towards higher value, so it's not a line...it's bended upwards.   So what do you want to do with that information? We want to see or you want to screen which parameters basically moves or bends the curve upwards so makes it non ideal.   So I go to the key parameters of what would we define as a key parameters or it's basically the difference between a straight line and the actual line which you get. So it's pretty simple, you just use the   the difference...it's some of the difference with this point.   So if you do the analysis, then on the key parameters, it's also...it's a very simple now analysis. I use a decision tree to get the...to get the screen here which which parameter influences the most, this bending away from the ideal spectrum and it's pretty simple. We got about 22 influencing factors.   There you see one, which basically has an influence on the some or influences this bending away from this ideal situation.   And it's called this X 22, whatever it is, doesn't really matter. So at one point if it moves...   if it's larger than a certain value, you get a let's say, a number of 250, if it moves below that certain value, you get 196, the problem is, if you told this to development...if you showed it to a development analysis,   he will ask you, is 196 good or is 254 still good or is it very bad or is that maybe suitable and that sort of suitable.   The problem is with the pure number, you don't really see that, so because he is talking in different kind of...in a different area with a different language.   And using the Functional Data Explorer can really greatly help you to answer the question, though, because if you move over to the Functional Data Explorer and do the same analysis in the data in the function...with the Functional Data Explorer,   you get something which shows like this. So it shows you, it shows him or the developer you showed him ...you show him the graph. The graph he is used to with the graph normally gets from his measurements or his experiments, the experiments which he does.   The same thing now if you, for example, if he has the same parameter which influences the shape of the curve, but if you don't like that shows the one with the 250,   from the analysis for the key parameter would be your 250, it will be 196 and you see the difference. I mean that not really a straight line.   But if it moves down like to the lower number, which shows the number before they want to keep from, 196,   it doesn't look too bad so...it's not really, I mean, a sort of perfect straight line, but it's not so, not so much...it's not far away from being   a straight line and maybe at that point, if you if you show this kind of analysis to the non data scientists, the developer,   maybe that's already good enough for him, maybe that point you say okay it's not a straight line it's not perfect, but it's it's it's good for the job.   Basically, it fits the customer's need. And that's something which he doesn't see from the pure number which you show him. So in that case it's a very trivial example, very simple example,   but I hope it gives you the essence that the using a Functional Data Explorer enables you to talk in the language of the experts, of the non data scientists, maybe the developer of the material.   And you're talking in his area or in his language, helps him to get this insight and it doesn't have to translate the number into insight into into a spectra but he sees that, and maybe get the answer which he needs.   So again, very trivial, very simple example, but I get...I hope it shows you the the essence of what the data...Function Data Explorer is or helps you to achieve.   So, moving on to the second example, that's the start of the chemical reaction which you want to detect.   And we do that by doing the ??? measurement and you want to see a change in modulus during the...if you heat up a certain chemical formulation. Normally there are two big spectrum; spectrum would look like that,   where up at a certain point, the modulus changes, it moves steep upwards. That's normally here, it's the start of the chemical reaction. So the key parameter, which you are extracting from that spectra, what the point where the modulus starts sharply to go up.   The   key parameter to good extract is that point to find the point. Now it might be a little more complicated to calculate a point from the from the original raw data, but nevertheless, you can do that, check the point where the reaction start and do then the analysis on the key parameter.   Doing that,   what you get from this analysis on the key parameter is a is a model which allows you to predict the sort of the chemical reaction based on, in this example, on five parameters. So,   I will see. Okay, the point of...the point or these these key parameters at the start of the chemical reaction is ...may depends on those...depends on those three parameters   and basically you get the range from something like 70 degrees where the chemical reaction starts up to something about 100 degrees Celsius (oh I'm sorry, that's the wrong one.)   to about 100 degrees, yeah. So again, you reduce the total spectra into one number,   giving the development a clue where this chemical reaction starts.   Doing the same analysis and also the data...you already have the data, so if you want to extract the key parameters, you also need the raw data, so you also have the raw data available.   So it doesn't...it's not really much more effort...more effort to do the analysis within the Functional Data Explorer using a standard regression method on key parameters.   So,   using the same amount or the same data and using the Functional Data Explorer to   analyze the data just takes a couple of seconds to calculate the outcome.   There it is.   Again, now you can model the whole spectrum, and what...so in that particular case, I didn't show the developer for the analysis on the key parameters, but I showed them the Functional Data Explore first, the results from the Functional Data Explorer.   And when I just moved...started to move around and show them this and started to move around the the influencing factors, they immediate noticed   that the point of the reaction changes. So I didn't have to tell them that points started to move because they are being the experts in the field...of being the expert on the spectrum data,   they immediately recognized that basically the start of the chemical reaction shifts, as you can see here, you see that it starts to shift.   The other one what you see, because now you basically modeled the full spectra, you don't not only model the key parameter, what you see   that the way that the spectra looks totally changes if you move it. So, for example, they will see that the peak started to move and the height of the peak   changes. That was something that was also recognized immediately after they saw these this analysis.   And it was the difference is on the key parameter, if you extract the chance to one key parameters that means sort of chemical reaction, you can use another key parameters, maybe.   telling you the position of the optimum or the position of the maximum peak here or height of the maximum peak.   And then you can use the spectral data into different kind of key parameters analyzing those key parameters subsequently and showing them to them, calculate an optimum.   But the good thing is, if you use the Functional Data Explorer, you all get this information into one   analysis. And you, as I mentioned before, you stop talking or you...you're not talking in numbers, but you talk in their, or you talk in their language of spectral data and you don't have to translate these key parameters into their world of the spectra anymore.   Yeah and so you get much more information about these these in Functional Data Explorer and maybe they get a deeper insight into the underlying processes.   Yeah, now I come to the very last and final example. It's the   polymerization process which I was talking about and that's basically to determine the molecular weight distribution,   which is results from polymerization reaction. And again, as I have mentioned, to determine the molecular distribution it's also typically tricky   to get or to reduce a whole distribution into certain key parameters. So normally, how does a typical spectrum look like? So let's put your normal typical spectral showing the molar mass of a polymer so make all the number of repeating units which you have in a polymer chain.   And that's what you get from the chemical reaction and from the measurement. So the key parameters which are it's a well known or it's a key parameter, you have to...   that's the number average molar mass, and mass average molar mass, Z-average molar mass and the polydispersity,   which is the mass average molar mass divided by the number average molar mass.   That parameter, the PDI, the polydispersity, gives you an indication of the broadness or the size of the...or the width of your of your molecular weight distribution.   Again, having these four key parameters which you calculate from your raw data or your spectral data, again we already have that data available in order to   calculate the key parameters. You can do the analysis again using standard techniques that nothing, nothing fancy.   And then you get a model about on your...this case, it's four infringing parameters on your, in this case three, I didn't put polydispersity in,   under three key parameters. And you see...what you see that the number average molar mass doesn't really change a lot if you change these four factors, so the influence on on those four factors on the number average molar mass, it's not too bit. It's quite big and it's quite significant   on the mass average molar mass, as you can see here. So you can quite move it around it, so the difference is quite big, so it's about 150,000 ? per mole in total, the difference between at the lowest setting and the highest setting.   But, meaning that this one, that this number changes quite significantly and this doesn't also means that your dispersity should vary or should should be...or should should change quite a lot.   The problem was you don't see from this number is the way the distribution looks like, so it doesn't give you the shape, and let's say, in our case, it's very important to have shape because it   affects the processability of a polymer quite dramatically. So   having only three numbers and maybe the polydispersity, it doesn't give you or doesn't tell you, if the shape...or   if the distribution is that monomole distribution, bimodal, trimodal whatsoever.   What you see or you could just talking about the dispersity maybe gives you an indication of the broadness of distribution, but it doesn't tell you if you have one peak, two peaks or three peaks.   But there's quite...makes quite a large difference in the final material characteristics which we get   of the polymer or change the the performance of our adhesive tape in development.   Now, again, since you already had the raw data available, you can use again the Functional Data Explorer in order to get these insight or gets much more information out of these...all of these data just from those key parameters.   Once again, takes couple of seconds, but then you're done.   And, once again, so there is nothing fancy about the analysis so also to the start of...to operate the Functional Data Explorer it's nothing fancy, nothing   out of the ordinary, which you pull in your exports as Y, you put your sample ID, then you put in your factors from your DOE.   Put them in, press the start button and basically that's it. And you can basically you now...you can fit it on with the model parameters.   But it's a very simple, very straightforward procedure and, again, you get much more insight into the into the underlying parts of interest, just reducing all those data to some key parameters. In that case,   I mean having shown you that the one key parameter doesn't really change a lot, the other one does, but it doesn't give you the form, so it doesn't get you an indication or the feeling about the distribution.   In that case, using again this tool gives you that insight, because if you see how... before you see that one point that changes a lot, the other one didn't.   If you know sort of change, though, for example, in that case.   That peak is more like a shoulder, it's not really a peak, but if you change your influencing factors you come to a point that it's actually really gets a peak.   Yeah, so it really gets a second peak, and that tells you something about maybe the polymerization mechanism or the underlying polymerization mechanism   which you don't get from the key parameters, with just looking at those three numbers yeah.   But here, it might give you an indication or you might trigger in the developer something because that's his area of expertise. You might trigger something about which helps him understand the underlying mechanism and then   helps him to do the next step or helps him to do the next step better to plan the next next experiments even better than before, just having seen these numbers.   Right so I'm at the end of my...presenting my example, so I could hope that I could show you that the Functional Data Explorer really enables the visualization of the   results to developers in their area of expertise. I mean, keep in mind that most in, let's say in our case, most developers are non data scientists so   they   don't believe or maybe not so interested about how to get to to the results but they're more interested in the results and the   problem there is...or if the Functional Data Explorer, let's say, allows you to talk in their language or in their area of expertise that means, in our case, being a chemical company, spectral data, and you don't have to   throw away valuable data in order to reduce spectral data into some key parameters and do the analysis on key parameters.   Having to explain what key parameters actually...how they affect spectra, the forms of spectra, the shape of the spectra, but you can just use the data and then   you will start to show them these influence on the factors which you are studying on the form and shape of the spectra and how the spectra changes.   And a lot of cases, in some cases, you might get the same insight, but a lot of cases, the analysis, using the FDE often gives you much more insight, it gives you a deeper insight into the underlying mechanism.   And you don't have to, let's say, throw away a lot of valuable data during the process of reducing the spectral data into some key parameters.   With that, I like to thank you, my colleagues from the R&D department for doing all the experimental work, so I can stand here in front of you to present the nice results. And   I'm finished with my presentation. Thank you very much for your attention. If you have any questions, please feel free to ask. Thank you very much.
Marcello Fidaleo, Professor, Università della Tuscia   The availability of functional data in batch unit operations is becoming more and more common as PAT tools and high-throughput analytical methods are developed. The Quality by Design approach to process development requires the development of a design space, that is the ‘multidimensional combination of raw material attributes and critical process parameters that assure product quality.’ For batch processes, such a design space should be dynamical in nature. In this work, functional data analysis applied to functional designed experiments was used to build the dynamical design space of the refining process of a cocoa and hazelnut paste used in ice cream manufacturing.     Auto-generated transcript...   Speaker Transcript Marcello FIDALEO Hi, my name is Marcello Fidaleo, and my presentation is about the functional data analysis and design of experiments applied to food milling processes. The aim of this research was to study the milling step of hazelnut and cocoa bean base used the in the manufacturing of ice creams. These process is called refining and is typically carried out in stirred ball mills in the batch mode, like this one report to the here on the right. The aim of this process is to reduce the particle size of the solid ingredients, such as cocoa and sugar, so that the final product is not gritty. To this aim we designed our central composite design like one report here on the right. We did three factors, N, D and S, where N the shaft the rotation speed, D is the ball diameter, and S is the overall mass of balls. As the responses, we consider that the size of largest solid particles, so that is the fineness, and the milling energy. Both responses were measured as a function of time, so this was actually a functional designed experiment, because the responses were not ???, but were functions, in this particular case, functions of time. We used JMP Pro to design the experiment and to analyze the results. For the analysis of the results, we followed two approaches. The first approach is the classical analysis of designed experiments. So we use the response surface models to regressive the responses as a function of the process parameters at two different time instance. In the second approach, which is a functional data analysis in design of experiments, we were able to include also the effect of time in the final model, that is, we considered the nature of the functional responses so...that's the functional nature of the responses. So in this presentation, I will talk about the second approach. Functional data analysis in design of experiments requires some intermediate steps to build the final model and also to obtain the design space of the system. Basically, we start with the functional data analysis to smooth the functional data. Then we apply functional principal component analysis to the smoothing functions. And so, at this point, by retaining a just a few components of this system we developed a model for the functional responses. So let's see the results... the results of this case studies. These are the results of the smoothing procedure. We applied the beta splines and we considered the ??? fineness and the energy. We can see that by using data transformation and also data filtering simple fitting functions, we're able to fit well the experimental data. In fact, we used one knot and a cubic spline and the one knot and a linear spline for fineness and for energy. So then we applied the function principal component analysis to the smoothing function. Here on the left, we can see that for fineness, the first two components explained 96% of the variance, while for energy, just the first component explained 98% of the varience. So in the final models, we used the two components for fineness and one component for energy. Here on the right, I reported the scores calculated for the 16 trials of the experiment. And we found from our experience that the scores are useful to understand the behavior of the system. For example, in this case, the first component acted as a grinding intensity axis with the high grinding intensity runs on the left and low grinding intensity runs on the right. And so, at this point, we regressed the scores as a function of the process parameters by using a responsiveness(?) model by including the linear quadratic and the interaction effects. Also these these models were very useful to understand the effect of the process parameters on the functional responses. So finally, here, we can see that finally, we were able to build the final models, as I said. And here I reported the fineness and the energy as a function of time for a few runs. And we can see that the agreement between the experimental data and the predictio...prediction ones are really good. As I said before, we use the two components for fineness and one component for energy. So, finally, we were able to build the design space and to study...to understand the effect of the process parameters, that is N, D and S, on the functional responses here. On the profiler, we can see that, under these conditions of N, D and S, we could predict the profile of fineness as a function of time and the profile of energy as a function of time. At the bottom...at the bottom here, I reported two contour plots of the system. The one on the left was obtained with the 55 rpm rotation speed and 6.5 millimeter of ball diameter. While on the right, the contour plot was used with the mass of spheres of 29 kilograms and an operating time of 80 minutes. So, for example here on the left, we can see that in the boundary that we have the mass of spheres when we increase the the operating time, the energy, of course, increases, while the fineness decreases. The white areas in both plots is the design space, and it is the the area and in which fineness is between 20 and 30 microns, so it's the area where the final product is not gritty and it's not over milled. So from these results, we can conclude that the functional data analysis applied to functional designed experiments appears to be a straightforward, robust, and easy to use approach to build the dynamical design space of a batch process.  
Chi-Feng Ho, Student, STEAMs Training center Mason Chen, Student, Stanford OHS   Age is a big factor for National Basketball Association (NBA) professional players, because an injury could mean a premature end to a player’s career. 36 effective candidates have been selected for study. “Effective” does not mean how strong they are; these players all have had long careers. Our goal is to find methods/information that can help coaches, doctors and even superstars understand the effects of aging. Another goal is to discover why these 36 players can play longer than other players. We use the trends of former players to predict the future performance of active NBA players. We selected three categories on the NBA scoreboard to understand how age affects players’ total game time and total points per season. The total Mins will be used to determine the player's health, because only healthy (no injury) players can have a longer career. The Tpt and PPM show how the player’s performance decreases with age. We then used JMP tools and statistical methods to build up models to derive the correlation between players and their performance related to age. Finally, we predicted the future performance of the current participant and the exact time for the player to retire.     Auto-generated transcript...   Speaker Transcript Chi-feng Ho NBA aging influence and benchmarking analysis. Mason Chen is the co author of the project. The main reason we decided to do this product is that our NBA idol Kobe Bryant was passed away by helicopter accident last year. Many people start to repay attention to their retirement age, lots of media and fans out Kobe could still play more rather than leave the NBA League. We want to know if those historical players could keep going about two to three years. The team might consider a player's maximum contract due to the player's value, current and future performance, given age and previous injury history. There's no difference between a superstar, or even a bench players. One day they absolutely are going to confront performance decline due to aging factors. If their team had superstar already faced his data decline should team tickets be still centered on him? Of course not. The NBA is a business alliance that managers only want to build up a strong team to get an NBA championship instead of losing, won't get any benefits for themselves. In our project that we want to help coaches, trainers and doctors to find out how important the aging factors is for the NBA players. From our perspective we will consider is the time good enough for Kobe to leave the leave. Secondly, we consider players who retired too late and some of who retired too early from our data. Lastly, we worked on how age is important for an NBA basketball players and give some clues. In our objectives, we want start to build statistical models to show how age is a big factor for NBA players and then we will predict active duty players to performance by using former players' career trajectory. In the data selection part, we only consider the top players who have complete at least 1,200 games and 15 seasons. And every season should be counted after 1979-1980 seasons. We exclude the short season records and then we remove any seasons where the players play less than 20 games. total minutes, the points scored and points per minutes. Three categories. By doing the standardization for each player's year-to-year statistics against the career average. Finally, there are 36 players qualified. The Career index making and using the standardization logarithm. Although those players are qualified and counted in our data set, we cannot just use their scores, minutes on stage to to contrast. After doing the standardization, the group ratio of three categories came out and then we derived the peak career index, which is equal to Z TM, plus Z PP plus Z PM. The, Z TM is equal to the total minutes of one players, minus the group average. And then we divided by the standard deviation of total minutes. A method for three categories is the same. You may ask me why we do the Z standardization? Because Z statistics have two benefits. First of all, it will remove any standard deviation bias and it will make sure also to equal weight for players' career average. This is the pattern for 36 players position distribution by bar graph showing the 36 historical data set of power forward position players are qualified most, which contains amount of 11, and a small forward position players had the least, which only qualified three of them. On the right, the top, on top graph shows how do we calculate the combo curve and the bottom graph shows the top three players (Kobe Bryant to LeBron James to Michael Jordan) curve versus the combo. Why do we choose to use three players? Because those players ??? maintain well and they are the best players in history. By using the average of points per minute, total points to the minutes in each age categories, we combine them as the combo curve to use to compare to each of the data. It's easy to point out a peak average on the top graph will be 27 years old. And each categories point out the highest value. Kobe Bryant will be a good example to explain that. On the bottom graph LeBron James shows is an outlier, whose peak age seems to be 21 years old and during the 27 years old, his combo curve dropped to a new low. The peak season position dependency is to find out the golden age of difference position players. We use ANOVA test to find out the five position dependency was age, same or not. The dependent variable is to eight factors. We want to know whether there are any difference. Then we do one-way ANOVA. Is on the figure on the left there is difference between each position of the basketball players; shooting guard and small forward have an early age than other players as well. The different kind of ultimate strategy most focused on small forward and shooting guard. If the rookie has been chosen to train in those two positions they will have lots of shot and time. Although the run-and-gun strategy is famous this year, so small forward and shooting guard was still enter their golden age earlier as well. On the bottom the ANOVA table points out that P value is less than 0.5, which means the probability can reject the null hypothesis to further conclude at least one of the positions has different peak age. On the right graph, the MVP age is about two years after players peak age. At this time, players experienced their golden age range. We usually use MVP to contrast a player's performance. As they become older, their body functions will decrease and the injury might influence their mark in MVP selections. We usee a paired T test along the curve on the age study. We would like to see the connection between players' peak age and their MVP age. We want to check whether the difference between two measure values of the statistic is not zero. Our results shows one players who enter his peak age and after two years on average to receive his MVP title. The peak age and the MVP average age found a significant difference between each of them. A prediction model cannot reject the hypothesis though, suggesting that a prediction is accurate. Yes, we set the judgments that told us if the player's performance is less than 60 percent of the career average we might think that whether that person need to decide to retire. If someone in the last year could also have performance scores at 80%, we might think about players could still play two to three years more. Kobe Bryant left the league was when he was 37. We saw that his performance in last year had maintained about 80% of his ability. On the right...on the left graph showing a set of players who should retire earlier, but they don't, especially when you go see Juwan Howard's performance after 34 years old become about 20 to 40% of his career ability. We could point out that Kobe can play more and Juwan should retire earlier. Our purpose in clustering because we want to use clustering results to help coach and team to see how their players' future performance is. We used JMP and partitioned 36 players into seven clusters by sixs different categories. The collection of data's object is going to be similar to one another. Let's look at the graph. This graph compared Gary Payton and Derek Fisher, in some cases. Also R squared value is pretty high to their own clusters, which means there is a strong correlation between themselves. Based on the R squared value, the graph could clearly show us that they are kind of similar. Gary Payton retired from league by 2007 and Derek Fisher in 2014. A seven year difference. We might as well use Gary Payton's career trajectory to predict the future tendency for Derek Fisher. Age is one of the most important factors for NBA players. As players' age increase, they will have a chance to face retirement. This combo curve model utilize the multivariate statistics clustering and correlation to demonstrate how players' performance changed, based on the difference in age and performance trajectory with age. And the modeling methods could also apply to other professional sports as well. Thank you.  
Patrick Giuliano, Lead Quality Engineer, Abbott Mason Chen, Student, Stanford University OHS Charles Chen, Principal Data Scientist/Continuous Improvement Expert, Applied Materials   As consumers recede from social eating to rely on satellite delivery, optimal food preparation is increasingly vital. Using dumplings – a simple, nutritious yet tasty dish, multidisciplinary DSD DOE, robust design, and HACCP control planning were utilized to prepare the most efficient recipe. Although a controlled experiment was initially designed using DOE, some runs had to be substituted with different meat/vegetable types after a shortage of ingredients. To study the impact of the outliers, the original DSD was modified to try to account for the substitution. Before assessing model fit, each DSD was checked for orthogonality (color map on correlations), design uniformity (scatter plot matrix with nonparametric density), and power (design diagnostics). This ensured that the response surface model (RSM) results would be attributed to scientific aspects of dumpling physics instead of problems in the data structure. Next, the optimal RSM model was selected using stepwise regression, and model robustness was probed using t-standardized residuals and global/local sensitivity. Finally, advanced modeling capabilities in JMP Pro were leveraged (multi-layered neural network, bootstrap forest, boosted tree), and, with the reapplication of HACCP framework, the optimal parameter fitting was then proposed for use in commercial manufacturing based on the data structure and physics.     Auto-generated transcript...   Speaker Transcript Mason Chen So hi I'm Mason, and today I'll be presenting the dumpling cooking project in which we studied the relationship between data structure, dumpling physics and RSM results.   The motivation behind this project is that COVID-19 has increased demand of remote cooking and artificial intelligence and robotics in the food industry will play an increasingly important role as an application of technology continues to spread.   But most foods are made without precise control of cooking parameters as we rely on the chef's expertise to create consistent dishes.   But quality control and efficiency will be much more important if the robot is cooking a meal, so we decided to use steamed dumplings as a simple, nutritious but tasty dish to prepare the optimal recipe, according to the dumpling rising time.   So I stated previously, we performed a dumpling experiment, but due to an ingredient shortage, we had to substitute a different meat type in the experimental process.   So when we evaluated the DOE design, it was not orthogonal as there were some major confounding ???.   And when we try to run the RSM result with this data, there was also a potential outlier, particularly Run #6, which we wanted to study further.   So our first objective of this project is to address the shortage from a corrective(?) standpoint   and see if we can save the DSD structure. We'll do this by first changing the meat type, which was originally a two level categorical variable consisting of pork or shrimp   to a continuous variable so that we can use varying percentages of meat type for those three runs which we use with different mixtures of shrimp and pork.   And then we'll assess the data structure for this new DSD design using a variety of evaluation tools.   Well, since the DSD structure was not problematic based on those tools, we decided to run an RSM model to see if its results are in line with our scientific research regarding convection and conduction.   Had the DSD structure been problematic, we may have received some false interactions that cannot be explained by science   and interactions that we would expect from science may be hidden due to the absence of orthogonality.   And our second objective after finding a good DSD model to account for the ingredient shortage   is more preventative approach, where we want to study the impact of the potential outlier and a ??? DSD structured by revising the settings.   The impact of outlier can help us better understand the importance of measurement control and a problem...while studying the DSD structure   will let us know how confounded structure will affect our results before even running our experiment in the future.   So we'll do the second objective by modifying and comparing the factor settings and response from Run #6, which is a potential outlier, and Run #9, which is our center point, and studying whether the model results are due to confirming data structure or if it can be explained by physics.   So the original design table is on the right and the ones that we substituted are Run #3, 4, and 9, as highlighted in orange.   Run #9, again, this is the center point so substitution may affect the model orthogonality, such as the average prediction variance we're looking to layer on. The original run uses categorical meat type of pork and shrimp, so we kept the setting   shrimp, even for these three runs where we had to substitute some of the shrimp for pork.   The colored map on left hand side show some confounding groups, so blue indicates no confounding, zero correlation between those two terms,   and dark red indicates severe confounding and high correlation, which is why the diagonal line will always be 1, since each variable is...each term is 100% correlated with itself. So there are some resolution lines ??? that affect confounding.   and   the...for meat type categorical, especially about a 0.3 correlation, and we have some severe Resolution IV, which is interaction interaction compounding risk, as some of them, such as the red blocks are greater than 0.5 correlation. So we should not   run an RSM model from here, since our data itself cannot be trusted to to the severe confounding, so before we proceed to run a model, we need to improve the RSM...the data structure. But how can we address this problem without recollecting the data?   So, since the meat type categorical had confounding problems and we should account with substituting meat type, we decided to change the meat type from categorical to continuous.   We changed the variable essentially to shrimp percent so the previous meat type that was all shrimp was changed to 100% and the ones that were all pork changed to 0%.   For three substituted meats, we estimated the true percentage values based on what we could recall about the order history for total meat we bought at the supermarket that day, so 70, 40 and 65%.   We then ran the color map on correlations again and looked at the power analysis to assess whether or not the data structure is more orthogonal.   The power analysis is all greater than 0.9 for main effects, which means that there is a greater than 90% chance we can detect a significant effect of these variables.   Additionally, the color map on indicates slightly reduced Resolution 1 confounding as it is shaded bluer, especially for meat type continuous.   And Resolution 4 confounding doesn't change much, so we will choose to stick to this model, since the apparent orthogonality is not bad, and later on when you run the RSM, we will return to the color map to check if any interactions may be due to a confounding problem.   So next we look at the prediction variance profile, which indicates that prediction variance at different levels of factors.   Now the actual prediction profile depends on both the response error and a quantity that depends on the design and factor setting.   But for this prediction variance profile, the Y axis is a relative variance of prediction, which is the actual variance divided by the response variance   so that the response variance cancels out and the prediction variance profile only depends on the DSD structure and not the response.   So the top graph is the average variance at the center point and the bottom is the maximized variance, so we need to look at both because average variance indicates more information about the entire design,   and the maximum variance tells you information about the worst case point. So the average variance is Run #9   at Run #9's settings, which we had to substitute some of the meat. And at Run #9, the variance is 0.06, which is not too bad and the variance is symmetric about the center point, which means that the substitution does not severely impact them model orthogonality.   The prediction variance will always be greatest at corners, since you don't want to make predictions of the corners, because less points are around it and the optimal run is usually around the center   with the least prediction variance. For the maximum variance, you can see that the curve is a bit asymmetric for dumpling weight and meat type.   This one is Run #18 but there isn't anything special about it, so shouldn't be too big of a deal. So since there hasn't been anything major standing out for this DSD structure, we will go ahead and proceed to run RSM model for the continuous meat type.   So we ran RSM model using stepwise progression and mixed selection, based on the p value. And we chose a stepwise progression instead of the ordinary least squares regression, because the least squares   need at least the same number of runs as terms and we don't have enough runs to estimate all the terms.   So three important mean effects are water temperature, dumpling weight and batch size. And we have interaction term of water temperature times dumpling weight.   The ANOVA is significant, which indicates that we can reject the null hypothesis and conclude we have a model. We don't have an overfitting problem   as the R squared and R scquared adjusted differ by less than 10%. For lack of fit window, the sum of the lack of fit error and the pure error is the total error sum of squares from the ANOVA table.   And pure error is independent of model, such as things like experimental error, so if the total residual error is large compared to the pure error.   It means we might have to fit a nonlinear models, since the linear model has high lack of fit. When all the error is only due to pure error, then there is zero lack of fit   and the R squared value is equal to the maximum R square. On the p value for the lack of fit is greater than 0.05 so we don't really have evidence that we need to switch to a robust nonlinear design as the pure error is much smaller compared to the total residual error.   Now the R squared is not very good...not excellent, because of the possible outlier in Run #6 seen in the studentized residual, which is touching the bound for any limit. Let's see why that run might be an outlier.   So, if you look at the prediction profiler for Run #6, we see that there's a large triangle, which indicates greater local sensitivity.   And this indicates that a small change in the input causes a large change in rising time, which means a greater instantaneous slope at the factor settings for Run #6.
Saloni Patel, Student, Stanford OHS Mason Chen, Student, Stanford OHS   This project investigates the validation of a prediction model and the actual result of the 2020 United States presidential election. The prediction model consists of the predicted election result, which is derived from the z-scores of the number of infected cases, deaths and unemployment increase rates for each of 15 “swing states” along with the 2012-2016 election result average. In order to identify the most important swing states, a Swing State Index was derived using the 2012, 2016, and 2020 election outcomes.  The predicted election result is then subtracted in response to the media’s report about how Donald Trump is expected to lose 3-5 percent of his votes from the 2016 election. The model is used to compare the level of accuracy between the predicted 2020 election result and its subtracted values against the 2020 actual election result. The paired t-test and regression test are used to test the significance between the 2020 actual result and the 2020 predicted result as well as the 2016 actual result and the 2012 actual result to see how the 2020 predicted result compares with the 2016 election result and the 2012 election result in predicting the 2020 actual result. A one proportion hypothesis test is also used to compare the accuracy of the 2020 predicted result with the 2020 actual result.  The next part of this project studies factors that influenced the voting behavior of the 15 key swing states in the 2020 United States presidential election by linking statistical clustering methods with notable political events. In addition to key decisions made in the Trump administration, factors unique to this presidential election such as the global COVID-19 pandemic and the Black Lives Matter movement were investigated. Hierarchical clustering was used to group the 15 swing states based on the Swing State Index, and the relationships between each cluster were attributed with events that may have factored into the cluster behavior. The most representative and significant swing states were identified to be Arizona, Georgia, Wisconsin, and Pennsylvania (based on the clustering history) as well as Michigan and Minnesota (based on the Swing State Index). After analyzing specific events that affected these six states’ voting behavior, the Black Lives Matter movement and concerns over health care were the most significant factors in President Trump’s defeat. Next, the state of Georgia was further studied to better understand the influence of COVID-19 and the economy on the state’s voting behavior. By adjusting the ratio of the COVID-19 values (infected cases and deaths) and economic value (unemployment rate), it was found that the economy was of greater importance than COVID-19 to Georgian voters. The study of similar events by connecting political science (e.g. government decision-making) and clustering methods can be applied to future elections to better predict the outcome of important swing states and, thus, the overall election results.  All calculations and analysis are done on the JMP 15 platform.     Auto-generated transcript...   Speaker Transcript Saloni Patel Okay, so hello, my name is Saloni and today I'll be presenting our project, the United States presidential election prediction model and swing states study behaviors study. There are two parts in this project, the first involves creating and evaluating a model meant to predict the 2020 US presidential election. The second part of the project will study swing state behavior in the 2020 US presidential election and identify key events that affected the voting patterns in the election using hierarchical clustering methods. All the analysis was done on the JMP 15 platform. To clarify our project does not focus on all 50 US states and instead we will only study the top 15 swing states. The swing states are states that can reasonably be won by either the Democratic or Republican presidential candidate, as opposed to safe states that consistently lean towards the one party. Additionally, the US voting system depends on the Electoral College system that gives a set number of votes to each state based on population numbers. There is a total of 538 electoral votes so a presidential candidate must get 270 electoral votes to win the presidential election. Since most of the states are known to vote for either a Democratic or Republican candidate without hopes of being swayed out of the normal voting pattern, the Electoral College system and the presidential election result depends on the bulk of the swing states that can potentially be won by any of the candidates. A win by even a small margin results in that candidate acquiring all the votes the state has to offer, so swing states are especially impactful in determining the next president. So, to begin we conducted this project in hopes of better understanding the historic 2020 US election that occured in the middle of a global pandemic and socially as well as economically unstable times. The first part of our project's objectives is to identify key swing states, create a prediction model based on the influence of COVID 19 and the economy in those identified swing states, and lastly validate the prediction model with the actual election results once those came out. So the first step in our prediction model is identifying top 15 swing states from the past three elections using this swing state index. We use this formula to determine whether the states are swing states or not. It is also important to note that this swing index does not take into account which side each state votes for, but rather on the election results itself. In other words a state could have voted for the same side all three years yet by very different margins and still be counted as a swing state. We will further study this index in the next part of the project, but right now, all we use this index is to...for is to identify the top 15 swing states. Once the swing states are identified, we derive the first value we will need to calculate the predicted 2020 election result. This is the 2016-2012 composite win margin. To calculate this value we took the 2016 result and the 2012 election result. In the formula, we gave the 2016 results twice the weight because it was more recent than the 2012 election and we gave another twice the wait for the 2016 election. Because President Trump was present in the 2016 election running as president, while Joe Biden was present in the 2012 election as a candidate for vice president. In total, the 2016 results will have four times the weight as in 2012 in the 2016-2012 composite win margin. Next we identify factors that are unique to the 2020 and factors that voters may vote according to. We found that the global COVID-19 pandemic and the following hit the economy took were important factors unique to 2020 so we collected the infected cases and death cases due to COVID-19, as well as the unemployment increase in each state. Next we applied the Z standardized transformation to avoid any sampling mean and variance biases. Using those Z scores, as Z-infected, Z-deaths and Z-unemployment, we derived the Z-COVID index. This index will represent the impact the global pandemic and the following economic hit each state experienced. Lastly, we calculated the 2020 predicted result, using the 2016-2012 composite win margin and the Z-COVID index. Once the 2020 election passed, we recorded the 2020 actual election results and proceeded to validate our prediction model and whether our choice of factors did a good job helping predict the 2020 election result. Additionally, since the media before the election had predicted that Trump will lose 3 to 5% of the votes from 2016, we decided to subtract certain percentages from the predicted results. In the table below, that predicted result is the zero percent category and the reductions can be seen as well. To analyze the results we compared the predicted results with the 2020 actual election results, using the regression and paired t-test. To compare how the 2020 predicted results compared with previous election results at predicting the 2020 actual result, we also include the 2012 election and 2016 election results in our evaluation. Lastly, we also conducted a 1-proportion hypothesis test to test the 2020 predicted results accuracy. To begin we conducted a regression test with the election results presented just from each state. The 2012 result compared to the 2020 actual election result did not yield a significant result. However, the regression tests between 2016 actual and 2020 actual displays a significant result and the highest R squared of 0.81. The results of the regression is also close to 1, at about 1.17, suggesting a strong regression relationship. The regression between the 2020 predicted and the 2020 actual also displays a significant result but a lower R square of 0.3. From these results, the regression between the 2016 actual and 2020 actual results had the highest R Square and slope closest to one, despite having declared a different winner. It is reasonable to find that the 2020 election results would be correlated with the 2016 election results since Trump lost those swing states narrowly wone in the 2016 election by small margin, so just Michigan, Pennsylvania, and Wisconsin. Next, the paired t-tests that will also compare the election results percentages of each state. we use the paired t-test because the same states or pairs are being assessed against each other. The paired t-test only found a significant difference in the means of the 2016 actual election results and the 2020 actual results, which makes sense since these results had a high regression test significance. This would suggest that the means of the 2012 election and the 2020 predicted results are similar to the 2020, actual meaning that the election results are similar. In the 2012 election, the Democrats had won the election and the predicted results had predicted that Democrats would win in 2020, while in 2016 the Republicans had won the election. This can explain why 2016 is significantly different from the 2020 actual election results, while the... while the 2012 actual and 2020 predicted are not. Lastly, we use the 1-proportion hypothesis test to test how the 2020 predicted results matched with the 2020 actual results. Unlike the regression and paired t-test, the 1-proportion test compares the states and which side they voted for. The regression and paired t-test only compared the election results without any indication on which side the states voted for. Therefore, this test is more powerful and validating the prediction model, since it compares the predicted side each state would vote for and which side the states actually voted for. We assign the states that voted for the predicted side with a pass and those that did not vote for the predicted side with a fail. In total 12 out of 15 states received a pass, as they were predicted accurately, while the other three received a fail. We set the success value at pass and the scale is a sample proportion of 0.8. Since we want the sample proportion to be greater than 0.9 or 90% accurate, we set the hypothesized proportion to 0.9. Since the 0.8 proportion failed to exceed the 0.9 at the 95% confidence level, the prediction model failed to be 90% accurate, failed to reject the null hypothesis at the 95% confidence level. According to the proportion of our sample this model is 80% accurate. To summarize the regression test showed significance between the 2016 actual results and the 2020 actual election result., as well as a weaker significance between 2020 predicted and 2020 actual election results. We theorize that this may have been because this election, President Trump lost those swing states narrowly won in the 2016 election. The paired t-test showed significant difference between the 2016 actual and 2020 actual, and we theorized that this may have been because those two elections declared different winners. President Trump won the 2016 election yet lost the 2020 election. Additionally the 2012 and 2020 predicted results are not significantly different from the actual 2020 result...election result, which may have been because they both declared the same political party as the winners. As...lastly, the 1-percent hypothesis test failed to reject the null hypothesis, and so our prediction model is not 90% accurate at the 95% confidence level. Arizona, Wisconsin, and Minnesota, which could suggest that there were other major factors besides the impacts of COVID-19 and unemployment rates that influenced the 2020 election result. This is where we transition to the next part of our project in which we will group states based on their swing state index and identify them with key events that took place in 2020 that could have influenced the swing states' voting behaviors. So the questions this part of the project will address from the last is which events and factors influence the swing states to vote the way that they did. How much more or less did voters care about COVID-19 than the economy and other side investigations? Can we use statistical tools to link political events with voting patterns? The goals for this project is to study the previously identified swing states voting patterns by linking statistical clustering methods to political events. We will also adjust the Z-COVID index, or as we will now call it Z-Ratio, with new ratios to better understand the importance of COVID 19 and the economy in voting behavior. Previously the Z-COVID index had two by one ratio, where the values of COVID-19 infected cases and deaths were given twice the weight compared to the unemployment increase value, since there were two values for COVID-19 and only one meant for the economy. We realized that each State was impacted differently by the pandemic, so we thought it would be appropriate to analyze the effects of switching this two by one ratio to other ratios. First, we will go back to the swing state index, which helps identify the swing of each State using the election result percentages from the past three elections. A negative election result indicates that the swing voted for Democrats, while positive indicates a Republican vote. The larger the magnitude and the more negative the swing index, the more that states voting patterns have swung. If the state changes direction then the signs of the two differences will not be equal, causing the swing state index to be negative and display more of a swing behavior. From this table, we can see that Michigan and Minnesota have negative values of the largest magnitude, which means they have been swinging the most in the past three elections. Overall, the swing state index is quite useful in understanding basic voting patterns for the swing states. However, the swing state index cannot identify key events that caused the voting patterns. We used hierarchical clustering to study states with similar voting patterns and list potential factors that affected their voting behavior. Hierarchical clustering grouped the 15 swing states into four different clusters, as seen on the right. We used this method, because of its bottom up approach, where every state is its own cluster before they emerged one at a time and moved up the hierarchy. On the right, Iowa and Ohio can be seen in red indicating that they're in the same cluster. As mentioned previously, the hierarchical clustering divided the swing states into four clusters, the first cluster consists of Iowa and Ohio. Both of these states had voted blue or Democratic in the 2012 election, yet red or Republican in the 2016 and 2020 election. The second cluster has Georgia, Arizona, North Carolina, and Florida. All these states except North Carolina became bluer or redder, or in other words are starting to favor one side heavily. But third cluster consists of Wisconsin, Pennsylvania, Michigan, Nevada, New Hampshire and Minnesota. All of these states, besides Nevada, have a negative swing index, meaning they're the most inconsistent swing states. The last cluster has Colorado, Virginia and New Mexico, which are all relatively blue states or states that have consistently voted Democrat and in the 2020 elections, voted blue by a larger margin thatn previously. Now that we have all the clusters and idea of their characteristics, we looked at the clustering join history, which identifies the top pairs of states or which two states are the most similar in their clusters. From the join history, the first two pairs are Wisconsin with Pennsylvania from the third cluster and Georgia with Arizona from the second cluster. Both pairs are part of the clusters that had states that switched from red to blue in the 2020 election. After further research, we found that Wisconsin and Pennsylvania appeared to have concerns for the economy, dissatisfaction with President Trump's healthcare related policies, such as his efforts to weaken the Affordable Care Act formed under the Obama administration, as well as concerns for the environment, all of which ultimately made majority of the voters vote Democratic. However, in Georgia and Arizona, we see that major shifts in demographics, such as more registered Latino voters in Arizona and the Black Lives Matter movement that exposed serious racial injustice, ultimately caused majority of the voters to cast a Democratic ballot. Through hierarchical clustering we were able to separate states into different groups based on their voting behaviors and create connections on which key events caused the observed voting behavior. Although we found key events that influenced each state's voting behavior, hierarchical clustering did not tell us the weight each event played in the individual swing state's voting behavior. COVID-19 and the economic recession that followed. COVID-19 and the economy. However, we had to assume that each state would have the same Z-ratio, which gave COVID-19 twice the weight, resulting in a two by one ratio for each state. However, from the hierarchical clustering, we found that each state has a unique situation and their voters cared for different issues. To adjust the Z-ratio and...the Z-ratio, we created a value called the Ratio Variable. The Ratio Variable will determine the ratio of the importance of COVID-19 or Z-COVID index, which represents the infected cases and deaths in each case versus the economy or the unemployment which represents the annual unemployment increase rate in each state. Once the Z ratio is adjusted, with a few different ratio variables, such as 0.1, which creates a ratio of one by 10 giving the economy 10 times the importance, it is implemented into the full formula used to calculate 2020 predicted results. These adjusted 2020 predicted results are compared against the 2020 actual election results to determine which ratio best explains the state's situation and how much of importance COVID-19 and the economy had in influencing the voting behavior. We decided to study Georgia's voting behavior closely since it appeared to stand out compared to the other swing states. For one, Georgia was the first state to reopen business in April, while the rest of the states did not. Additionally, Georgia was a key state in the 2020 election, which President Trump had an eye on even after the election results were announced in attempts to overturn them. Georgia voted blue by a small margin and had an election results of -0.3%. The adjusted 2020 predicted results for Georgia was potted and a marker for the Georgia...for Georgia's actual election result was placed on the graph on the right. From the graph we see that the adjusted 2020 predicted results with the ratio variable of 0.75 had the value of -0.2%, which is the closest to Georgias's actual election result, -0.3%. The 0.75 ratio variable means that the ratio is three by four, indicating that the economy was more important issue to a majority of the voters in Georgia. This makes sense because, as mentioned previously, Georgia was the first state to reopen business in April, indicating a strong concern for the well being of its businesses and economy. In this project we explored different key events and their importance in influencing the voting behaviors of 15 identified swing states using statistical methods. First hierarchical clustering was utilized to group swing states based on their voting behavior in the past three elections. From this we found that in the second cluster consisting of Arizona, Georgia and others, were mostly affected by issues regarding civil rights while states such as Pennsylvania and Wisconsin in the third cluster had voted for Joe Biden due to concerns for the economy, healthcare, and environment. Overall, the worsening COVID-19 situation, racial movements such as Black Lives Matter movement, COVID-19 and the economy, had on each state. Georgia was explored in more detail, and it was found that a three by four ratio matched best with the actual election result, suggesting that the economy was a more important issue to voters compared to COVID=19, and this makes sense because Georgia was the first state to reopen business in April. Thank you for listening to my presentation.  
Anne-Catherine Portmann, USP Development Scientist, Lonza Sven Steinbusch, Senior Project & Team Leader, Microbial Development Service USP, Lonza   Often, the analysis of big data is considered to be essential in the fourth big industrial revolution – the “Data-Based Industrial Revolution” or “Industry 4.0.” However, handling the challenge of unstructured data or a less than in-depth investigation of data prevent using the full potential of the existing knowledge. In this presentation we offer a structured data handling approach based on the “tidy data principle,” which allowed us to efficiently study the data from more than 80 production batches. The results of different statistical analyses (e.g. predictor screening, machine learning or PLS) were used in combination with existing process knowledge to improve the overall product yield. With the newly created knowledge, we were able to identify certain process steps that have a significant impact on the product yield. Additionally, several models demonstrated that the overall product yield can be improved up to 26 percent by the adaptation of different process parameters.     Auto-generated transcript...   Speaker Transcript Anne-Catherine Portmann Hello, today I will present you the power behind data. This presentation is based on the idea that a principal this presentation will allow us to efficiently study the data of more than 80 production batches. We were able to improve the product yield of more than 26% based on the process knowledge and the statistical analysis. The statistical analysis allows also to identify the key process step which have an impact on the product yield. So I will first introduce Lonza Pharma Biotech and then we will go to the historical data analysis. Lonza Pharma Biotech was found firstly in 1897 and shortly thereafter, it was transformed to chemical manufacturer. Today we are one of the world's leading supplier to pharmaceutical, healthcare and life sciences industry. Here at Visp, we are one of the biggest site from Lonza and we are most significant for R&D, development and manufacturing. We are, we have a new part of the company, the Ibex solutions, where we are able to complete biopharmaceutical cycles from preclinical to commercial stage, from drug substances to drug product, all of this in one location. You probably heard about this lately in the Moderna vaccine against the COVID-19, but it's not the only product that we are producing here in this. We are also producing small molecules, mammalian and microbial biopharmaceuticals, high potent APIs, peptides and bioconjugates including antibody-drug conjugates. Now that you know a little bit more about Lonza, I will go to the historical data analysis. So, first of all, I will present to you this process on which the 80 batches are run. So, first the upstream part. So the upstream part have first the fermentation part, where the product is generated by the micro organism. So the product make a microorganism. The... product to produce...the product, the microorganisms is produced (???) from the DNA during fermentation. Then we have the cell lysis where we disturb the cell membrane and allowed to release the product and all what is in the cell. and have access to this product. Then become the separation. In the separation part, we remove the cell fragments, such as the cell membrane or the DNA. Then we come to the downstream part, which is based on three different chromatography and allow the purification of the product. So the product is here in yellow in the below part of the slide. And we can see that during each of the chromatography part, we are able to profile a little bit more of the product. At the end, we perform a sterile filtration of the product. So the goal was to increase the overall project yield, and to do that, we first collect the data of the 80 batches and order them in a way that we can analyze them. Then we perform yield analysis. And then we discuss the result with the process analysis. With the SMEs, so the subject matter experts. Then we have seen and...we went to the data analysis for the upstream part and we perform this for analysis on the left of the slide. Then we go for the downstream part and focus on the Chromatography 1. At the end, we make a conclusion from all what we see in the...in the analysis and what the subject matter expert orders. And at the end, we recalculate the yield. Let's see what...how we organize our data. So we based the data on the tidy data principles. That is a big part of the...before the analysis, which takes time, but it's really important to have clean data and making an efficient analysis afterwards. So first we have about, that is, the observational unit one, for example, we can say to the fermentation. And then on each row of the...of the file we include one batch each time. For each column, we take a parameter. For example, for fermentation, the pH, the temperature, all the titer (that means the amount of the product at the end). And then, for each values, here corresponds the correct value from the column and the batch. And with this one, we can go to JMP and perform the analysis. So let's see how we calculate the yield. So, first we calculate the yield for each step, beginning at 100% for fermentation and see how it decreases along the process. So what we observe is that we have a big variation at the fermentation step. And then we have a decrease in the in the product amount at the separation step, as well as the chromatography 1 step. And so we go with this data to the subject matter experts and they told us that the complex media variability impacts the final titer of fermentation, so we have to explore this spot. Then, for the separation, the strategy that was choose could have a different impact on the mass ratios. And for the chromatography 1, the pooling strategy have most probably an impact. So, then we will see what the data said. So we look at the upstream part and perform different analysis. So the first analysis was the multivariate analysis of each of these USP process stages. So we focus on the fermentation, cell lysis, and separation. And see all the parameters, how they could correlate with the product at the end. So here, what we see the fermentation, the amount added to Reactor 1 had a medium correlation with a good significance probability. For separation, the final mass ratio. The mass ratio at the intermediate separation have both a major impact on a significant probability. You see that other parameters, such as the initial pH from Reactor 2 is very close to the medium correlation threshold and have a significant probability. And we will see if this parameter in the next analysis. We have also selected here only the parameter, which is scientifically meaningful for the other analysis also. Then we went to the partial least squares. For the partial least squares, we see that we have for fermentation a positive correlation for all these parameters. So again we see the amount of Reator 2, the initial pH of Reactor 2 and the initial amount of Reactor 2. As well as a new parameter, that is the hold time. And we see that the amount of Reactor 2 have a positive correlation with this analysis, but the negative with MVA. And this could be explained, because of the 80 batches whereby just...which were running production, but they were not designed to answer a question of positive or negative correlation on the product...on the final product. So that could be done in the future, in another analysis with a proper design. With ??? to still say that we have an impact on the final product. For the other parameters, at the other steps of the upstream part, we also see that the prediction matches the multivariate results. And we have also a possibility to improve the titer. Here we see with the prediction profiler that we can also optimize in the future and the project yield. Then we test the product...the predictor screening. And here we ran 10 times the predictor screening and the five parameters which will always found in the top 10 were selected. And here what we see. It's the initial pH of Reactor 2, the mass ratio at the end of separation, the mass ratio at intermediate separation, the initial amount of Reactor 2, the amount of Reactor 2, and the amount of Reactor 1. So again, we have the same parameter that appears to have an impact. So, then, we went to the machine learning. This machine learning analysis, XGBoost, is a decision tree machine learning algorithm. And to avoid having in this result parameter that's not part of the... of the top of our parameter, we include a fake parameter, which give us a kind of threshold in the parameter importance. And all the parameters which appear above this threshold were considered to have an impact. The other are considered to be random and below this random parameter and have no impact or not significant impact on the on the final product. And here we can see that the negative correlation will appear for Reactor 1. The pH of Reactor 2 and initial amount of Reactor 2, we have a positive correlation. And for the mass ratio, we have a negative correlation. Again, as I explained before, the difference between negative and positive correlation was not the goal and not designed for this experiment, so we know it's an impact, but we don't know if it's positive or negative yet. Then we will go to the downstream part, specifically on the Chromatography 1. And here we use the neural predictive modeling. So in the normal predictive modeling we use the a different fraction of the chromatography. So on the graph on the right, we see that Fraction 8 have...is the main fraction, so where we found most of the product, the highest purity. And then by decreasing from Fractions 7-1, we have product, but also more impurity. and Until now, we were taking into account of the Fraction 4 in our analysis and we would like to see if we can include also Fractions 3, 2 and 1. And what we saw is that the by increasing the number of fraction, we increase the yield but decrease very few the purity. So the graph on the left, we see that if we go to the Fraction 2, we decrease the purity to less than 1% but the yield was increased to 5%. And then, when we include the Fraction 1, there we have a bigger decrease of the purity but it's a little bit than 1%. More than one percent of purity decrease and the yield was in the other side increase of about 10%. So then, with this result we try to summarize everything together. So far, fermentation, we have a final volume of the tanks and reactor, which were identified by most of the methods. The initial pH of the fermenter was also identified by the different analysis methods and the complex compound variability by the process experts. And to be able to see the effect of the complex compound variability, we will need further investigation in the lab. Then for separation, we have the mass ratio, which was identified by some methods, analysis of the data but also with the process...by the process experts. And the strategy is very interesting. The process expert decide to look at it and try to make some tests to be able to improve the yield in the production. For the Chromatography 1, the pooling strategy was identified by the process expert on the neural network analysis. And here the method can be easily implemented in the lab and also in the production. And the yield is really increasing a lot with with this method. So, then, we recalculate with the prediction... prediction profiler how we can increase our yield of the different steps. And for fermentation we were able to increase the yield up to 16%. For the separation, we are able to increase it up to 5%, and for Chromatography 1, here we've raised up to 5%. In the other slide we wrote up to 10%, so we try to be the worst case scenario to say up to 5%. And, at the end, we will have a total increase at the end of 26%, so that is a good way to improve our process and to focus exactly on the part where we can have a big impact. And just based on the data without doing a lot of experiment in the in the lab, that is also cheaper to do these analysis with JMP as doing a lot of experiments in the lab. So we have a lot of of gains at the end. So, thank you very much to all of you for listening me today, and also a big thank you to my colleague, Ludovic, Helge, Lea, Nichola and Sven with the ??? of this presentation.  
Paolo Nencioni, Technician, A.Menarini Diletta Biagi, PhD Student, Università di Firenze   In the pharmaceutical world, tabletability, compactibility and compressibility profiles are commonly used to characterize the raw materials and powder formulations under compression. By evaluating these profiles, it is possible to explain the mechanical behavior of the tested materials during the tableting process (tableting performance). The tabletability profile explains the relationship between pressure and tablet strength. Compactibility and compressibility give additional information to describe the overall tableting behavior, keeping into account other parameters influencing the process, such as porosity. Using an instrumented single punch press, it is also possible to conduct compression studies  with a little amount of material. This type of approach is the basis for developing a robust formulation, as required in a Quality by Design framework. In this early stage of the study we use JMP for modelling and visualizing tablet performances as a function of compaction pressure. This is a fundamental step for defining an experimental domain for further trials. Thanks to the powerful interactive visualization capabilities of JMP, it is possible to freely move inside the domain and predict tableting behavior and properties shaping an acceptable space design.     Auto-generated transcript...   Speaker Transcript Diletta Biagi Hello everyone, welcome to our presentation. I'm Diletta Biagi. I'd like to start introducing you what inspired us in this project. So I will give you very basic and simple element about what are tablets and how to manufacture them, then I'll talk to you about why making a compaction study. Then, Paolo will go deeper, showing you all the practical step of a work. So he will be showing you how we work with data and also some real case studies. When talking about an administration of drugs, the older route is the most used, and tablets, in particular, then are the absolute most popular solid dosage form. You can have all types, all color, all shapes of tablets. You can have coated control release, but also standard compressed tablets. So how can you obtain tablets? You can get tablets from powder, but also from granulate. The powder is put into a confined space of the die, and then a compression force is applied, causing a reduction of the volume. But what does actually happen when reduction of volume occurs? At first when the compaction...the pressure is still low, the particles rearrange and pack more closely, reducing their porosity. But, as the pressure increases, the particle dimension changes either, either because they form, or because they fracture into smaller particles. So the main...this behavior, it depends on the starting powder. So about the starting powder, it has its own characteristics affecting the tableting process. In the same way, the resulting tablets has got some characteristics, which depends on the applied force on the geometry of the die and the punches and also on the powders that we use. So we have a lot of characteristics affecting the process and also resulting from the process. And all these characteristics can be resumed into four formulas that we actually use to describe the compression process. So let's see this formula, the compaction pressure links the applied force to the cross sectional area of the punch. And it is useful to compare the loading applied to tablets when tablets have a different size, because you could not directly compare the force when two tablets have a different size. The tensile strength links the breaking force and the area of the longitudinal section of the tablet. And it is useful to compare the mechanical strength of tablets of different size, because, again, you could not directly compare the breaking force when two tablets have a different size. The true density represents the density only to the solid portion of powder, so no air between particles, no intra voids in inside the particles contribute to the calculation of the true density. The solid fraction is the ratio of the tablets apparent density and the true density and, as you can see, it is related with porosity. But compassion pressure and tensile strength and solid fraction can be plotted one against another in three different ways. Using these three plots, let's briefly see what do they mean. The compressibility is the ability of a material to reduce its volume when a pressure is applied. As you can see here, the reduction of volume is expressed as an increase in solid fraction. The compactibility is the ability of a powder to give tablets of a specified strength when a reduction of volume occurs. And again, you can see that the reduction of volume is expressed in terms of solid fraction, it has an increasing of solid fraction. Then the tabletability is the ability of a powder to give tablet have a sufficient strength when a pressure is applied. In fact, it is a plot of tensile strength versus compaction pressures. With these three plots, we are able to better understand what happened to a powder when it is compressed. And we are also able to understand and explain the characteristics of the resulting tablets, especially if we use these three plots together. Why it is important to understand what happened during the compression process? Because it is absolutely necessary if we want to develop a robust tablet formulation, and it is also very useful for the scale up of a laboratory formulation. OK, so now, I will tell you the first practical step of our work. We started by collecting some data, so we selected some pure excipients, in particular microcrystaline cellulose, lactose and calcium phosphate. And we also selected different particle size dimensions for these excipients. Here I reported just one type of cellulose, as an example. So we compressed the cellulose with a single punch press using a flat-face punch. All the tablets were manually made one at a time, and for each single tablet we recorded the compaction force, the weight, the thickness and the crushing strength. The compaction force were changed every time and increased, so we've recorded tablets using compaction force from two kilonewtons up to 40 kilonewtons, which correspond to a compaction pressure of 20 up to 400 megapascal. I would like to underline that these force were the only only type of data that we actually recorded because all the other data that we use in the in the this project (and you see can see the tensile strength, but we use also others) derived from these force types using the formula that I showed you before but also using some other models that Paolo is going to show you just in a while. Paolo Nencioni Okay, as Diletta said, to compute the solid fraction, we needed to know the true density of the material. The density is commonly measured by ???, but it can, it can also be derived from compaction data. A method developed by Sun use a non linear regression of compaction pressure by tablet density, based on a modified Heckel equation here on display. To model this equation, we used the nonlinear regression in the specialized modeling platform of JMP. The built-in model library contains a lot of models, but it's also possible to create a self customized equation as here I show. You have only to add your own ??? defined in NonlinLib.JSL. Okay. running the nonlinear regression JMPs of the equation, with the ??? computed, with parameter estimates that better fit the data, and the parameter that we call here D is the true density, that we will use in all our elaboration of data. Once we have the true density. We can start to plot, the data and the relative relationship, the compressability properties first it links the reduction in volume of the material with applied pressure. This relationship can be explained by the Kawakita equation, also here we have to. model this equation, with a nonlinear regression and also here again, we have to add a customized equation defined in NonlinLib.JSL. Okay. Paolo Nencioni Running the nonlinear regression gets a formula, and we can save this formula in our data table. So we get the compressibility plot. The saved equation of Kawakita explains the volume reduction with the applied pressure. And this is the first plot that Diletta showed us before. The compactability is another property, very important, most of every paper that speaks of it use a Ryshkewitch equation to describe the relationship between the solid fraction and tensile strength. Here is not necessary to use the nonlinear modeling, because from our Fit Y by X report we have only to select the Fit Special command. In this way it is possible to apply a logarithmical transformation to the Y data, the Tensile strength, so we have only to save. the formula in the data table. And we get the relation that links solid fraction to tensile strength. This is the compactibility plot with Ryshkewitch equation, explains the tensile strength and the powder densification, the solid fraction. The last plot that Diletta showed us is the Tabletability. It describes the effectiveness of the applied pressure in increasing the tensile strength. Normally a great compression pressure results in stronger tablets, but it's not always this, this relationships is true because after Increasing so much the compaction pressure, the tensile strength, to increase, stops to. Be high. On. Paolo Nencioni The relationship here is not a direct function Also, if the topic. has been investigated deeply in a full and versatile theoretical framework about a powder tabletability is missing. Here we try to apply a function composition of the two previous equations not whatever the resulting equation is able to fit the data. That is mainly from the material characteristic, however, we use these function composition in the next slide. The three graphs can be displayed together, using a dashboard, for example, and having a local data filter, that gives us also the possibility to highlight the desired range of compaction pressure. Data can. also be shown in three dimensional graphics and a scatter plot to understand the relation between compaction pressure, solid fraction and tensile strength. Each cube face is one of the three plots that we have seen before. The solid fraction of the compact, is direct to evaluate the results of. Applied compaction pressure simulating that the tensile strength of the compact is a direct results of the solid fraction. Now we go to see some case studies, two case studies. The first one is a very simple application with a profile of two excipients and their mixture, using a flat face punch of one square centimeter of area. Cellulose and lactose have different behavior under compression, the first - cellulose - is commonly known as the material that consolidates by the formation. The second one is commonly known, it gives compaction by fermentation. Here we can see that celulose gives ??? against the lactose reaches higher tensile strength value. We can see also that in the. tabletability plot the last part of data doesn't fit very well the equation. The lactose doesn't reach the same value of tensile strength of cellulose, but in the tabletability you can see that data. On that line fits very well the equation of the composition of function that we use. Okay we did also some mixture of the excipients with a different ratio. As expected, the two mixture cellulose and lactose profiles look alike profile excipients in major amount So we can change the behavior of the mixture not simply increasing or decreasing the ratio between the two ingredients, these can be very useful in doing a formulation activity, when we have to face. To face up with some active ingredient with the low compaction properties. Here we have a second case study. Is on a real formulation, a real tablet formulation, it is a compaction study a using both a flat face punch we have seen before, but also a real punch that is a very small tablet around the 5 million ??? First, we did the profile using a single punch press. Here we can compare the plot for the two different punches. It is possible to see that tablets done with the smaller convex punch are not able to reach the same value of tensile strength for solid fraction obtained with the bigger flat face punch. To get more reliable results, we continued the study using the punch really used in the industrial batches, the smaller one. Here we see the profile did with two different equipments, a single punch press and a rotary press. We can see that there isn't a remarkable difference between the two equipments. Tensile strength and solid fraction are quite the same for both equipments. Finally, we produced tablet at two different speeds, using the rotary equipment. Here we have to introduce a new term, the term 'dwell time' that defines the time that the powder is under the loading, under the maximum pressure of the cycle. A lower speed, that means a longer dwell time result in the tablets with the highest solid friction, and we can clearly see from the first plot, the compressibility that links compaction pressure and solid fraction. The compactability, that is at the right side of the plot, is perhaps the most valuable of the three properties, because it reflects. tensile strength and solid fraction and we can see here, that a part of minor difference is a result of the two curves are quite the same. This means that The compactability is not significantly affected by compaction speed and for this reason the compactibility profile become a useful tool during the scale up from laboratory to industrial equipment. Because the ??? is a very important information about the compaction of the pressure, of the powder that we have. So yeah I thank you for having followed our talk and I say goodbye.  
Markus Schafheutle, Consultant, Business Consulting Laura Castro-Schilo, JMP Senior Research Statistician Developer, SAS Christopher Gotwalt, JMP Director of Statistical R&D, SAS   We describe a case study for modeling manufacturing data from a chemical process. The goal of the research was to identify optimal settings for the controllable factors in the manufacturing process, such that quality of the product was kept high while minimizing costs. We used structural equation modeling (SEM) to fit multivariate time series models that captured the complexity of the multivariate associations between the numerous process variables. Using the model-implied covariance matrix from SEM, we then created a Prediction Profiler that enabled estimation of optimal settings for controllable factors. Results were validated by domain experts and by comparing predictions against those of a thermodynamic model. After successful validation, the SEM and Profiler results were tested in the chemical plant with positive outcomes; the optimized predicted settings pointed in the correct direction for optimizing quality and cost. We conclude by outlining the challenges in modeling these data with methodology that is often used in social and behavioral sciences, rather than in engineering.     Auto-generated transcript...   Speaker Transcript Hello, I'm Chris Gotwalt and today I'm going to be presenting with Markus Schafheutle and Laura Castro-Schilo on an industrial application of structural equations models, or SEM. This talk showcases one of the things I enjoy most about my work with JMP. In JMP statistical development, we have a bird's eye view of what is happening in data analysis across many fields, which gives us the opportunity to cross fertilize best practices across disciplines. In JMP Pro 15, we added a new structural equations modeling platform. This is the dominant data analytic framework in a lot of social sciences because it flexibly models complex relationships in multivariant settings. One of the key features is that variables may be used as both regressors and responses at the same time as a part of the same model. Furthermore, it occurred to me that with these complicated models they are represented with diagrams that, at least on the surface, look like diagrams representing manufacturing processes. I wasn't the only one to make this connection. Markus, who was working with a chemical company, thought the same thing. He was working on a problem with a chemical company with a two column twin distillation manufacturing process where they wanted to minimize energy costs which were largely going to steam production, while still making product that stayed within specification. He reached out to his JMP sales engineer, Martin Demel, who then connected Markus to Laura and I. We had a series of meetings where he showed and described the data, the problem and the goals of the company. We were able to model the data remarkably well. Our model was validated by sharing the results as communicated with the JMP profiler to the company's internal experts and then with the first principle simulator and then with new physical data from the plant. This was a clear success as a data science project. However, I would like to add a caveat. The success required the joint effort of Laura, who is a top tier expert in structural equations modeling. Prior to joining JMP, she was faculty in quantitative psychology at the University of North Carolina, Chapel Hill, one of the top departments in the US. She is also the inventor of the SEM platform in JMP Pro. This exercise was challenging even for her. She had to write a JSL program that itself wrote a JSL program that specified this model, for example. The model we fit was perhaps the largest and most sophisticated SEM model of all time. I want to call out the truly outstanding and groundbreaking work that Laura's done both with the SEM platform generally and in this case study in particular. Now I'm going to hand it over to Markus, who is going to give background to the problem the customer wanted to solve, then Laura is going to talk about SEM and her model for this problem. I'll do a brief discussion of how we set up the profiler and then Markus will wrap up and talk about the takeaways from the project and the actions that the customer took based on our results. Thank you, Chris, for the introduction. Before I start with the problem, I want to make you familiar with the principles of distillation. Distillation is a process which separates a mixture of liquids most of the time and separates them into their individual components. So how does this work? Here, you see a schematic view of a lab distillation equipment. You see here is a flask where the crude mixture, which has to be separated, is inside. You heat this up, here in this case, with an oil bath and stir it and then it starts boiling. So the lowest boiling material starts first and then the vapor goes up here and pauses here the thermometer to reach the boiling temperature and then further goes into these cooler here where it condensates and then the condensates drop here into the other little flask. So as I said, it's built from...for separating a mixture of liquids with different boiling points, and those are separated by boiling point. For example, everybody knows that perhaps if you want to make schnapps from a mash, you just make the mash inside of this flask, heat it up, and then you distill the alcohol over here and get the schnapps. As it looks very simple here in the lab, in the industry it's a bit more complicated. Because let's say the equipment is not only bigger but more complex and mainly because of of the engineering part of of the story. So in this study, the distillation was not done batchwise as you've seen before, but in a continuous manner. This means the crude mixture is pumped somewhere in the middle of the column and then the low boiling material goes up as a vapor to the top of the column and there it leaves the column and the other higher boiling material flows downward and leaves the column on the bottom. And to make it a bit more complicated, in our case the bottom stream is then pumped into a second column and distilled again, so to separate another material from the main..from... from the residual stuff. So actually we separated the original crude mixture into three parts Distillate 1, Distillate 2 and what's left, then the bottom stream of the second still. And to make it even more complex, we used the heat of the distillate of this second column to heat the first column in order to save energy for this. So in a schematic view, this looks like this. Here you have Still 1 and there's Still 2. And here we have the raw material mix which is actually assembly of distillates from the manufacturing and what we want to do is to separate the value material from the rest. So we pumped this crude mix into the first stills. As I said somewhere in the middle it separates in the low boiling point, which is the first value material we want to have, and the rest leaves the first still on the bottom. Then it's stored immediately here in a tank and then from here it's pumped again into the second still, again somewhere in the middle, and it separates into the second value material and away stream, which was then redeposited. To heat the stuff, we start on this side because this is the higher boiling material, so we need higher temperatures and this means also more energy. So we pump in the steam here, heat this still here, and the material leaving here on the top has the temperature of the boiling of this material, so we used the residual heat to heat the first still. And if there's a little gap between what's coming up from here and what we need for for distillation here in the first still, we can add a little extra steam to keep everything running. So this is a very high level view of the things, and if you want to go a bit more into the details, here it's kind of the same picture. But here I show you with all the different tags which we have here for all the readouts and quality control and temperature control and so on and so forth. So what you see here? We start here, for feed one into the first still. And then we separate into the top stream where we control the density, which is a quality characteristic. And in the bottom stream, the bottom stream goes in this intermediate tank. And then from here it's fitted again into the second still and also and separated here in the bottom stream for top stream. And again we are testing here density for quality control. Here also we add the steam to heat all these things up and the top flow then goes via heat exchanger into the first still and heat that up again. And here we have the possibility to add some extra steam to have everything in balance here. So what we see is on a local basis you have a lot of correlation, so this is done here with the color code of the arrows. So for example, the feed here together with the feed density, which is a measure for the composition here of this feed. So together with these two are defining the top stream and quality here, and that bottom stream more or less. So you have some local predictions. Also over here the material going in here and here defines stuff over here. But if you want to have a total description of the entire equipment, then it gets tricky because you can do local least square correlations here. You can do it here. You can do it separately for the steam or also here. But as you see, we have a start of the mass stream coming here, going through first still, through the second still, to here and we have an energy stream which starts more or less here, going through here via the heat exchanger also down here. So it's a kind of a circuit, which we have here, and all these things are correlating more or less in a kind of circuit and this gives gives us the difficulty that we actually didn't know what the Xs and the Ys were. And that was the reason where we started to think about other possibilities to model this. So the target for this study was to find the optimal flow and steam settings for all these varying incoming factors here, and in a way that we are able to stay everything in spec. So the distillate quality should stay in spec but also internal operational specs and also the spec for the final waste stream. And the most interesting part, at least money wise, we want to minimize the consumption of the speed...sorry, of the steam. So what we actually needed was first of all, a good model which describes this and that was the point where Laura came into the game here and developed this structural equation model. And we also need the kind of profiler which enables us to figure out what are the best settings, the optimal settings for all these incoming variations, which we may have here in order to stay within all these specs. And that was the point where Chris came in, building on the model from Laura, a profiler, which we can use for doing all the predictions we need. So now I want to pass over to Laura to describe the model she built from this data here. Laura, please. Thank you, Markus. I'm Laura Castro-Schilo and I'm going to tell you about the steps we followed to model the distillation process using the structural equation models platform. So when Markus first came and talked to us about his project, there were three specific features that made me realize that SEM would be a good tool for him. The first is that there was a very specific theory of how the processes affect each other, and we saw that on the diagram that he showed. An important feature of that diagram is that all variables had dual roles. In other words, you can see that arrows point at the nodes of the diagram, but those nodes also point at other variables, so there wasn't a clear distinction between what was an input and what was an output. Rather, variables had both of those roles. Lastly, it was important to realize that we were dealing with processes that were measured repeatedly. In other words, we had time series data and so all of these features made me realize that SEM would be a good tool for Markus. Now, if you're not familiar with SEM, might wonder why. SEM is a very general framework that affords lots of flexibility for dealing with these types of problems. I've listed in this slide a number of different features that make SEM a good tool, but since we're not going to be able to go through all of these, I also included a link where you can go with learn more about SEM if you're interested. Now I'm going to focus on two of the points I have here. The first is that SEM allows us to test theories of multivariate relations among variables, which was exactly what Markus wanted to do. Also, there are very useful tools in SEM called path diagrams. These diagrams are very intuitive and they represent the statistical models that we're fitting. So let's talk about that point a little more. Here is an example of a path diagram that we could draw in the SEM platform to represent a simple linear regression, and the diagram is drawn with very specific features. For example, we're using rectangles to represent the variables that we have measured. Here, it's X and Y. We also have a one-headed arrow to represent regression effects. And notice the double-headed arrows that start and end on the same variables represent variances. Now, if these were to start and end on a different variable, those double-headed arrows would then represent a covariance. In this case, we just have the variance of X and the residual variance of Y, which is the part that's not explained by the prediction of X. So this is the path diagram representation of a simple linear regression. But of course we could also look at the equations that are represented by that diagram. And notice that for Y, this equation is that of simply a linear regression. And I've omitted here the means and intercepts just for simplicity. It's important to note that all of the parameters in the equations are represented in the path diagram, so these diagrams really do convey the precise statistical model that we're fitting. Now in SEM, the diagrams or models that we specify imply a very specific covariance structure. This is the covariance structure that we would expect given the simple linear regression model. So you can see we have epsilon X as the variance of X. We also have the variance of Y, which is a function of both the variance of X and the residual variance of Y, and we also have an expression for the covariance of X and Y. And generally speaking, the way that model fit is assessed in SEM is by comparing the model implied covariance structure to the actual observed sample covariance of the data, and if these two are fairly close to each other, we would then say that the model fits very well. So a number of different models can be fit in SEM. And today our focus is going to be specifically on time series models. When we talk about time series, we're speaking specifically about a collection of data where there is dependence on previous data points, and these data are usually collected across equally spaced time intervals. And the way that time series analysis deals with the dependencies in the data is by regressing on the past. So one type of these models are called autoregressive processes or ARP. And you can see here, where Y represents a process that is measured at time T, the auto regressive models consist on regressing that process on previous observations of that process up to time T minus P. So if we're talking specifically about an autoregressive one process, then you can see we have the process YT regressed on its immediately adjacent past YT minus one. And the way that we would implement this in SEM is simply by specifying, as we saw before, the regression of YT on YT minus one. So notice that here the regression equation is very similar to what we saw in the previous slide, and so it's no surprise that the path diagram looks the same. And we can extend this AR(1) model to one that includes two lags, in other words, an autoregressive of order two. And here we see we have the process YT that is being regressed on both T minus 1 and T minus 2. And if we look at the path diagram that represents that model, we see that we have an explicit representation for the process at the current time, but also at the lag one and lag two. A very specific aspect of this diagram is that the paths for adjacent time points are set to be equal to each other, and this is an important part of the specification that allows us to specify the model correctly. So notice here we're using beta 1 to represent this lag 1 effects and we also have to set equality constraints on the residual variances. Lastly, we also have the effect of YT minus 2 as it's predicting YT, and so here's the lag 2 effects. Now all of these models are univariate time series models, and you can fit them using the structural equation modeling platform in JMP or you could also use the time series platform that we have available. However, the problem we were dealing with with Markus' data require more complexity. It required us to look at multivariate time series models and a type of these models are called vector autoregressive models. And what I'd like to show you is one of these models of order two. So we have a process for X and another one for Y, and the same autoregressive effects that we saw before are included here. Notice we have our equality constraints which are really important for proper specification. But we also have the cross lagged effects which tell us how the processes influence each other. And notice here gamma 1 and gamma 1 and also gamma 3, gamma 3, suggesting here that we have to put equality constraints on those lag 1 effects across processes. We also have to incorporate the covariances across the processes so we have their covariance at time T minus 2. But we also have the residual covariances at time T minus 1 and T and notice these have to have equality constraints again to have proper specification. So I'm going to show you in this video how we would fit a bivariate time series model just like the one I showed you, using JMP Pro. We're going to start by manipulating our data so that they're ready for SEM. First, we standardize these two processes because they are in very different scales. Then we create lagged variables to represent explicitly the time points prior to time T. So we're going to have T minus 1 and T minus 2. We launched the SEM platform and we're going to input the Xs then the Ys so that it's easier to specify our models. And now I sped up the video so that you can quickly see how the model is specified. Here we're adding the cross lagged effects for lag 1. And then directly using the interactivity of the diagram, we add the lag 2 effects. And what remains is to specify all the equality constraints that are required for these models within process and across processes. We name our model. And lastly, we're going to run it. As you could see, even just a bivariate time series model that only incorporates two processes requires a number of equality constraints and nuances in the specification that make it relatively challenging. However, in the case of the distillation process data, we had a lot more than two processes. We were actually dealing with 26 of these processes and in total we had about 45,000 measurements, which were taken at 10 minute intervals. And so our first approach was to explore the univariate time series models using the time series platform in JMP. And when we did this, we realized that for most processes an AR(1) or AR(2) model fit best, and so this made me realize that really at the very least we needed to fit multivariate models in SEM that incorporated at least two lags. We also had to follow a number of preprocessing steps for getting the data ready into SEM. On the one hand, we had a lot of missing data, and even though SEM can handle missing data just fine, with models that are as complex as these ones, it became computationally very very intensive. And so we decided to select a subset of data where we had complete data for all of the processes and that left us with about 13,000 observations. Also, as we saw in the video, we had large scale differences across the processes, so we had to standardize all of them. And lastly we created lag variables to make sure that we could specify the models in SEM. Now for model specification, equality constraints in particular are very very big challenge because it would take a lot of time to specify them manually and it would be, of course, tedious and error-prone. So our approach for dealing with this was to generate a JSL script that would then generate another JSL script for launching the SEM platform. And what you see here is the final model that we fit in the platform and thankfully, after estimating this model, we are able to obtain a covariance structure that is implied by the model and that was the piece of information that I could pass over to Chris Gotwalt, who then used the information from that matrix in order to create a profiler that Markus could use for his purposes. So Chris, why don't you tell us how you created that profiler? Thank you, Laura. Now I'm going to show the highlights of how I was able to take the model results and turn them into a profiler that the company could easily work with. So Laura ran her model on the standardized data and sent me a table containing the same model intercepts and she also included the original means and standard deviations that were used to standardize the data. On the right we have the sim model implied covariance matrix, which includes the covariances between the present values and the lagged values from the immediate past. This information describes how all the variables relate to one another. In this form, though the model is not ready to be used for prediction. To see how certain variables change as a function of others, we have to use this information to derive the conditional distribution of the response variables, given the variables that we want to use as inputs. So essentially we need the conditional mean of the responses given the inputs. So to do that, we need to implement this formula right here. And to do that, we use the SWEEP Operator in JSL, the SWEEP Operator is a mathematical tool that was created by SAS CEO and co-founder Jim Goodnight. It's was published in the American Statistician in 1979. The SWEEP Operator is probably the single most important contribution to computational statistics in the last 100 years. Most JMP users don't know that the SWEEP Operator is used by every single JMP statistical platform in many ways. We use it for matrix inversion, the calculations that sums of squares and also can be exploited as simple and elegant way to compute conditional distributions if you know how to use it properly. I created a table with columns for all the variables. The two rows in the table are the minimums and maximums of the original data, which lets the profiler know how to set up the ranges. I added formula columns for the response variables using the swept version of the variance matrix from Laura's model and put those formulas into the back here in the data table or the far right. Here's what one of the formulas looked like. I pulled in the results from the analysis as matrices. Laura's model included the estimated covariance between the current Ys in the last two preceding values because it was a large multivariate autoregressive model of order 2. Predicting the present by controlling the two previous values of the input variables was going to be very cumbersome to operationalize. So I made a simplifying assumption that these two values were to be the same, which collapsed the model into a form that was easier to use. To do this, I simply use the same column label when I was addressing into the lag one, and lag two entries for term. Without machinery in place, I created a profiler for the response columns of interest. I set up desirability functions that reflected the company's goals. So they wanted to match a target on on A2TopTemp, maximize A2BotTemp, and so on, ultimately wanting to minimize the sum of the steam that came out of the two columns. So you can lock certain variables in the profiler by control clicking on a pain. The lock variables will have their value drawn via a solid red line, and then once we've done that we can enter values for them, and when we run the optimizer or maximize desirability, the locked variables will be held fixed. This way we find settings of the variables that we can control that keep the product being made to specification while minimizing energy costs. It's fair to say that it would be difficult for someone else to repeat Laura's modeling approach on a new problem, and it would be difficult for another person to set up a profiler like I did here. If enough people see this presentation and want us to add a platform that makes this kind of analysis easier in the future, you should let us know by reaching out to Technical Support via support@JMP.com. Now I'm going to hand it back over to Markus who will talk about what the customer did with the model and our conclusions. Thank you, Chris. So with the prediction Profiler, which Chris just presented, we used that to, let's say, make a predictive landscape, which makes us understanding how the best settings should be in order to achieve all the necessary quality specs. And so what the three factors which are, let's say, are varying with limited extent to our influence, and what's the feed for the Still 1 and the feed for the Still 2 and also the quality or the composition of the feed into one. And what we've turned out as in their model as well, is the cooling water temperature was also playing an important role in this scenario. All the other variables are of smaller importance so that we neglected them in this first approach. Here you see the landscape. It's kind of a variability chart, so to say, so we have here the feed density for the Still 1, the feed into Still 1 and the feed into Still 2 and all possible combinations, more or less. And here you see then the settings which are predicted to be best in order to stay within the specifications. And here are some of these I have specifications as well, so we have to stay inside them. So it's, for example, it's the steam flow for Still 2, the reflux there, the boiler up and the same things for the Still 1. And here on the right side, you see the predicted outcomes, so the quality specs, so to say. So the temperature of the top flow in Still 1 that the density of the distillate, the density of the distillate of Still 2, and so on and so forth. So what you see is here, if you have a look here on the desirability, which is the bottom row here, there's big areas where we cannot really achieve a good performance of our system. And if you have a look into the details you see, OK, here we are off spec, here we are off spec, here on some points, we are off spec, and so on and so forth. But what else it sees is that this in spec/off spec thing is also governed not only by these three components down here, but also by the river temperature, and for the moment it's highlighted the lowest river temperature; it's 1 degree. So this you see here with it, we are staying most of the time in specs, though there only are rare combinations of these three factors where we aren't. But if we are increasing the river temperature, for example for 24 degrees, then the areas where we are off spec are...become much more predominant. Also here it's very hard to stay within this specifications. So what we learned from the model is that we have problems to stay in our specifications when the river temperature is above 7 degrees C. So then then the the question was why is that? And the engineers often...suspected is that this was because of the cooling capacity of the coolant. But before we went into the real trial, we compared our SEM model versus a thermodynamic model based on Chem CAD. And what it pointed out was that both models are pointing in the same direction, so there were no no real discrepancies between the both. OK, this made us in an optimistic mood and so we did some real trials and with the best settings, and let's say, approved the the predicted things from the models. And so it turned out, as I said already, that what the engineer suspected that the cooling capacity of the cooler is not sufficient. And so when you have at higher river temperature, then the heat transfer is too small, and so the equipment... equipment doesn't really run anymore. So the next step now is to use these data from from this study here to justify another investment which builds a cooler here with a better heat exchange capacity. So thanks to Laura and Chris, we could set up here the investment. If you have questions so please feel free to ask now. Thank you.