Choose Language Hide Translation Bar

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
レベル:中級 被験者に発生する有害事象の報告、追跡、分析は、臨床試験の安全性評価において重要です。多くの製薬会社と新薬申請を提出する先である規制当局は、この有害事象の評価を支援するJMP Clinicalを用いています。バイオメトリック分析プログラミングチームは、メディカルモニターやレビューアのために静的な表、リスティング、および図を作成する場合があります。 このことは、特定事象の発生による医学的影響を理解しているドクターが有害事象の要約と直接対話ができないといった非効率につながります。しかし、有害事象の単純なカウントと頻度分布を作ることでさえ、必ずしも簡単であるとは限りません。このプレゼンテーションでは、JMP Clinicalの主要なレポートである有害事象のカウント、頻度、発生率、事象が発生するまでの時間の出力に焦点を当てます。JMP Clinicalの常識を超えたレポート機能により、JMPの計算式、データフィルタ、カスタムスクリプト化された列スイッチャー、仮想結合されたテーブルに大きく依存する複雑な計算を行っている場合でも、完全に動的な有害事象分析を簡単に行うことができます。   Kelci Miclausは、JMPライフサイエンスR&Dのマネージャーであり、JMP GenomicsとJMP Clinicalソフトウェアの統計機能を開発しています。彼女は、2006年にSASに入社し、ノースカロライナ州立大学で統計学の博士号を取得しています。  
レベル:初級 発表タイトル:空間データおよび形態解析ツールとしてのJMP―最短距離法クラスター分析を用いた上気道上皮内癌とその前駆病変のGradingの試み   口腔,咽頭や喉頭など上気道粘膜の悪性腫瘍の多くは粘膜の表面を覆う重層扁平上皮という細胞の層に発生する扁平上皮癌である.さらにこの前段階ないし初期と考えられる病変が粘膜の白色ないし赤色の局面として見出されることが臨床的に知られており,患者から採られた組織の顕微鏡標本の観察によりそれぞれ上皮異形成epithelial dysplasia,上皮内癌carcinoma in situと名付けられている.さらに異形成は細胞形態の異常とそれらが上皮層内に占める比率に応じて変化の軽いものから軽度mild,中等度moderateおよび高度severeのグレードに三分されている.この鑑別は病理医の視察により直観的に行われ,ある程度の再現性を有しているものと考えられるが,客観的な検討は多くない.今回上皮層における細胞(核)の配置を定量化し,非腫瘍性(正常),異形成,上皮内癌でどのような差異があるか検討を試みた.デジタル画像解析により顕微鏡写真上で細胞の核の重心座標を抽出し,JMP ver.15の最短距離法の階層的クラスター分析を用いて各重心をつなぐ最小木(Minimum spanning trees; MST)を生成させ,各枝の長さのヒストグラムを比較検討したところ,各群の間に差異が見出された.   千場 良司 東北大学元講師(加齢医学研究所病態臓器構築研究分野).医学博士.元文部省在外研究員(医学)(デンマーク王国オーフス大学). 人体病理学の領域において疾患の病理発生過程を幾何確率論や積分幾何学を応用した定量形態学,デジタル画像解析および多変量統計解析を用いて研究してきた.肝硬変,肺胞上皮,膵管上皮および子宮内膜に発生する早期癌とその前駆病変や癌の肝転移に関する研究論文がある. (https://pubmed.ncbi.nlm.nih.gov/7804428/ , https://pubmed.ncbi.nlm.nih.gov/7804429/, https://pubmed.ncbi.nlm.nih.gov/8402446/ , https://pubmed.ncbi.nlm.nih.gov/8135625/, https://pubmed.ncbi.nlm.nih.gov/7840839/ , https://pubmed.ncbi.nlm.nih.gov/10560494/) 癌の発生過程やその組織診断の観点からそれらの解析に応用可能な数理的手法に興味があり,クラスター分析や判別分析などの数値分類法に特に関心がある.統計解析のプラットフォームとしてはメインフレーム上のFortran統計サブルーチン,PC上のSPSSやSYSTATなどを経て優れたデータテーブルの機能と柔軟な分析環境に注目しバージョン8からのJMPユーザーである.     千場 叡 公立はこだて未来大学システム情報科学部複雑系知能学科卒.在学中は物理化学反応における複雑系現象に興味を持ち,アミノ酸熱重合物のアルコール液相中におけるカプセル形成機構に関する実験と研究を行った.現在はデジタル画像解析,データサイエンスおよびニューラルネットワークを用いた形態および画像の認識や分類にも関心を持っている.
レベル:初級 早崎 将光, 主任研究員, エネルギ・環境研究部 環境評価グループ, 一般財団法人日本自動車研究所 伊藤 晃佳, グループ長・主任研究員, エネルギ・環境研究部 環境評価グループ, 一般財団法人日本自動車研究所   我々の主要な研究テーマは、自動車交通と大気環境、ならびに大気環境と人への健康影響であり、自動車交通量は重要な情報の一つである。自動車交通量の指標の一つである断面交通量は、車両感知器などによる交通量の情報で、それぞれの地点における5分毎のデータが公開されている。現在、東京都内では約2,400ヵ所の断面交通量情報が公開されている。断面交通量は、比較的広い範囲における自動車交通量を、面的にとらえる指標として重要である。 新型コロナウィルス(COVID-19)の感染拡大による緊急事態宣言によって、社会経済活動は大きく変化し、自動車交通にも影響があったと考えられる。今回我々は、緊急事態宣言期間の前後における東京都内の自動車交通量の変化を、断面交通量を指標として解析を行った。また、同期間における大気質の変化についても検討を行った。解析の主要なツールとしてjmpを用いた。jmpのテーブル結合、連結機能などのデータテーブル編集機能、データとリンクしているグラフ機能を用いることで、効率的に解析を実施することが出来た。本報告では、我々のjmp使用例について紹介をする。   堺 温哉 愛媛大学大学院連合農学研究科博士課程修了(農学博士) 学術振興会特別研究員(PD)、浜松医科大学(教務補佐員)、横浜市立大学医学部(助教)、信州大学医学部(特任助教)を経て2012年9月より一般財団法人日本自動車研究所に所属(主任研究員)、2020年4月より現職。現在の主要な研究テーマは、Traffic Related Air Pollution (TRAP) を対象とした大気環境疫学。   早崎 将光 筑波大学大学院博士課程地球科学研究科を単位取得退学(2000年).同大学生命環境科学研究科地球環境科学専攻にて博士(理学)取得(2006年).国立環境研究所,千葉大学環境リモートセンシング研究センター,富山大学,九州大学,東京大学大気海洋研究所での勤務(PD,特任研究員など)を経て,2017年より現職.主な研究テーマは,高濃度大気汚染事象の要因解明など.   伊藤 晃佳 2002年3月北海道大学大学院工学研究科環境資源工学専攻博士後期課程修了,博士(工学).2002年4月より一般財団法人日本自動車研究所に所属し,2010年より現職.近年の主要な活動として,大気環境に対する発生源寄与度の評価,大気観測結果(常時監視局等)を用いた解析,大気シミュレーション(CMAQ等)を用いた解析等が挙げられる.   ※配布資料はございません。  
レベル:中級 データ解析を行う前に,必要なデータの抽出,変数間の対応付けの変更・整形,変数変換・カテゴリ化・再カテゴリ化等を行って,解析用データセットを作成する必要がある。JMPには行や列の抽出,結合等のデータベース操作,変数変換等に必要な計算関数が用意されている。これらの機能を用いて解析用データセットは容易に作成できる。しかし,抽出対象の設定や変数変換などの操作命令は解析者が指示する必要があり,操作命令が複雑なったとき,意図した結果が得られていない可能性が高くなる。例えば,データ抽出における範囲設定やif文によるカテゴリ化の際,”and”,”or”ルールが複雑になればなるほど,所望の解析用データセットが得られていない可能性が高まる。そこで,解析用データセットが解析者の意図したものに一致しているかを機械的に調べる必要がある。 JMPの統計的方法によって,解析用データセットの質を確認することができる。ある変数の最大値や最小値を求める方法は最も簡単なものであるが,「1変数の分布」,「2変数の関係」も強力であり,「2変数の関係」において寄与率1がエビデンスである。 本発表では大規模データに対して解析用データセットの質を確かめた事例を報告する。   東京理科大学理工学部経営工学科講師。東京大学大学院化学システム工学専攻主幹研究員。研究専門分野は統計的品質管理。主に品質管理に必要な統計解析法について研究しているが,統計的品質管理の防火建築,火災現象,医療や介護への応用も行っており,JMPを用いて大規模データをモデリングして背後に潜む情報を抽出し,研究対象となる固有の分野へフィードバックしている。  
レベル:初級   近年、企業経営において日々蓄積されるデータを分析・可視化し、戦略策定や意思決定に役立てるビジネスインテリジェンス(Business Intelligence)やピープルアナリティクス(People Analytics)が注目を集めている。本発表では、従業員や組織に関する調査データをJMPによって解析し、その結果を経営の意思決定のためのコンサルティング提案に活用する事例について報告する。 企業の経営コンサルティング活動において、組織の実態を把握するために行われる定量・定性の組織調査は欠かせないものである。従業員一人ひとりの成長によってもたらされる組織の持続的な成長を実現するには、これらの調査データから組織の状態を可視化し、将来の予測や意思決定に活用できることが望ましい。 本事例では、A社で取得したデータに対し、JMPの多変量解析機能および解析模型図や構造模型図という可視化ツールを用いた方法論を適用する。その結果、A社の経営層に向け、わかりやすい提案を行うことが可能になる。本発表では、記述統計を用いた一般的な分析から一歩進んだ解析手法について、データ取得から提案までの一連の流れを紹介する。   マーケティング関連会社、EAP(Employee Assistance Program:従業員支援プログラム)サービスを提供するプロバイダー、ベンチャー企業勤務を経て、組織人事コンサルタントとして独立。企業の組織・人材開発の業務に携わりながら、社会人大学院生として博士課程に進学し、質問紙調査・質問紙実験に基づく解析と設計をテーマとした研究に取り組む。修了後も引き続き社会科学領域のテーマを中心に企業実務と研究を両輪で実践し、現在は桜美林大学ビジネスマネジメント学群 特任講師、NPO法人GEWEL理事、FREELY合同会社代表として理論開発と開発した理論の実務への適用を進めている。http://researchmap.jp/sho-kawasaki/   高橋 武則 50年近くに亘りQM(質経営),SQM(統計的質経営)および設計論の研究を行ってきた.21世紀に入ってからは設計パラダイムである超設計(Hyper Design)を提案し,その数理であるHOPE理論を開発しその支援ソフトHOPE-Add-inをSAS社との共同開発行っている.考え方である超設計,統計数理であるHOPE理論,支援ツールであるHOPE-Add-in for JMPの三位一体で新しい設計法を実現している.そしてこの理論の社会科学的延長線上で多群主成分回帰分析を提案している.   橘 雅恵 社労士事務所を開業以来、人事制度構築に注力。サポート企業は80社以上。各社に最適な制度構築は、社員インタビューや社員アンケートを使った組織風土診断・賃金分析が不可避であると考えている。経営全般をサポートする専門家集団ジャパンコンサルティングファームを設立し、経験や勘だけではなく、データに基づいて因果関係を見つけ出し精度が高い経営課題を抽出し、企業の業績アップ、組織開発を提案できるチームを目指している。      
レベル:中級 おそらくJMPは信頼性データ分析のソフトウェアの中では最も強力な能力と体系的なアプリケーションを持っているソフトウェアである。本報告では,JMPの信頼性/生存時間分析のプラットフォームを使って、一変量の分布、二変量の関係、予測、モデル化と許される時間の中で寿命データの分析方法を体系的に紹介する。特にモデル化では再生定理や信頼性試験で使われる方法を実例から飛躍しない程度の仮想的な例を通じて信頼性活動とデータ分析プロセスを紹介する予定である。   廣野 元久 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、統計学の啓発普及に従事。品質本部QM推進室室長、SF事業センター所長を経て、現職。東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。   遠藤 幸一 1987年 株式会社東芝に入社。パワーIC(電源用IC、モーター用500V耐圧ドライバIC等)の製品開発・プロセス開発を経て、現在は故障解析技術開発に従事。博士(情報科学) 大阪大学。   ※配布資料はございません。  
レベル:中級 JMP (Pro)を使えばR , Pythonなどに較べて手軽に分析を楽しめます。フルオーダーメイドの分析とはいきませんが、セミオーダーには十分に対応が可能です。JMPを使えば以下のようなことが簡単に実行できます。 ①コマンドを打ちこまなくてもマウス1つで分析が可能に、②グラフと統計量のセット、③分析プロセスをスクリプトに残せる、④分析プロセスの流れに沿ったレポートの出力が可能、⑤統計的な思想が基本にあるから体系的な理解と学習に最適、など。本報告では数値例を使ってJMPでできる予測や分類の話をします。扱う方法はカーネル平滑化、SVMやニューロ判別などです。また、従来の統計的な多変量解析との対比も行い理解を深めます。   1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、統計学の啓発普及に従事。品質本部QM推進室室長、SF事業センター所長を経て、現職。東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。   ※配布資料はございません。  
レベル:中級 現状を良くするためのアンケートは結果系と原因系の両方の質問が必要である.そして,両者の因果関係を回帰分析で把握すれば手を打つべき対象の選抜ができる.その場合,聞き逃した項目は調査後にカバーができないが,結果として不必要なものがあった場合は事後にそれを無視すればよい.このため回答者に負担をかけ過ぎないという配慮のもとに漏れのない質問項目を用意すると,その結果とし質問項目は多数となり項目間に高い相関が現れる.この問題に対しては,高橋・川崎により「多群主成分回帰分析」が提案されている.その本質は群内の相関は高く群間の相関は低くなるように構成した合理的な群のもとで群毎に主成分を求め,これを説明変数とした主成分回帰で重要な主成分を選択し,選択された主成分に対する因子負荷量の絶対値の大きなものの対策を打つべき主要項目として選抜するというものである. 時にはこの主要項目が密集することがあるが,それは因子分析で対応できる.因果分析は主成分を用いた場合が表側因果分析で,因子を用いた場合が裏側因果分析で,両者を合わせたものが両側因果分析である.本発表ではそのための理論とJMPを用いた具体的な方法を紹介する.   著者は半世紀に亘りQM(質経営),SQM(統計的質経営)および設計論の研究と実践(多数の企業との共同研究および経営指導他)を行ってきた.1990年代より新しい設計パラダイムである超設計(Hyper Design)を提案し,その数理であるHOPE理論(Hyper Optimization for Prospective Engineering)を研究し,その支援ソフトであるHOPE-Add-in for JMPをSAS社と共同開発を行っている.以上より考え方である超設計,統計数理であるHOPE理論,支援ツールであるHOPE-Add-in for JMPの三位一体で新しい設計法を実現している.  設計は敷居が高いため特殊な人々による特殊な活動と誤解されることが多い.これを打破し多くの人が設計を使いこなせるようにするために,著者は理論研究とともに新しい教育方法についても開発している.それは実物教材(紙ヘリコプター,紙グライダー,コイン射撃ほか)と仮想教材(飛球シミュレーターほか)を用いた体験型教育である.この教育プログラム(統計の基礎から超設計まで)は多くの場面でJMPによる可視化した解析と設計を行うために,分かり易くかつ面白い教育に仕上がっている.この教育は30年以上に亘り国内外の大学(慶應義塾大学,Yale University,東京理科大学,筑波大学他)および多数の企業で実施しその有効性を確認している.   ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である高橋武則様まで直接ご連絡ください。  
レベル:中級 働き方改革が叫ばれる昨今、実験と解析の効率化の重要性が高まっている。JMPによる実験効率化の威力を示すには、実験データを決定的スクリーニング計画(DSD)やカスタム計画で置き換えて見せて実験数を大幅に削減できることを示すと良い。その際の応答データは既存実験データから拾い出す。複数の表に分けられた実験データを見つけた時は、一つのテーブルにまとめて多変量解析を行い、プロファイルで可視化して見せて、OFAT(One Factor at a Time)的方法の落し穴に気づいてもらう。繰り返しのある実験データを平均で分析する考え方に対しては、積み重ね処理や平均・分散による多目的最適化やロバスト最適化の方法があることを示す。 開発現場で実験計画法を使う場合は交互作用の存在を予測できないことが多く、しかも交互作用は決して稀なことではない。DSDは主効果と交互作用(2FI)の交絡や2FI間の交絡がなく、実験数が因子数の2倍程度の少ない数で済む。これは大きな利点である。実際にDSDを使って分かったこと、主効果数+交互作用項数が因子数に近づく時に起きる破綻、拡張計画による解決方法、JMPコミュニティやASQから入手したDSD関連論文の中で実務上重要と思われる点、などについて報告する。   山武ハネウエル(現Azbil)でFA開発部長,理事 研究開発本部長,理事 品質保証推進本部長,アズビル金門参与,などを歴任したのち東林コンサルティングを設立.専門領域は生産データ解析による歩留まり改善や品質改善,市場不良予測・ロバスト設計・最適化設計・実験計画などの統計的問題解決全般,デザインレビュ―・根本原因分析手法(RCA)・ヒューマンエラーの未然防止・工程改善などの現場指導など.著書は『ネットビジネスの本質』 日科技連出版 2001(共著)【テレコム社会科学賞受賞】,『実践ベンチャー企業の成功戦略』 中央経済社 2011(共著),『よくわかる「問題解決」の本』 日刊工業新聞社 2014(単著).主な論文は「生産ラインのヒヤリハットや違和感に関する気づきの発信・受け止めを促進するワークショップの提案」品質管理学会 2016【2016年度 品質技術賞受賞】.主な講演「作業ミスを誘発する組織要因を可視化し改善を促進する仕組みの提案」(Discovery-Japan 2018) 「JMPによる品質問題の解決~製造業の不良解析と信頼性予測~」(Discovery-Japan 2019)  
レベル:中級 特定検診は、生活習慣病予防の観点から、40歳から74歳を対象にメタボリックシンドローム(通称:メタボ)の該当者を減少させることを目的としている。特定検診受診者全体に対する検診結果の要約は、個人と全体を比較するベンチマークとなり得るため、中高年個々人の健康管理に対して参考になると思われる。 厚生労働省が提供するNDBオープンデータでは、特定検診の情報として、年度ごとに検査項目(腹囲、血糖値、血圧など)の平均値や階級別分布を入手することができ、メタボの判断基準となるいくつかの検査項目に対し、性別、年代などの属性ごとに基準外の人数、検査人数を求めることができる。属性ごとに各検査項目に対する基準外の割合をグラフ化してみると、検査項目によっては年代による傾向が表れないなど興味深い結果が得られる。 本発表ではこれらグラフ化とともに、各検査項目に対する基準外の割合に対し、年度、都道府県、性別、年代を要因とした一般化線形モデルをあてはめた結果を示す。このモデル化により、対象者の属性(性別、年代、居住している都道府県)における基準外の割合を予測することができ、特定検診受診者全体を深く理解できる。   JMPジャパン事業部の技術エンジニア。現在は主に製薬会社、食品会社を対象としたJMP製品のプリセールス業務を行っている。JMPをお客様に紹介する立場ではあるが、自身も一人のJMPユーザであるという意識が強い。近年はメディア等で話題となる事柄について、JMPで分析した結果をブログや分析レポートの共有サイトである「JMP Public」に投稿している。 https://public.jmp.com/users/259  
Marcus Soerensen, Head of Quality & Six Sigma, Envases   This case study will show how the quality of a process was improved by using statistics as a common language between various departments in the company. What started out as a "confused" project with a very limited set of data turned out to be a successful project as the team began structured and data-oriented progress using JMP. This case illustrates how various departments can work together using data and statistics as the foundation for process improvements.     Auto-generated transcript...   Speaker Transcript Marcus Welcome to this presentation here, using SAS JMP. The topic for this presentation is how to use statistics as a common language and how to improve process quality by doing so. I'm working as a quality manager for a company called Envases. Envases is a huge company, a global company making cans for the food industry. You can see some of the products here in the in the Left corner. Some of the products may be familiar to you. It's cans you can buy in the supermarkets, we have meat, have fish, we have milk powder, we have also juice and so on in the cans. We have manufacturing facilities in Mexico and in Europe and the topic that I want to discuss here today is related to our production facilities in Denmark. Basically what I'm going to present here is a problem that has been solved using SAS JMP as one of the tools. And I'm going to spend some time in the JMP to to show some of the functionalities and to give an idea of how SAS JMP can can be used to work with these kind of problems, which I'm quite sure is very familiar to many of the people, seeing this presentation. Well, going back to last year, we had some issues in our production lines and the people in the production came to me. I'm the head of quality and I'm dealing with these kind of problems that came to my desk saying, Marcus, we have some some tin dust in our production lines. Tin is the material that we're using for cans, so tin dust as such it's not it's not around things to see in our production, but people were complaining that they could see too much tin dust in our production lines. And, as a result of that, they need to clean the lines all the time. That was, like the beginning of the problem. We didn't have any data to to show this. We didn't have any paper to go through. We didn't have any Excel sheets. We didn't have any JMP files to to look into. We just had some opinions from the people saying we have too much tin dust on our production line, and we need to clean all the time. And to make it even up a bit more confusing, but people were saying we have seen this many times before, and some people were saying that we have never seen this before. I started to speak with the people on the on the line, saying, is this new to you? Has this happened this week or was it last week? Or what about last year? And I got a lot of different opinions and a lot of inputs, and also people were not telling me the same thing, so I was a bit confused about how could we get started with this project. But anyway, we needed to get started. So the first thing I did was to set the team, saying, people here in the organization will need to set a team. Who should be involved in team? And we pinpointed some some people in a in the production, some technical people, some people from the operation, some people for the maintenance department and so on, relevant people for the project. And the structure that we have used here to solve the problem is inspired from the Six Sigma DMAIC, maybe some of you know it already. DMAIC is about defining the problem, measuring the problem, analyzing the reason for the problem, improve and then control, if you actually succeeded with the with the solution that you came up with. One of the idea with this, DMAIC, and one of the idea, also for me, using the structure here is to define and measure the problem, because sometimes you may be thinking you have a problem, but by starting to measure the problem you may realize that what you think was a problem is actually not a problem at all. So there's no need to to continue solving the problem that is actually not existing, and that was what I told people saying, Okay, first of all, let's measure the problem. Do we even have a problem, or is it just a stomach feeling that we are having in the production? That was the first step. So we went to the to the production line ???, going to the line to see, can we have an idea of the process, maybe we can even see where the problem occurs. The process is is is rather simple. I've tried to simplify it here in this very simple drawing, just to give you an idea. We have tin sheets coming in to a conveyor and then we have a piston making the lid for the cans, so the process is rather simple. As a result of this piston and of making these lids, we can see that we can collect some dust on the conveyor, and this is where the operator needs to clean once in a while. And we could see by doing the ??? that after 25,000 lids, we could collect about or more than two kilograms of tin dust. And that was when we needed to clean the line. So that was like our baseline, say when we had produced 25,000 lids equal to more than two kilograms, we need to clean the line, and that was not acceptable because cleaning is taking up production time and having reduced production time we cannot produce the lids that we want to produce. That means that we maybe will be late for the delivery, or maybe we are not able to deliver at all. So we had, all of us, an interest of reducing this cleaning and we could see that we can reduce your cleaning if we can reduce the tin dust produced at the line. So the problem was pretty simple when you when we have collected the the data in kilograms here. Remember that you can see the back here on the picture, this is the dust that we could collect using a vacuum cleaner after 25,000 lids. In the beginning, when people came to me we didn't even have a baggage with the dust, it was just it was just by watching, we could express the problem that we were having. Now at least we could see that this is the problem that we're having. We have too much dust and this is too much dust, because we need to clean it. So we have to reduce the kilograms after 25,000 lids produced, that was the success criteria of the project. By collecting the right people, we took a like a workshop, put all the people in a room saying we have a problem here. After 25,000 lids we produce more than two kilograms of dust. What is the reason for that? And then we did a brainstorming, saying look at the process, looking at what is coming in the process, what is coming out the process. Can any of these variables explain the recent for the tin dust that we see? So step one was to define the variables. And you can see, I have marked here in yellow what the team expected to be some of the root causes for the tin dust that we saw. We started with the input that since it's coming in, we know that the tin sheet, the thickness of the tin sheet can can can vary, so we can have some thicker tin sheet coming in and some not so thin thin sheet coming in. And we also, we could see that the coating of the piston could be a reason for the tin coat. We could see, compared to other lines that had some coat, didn't have the problem in the same scale. We could also see that the measurements of the piston, we have four different measurements on the piston that could explain the reason for the tin dust. We have never tried this out, so this was just on the paper, so this could be the reason but let's try to find out. Last year or two years ago, when we have a problem like this, the approach would normally be that we were trying different things out, so we could try and make the thickness could see if this could change anything, the coat. But here we would like to combine all the variables in one experiment simply to to speed up the process. So we set up a design of experiments, a DOE. Over time we have to change this a bit, so it could reflect the reality that we have and also the allocated production times that we could use for the experiment. Setting up the experiments and defining the variables was not a difficult task. It took maybe a couple of hours. Setting up the DOE was not difficult, we did that in SAS JMP. But executing the DOE was the tricky part because it took a long time; it took about a week. So we need to plan to take out the machine and then we did the trial for about five days. And simply what we did, we produced 25,000 lids using one kind of setting and then 25,000 lids using another kind of setting and so on. And then, after the week, we analyzed the results, and then we concluded based on these results. Let me try to show you what we did in SAS JMP. You can see here this, just by looking at the numbers, we could see that this is a huge progress since our starting point. We started by just having people say we have too much dust and we need to clean all the time. Now at least we have some number a number...numbers on the on the tin. You can see here, we have the tin dust. This is the produced tin dust after 25,000 lids, and we also have different settings of the thickness of the material coming into the line. And we have the four different measurements of the piston here. And we have a statement, has it been coated or has not been coated. So we have different kind of pistons that we were trying out. This is rather easy for people to understand. They could they could see how much tin dust do we have if the thickness of the material coming in, is 5.74, if the measurement is 1.47 and so on. So this is a huge step from coming from just watching to actually have some real numbers behind the working set that we were having in the beginning. So just collecting the number here was a huge progress from our starting point, but the idea was to use the number to see can we explain the reason for the tin dust based on the thickness of the material coming in, the four different measurements of the piston, and if the piston has been coated or not. We're using some of the tools in SAS JMP and one of the tools that we're using a lot here is the fit model. The fit model explains if there will be any relationships between your responses and your variables. And up here we have the response. This is our tin dust. Here we have the model, so we would like to see if the thickness of the material effect the 10 dust, the measurement of the piston, and if the coating of piston would have any impact. Running the model here and saying, try to combine the different variables and tell me what will have the highest impact on our tin dust. And this is basically the results that we got out of it. We could see here we have what we're using the p value for us to guide us if this makes sense for us. The coat is low and the p value meaning, well, it seems like we have a significant relationship between the coat and our tin dust. We also believe that the thickness of the material will have a relationship with our tin dust, and we believe that the Measurement 3 will have some kind of relationship with the tin dust. This mean also that the Measurement 1, 2 and 4 don't seem to be significant when we talk about the tin dust. Remember that we were starting from just working without any kind of number, so now we were talking about P values and how this can help us. And this was actually quite easy for us to to interpret it, and people did understand, okay significant means that this maybe is not a coincidence. And by using the right people, we could verify this makes sense for them as well, so it seems like the coating can have an impact on the tin dust. And the technic... technical staff were saying, yeah, it makes sense that the coating will impact on the tin dust, because we have seen this on other lines. And the thickness could be verified makes sense and the Measurement 3 could make sense, so we started to believe that this is some good guidelines for us, but we need to see yeah, we can see that the coat seems to be significant, but is it with or without coating that is relevant for us? So we expanded this... the fit model here and you can see here in the in the profiler how the different variables will impact the tin dust. You can see the tin dust here to the left. And you can see the thickness of the sheet coming in the process here, the Measurement 3 and then the coat. And then we could try to simulate if we have to coating, if we have Measurement 3 on. what would be the expected kilograms coming out of the process? Here it's saying we can expect 0.0057. We also have a confidence interval here, but we can expect that this will be the, the number of tin dust coming out of the process. Then look and see what if we have a piston that is not coated. Can see that will change significance, and it will be higher. And we know from our started that around 0.2 will be like the game changer if we have more than 0.2 kilograms in tin dust, we need to clean, so we want to be lower than 0.2. And what we did to go even further here, because we know that the thickness of the material was very difficult for us to control, this is specified and there will be some variation within the thickness, which would be very difficult for us to change. So we needed to have a very robust process, saying we need to keep the thickness flexible. But what we can control is the Measurement 3 and the coating. So we expanded this profiler to the simulator so that we could simulate what if the thickness will change with some specified standard deviation. So we're saying we know the thickness can can change. We know that we have a mean around 5.595 and we have a standard deviation of this material equal to 0.058. We could change this later on. We want to fix the measurement and we want to fix the coating, having the coating on the piston. We also know that our target is not higher than 0.2 so we could add a target in here. And then we could simulate. The Measurement 3 will be fixed, the coat will be fixed, but the thickness will change over time. Then we can simulate if we want 5,000 sheets in the process, what could we expect to see in the tin dust? And then we could simulate. You can see if we have a tin sheet coming in having a thickness of about 5.6, the Measurement 3 at 1 and with a coat, we could expect a very good result. And you can see, we have the red line here at 2... sorry, 0.2, and we can also see hfere that the rate of defect, meaning that rate of measurements higher than 0.1 would be 0, so this is good for us. You can also see that if we then change it, the Measurement 3, not at 1, but to 2.5. And with the simulation again. We will be at a slightly higher tin dust amount. If we, on top of that, sorry, change it to a piston with no coating and run 5,000 sheet plate, we could expect a very poor response. And we could see that the setup that we had in our production line before we started these changes were pistons without any coat. So this was very new to us, and it was very exciting for us to see that we can actually see what could control the the tin dust and we can even control it ourselves. So, based on the simulations that we have here, we decided to have a Measurement 3 on 2 and then a piston with a coating, meaning that we will have a very robust process that gets...that can handle the variation that we have in our thickness of the sheets. Then we could also try to to to simulate thing, what if this deviation will not be 0.058 but will be 1.5. You see, this will be too much, so this is some of the agreements that we have with our suppliers that they need to keep the standard deviation at a specific level because then we will have a robust process. So this was like a an eye opener for all of us since this is very, very good picture for us, and it hasn't cost us a lot. The only thing it has cost us was the experiment that we set for five days. So this was good news for us. And so here on number five, we concluded saying, Okay, now we know what should be improved, we need to coat the tools or the piston and we need to adjust the measurement to 2.0. It was coming from about 3.0. So this was ??? an improvement, and this is a very simple picture from the profiler, showing the relationship between our tin coat and the different variables that we expected. And we did a control saying let's try to change our processes and then produce 25,000 more lids and then you can see the comparison. This is was our like our baseline. You can see, the small dust here in the baggage before we change the process, and you can see the 25,000 of the tin dust after producing 25,000 lids after the change and we didn't see any tin dust here. So this was like...it was very good proof of concept for us. It was one of the first projects that we did using SAS JMP and it was a very good proof of concept and people did really rely on this way of doing problem solving. What we learn here is the use of data, it shows benefit for us, because then we have something in common, something that we can relate to everybody, instead of having different opinions that is very difficult for us to quantify, so the use of this was very helpful for all of us. And using the right people, the technical people, people with a knowledge from from the line, also people who can use like a statistical software as JMP understand how to set up an experiment, understand how to do a fit model, regression models and so on. And then the use of SAS JMP is truly powerful. We could not have done this, like in Excel because they don't have the tools. And then, using a structured process like they make us very powerful for us, so this was a very good learning for us, and this is something that we have implemented in many projects afterwards with very good response.  
Christopher Gotwalt, JMP Director of Statistical R&D, SAS   There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience.       Auto-generated transcript...   Transcript Hello Chris Gotwalt here. Today, we're going to be constructing the history of graphic paradoxes and oh wait, wrong topic. Actually we're going to be talking about candidate set designs, tailoring DOE constraints to the problem. So industrial experimentation for product and process improvement has a long history with many threads that I admit I only know a tiny sliver of. The idea of using observation for product and process innovation is as old as humanity itself. It received renewed focus during the Renaissance and Scientific Revolution. During the subsequent Industrial Revolution, science and industry began to operate more and more in lockstep. In the early 20th century, Edison's lab was an industrial innovation on a factory scale, but it was done to my knowledge, outside of modern experimental traditions. Not long after RA Fisher introduced concepts like blocking and randomization, his associate and then son in law, George Box, developed what is now probably the dominant paradigm in design of experiments, with the most popular book being Statistics for Experimenters by Box, Hunter and Hunter. The method described in Box, Hunter and Hunter are what I call the taxonomical approach to design. So suppose you have a product or process you want to improve. You think through the things you can change. The knobs can turn like temperature, pressure, time, ingredients you can use or processing methods that you can use. These these things become your factors. Then you think about whether they are continuous or nominal, and if they are nominal, how many levels they take or the range you're willing to vary them. If a factor is continuous, then you figure out the name of the design that most easily matches up to the problem and resources that you...that fits your budget. That design will have... will have a name like a Box Behnken design, a fractional factorial, or a central composite signs, or possibly something like a Taguchi array. There will be restrictions on the numbers of runs, the level...the numbers of levels of categorical factors, and so on, so there will be some shoehorning the problem at hand into the design that you can find. For example, factors in the BHH approach, Box Hunter and Hunter approach, often need to be whittled down to two or three unique values or levels. Despite its limitations, the taxonomical approach has been fantastically successful. Over time, of course, some people have asked if we could still do better. And by better we mean to ask ourselves, how do we design our study to obtain the highest quality information pertinent to the goals of the improvement project? This line of questioning lead ultimately to optimal design. Optimal design is an academic research area. It was started in parallel with the Box school in the '50s and '60s, but for various reasons remained out of the mainstream of industrial experimentations, until the custom designer and JMP. The philosophy of the custom designer is that you describe the problem to the software. It then returns you the best design for your budgeted number of runs. You start out by declaring your responses along with their goals, like minimize, maximize, or match target, and then you describe the kinds of factors you have, continuous, categorical mixture, etc. Categorical factors can have any number of levels. You give it a model that you want to fit to the resulting data. The model assumes at least squares analysis and consists of main effects and interactions in polynomial terms. The custom designer make some default assumptions about the nature of your goal, such as whether you're interested in screening or prediction, which is reflected in the optimality criterion that is used. The defaults can be overridden with a red triangle menu option if you are wanting to do something different from what the software intends. The workflow in most applications is to set up the model. Then you choose your budget, click make design. Once that happens, JMP uses a mixed, continuous and categorical optimization algorithm, solving for the number of factors times the number of rows terms. Then you get your design data table with everything you need except the response data. This is a great workflow as the factors are able to be varied independent from one another. What if you can't? What if there are constraints? What if the value of some factors determine the possible ranges of other factors? Well then you can do....then you can define some factor constraints or use it disallowed combinations filter. Unfortunately, while these are powerful tools for constraining experimental regions, it can still be very difficult to characterize constraints using these. Brad Jones' DOE team, Ryan Lekivetz, Joseph Morgan and Caleb King have added an extraordinarily useful new feature that makes handling constraints vastly easier in JMP 16. These are called candidate or covariate runs. What you can do is, off on your own, create a table of all possible combinations of factor settings that you want the custom designer to consider. Then load them up here and those will be the only combinations of factor settings that the designer will... will look at. The original table, which I call a candidate table, is like a menu factor settings for the custom designer. This gives JMP users an incredible level of control over their designs. What I'm going to do today is go over several examples to show how you can use this to make the custom designer fulfill its potential as a tool that tailors the design to the problem at hand. Before I do that, I'm going to get off topic for a moment and point out that in the JMP Pro version of the custom designer, there's now a capability that allows you to declare limits of detection at design time. If you want a non missing values for the limits here the custom designer will add a column property that informs the generalized regression platform of the detection limits and it will then automatically get the analysis correct. This leads to dramatically higher power to detect effects and much lower bias in predictions, but that's a topic for another talk. Here are a bunch of applications that I can think of for the candidate set designer. The simplest is when ranges of a continuous factor depend on the level of one or more categorical factors. Another example is when we can't control the range of factors completely independently, but the constraints are hard to write down. There are two methods we can use for this. One is using historical process data as a candidate set, and then the other one is what I call filter designs where you create...design a giant initial data set using random numbers or a space filling design and then use row selections in scatter plots to pick off the points that don't satisfy the constraints. There's also the ability to really highly customize mixture problems, especially situations where you've got multilayer mixturing. This isn't something that I'm going to be able to talk about today, but in the future this is something that you should be looking to be able to do with this candidate set designer. You can also do nonlinear constraints with the filtering method, the same ways you can do other kinds of constraints. It's it's very simple and I'll have a quick example at the very end illustrating this. So let's consider our first example. Suppose you want to match a target response in an investigation of two factors. One is equipped...an equipment supplier, of which there are two levels and the other one is the temperature of the device. The two different suppliers have different ranges of operating temperatures. Supplier A's is more narrow of the two, going from 150 to 170 degrees Celsius. But it's controllable to a finer level of resolution of about 5 degrees. Supplier B has a wider operating range going from 140 to 180 degrees Celsius, but is only controllable to 10 degrees Celsius. Suppose we want to do a 12 run design to find the optimal combination of these two factors. We enumerate all possible combinations of the two factors in 10 runs in the table here, just creating this manually ourselves. So here's the five possible values of machine type A's temperature settings. And then down here are the five possible values of Type B's temperature settings. We want the best design in 12 runs, which exceeds the number of rows in the candidate table. This isn't a problem in theory, but I recommend creating a copy of the candidate set just in case so that the number of runs that your candidate table has exceeds the number that you're looking for in the design. Then we go to the custom designer. Push select covariate factors button. Select the columns that we want loaded as candidate design factors. Now the candidate design is loaded and shown. Let's add the interaction effect, as well as the quadratic effect of temperature. Now we're at the final step before creating the design. I want to explain the two options you see in the design generation outline node. The first one, which will force in all the rows that are selected in the original table or in the listing of the candidates in the custom designer. So if you have checkpoints that are unlikely to be favored by the optimality criterion and want to force them into into the design, you can use this option. It's a little like taking those same rows and creating an augmented design based on just them, except that you are controlling the possible combinations of the factors in the additional rows. The second option, which I'm checking here on purpose, allows the candidate rows to be chosen more than once. This will give you optimally chosen replications and is probably a good idea if you're about to run a physical experiment. If, on the other hand, you are using an optimal subset of rows to find to try in a fancy new machine learning algorithm like SVEM, a topic of one of my other talks at the March Discovery Conference. You would not want to check this option if that was the case. Basically, if you don't have all of your response values already, I would check this box and if you already have the response values, then don't. Reset the sample size to 12 and click make design. The candidate design in all its glory will appear just like any other design made by the custom designer. As we see in the middle JMP window, JMP also selects the rows in the original table chosen by the candidate design algorithm. Note that 10 not 12 rows were selected. On the right we see the new design table, the rightmost column in the table indicates the row of origin for that run. Notice that original rows 11 and 15 were chosen twice and are replicates. Here is a histogram view of the design. You can see that the different values of temperature were chosen by the candidate set algorithm for different machine types. Overall, this design is nicely balanced, but we don't have 3 levels of temperature in machine type A. Fortunately, we can select the rows we want forced into the design to ensure that we have 3 levels of temperature for both machine types. Just select the row you want forced into the design in the covariate table. Check include all selected covariant rows into the design option. And then if you go through all of that, you will see that now both levels of machine have at least three levels of temperature in the design. So the first design we created is on the left and the new design forcing there to be 3 levels of machine type A's temperature settings is over here to the right. My second example is based on a real data set from a metallurgical manufacturing process. The company wants to control the amount of shrinkage during the sintering step. They have a lot of historical data and have applied machine learning models to predict shrinkage and so have some idea what the key factors are. However, to actually optimize the process, you should really do a designed experiment. As Laura Castro-Schilo once pointed... As Laura Castro-Schilo once told me, causality is a property not of the data, but if the data generating mechanism, and as George Box says on the inside cover of Statistics for Experimenters, to find out what happens when you change something, it is necessary to change it. Now, although we can't use the historical data to prove causality, there is essential information about what combinations of factors are possible that we can use in the design. We first have to separate the columns in the table that represent controllable factors from the ones that are more passive sensor measurements or drive quantities that cannot be controlled directly. A glance at the scatter plot of the potential continuous factors indicates that there are implicit constraints that could be difficult to characterize as linear constraints or disallowed combinations. However, these represent a sample of the possible combinations that can be used with the candidate designer quite easily. To do this, we bring up the custom designer. Set up the response. I like to load up some covariate factors. Select the columns that we can control as factor...DOE factors and click OK. Now we've got them loaded. Let's set up a quadratic response surface model as our base model. Then select all of the model terms except the intercept. Then do a control plus right click and convert all those terms into if possible effects. This, in combination with response surface model chosen, means that we will be creating a Bayesian I-optimal candidate set design. Check the box that allows for optimally chosen replicates. Enter the sample size. It then creates the design for us. If we look at the distribution of the factors, we see that it is tried hard to pursue greater balance. On the left, we have a scatterplot matrix of the continuous factors from the original data and on the right is the hundred row design. We can see that in the sintering temperature, we have some potential outliers at 1220. One would want to make sure that those are real values. In general, you're going to need to make sure that the input candidate set it's clear of outliars and of missing values before using it as a candidate set design. In my talk with Ron Kennet this...in the March 2021 Discovery conference, I briefly demo how you can use the outlier and missing value screening platforms to remove the outliers and replace the missing values so that you could use them at a subsequent stage like this. Now suppose we have a problem similar to the first example, where there are two machine types, but now we have temperature and pressure as factors, and we know that temperature and pressure cannot vary independently and that the nature of that dependence changes between machine types. We can create an initial space filling design and use the data filter to remove the infeasible combinations of factors setting separately for each machine type. Then we can use the candidate set designer to find the most efficient design for this situation. So now I've been through this, so now I've created my space filling design. It's got 1,000 runs and I can bring up the global data filter on it and use it to shave off different combinations of temperature and pressure so that we can have separate constraints by machine type. So I use the Lasso tool to cut off a corner in machine B. And I go back and I cut off another corner in machine B so machine B is the machine that has kind of a wider operating region in temperature and pressure. Then we switch over to machine A. And we're just going to use the Lasso tool to shave off the points that are outside its operating region. And we see that its operating region is a lot narrower than Machine A's. And here's our combined design. From there we can load that back up into the custom designer. Put an RSM model there, then set our number of runs to 32, allowing coviariate rows to be repeated. And it'll crank through. Once it's done that, it selects all the points that were chosen by the candidate set designer. And here we can see the points that were chosen. They've been highlighted and the original set of candidate points that were not selected are are are gray. We can bring up the new design in Fit Y by X and we can see a scatterplot where we see that the the the machine A design points are in red. They're in the interior of the space, and then the Type B runs are in blue. It had the wider operating region and that's how we see these points out here, further out for it. So we have quickly achieved a design with linear constraints that change with a categorical factor without going the annoying process of deriving the linear combination coefficients. We've simply used basic JMP 101 visualization and filtering tools. This idea generalizes to other nonlinear constraints and other complex situations fairly easily. So now we're going to use filtering and multivariate to set up a very unique new type of design that I assure you you have never seen before. Go to the Lasso tool. We're going to cut out a very unusual constraint. And we're going to invert selection. We're going to delete those rows. Then we can speed this up a little bit. We can go through and do the same thing for other combinations of X1 and the other variables. Carving out a very unusual shaped candidate set. We can load this up into the custom designer. Same thing as before. Bring our columns in as covariates, set up a design with all... all high order interactions made if possible, with a hundred runs. And now we see our design for this very unusual constrained region that is optimal given these constraints. So I'll leave you with this image. I'm very excited to hear what you were able to do with the new candidate set designer. Hats off to the DOE team for adding this surprisingly useful and flexible new feature. Thank you.  
Vince Faller, Chief Software Engineer, Predictum  Wayne Levin, President, Predictum   This session will be of interest to users who work with JMP Scripting Language (JSL). Software engineers at Predictum use a continuous integration/continuous delivery (CI/CD) pipeline to manage their workflow in developing analytical applications that use JSL. The CI/CD pipeline extends the use of Hamcrest to perform hundreds of automated tests concurrently on multiple levels, which factor in different types of operating systems, software versions and other interoperability requirements. In this presentation, Vince will demonstrate the key components of Predictum’s DevOps environment and how they extend Hamcrest’s automated testing capabilities for continuous improvement in developing robust, reliable and sustainable applications that use JSL: Visual Studio Code with JSL extension – a single code editor to edit and run JSL commands and scripts in addition to other programming languages. GitLab – a management hub for code repositories, project management, and automation for testing and deployment. Continuous integration/continuous delivery (CI/CD) pipeline – a workflow for managing hundreds of automated tests using Hamcrest that are conducted on multiple operating systems, software versions and other interoperability requirements. Predictum System Framework (PSF) 2.0 – our library of functions used by all client projects, including custom platforms, integration with GitLab and CI/CD pipeline, helper functions, and JSL workarounds.     Auto-generated transcript...   Speaker Transcript Wayne Levin Welcome to our session here on extending Hamcrest automated testing of JSL applications for continuous improvement. What we're going to show you here, our promise to you, is we're going to show you how you too can build a productive cost-effective high quality assurance, highly reliable and supportable JMP-based mission-critical integrated analytical systems. Yeah that's a lot to say but that's that's what we're doing in this in this environment. We're quite pleased with it. We're really honored to be able to share it with you. So here's the agenda we'll just follow here. A little introduction, my self, I'll do that in a moment, and just a little bit about Predictum, because you may not know too much about us, our background, background of our JSL development, infrastructure, a little bit of history involved with that. And then the results of the changes that we've been putting in place that we're here to share with you. Then we're going to do a demonstration and talk about what's next, what we have planned for going forward, and then we'll open it up, finally, for any questions that that you may have. So I'm Wayne Levin, so that's me over here on the right. I'm the president of Predictum and I'm joined with Vince Faller. Vince is our chief software engineer who's been leading this very important initiative. So just a little bit about us, right there. We're a JMP partner. We launched in 1992, so 29 years old. We do training in statistical methods and so on, using JMP, consulting in those areas and we spend an awful lot of time building and deploying integrated analytical applications and systems, hence why this effort was very important to us. We first delivered JMP application with JMP 4.0 in the year 2000, yeah, indeed over 20 years ago, and we've been building larger systems. Of course, since back then, it was too small little tools, but we started, I think, around JMP 8 or 9 building larger systems. So we've got quite a bit of history on this, over 10 years easily. So just a little bit of background...until about the second half of 2019, our development environment was really disparate, it was piecemeal. Project management was there, but again, everything was kind of broken up. We had different applications for version control and for managing time, you know, our developer time, and so on, and just project management generally. Developers were easily spending, and we'll talk about this, about half their time just doing routine mechanical things, like encrypting and packaging JMP add-ins. You know, maintaining configuration packages and, you know, and separating the repositories or what we generally call repo's, you know, for encrypted and unencrypted script. It was...there was a lot we hade to think about that wasn't really development work. It was really work that developer talent is...was wasted on. We also had, like I said, we've been doing it a long time, even at 2019, we had easily 10 years, so over 10 years of legacy framework going all the way back even to JMP 5, you know, with, you know, it was getting bloated and slow. And we know JMP has come a long way over the years. I mean in JMP 9, we got namespaces and JMP 14 introduced classes and that's when Hamcrest began. And it was Hamcrest that really allowed us to go this this...with this major initiative. So we began this major initiative back in August of 2019. And that's when we are acquired our first Gitlab licenses and that's the development of our new...the development of our new development architecture, there you go, started to take shape and it's been improving ever since. Every month, basically, we've been adding and building on our capabilities to become more and more productive, as we go forward. And and that's continuing, so we actually consider this, if you will, a Lean type of effort. It really does follow Lean principles and it's accelerated our development. We have automated testing, thanks to this system, and Vince is going to show us that. And we have this little model here, test early and test often And that's what we do. It supports reusing code and we've redeveloped our Predictum system framework. It's now 2.0. We've learned a lot from our earlier effort. All that's gone, pretty much all of its gone, and it's been replaced and expanded. And Vince will tell us more about that. Easily, easily we have over 50% increase in productivity, and I'm just going to say the developers are much happier. They're less frustrated. They're more focused on their work, I mean the real work that developers should be doing, not the tedious sort of stuff. There's still room for improvement, I'm going to say, so we're not done and Vince will tell us more about that. We have development standards now, so we have style guides for functions and all of our development is functionally based, you might say. Each function requires at least one Hamcrest test, and there are code reviews that the developers, they're sharing with one another to ensure that we're following our standards. And it raises questions about how to enhance those standards, make them better. We also have these, sort of, fun sessions, where developers are encouraged to break code, right, so they're called like, these break code challenges, or what have you. So it's become part of our modus operandi and it all fits right in with this development environment. It leads to, for example, further tests, further Hamcrest tests to be added. We have one small, fairly small project that we did just over a year ago. We're going into a new phase of it. It's got well over... well over 100 Hamcrest tests are built into it and they get run over and over and over again through the development process. So some other benefits is it allows us to assign and track our resource allocation, like what developers are doing what. Everyone knows what everyone else is doing, continuous integration, continuous deployment, something like that), there's...code collisions are detected early so if we have... and we do, we have multiple people working on some projects, so, you know, somebody's changing a function over here and it's going to collide with something that someone else is doing. We're going to find out much sooner. It also allows us to improve supportability across multiple staff. We can't have code dependent on a particular developer; we have to have code that any developer or support staff can support ging forward. So that's was an important objective of ours as well. And it does advance the whole quality assurance area just generally, including supporting, you know, FDA requirements, concerning QA, you know, things like validation, the IQ OQ PQ. So it's...we're automating or semi automating those tasks as well through this infrastructure. We do use it internally and externally, so you may know, we have some products out there, (???)Kobe sash lab but new ones spam well Kobe send spam(???) are talked about also elsewhere in the JMP Discovery European Conference in 2021. You might want to go check them out, but they're fairly large code bases and they're all developed, in other words, we eat our own dog food, if you know that expression, but we also use it with all of our client development, so this is something that's important to our clients, so because we're building applications that they're going to be dependent on. And so we, we need to...we need to have the infrastructure that allows us to be dependable, and anyway, that's a big part of this. I mentioned the Predictum system framework. You can see some snippets of it here. It's right within the scripting index, and you know, we see the arguments and the examples and all that. We built all that in and 95%, over 95% of them have Hamcrest tests associated with them. Of course, our goal is to make sure that all of them do and we're we're getting there. We're getting there. Have...these framework...this framework is actually part of our infrastructure here. That's one of the important elements of it. Another is just that...Hamcrest... the ability to do the unit testing. And I'm going to have...there's a slide at the...at the end, which will give you a link into the Community where you can learn more about Hamcrest. This is a development that was brought to us by by JMP, back in JMP 14, as I mentioned a few minutes ago. Gitlab is a big part of this; that gives us the project management repository, the CI/CD pipeline, etc. And also there's a visual...visual studio code extension for JSL that we created and we'd...you see five stars there because it was given five stars on the on the visual studio. I'm not sure what we call that. Vince, maybe you can tell us, the store, what have you. It's been downloaded hundreds of times and we've been updating it regularly. So that's something you can go and look for as well. I think we have a link for that as well in the resource slide at the end. So what I'm going to do now is I'm going to pass this over to Vince Faller. Vince is, again, our chief software engineer. Vince led this initiative, starting in August 2019, as I said. It was a lot of hard work and the hard work continues. We're all, in the company, very grateful for Vince and his leadership here. So with that said, Vince, why don't you take it from here? I'm gonna... I'm... Vince Faller Sharing. So Wayne said Hamcrest a bunch of times. For people that don't know what Hamcrest is, it is an add-in created by JMP. Justin Chilton and Evan McCorkle were leading it. It's just a unit testing library that lets you run, test, and get results of it in an automated way. It really started the ball rolling of us being able to even do this, hence why it's called extending. I'm going to be showing some stuff with my screen. I work pretty much exclusively in the VSCode extension that we built. This is VSCode. We do this because it has a lot of built-in functionality or extendable functionality that we don't have to write, like get integration, get lab integration. Here you can see this is a JSL script and it reads it just fine. If you want to get it, if you're...if you're familiar with VSCode, it's just a lightweight text editor. You just type in JMP and you'll see it. It's the only one. But we'll go to what we're doing. So. For any code change we make, there is a pipeline run. We'll just kind of show what it does. So if I change the README file to this is a demo for Discovery. 2021. And I'm just going to commit that. If you don't know get, committing as just saying I want a snapshot of exactly where we are at the moment, and then you push it to the repo and it saved on the server. Happy day. Commit message. More readme info. And I can just do get push, because VSCode is awesome. Pipeline demo. So now I've pushed it. There is going to be a pipeline running. I can just go down here and click this and it will give me my merge request. So now pipeline started running. I can check the status of the pipeline. What it's doing right now is it's going to go through and check that it has the required Hamcrest files. We have some requirements that we enforce so that it can... we can make sure that we're doing our jobs well. And then it's done. I'm going to press encrypt. Now encrypt is going to take the whole package and encrypt it. If we go over here, this is just a vm somewhere. It should start running in a second. So it's just going through all the code. Writing all the encrypted passwords, going through, clicking all that stuff. If you've ever tried to encrypt multiple scripts at the same time, you'll probably know that that's a pain, so we automated it so that we don't have to do this because, as Wayne said, it was taking a lot of our time to do these. Like, if we have 100 scripts to go through and encrypt every single one of them every time we want to do any release, it was awful. Because we have to have our code encrypted because, yeah sorry, opinion, all right, I can stop sharing that. Ah. So that's gonna run. It should finish pretty soon. Then it will go through and stage it and then the staging basically takes all of the sources of information we want, as our as in our documentation, as in anything else we've written, and it renders them into the form that we want in the add-in, because much like the rest of github, gitlab, most of our documentation is written in markdown and then we render it into whatever. I don't need to show the rest of this but yeah. So it's passing. It's going to go. We'll go back to VSCode. So. If we were to change, so this is just a single function. If I go in here like, if I were to run this... JSL, run current selection. So. You can see that it came back...all that it's trying to do is open Big Class, run a fit line, and get the equation. It's returning the equation. And you can actually see it ran over here as well. But. So this could use some more documentation. And we're like, oh, we don't actually want this data table open. But let's let's just run this real quick. And say, no. This isn't a good return, it turns the equation in all caps apparently. So if I stage that. Better documentation. Push. Again back to here. So, again it's pushing. This is another pipeline. It's just running a bunch of power shell scripts in order, depending on however we set it up. But you'll notice this pipeline has more stages. So when we in an effort to help be able to scale this, we only test the JSL minimally at first, and then, as it passes, we allow to test further. And we only tested if there are JSL files that have changed. But we can go through this. It will run and it will tell us where it is in the the testing, just in case the testing freezes. You know, if you have a modal dialog box that just won't close, obviously JMP isn't going to keep doing anything after that. But you can see, it did a bunch of stuff, yeah, awesome. I'm done. Exciting. Refresh that. Get a little green checkmark. And we could go, okay, run everything now. It would go through, test everything, then encrypt it, then test the encrypted, basically the actual thing that we're going to make the add-in of, and then stage it again, package it for us, create the actual add-in that we would give to a customer. I'm not going to do that real quick because it takes a minute. But let's say we go in here and we're, like, oh, well, I really want to close this data table. I don't know why I commented out in the first place. I don't think it should be open, because I'm not using it anymore, we don't...we don't want that. We'll say okay. Close the dt. Again push. Now, this could all be done manually on my computer with Hamcrest. But you know, sometimes a developer will push stuff and not run all of their Hamcrest for everything on their computer, and this is...the entire purpose of it is to catch it. It forced us to do our jobs a little better. And yeah. Keep clicking this button. I could just open that, but it's fine. So now you'll see it's running the pipeline again. Go to the pipeline. And I'm just going to keep saying this for repetition. We're just going through, testing, and encrypting, then testing because sometimes encryption enters its own world of problems, if anybody's ever done encrypting. Run, run, run, run, run. And then, oh, we got a throw. Would you look at that? I'm not trying to be deadpan, but you know. So if we were to mark this as ready and say, yeah we're done, we'd see, oh, well, that test didn't pass. Now we could download why the test didn't pass in the artifacts. And this will open a J unit file that I'm just going to pull out here. It will also render it in getlab, which might be easier, but for now we'll just do this. Eventually. Minimize everything. Now come on. So, we could see that something happened with R squared and it failed. Inside of blue. So we can come here. Say, why is there something in boo that is causing this to fail? We see, oh, somebody called our equation and then they just assumed that the data table was there. So because something I changed broke somebody else's code, as if that would ever happen. So we're having that problem. Where did you go? Here we go. So that's the main purpose of everything we're doing here, is to be able to catch the fact that I changed something and I broke somebody else's stuff. So I could go through. Look at what boo does. Say, oh well, maybe I should just open Big Class myself. Yeah, cool. Well, if I save that, I should probably make it better. Open Big Class myself. I'll stage that. Open Big Class.Get push. And again, just show the old pipeline. Now this should take not...not too long. So we're going to go in here. We're...we only test on one...to... JMP version, but you can see automatically, we only test on one. Then it waits for the developer to say, yeah, I'm done and everything looks good. Continue. We do that for resource reasons, because these are running on vms that are automatically just chugging all the time, and we have multiple developers, who are all using these systems. We're also... You can see, this one is actually a docker system, we're containerizing these. Well, we're in the process of containerizing these. We have them working, but we don't have all the versions yet. But we run 14.3, at least for this project, we run 14.3, 15, 15.1, and that should work. Let's just revert things. Because that you know works. Probably should have done a classic...but it's fine. So yeah. We're going to test. I feel like I keep saying this over and over. We're going to test everything. We'll actually let this one run to show you kind of the end result of what we get. It should only take. a little bit. And so we'll test this, make sure it's going, and you can see the logs. We're getting decent information out of what is happening, on where it is, like it'll tell you the runner that is running. I'm only running on Windows right now. Again, this is a demo and all that but we should be able to run more. While that's running, I'll just talk about VSCode some more. In VSCode, we also...there's also snippets and things, so if you want to make a function, it will create you all over the the function information. We use natural docs again, that was stolen from the Hamcrest team, as our development documentation. So it'll just put everything in a natural docs form. So it just, again, the idea is helping us do our jobs and forcing us to do our jobs a little better, with a little more gusto. Wayne Levin For the documentation? Vince Faller So that's for the documentation, yeah. Wayne Levin As we're developing, we're documenting at the same time. Vince Faller Yep. Absolutely. You know, it also has for loops, while loops. For with an associate row, stuff like that. Are we...is this...is this done yet? It's probably done, yep. So we get our Green checkmark. Now it's going to run on all of the systems. If we can go back to here, you'll just see it. Open JMP. It'll run some tests, probably will open Big Class. Then close all...close itself all down. Wayne Levin So we're doing this, largely because many of our clients have different versions of JMP deployed and they want a particular add-in but they're running it, they have, you know, just different versions out there in the field. We also test against the early adopter versions of JMP, which is a service to JMP because we report bugs. But also for the clients, it's helpful because then they know that they can...they can upgrade to the new version of JMP. They know that the applications that we built for them have been tested. And that's just routine for us. Good. Vince Faller You're done. You're done. You're done. Change to... I can talk about... And this is just going to run, we can movie magic this if you want to, Meg. Just to make it run faster. Basically, I just want to get to staging but it takes a second. Is there anything else you have to say, Wayne, about it? Cool. I'll put that... Something I can say, when we're staging, we also have our documentation in mk docs. So it'll actually run the mk doc's version, render it, put the help into the help files, and basically be able to create a release for us, so that we don't have to deal with it. Because creating releases is just a lot of effort. Encrypting. It's almost done. Probably should just have had one pre loaded. Live demos, what are you gonna do. Some. Run. Oh, one thing I definitely want to do. So, the last thing that the pipeline actually does is checks that we actually spent our time, because, you know, if we don't actually record our time spent, we don't get paid, so forcing us to do it. Great, great time. Vinde Faller So Vince Faller the job would have failed without that. I can just show some jobs. Trying. That's the docker one. We don't want that. So You can see that gave us our successes. No failures. No unexpected throws. That's all stuff from Hamcrest. Come on. One more. Okay got to staging. One thing that it does it creates the repositories. It creates them fresh every time. So it's like, it tries to keep it in a sort of stateless way. Okay, we can download the artifacts now. And now we should have this pipeline demo. I really wish it would have just went there. What. Why is Internet Explorer up? So now you'll see pipeline demo is a JMP add-in. If we unzip it. If you didn't know, a JMP add-in is just a zip file. If we look at that now, you can see that has all of our scripts in it, it has our foo, it has our bar. If we open those, open, you can see it's an encrypted file. So this is basically what we would be able to give to the customer and not have so much mechanical work. Wayne mentioned that it's less frustrated developers, and personally, I think that's an understatement, because doing this over and over was very frustrating before we got this in place, and this has helped a bunch. That. Wayne Levin Now, about the encryption, when you're delivering an add-in for use by users within a company, you typically don't want, for security reasons and so on, you don't want them to anyone to be able to go in and deal with the code. You know, that sort of thing, so we may deliver a code unencrypted just for, you know, so the client has their code on encrypted, but for delivery to the end user, you typically want everything encrypted, just so it can't be tampered with. Just one of those sort of things. Vince Faller Yep, and that is the end of my demo. Wayne, if you want to take it back for the wrap up. Wayne Levin Yeah, terrific. Sure, thanks, very much for that, Vince. So there's a lot of moving parts in this whole system so it's, you know, basically, making sure that we've got, you know, code being developed by multiple users that are not colliding with one another. We're building in the documentation at the same time. And actually, the documentation gets deployed with the application and we don't have to weave that in. It's... We set the infrastructure up so that it's automatically taken care of. We can update that, along with the code comprehensively, simultaneously, if you will. The Hamcrest tests that are going on, each one of those functions that are written, there are expected results, if you will. So they get compared and so we saw, briefly, there was, I guess, some problem with that equation there. An R square or whatever came back with a different value, so it broke, in other words, to say hey, something's not right here; I was expecting this output from the function for a use case. So that's one of the things that we get from clients, so you know, we build up a pool of use cases that get turned into Hamcrest tests and away we go. There are some other slides here that are available to you, like, when you, when you, if you go and download the slides. So I'll leave that available for you and here's a little picture of of the pipeline that that we're employing and a little bit about code review activity for developers too. If you want to to go back and forth with it. Vince, do you want to add anything here about how code review and approval takes place? Vince Faller Yeah, so inside of the merge request it will have the JSL code on the diffs of the code. And again, a big thank you to the people who did Hamcrest, as well, because they also started a lexer(?) for GitHub and GitLab to be able to read JSL, so actually this is inside of getlab. And they can also read the JSL. It doesn't execute it, but it has nice formatting. It's not all just white text, it's it's beautiful there. We will just go in, like in this screenshot, you click a line, you put in a comment that you want, and it becomes a reviewable task. So we try to do as much inside of GitLab as we can for transparency reasons, and once everything is closed out, you can say yep, my merge request is ready to go. Let's put it into the master branch, main branch. Wayne Levin Awesom. So you're really it's...it's helping, you know, we're really defining coding standards, if you will, and I don't like the word enforcement but that's what it what it amounts to. And it reduces variation. It makes it easier for multiple developers, if you will, to understand what what others have done. And as we bring new developers on board, they come to understand the standard and they know what to look for, they know what to do. So it...it makes onboarding a lot easier, and again it deals with...everything's attached to everything here, so you know supportability and so on. This is the slide I mentioned earlier, just for some resources so we're using GitLab. I suppose the same principles applied to any git generally so like GitHub or what have you. Here's the community link for Hamcrest. There was a talk at in tucson, that was in 2019, in the old days when we used to travel and get together. That was a lot of fun. And here's the the marketplace link for the visuals...visual code studio. Visual studio code, what have you. So as Vince said, yeah we make a lot of use of that editor, as opposed to using the built-in JMP editor just because it's all integrated. It's... it's just all part of one big application development environment. And with that, Vince and I, on behalf of Vince and myself, I want to thank you for your your interest in this, and want to, again we really want to thank the JMP team. Justin Chilton and company, I'll call out to you. If not for Hamcrest, we would not be on this. That was the missing piece or was the enabling piece that really allowed us to to take JSL development to, basically, the kinds of standards you expect in code development, generally, in industry. So we're really grateful for it, and I know our... you know, that that is propagated out with each application we've deployed. And at this point, Vince and I are happy to take any questions that... info@predictum.com and it'll get forwarded to us and we'll get back to you. But at this point, we'll open it up to Q&A.  
Mia Stephens, JMP Principal Product Manager, SAS   Predictive modeling is all about finding the model, or combination of models, that most accurately predicts the outcome of interest. But, not all problems (and data) are created equal. For any given scenario, there are several possible predictive models you can fit, and no one type of model works best for all problems. In some cases a regression model might be the top performer; in others it might be a tree-based model or a neural network. In the search for the best-performing model, you might fit all of the available models, one at a time, using cross-validation. Then, you might save the individual models to the data table, or to the Formula Depot, and then use Model Comparison to compare the performance of the models on the validation set to select the best one. Now, with the new Model Screening platform in JMP Pro 16, this workflow has been streamlined. In this talk, you’ll learn how to use Model Screening to simultaneously fit, validate, compare, explore, select and then deploy the best-performing predictive model.     Auto-generated transcript...   Speaker Transcript Mia Stephens model screening.   If you do any work with predictive modeling, you'll find that model screening helps you to streamline your predictive modeling workflow.   So in this talk I'm going to provide an overview of predictive modeling, talk about the different types of predictive models we can use in JMP.   We'll talk about the predictive modeling workflow within the broader analytics workflow, and we'll see how model screening can help us to streamline this workflow.   I'll talk about some metrics for comparing competing models using validation data and we'll see a couple of examples in JMP Pro.   First let's talk a little bit about predictive modeling and what is predictive modeling.   You've probably been exposed to regression analysis, and regression is an example of explanatory modeling. And in regression we're typically interested in building a model   for a response, or Y, as a function of one or more Xs, and we might have different modeling goals. We might be interested in identifying important variables. So what are the key Xs, or key input variables,   for example, in a problem solving setting that we might focus on to address the issue?   We might be interested in understanding how the response changes, on average, as a function of the input variables.   a one unit change in X is associated with a five unit change in Y. So this is classical explanatory modeling and if you've taken statistics in school, this is probably how you learn about regression.   Now to contrast, predictive modeling has a slightly different goal. In predictive modeling our goal is to accurately predict or classify future outcomes.   So if our response is continuous, we want to be able to predict the next observation, the next outcome, as precisely or accurately as possible.   And if our response is categorical, then we're interested in classification. And again we're interested in using current data to predict what's going to happen at the individual level in the future.   And we might fit and compare many different models, and in predictive modeling we might also use some more advanced models. We might use some machine learning techniques like neural networks.   And some of these models might not be as easy to interpret and many of them have a lot of different tuning parameters that we can set. And as a result, with predictive modeling we can have a problem with overfitting.   What overfitting means is that we fit a model that's more complex than it needs to be.   So with predictive modeling we generally use validation and there are several different forms of validation we can use.   We use validation for model comparison and selection, and fundamentally, it protects against overfitting but also underfitting.   And underfitting is when we fit a model that's not as complex as it needs to be. It's not really capturing the structure in our data.   Now, in the appendix at the end of the slides, I've pulled some some slides that illustrate why validation is important, but for the focus of this talk I'm simply going to use validation when I fit predictive models.   There are many different types of models we can fit in JMP Pro, and this is not by any means an exhaustive list.   We can fit several different types of regression models. If we have a continuous response, we can fit a linear regression model.   If our response is categorical, a logistic regression model, but we can also fit generalized linear models and penalize regression methods, and these are all from the fit model platform.   There are many options under the predictive modeling platform, so neural nets, neural nets with boosting, and different numbers of layers and nodes,   classification and regression trees, and more advanced tree based methods, and several other techniques.   And there are also a couple of predictive modeling options from the multivariate methods platform, so discriminate analysis   and partial least squares are two differen...two additional types of models we can use for predictive modeling. And by the way, partial least squares is also available from fit model.   And why do we have so many models?   In predictive modeling, you'll often find that that no one model   or modeling type always works the best. In certain situations, a neural network might be best and neural networks are generally are pretty...pretty good performers.   But you might find in other cases that a simpler model actually performs best, so the type of model that fits your data best and predicts most accurately is based largely on your response, but also on the structure of your data.   So you might fit several different types of models and compare these models before you find the model that fits most accurately or predicts most accurately.   So, within the broader analytic workflow, where you start off with some sort of a problem that you're trying to solve, and you compile data, you prepare the data, and explore the data,   predictive modeling is down in analyze and build models. And the typical predictive modeling workflow might look something like this, where you fit a model with validation.   Then you save that formula to a data table or publish it to the formula depot.   And then you fit another model and you repeat this, so you may fit several different models   and then compare these models. And in JMP Pro, you can use the model comparison platform to do this, and you compare the performance of the models on the validation data, and then you choose the best model or the best combination models and then you deploy the model.   And what's different with model screening is that all of the model fitting and comparison, this selection is done within one platform, the model screening platform.   So we're going to use an example that you might be familiar with, and there is a blog on model screening using these data that's posted in the Community, and these are the diabetes data.   So the scenario is that researchers want to predict the rate of disease progression, one year after baseline.   So there are several baseline measurements and then there are two different types of responses.   The response Y is the quantitative measure, a continuous measure, so this is the rate of disease progression and then there's a second response, Y Binary, which is high or low.   So Y Binary can represent a high high rate of progression or a low rate of progression.   And the goal of predictive modeling here is to predict patients who are most likely to have a high rate of disease progression, so that the the corrective actions can be taken   to prevent this. So we're going to...we're going to take a look at fitting models in JMP Pro for both of these responses, and we'll see how how to fit the same models using model screening.   And before I go to JMP, I just want to talk a little bit about how we compare predictive models.   We compare predictive models on the validation set or test set. So so basically what we're doing is we fit a model to a subset of our data called our training data.   And then we fit that model to data that were held out (typically we call these validation data) to see how well the model performs.   And if we have continuous responses, we can use measures of error, so root means square error (RMSE) or RASE, which is route average squared error, so this is the measure of prediction error.   AAE, MAD, MAE, these are these are measures of average error and there's different R square measures we might use.   To get categorical responses, we're most often interested in measures of error or misclassification.   We might also be interested in looking at an ROC curve, look at AUC (area under the curve) or sensitivity, or specificity, false positives, false negative rate.   the F1-Score and MCC, which is Matthews correlation coefficient.   So let me switch over to to JMP.   Tuck that this away and I'll open up the diabetes data.   And let me make this big so you can see it.   So these are the data. There are...there's information on 442 patients, and again, we've got a response Y, which is continuous and this is the the amount of disease progression after one year. And we'll start off looking at Y, but then we also have the second variable, Y Binary.   We've got baseline information.   And there's a column validation. So again validation, when we fit models, we're going to fit the models only using the training data and we're going to use the validation data to tell us when to stop growing the model and to give us measures of how good the model actually fits.   Now to build this column, there is a utility under predictive modeling called model validation column.   And this gives us a lot of options for building a validation column, so we can partition your data into training validation and a test set.   And, in most cases, if we're using this sort of technique of partitioning our data into subsets,   having a test set is recommended. Having a test set allows you to have an assessment of model performance on data that wasn't used for building the models or for stopping model growth, so so I'd recommend that, even though we don't have a test set in this case.   So let's say that I want to...I want to find a model that most accurately predicts the response. So as you saw there are a lot of different models choose from.   I'll start with fit model.   And this is usually a good starting point.   So I'm going to build a regression model with Y as our response, the Xs as model effects, and I'll only put in main effects at this point. And I'm going to add a validation column.   Now from fit model, the default here is going to be standard least squares, but there are a lot of different types of models we can fit.   I'm simply going to run least squares model.   A couple of things to point out here. Notice the marker here, V.   Remember that we fit our model to the training data, but we also want to see how well the model performs on the validation data, so all of these markers with a V are observations in the validation set.   Because we have validation, there is a crossvalidation section here, so we can look at R square on the training set and also on the validation set and then RASE.   And oftentimes what you'll see is that the validation statistics will be somewhat worse than the training statistics, and the farther off, they are, the better indication that you model is overfit or underfit.   I want to point out one other thing here that's really beyond the scope of this talk, and that's this prediction profiler.   And the prediction profiler   is actually one of the reasons I first started using JMP. It's a really powerful way of understanding your model.   And so I can change the values of any X and see what happens to the predicted response, and this is the average, so with   these models, we're predicting the average. But notice how these bands fan out for total cholesterol and LDL, HDL, right. And this is because we don't have any data out in those regions.   So that the new feature, I want to point out really quickly and again this is beyond the scope of this talk, is this thing called extrapolation control. And if I turn on a warning and drag   total cholesterol, notice that it's telling me there's a possible extrapolation. This is telling me I'm trying to predict out in a region where I really don't have any data, and if I turn   this extrapolation control on, notice that it truncates this bar, this line, so it's basically saying you can't make predictions out in that region. So it's something you might want to check out if you're fitting models. It's a really powerful tool.   So so let's say that I've done all the diagnostic work. I've reduced this model, and I want to be able to save my results.   Well, there are a couple of ways to do this. I can go to the red column, save columns, save the prediction formula.   So this saves the linear model I've just built out to the data table. So you can see it here in the background. And then, if I add new values to this data table, it'll automatically predict the response.   But I might want to save the model out in another form, so to do this, I might publish   this formula out to the formula depot. And the formula depot is actually independent of my data table, but what this allows me to do is I can copy the script and paste it into another data table with new data to score new data.   Or I might want to generate code in a different language to allow me to deploy this within some sort of a production system.   I'm going to go ahead and close this. This is just one model. Now is it possible, if I fit a more complicated or sophisticated model, that it might get better performance?   So I might fit another model. So, for example, I might...I'm just gonna hit recall. I might change the personality from standard least squares to generalize regression.   And this allows me to specified different response distributions. And I'll just stick with normal and click run.   So this will allow me to fit different penalized methods and also use different variable selection techniques. And if you haven't checked out generalized regression,   it's super powerful and super flexible modern modeling platform. I'm just going to click go. And let's say that I fit this model   and I want to be able to compare this model to the model I've already saved. So I might save the prediction formula out to the data table. So now I have another column in the background in the data table.   Or I might again want to publish this   to   the formula detail, so now I've got two different models here. And I can keep going. So this is just just one model from generalized regression. I can also fit several different types of   predictive models from the predictive modeling menu, so for example, neural networks or partition or bootstrap forest or boosted trees.   Now, typically what I would have to do is fit all these models, save the results out either to the data table or to the formula depot, and if I save them to   the data table, I can use this model comparison platform to compare the different competing models. And I might have many models, here I only have two.   And I don't actually even have to specify what the models are, I only need to specify validation. And I actually kind of like to put validation down here in the by field.   So this gives me my fit statistics for both the training set and the validation set, and I'm only going to look at the statistics for the validation step. So I would use this to help me pick the best performing model.   And what I'm interested in is a higher value of R square, a lower value of RASE (the root average squared error), and a lower average absolute error. And between these two models, it looks like this fit least squares regression model is the best.   Now, if I were to fit all the possible models, this can...this can be quite time-consuming. So instead of doing this, what's new in JMP Pro 16 is a platform called model screening.   And when I launch model screening, it has a dialogue at the top, just like we've seen, so I'll go ahead and populate this.   And I'll add a validation but over on the side, what you see is that I can select several different models and fit these different models all at one time.   So decision tree, bootstrap forest, boosted tree, K nearest neighbors, right, I can fit all of these models.   And it won't run models that don't make sense, so I have a continuous response logistic regression as one of my options, it won't run it a logistic regression model   with a continuous response. Notice that I've also got this option down here, XGBoost.   And the reason that appears is there is an add-in and it's actually uses open source libraries, and if you install this add-in (it's available on our JMP user community,   it's called XGBoost and it only works in JMP Pro), but if you install this add-in, it'll automatically appear in the model screening dialogue.   So I'm just going to click OK, and when I click OK, what it's going to do is it's going to go out and launch each of these platforms.   And then it's going to pull all the results into one window. So I clicked okay. I don't have a lot of data here, so it's very fast.   And under the details, these are all of the individual models that were fit. And if I open up any one of these dialogues, I can see the results, and I have access to additional options that will be available from that menu.   So I'm going to tuck away the details. And by default, what it's done is it's reported statistics for the training data, but it shows me the results for the validation data, so I can see R square and I can see RASE.   And by default it's sorting in order of RASE where lowest is best.   But I've got a few little shortcut options down here, so if I wanted to find the best models and it could be that R square is best for some models, but RASE is better for others, I'm going to click select dominant.   And this case, it selected neural boosted so across all of these models, the best model is neural boosted. And if I want to take a closer look at this model, I can either look at the details up here under neural   or, I can simply say, run selected.   Now I didn't talk about this, but in the dialogue window there's an option to set a random seed.   And if I set that random seed then the results that launch here will be identical to what I see here. So this is a...this is a   neural model with three nodes using the TanH function, but it's also using boosting. So in designing this platform, the developers did a lot of simulations to determine the best starting point and the best default settings for the different models.   So so neural boosted is the best.   And   if I want to be able to deploy this model, now what I can do is I can save the script   or I could run it, or I can save it out to the formula deeper.   So this is with a continuous response and there's some other options under the red triangle. What if I have a categorical response? For a categorical response, I can use the same platform. So again, I'll go to model screening.   I'll click recall, but instead of using Y, I'll use Y Binary.   And I'm not going to change any of the default methods. I will put in a random seed, so for example, 12345, I'm just grabbing random number.   And what this will do is give me repeatability. So if I save any model out to the data table or to the formula depot,   the statistics will be the same in the model, fit will be the same. A few other options here. We might want to turn off some things like the reports.   We might want to use a different cross validation method, so this platform includes K fold validation, but it also uses nested K fold cross validation.   And we can repeat this. So so really nice. Sometimes partitioning our data into training validation and test isn't the best and K fold   can actually be a little bit better. And there are some additional options at the bottom. So we might want to add two way interactions. We might want to add quadratic effects.   Under additional methods, this will fit additional generalized regression methods. So I'm just going to go ahead and click OK.   OK. It runs all the models and again, this is a small data set. It's very quick.   Right, the look and feel are the same, but now the statistics are different. So I've got the misclassification rate. I've got an area under the curve. I've also got some R square measures and then root average squared error.   I'm going to click select dominant, and again, the dominant method is neural boosted.   Now, what if I want to be able to explore some of these different models? So that misclassification right here is a fair amount lower than it is for stepwise.   The AUC is kind of similar, it's lowest overall but but maybe not that much better. And let me grab a few of these. So if I click on a few of these, maybe I'll...maybe I'll select these four, these five.   I can look at ROC curves, if I'm interested in looking at ROC curves to compare the models.   And there's some nice controls here to allow me to to turn off models and focus on on certain models.   And a new feature that I'm really excited about is this thing called a decision threshold.   And what the decision threshold allows us to look at,   and it starts by looking at the training data, is it's giving us a snapshot of our data.   And misclassification rate is based on a cut off of .5 for classification. So for each of the models, it's showing me the points that were actually high and low. And if we're focusing in on the high, the green dots were correctly classified as high.   And the ones in the red area were misclassified, so it's showing us correct classifications and incorrect classifications, and then it's reporting all the statistics over here on the side   And then down below we see several different metrics plus a lot of graphs for allowing us to look at false classifications and also true classifications.   I'm going to close this and look at the   validation data.   So why is this useful? Well, you might have a particular scenario where you're interested in maximizing sensitivity   while maintaining a certain specificity. And sensitivity...and there are some definitions over here. Sensitivity is the true positive rate;   specificity is the true negative rate. This is a...this is a scenario where we want to look at disease progression, so we want to make sure we are...we are maintaining a high sensitivity rate   while also making sure that our specificity is high, alright. So what we can do with this is there's a slider here, and we can grab this slider,   and we can see how the classifications change as we change the cutoff for classification.   So I think this is a really powerful tool when you're looking at competing models, because you might have some models that have the overall with a cut off of .5, they might have the best misclassification rate.   But you might also have scenarios where, if you change the cut off of classification, different models might perform differently. So, for example, if I'm in a certain region here, I might find that the stepwise model is actually better.   Now to further illustrate this, I want to open up a different example.   And this example is called credit card marketing.   And if I go back to my slides just to introduce this scenario.   This is a scenario where   we've got a lot of data based on market research on the acceptance of credit card offers.   The response is, was the offer accepted. And this is a scenario where only 5.5% of the offers in the study were actually accepted.   Now there are factors that we're interested in, so there are different types of rewards that are offered,   and there are different mailer types. So this is actually a designed experiment. We're going to, kind of, really, kind of, ignore that that part...that aspect of this study.   And there's also financial information, so we're going to stick on...stick to one goal in this example, and that's the goal of identifying customers who are most likely to accept the offer.   And if we can identify customers that are most likely, in this scenario we might send offers only to the subset that is more likely to accept the offer and ignore the rest. So that's the scenario here.   And I'm going to open these data.   I've got 10,000 observations and my response is is offer accepted.   And I've already saved the script to the data table, so I've got air miles, mailer type, income level, credit rating, and a few other pieces of financial information.   It's going through...I ran a save script. It's going through running all the models. And neural, in this case, will take the most time because it's running a boosted neural.   It will take a few more seconds. It's running support vector machines. Support vector machines will timeout and actually won't run if I have more than 10,000 observations.   I'm going to give it another second. I'm using validation so I'm using standard validation for this, where I've got a validation column. And in this case, I've got a column of zeros and ones and JMP will recognize the zeros for the training data and the ones for the validation data.   Okay. Tthere we go. Okay, so it ran, and if you're dealing with a large data table, there is a...there is a report you can run   to look at elapsed times. And for this scenario, support vector machine actually took the longest time, and this is why, at times, why it won't run if we have more than 10,000 observations.   So let's look at these. So our best model, if I select dominant, is a neural boosted and a decision tree, but I want to point something out here. Notice the misclassification rate.   The misclassification rate is identical for all of the models, except support vector machines. And why is this the case? Well, if I run a distribution of offer accepted   (let me make this a little bit bigger so we can see it) and just focus in on the validation data, notice that point .063% of our of our observations were Yes. This is exactly what our model predicted.   And why is it doing this?   I'm going to again ask for decision threshold.   And focusing on this graph here, and this graph has a lot of uses, and in this case, what it shows us is our cutoff for classification is .5   but none of our fitted probabilities were even close to that, right. So as a result, the model either classified the no's correctly as no's   or classified the yeses as no's. It never classified anything as a yes, because none of the probabilities were greater than .5. So if I stretch this guy out,   right, I can see the difference in these two models. So that the top probability was around .25 for the neural boosted and for decision tree it was about .15.   And notice that for decision tree, decision tree is basically doing a series of binary splits and basically I've got several predictive values, whereas a neural boosted it's showing me a nice random pattern in the points. So let me change this to something like .12.   Right, and it cut off at .12, in fact, if I slide this around, notice that the lower I get, I actually start getting (I'm going to turn on the metrics here) I start getting some true positives.   And I start getting some false positives. So as I drag this, you can see it in the bar, but the bar is kind of small, right. Neural boosted, I'm starting to see some true positives and some false positives.   And now you start seeing them. As soon as I get past this first set of points, I start seeing it for decision tree.   So, using a cut off of .5 doesn't make sense for these data, and again I might...I might try to find a cut off that gives me the most sensitivity, while maintaining a decent level for specificity.   In this case, I'm going to point out these two other statistics.   F1 is the F1 score, and this is really a measure of how well we're able to capture the true positives.   MCC is the Matthews correlation coefficient, and this is a good measure of how well it classifies within each of the four possibilities. So I've got...   I can have a false positive, a false negative, a true positive, a true negative.   And I didn't actually say that corresponding to the boxes here, but I've got four different options. MCC is a correlation coefficient that falls between minus one and plus one, that measures how will I'm predicting in each four...in each one of those four boxes.   So I might want to explore cut off that gives you the maximum F1 value or the maximum MCC value. And let's say that I drag this way down. Notice it that   the sensitivity is growing quickly and specificity is starting to drop, so maybe it around .5, right, I reach a point where I'm starting to drop off too far in specificity.   I might find a cut off and at the bottom there's this option to set a profit matrix. If I set this as my profit matrix, basically, what's going to happen is it will allow me to score new data using this cut off.   So if I set this here and hit okay, right, any future predictions that I make, if I save this out to the data table or to the formula depot, will use that cut off.   And this is a scenario where I might actually have some financial information that I could build into the profit matrix.   So, for example, instead of using the slider to pick the cutoff, maybe I have some knowledge of the dollar value associated with my classifications. And maybe if the actual response is a no,   but I send them an offer, so there's a yeah, I think they're going to be yes and I send them an offer. Maybe this costs me $5, right, so I have a negative value there.   And maybe I have some idea of the potential profit, so maybe the profits...potential profit over this time period is $100 and maybe I've got some lost opportunity. Maybe I say, you know, it's -100 if   the person would actually have responded, but I didn't send them the offer. So maybe this is lost opportunity and sometimes we leave this blank. Now if I use this instead,   I have some additional information that shows up, so it recognizes that I have a profit matrix. And now if I look at the metrics, I can make decisions on my best model based on this profit.   So I'm bringing this additional information into the decision making, and sometimes we have a profit matrix and we can use that directly and sometimes we don't. And this is one of those cases where I can see that the neural boosted model is going to give me the best overall profits.   So this is a sneak peek at model screening and let me go back to my slides.   And what have we seen here? Well, we talked about predictive modeling   and how predictive modeling has a different goal than explanatory modeling. So our goal here is to accurately predict or classify future outcomes, so we want to score future observations.   And we're typically going to fit and compare many models using validation and pick the model or the combination of models that does the best job, that predicts most accurately.   Model screening really streamlines this workflow so you can fit many different models, at the same time, with one platform.   And I really only went with the default, so I can fit much more sophisticated models than I actually fit there. It makes it really easy to select the dominant models, explore the model details. We can fit new models   from the details. This decision threshold, if you're dealing with categorical data, allows you to explore cut offs for classification and also integrates the ability to include a profit matrix.   And for any selected model, we can deploy the model out to the formula depot or save it to the data table. So a really powerful new new tool.   For more information. The classification metrics, I know before I saw the F1 score and the Matthews correlation coefficient, those those statistics were relatively new to me.   To make sense of sensitivity and specificity this Wikipedia post has some really, really nice examples and a really nice discussion.   There are also some really nice resources for predictive modeling and also model screening. In the JMP user Community, there's a new path, Learn JMP, that has access to videos,   Mastering JMP series videos. There is a really nice talk last year at the JMP Discovery in Tucson by by Ruth Hummel and Mary Loveless on which model when.   And this does a nice job of talking about different modeling goals and when you might want to use each of the models.   If you're brand new to predictive modeling, our free online course, STIPS, which is Statistical Thinking for Industrial Problem Solving, Module 7 is an introduction to predictive modeling.   So I'd recommend this. There is a model screening blog that uses the diabetes data that I'll point out, and I also want to point out that there's a second edition of this book, Building Better Models with JMP Pro,   that's coming out within the next couple months. They don't have a new cover yet, but they're going to...they include model screening in that book.   So so that's all I have. Please feel free to post comments for this post or ask questions, and I hope you enjoy the rest of the conference. Thank you.
Yassir EL HOUSNI, R&D Engineer/Data Scientist, Saint Gobain Research Provence Mickael Boinet, Smart Manufacturing Group Leader, Saint-Gobain Research Provence   Working on data projects across different departments such as R&D, production, quality and maintenance requires taking a step-by-step approach to the pursuit of progress. For this reason, a protocol based on Knowledge Discovery in Databases (KDD) methodology and Six Sigma philosophy was implemented. A real case study of this protocol using JMP as a supporting tool is presented in this paper. The following steps were used: data preprocessing, data transformation and data mining. The goal of this case study was to evolve the technical yield of a specific product through statistical analysis. Due to the complexity of the process (multi-physics phenomena: chemical, electrical, thermal and time), this approach has been combined with physical interpretations. In this paper, the data aggregation (coming from more than 100 sensors) will be explained. In order to explain the yield, decision tree learning was used as the predictive modelling approach. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. In our case, a model based on three input variables was used to predict the yield.     Auto-generated transcript...   Speaker Transcript YASSIR EL HOUSNI Hello. I am Yassir El Housni, R&D engineer and data scientist, in the smart manufacturing team of Saint-Gobain Research Provence. in Cavaillon, France. We are working for ceramic materials business units. In this post, we have two points. In the first we will present the data projects life cycle that we propose for manufacturers data projects. And in the second, we will continue to present two user cases of Saint-Gobain materials...ceramic materials industries. Working on data projects across different departments, such as R&D, production, quality and maintenance, require taking a step by step approach to pursue the progress. knowledge discovery in database methodology, and the six Sigma philosophy, also known as DMAIC to define, measure, analyze, improve, and control. We define in this infinity loop seven steps to pursue in order to manage correctly data analysis project. In all of them, we ensure to have the good understanding of the process, because we believe it's a relevance key to sucessful data projects in an industrial world. For example, to detect the variation in the process we use the SIPOC or flow chart map and to detect the causes of variation we use our toolbox of problem resolving which contain a ??? or Ishikawa diagram. Then infinity loop presents also another route to achieve the continuous improvement. In the next we will detail the approach, step by step. Let's start with define the projects. It's necessary to define clearly three elements before starting data projects. We propose here some questions which we found very useful to define clearly the elements of the trade(?). First of all business need definition. Frequently, the targets of data project in manufacturing is to optimize a process, maximize a yield, improve the quality for a specific product or reduce the consumption of energy. Under the definition of the business opportunity, we should know what will be used, is the target need just a visualization or analytics. And after that, the impact should bring our quantified gain to the business. Secondly, data availability and the usability. In it we launch a diagonal analysis of quality of data. It is so important in the step to determine the feasibility of the data project. And therefore the team setup, a person from the data team, a person from the business unit team and a person from the plants, a process engineer with Six Sigma green or black belts. Let's move to the second step. Data preparation. With the transformation, integration, cleansing, it's an important step which consumes a lot of time in data projects. For example, we have here different sources of data and we need to centralize them in one table. Mainly we use X for inputs and Y for outputs. In this setup we use a different tools in JMP such as missing value processing, the ??? of constant variables, and of course the JMP data tables tools which ensures the right SQL request to transform correctly tables. The third step is about exploring data with dynamic visualization and with JMP we have a large choice for visualization. For example, plot the distribution of variable and estimate the load(?) that it's follow. Detect the outliers with the box plots diagram, nonlinear regression between two variables, contour or the density mapping to determine the principal placement for concentration of each population, and we have a large choice to plot it. ???, we use them usually in our work and we found them very useful. The fourth step is to develop the development of the model and it depends on the kind of analysis that we need. It's to... the target is to explain or is to predict. The first is about links between variables and it serves to explain patterns in data. The second is about formula and it serves to predict patterns in data. Generally we cut our data sets in three blocks, 70% for training, 20% for testing, and 10% for validation. And sometime if we have a small size of data we use 70% for training and 30% for validation. If the model is good, we request a new set of data in order to drive decision making. We have to approach supervised and unsupervised learning. Today at Saint-Gobain we'll use the normal version of JMP. And we have access to the supervised learning tools, such as linear regression, decision tree and the neural network. We work a lot today with decision tree because it gives us a relevant results, which help us to resolve a challenge in ceramic materials industry. The fifth step is about finding the optimal solution. Sometimes it's just a solution, but in other cases it's a combination between a lot of models. And to ensure the good sense, we added some constraints, for example, the mean max and the step of valuation of each variable. JMP profile give us large possibility to optimize quickly solutions. From now, the next step is pass to the plant, by the support of our process engineer with the Six Sigma green or black belt. The sixth step is about implementing the best solution in the plants and governed by only one representative model. For example, we implement the control charts of output Y1 and analyze the different variation. In the seventh step, we monitor the model effectiveness and we visualize the global gain of working on our data project. For example in the pie charts, we see the radial impact to the global yield. And the last but not least, the preparation to the step one for a new data project to ensure the continuity, the continuous improvement and the continuity of the infinity loop. That was all about data project lifecycle and now we will pass to present two real case studies of the protocol using JMP. In the two examples we studied the same process technology. It's about electric arc furnace, but for the two we have different products and different target. In this technology, we have a complex multi-physical phenomena such as electrical, thermal, chemical, time and others physics effects. Here, we have, for example, more than 100 process variables that comes from different kinds of sensors. The business need was to explain the global yield of a specific product, JO7. In it we detect a lot of kind of defects, the Defect #1, #2 to #N. And for that to prepare correctly the data sets, we used Pareto, outlier processing with Mahalanobis distance and recoding the correct attributes to correct the type of errors and missing value processing. In step three, in explore, we present here just an example of correlation that we study between inputs to reduce number of variation after working with models. As our results, we found our decision tree with just three variables and here, the goal was to explain why the yield is not in the max. And we have a decision tree with just three variables under 100. So for the model we use here 70% of the data for training because we don't have a big size of data and 30% for validation and we get good results with important root square. As you see, it's more than 70%. So the message that we passed to the plant is that just with the specific setting of X1, X2, and X3 we can explain the global yield, and if we need to maximize Y, the precent of this yield, we need to get a specific setting just for X1 and X3. And the global yield should evolve rapidly. It's the point...at the cluster of points here. And for each project, we give also the physical understanding of each parameter to the plant. For the second example here with the same technology but for another product, here we have just 80 process variables and the target was to evolve the number of pieces with no defect, D1. The need is about explaining the quality of our specific products, so we use the same methodology for our steps. For example here, we studied the same correlation between inputs to reduce the number of variables that we will put in the model. And, as a result, we use also a decision tree, but here we found 12 variables that explain this global yield with good results. As you see the R square was 84%, the RMSE was 3% and number of size was 287. For that, we used the Cross validation method because we have a very small size of table. So the first parameter was very important, as you see, it contributes 50%. And it was difficult to explain that with the 12 variables to the plants. For that, when we plot just the first variable, we see visually that we can define a threshold just with the variable X1, and with it the global yield should evolve rapidly. Thank you.  
Arhan Surapaneni, Student, Stanford OHS Siddhant Karmali, Student, Stanford OHS Saloni Patel, Student, Stanford OHS Mason Chen, Student, Stanford OHS   Our projects include topics ranging from high level analysis of gambling utilizing hypothesis testing tools, probabilistic calculations and monte-carlo simulation (with Java vs. Python programming) to strategic leadership development through quantification of troop strength in the Empire: Four Kingdoms video game. These projects carefully consider decision-making scenarios and the behaviors that drive them, which are fundamental to domains of cognitive psychology and consciousness. The tools and strategies used in these projects can facilitate the creation of user-interfaces that incorporate statistics and psychology for more informative user decision-making – for example, in minimizing players’ risk of compulsive gambling disorder. The projects are about the game of poker and use eigenvector plots, probability and neural network-esque Monte Carlo Simulations to model gambling disorders through a game consisting of AKQJ cards. Offering a subtle analytical approach to gambling, the economic drawbacks are explained through multi-step realistic statistical modeling methods.     Auto-generated transcript...   Speaker Transcript Siddhant Karmali Hi everyone, this is Siddhant Karmali, Mason Chen and Arhan Surapaneni and we're working on optimizing the AKQJ game for real poker situations. So COVID-19 as effective mental health and can worsen existing mental health problems. The stressors involved in the pandemic, namely fear of disease or losing loved ones, may impact people's decision making ability and can lead them to addictive behaviors. And addiction to gambling is one such behavior that has increased due to an increase in online gambling sites...the site traffic of online gambling sites. This project analyzes how different situations in the game of poker affect how people make irrational decisions. These include situations that may lead to problem gambling. We developed a simplified model of poker that only uses the ace and the face cards, so A, K, Q, and J. That increase the probabilities of certain winning hands called and we called it AKQJ for ace, king, queen, jack. The variables in this model are the card value, the number of players, the number of betting rounds, and whether cards are open or hidden. Open cards or yeah...the objective of this model is to simplify the complicated probability calculations for the winning outcomes in a full game of poker. And we will extend this objective to the idea that since poker in real life has more than one betting round, we can prove or we can show that this model is effective, even in different variations of poker with different betting rounds. So this is how the project...or this is the outline of the project. First we researched emotional betting and compulsive gambling and what are the risk factors for compulsive gambling? How do people like...what do compulsive gamblers think like? Why do they gamble? So we found that people will gamble for thrills, just like addict...people who have addictions to drugs, they use the drug for a high or thrill. So again we... we we infer that gambling or gambling as an addiction must hit the same chords in the brain that are involved in the review...or the that are involved in the reward system. And then we went to...our technology was using hidden and open cards in real cases. So hidden and open cards are...so open cards are the cards that a player keeps face up and the hidden cards are face down, and only the player knows its identity. The and then we made two separate algorithms. There was a comprehensive algorithm and a worst case algorithm. But comprehensive algorithm is more complicated since all the cards are hidden and it's hard to do calculations, and the worst case algorithm had some open cards so players...or our modeled players could infer whether you take the bet or not. And so this was our engineering part. We used JMP to model players play styles. And we also used Java and Python programming, as Arhan will show, to generate...to randomly generate card situations, and we calculated the probabilities and conducted correlation and regression tests in JMP. So hidden...hidden and open cards. Open cards are, as I mentioned, open cards are the play...cards that a player keeps face up so other players can see it. And hidden cards are facedown and only the player knows its identity. The comprehensive algorithm, which ...which is what...usually what happens in a real game of poker where players have to try and calculate the probability of them winning against another person or them winning against their opponent, based on their current hand. And in a comprehensive algorithm it's hard to do, since all the cards are hidden and you don't know which which card which player has. And the open cards make AKQJ game easier calculation wise. And the number of hidden cards increases with the number of betting rounds. so the first case we did was with one round and six players, which had six hidden cards. Then we have one betting round and five players, which had seven hidden cards and so on. So earlier, we...or in the model, there were six players given labels A through F. We assign them probability characteristics, which are the percentages of confidence they have to make a bet. A's is 0%, B's is 15%, C's is 30%, D's is 45%, E's is 60% and F is 75%. And F's 75% probability means that unless they are...unless they are 75% sure...at least 75% sure that they will win against that person...their opponent, then they will not take the bet, so it means they're very, very conservative with their betting. a general poker case, which is the comprehensive algorithm, and the worst case algorithm. The general method is calculated or, for example, if we're trying to calculate the probability of A winning the poker match, in terms of the general method, we would have to use the probability of A is the probability of A versus A winning versus B times the probability of A winning versus C, all the way to probability of A winning versus F. This takes a very long to calculate and it's cumbersome in a real poker match, since the betting round time can be 45 seconds to a minute and not many people can do this kind of calculation in a minute. So the worst case...so that's where we developed the worst case method. The worst case...we calculated the worst case outcome by seeing which player can make the best hand with the cards they can see, out of four shared cards which are which are open to all players and one hidden card and one open card per player. We use these two algorithms in three different cases. The first case is with one betting round and six players. We have to determine in which cases each player will fold or stay and how many chips they will win or lose. For example, A stays, even if they lose chips, because that was ...that was one of our models, which we...which we knew had a problem gambling, so...that...so according to our condition, they had to stay. B wins against E, but not against anyone else. C does not win in this case. D wins against B and C and ties with A and E. And E doesn't win, and F wins all...against all the other players. Note that because C and E did not win, they...or because B didn't win enough because C did not win at all, and because E did not win at all, they all fold in the next betting round and they lose their...they lose their chips. And so the ones that stay, they're the ones that like...considering this is a one match or one betting round poker match, the ones that are...the one that is most likely to win is F, since they have the highest worst case winning probability. And we can see that, if you go to the previous slide, we can see that player A or player ...or for the six players' case, player F's overall was very close to 80% so we can... so the...so there is a correlation between...or there was a strong correlation between the AKQJ worst case method and the general method. And the next case is with one betting round and five players. In this case, the confidence values change. A's is 0%, B's is 12% , C's is 25%, D's is 38%, and F's is 65%. Player E was removed, since they lost the most chips and had to fold in the previous round. In this round with the fewer players, we see that there are more hidden cards. The number of hidden cards increases with the number of betting rounds and players. With the number of hidden cards increasing, the calculation time may take longer, and this may make players more nervous and unwilling to do those calculations, since they could lose money. In this case, player F didn't win, as shown by how their worst case winning probability is less than their confidence percentage, so they are forced to fold. Meaning that...or this could be because conservative players may not do well in the later stage of the game, because they are, you know, too stingy with their money. They do not make the right bets even when the stakes are higher. The next case is with four players and we did this test to confirm whether player F wins or not, and we can...and if it if F does not win, we can say with confidence that more conservative players do not well...do not do well in the later stages of the game. And note that F has to keep decreasing their winning probability. We also tested whether the worst case algorithm matches for five players. In the general method, B has an 11% chance of winning, D has a 46% chance of winning, and F has a 48% of...48% chance of winning. This is very close to the worst case values, and so we get a strong correlation of...we get a strong correlation with an R squared of 0.998. And so, this worst case makes F win 50% of the time in the five player match. And we also tested this for four players, in which... in which we confirm that F will win 50% of the time in the four players case. And then the third case is with three betting rounds at six players. In this case the values are the same. and E's, which we added back is 56%. In the first round F wins, however, as players started folding like how B, C and D fold, F has to change their confidence level. F has to change their confidence level to match the winning probabilities for a round. F's level changes to 60% after the first round and 54% after the second round. This is involved, like these are models. This will be involuntary, indicating that there's nervousness in a conservative play style, which contributes to such players losing in later rounds. Players A and F represent the extreme playing styles, which may be indicative of problem gambling. And this is a quick summary of the betting round calculations. In two betting rounds or in a game with two bending rounds, we will see that F only wins two times out of 20. So the the... player's possible hand doesn't all...like F's possible hand is not good enough to match up against the opponent's possible hand. This happens in both two betting rounds and three betting rounds. And this is this is due to the nervousness, and this is due to the fact that F's probability was way too high, like they could not match their confidence level, so if you want to...like perhaps the optimal strategy for doing well in a poker game is to be not too, like, aggressive with your betting and also not too conservative with your betting. Be like player D, who has... player D had a 38% chance...or 38% probability so they would have to be at least 38% sure that they would make...that they would win against their opponent. So like...I think...or based on this, around that spot is a good place to be for poker. Now, why is this important? We also did...or we also did the three players test to confirm that player F has to fold and player D wins in this round. So we can say that player D has arguably the best strategy in this poker model, in the AKQJ poker model, with more than one betting round and more than...and less than six players. We also did the two betting rounds, to consider...or to show that F doesn't win either. And this is another case which we did to test if the outcome of F losing held throughout the betting rounds. We did this, three betting rounds and four players. And now, why is this important? We've showed previously that players can perform simple calculations, like the worst case, to control their urge to bet even on the losing hand. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. People with a gambling addiction may be very, very aggressive with their betting and even though they know they're losing their money, they will still bet in the off chance, in the slight hope, that they will...that they will make a big win. This is an emotional style of betting and it falls right into the trap of gamblers fallacy, which is thinking that either if you're on a lucky or if you're on an unlucky streak, you will get lucky the next time, for example. These new cases of fewer players and more betting rounds in a poker match introduced the ideas of nervousness in players, even the most rational ones. So F, in the one betting round in six player calculation, was a player who had...like you could see that player had experience or he...or that that player was able to wait, he was able to have a good strategy, he or she, excuse me. This is...and so we can apply this to a real poker game, because in humans, the parasympathetic nervous system releases stressful hormones, like adrenaline, into the body, eliciting a fight or flight response. When you're in a high stakes situation like a poker game, then you know that...you know at a superficial level what's at stake, you know that your money and your assets are at stake, so you will bet on that to try and increase it. That's just inherent human nature. More bandwidth during this poker game is given to the amygdala, which is an area of the brain that controls emotion, rather than the rational and involved in executive functioning prefrontal cortex. So the more hidden cards in a round, players may be more nervous about their bets and make worse bets, even if they... even if they're very experienced in poker. And this nervousness may be correlated with the blindly betting nature of compulsive gambling disorder, based on the concepts of risk calculation and gambling for thrill. And our conclusions were that F, or overly conservative players like player F, may not do as well in realistic poker situations. So the ones that do the best are are on the conservative side but are more willing to bet than your regular very, very stingy player. And get our main takeaway is that gambling disorder may be mitigated if players can understand basic statistical calculations and use them in their games. And the future research, which is actually not in the future, it's going to happen or we've done that research already is, we are going to get more reliable data using Python and that's what Arhan's presentation is going to show. Thank you for watching. Arhan Surapaneni Millions of people visit the capital of casinos, Las Vegas, every year to party, enjoy luxuries, but most importantly, gamble. Gambling, though forming a false reward of success in winnings still hides the dark pitfall of financial and social struggles. Our generation is modernizing with technological advancements rapidly becoming the norm and computer programming languages being the needed subjects to thrive in a modern world. Utilizing modern computer programming tools and ???, we are able to take a deeper look into their this psychological problem and analyze tools to help solve them. This is done with authors Siddhant Karmali, Mason Chen and Saloni Patel, with the esteemed help with advisors, Dr Charles ??? and ???. As previously mentioned millions of people attend casinos per year affecting a large population. With a problem that affects such a large sample size, one of the overarching questions concerns how do we express solutions to the problem of gambling and how do we do this efficiently. Through Siddhant's presentation, we have learned much about original methods that help explain gambling, while also presenting the economic misnomers about gambling, proving that being calm and cautious provides the best results, rather than relying on luck of the game for positive results. All of which was done through Java. This method, however, takes 30 hours just for 92 runs. This is not effective and unusable. Using Python however reduces this time to seconds, while also allowing for higher levels of complexity that provides to be beneficial to the overall methods. To recap, the method required a six player model, each of whom receives two cards in the 16 pack of AKQJ. One of the cards is hidden, with four cards placed on the table. Each player has confidence levels determining how often they fold or continue playing. Continuing playing loses three points and folding loses only one point. This is done in an effort to model players that range from blindly gambling to cautious strategists. Usually when conducting a large scale experiment is very tedious and time consuming to run tests at an adequate amount so that the data that is produced is usable. An efficient alternative is using computer programming, more specifically, using the computer language Python in a Monte Carlo simulation, rather defined as a broad class of computational algorithms that rely on repeated random sampling containing numerical results. The Java program applied here in the diagram is very ineffective. They only allow for random two samples and the process of individually sourcing out each specific sequence is difficult to do. When we are trying to derive larger sets of data, it is important to change this. Analyzing general differences between languages, it is important to identify the Java is staticly typed in a compiled language, while Python is dynamically typed in interpreted language. This single difference makes Java faster at runtime and easier to debug. But Python is easier to use and easier to read. Python is an also an interpreter language with an elegant syntax makes it...making it a very good option for scripting and rapid application development in many areas. This is applied in the following method. One thing we need to look at is the new randomizer, which is applicable, with the full 52 cards set, the AKQJ 16 card model allows for more accuracy in statistical ??? rendering, especially when using a Monte Carlo simulation. One beneficial aspects that we can add with the Python our characters. As Siddhant covered, we can now use wagers. Instead of manually applying different confidence levels on our two card ??? randomized, we can use Monte Carlo to make different choices based off of a certain percent threshold on the ability to add or remove wagers which will see you later on. First, it is important to talk about the deck. Here we see a 16 card array with our shuffling function. This allows the cards to be randomized, similar to the initial function seen in the Java program. This draws two cards from the 16 card total for different players. With this in mind, there are extensive applications to both the original comprehensive method and our worst case method, which will be covered and developed on later. ??? the same deck allows us to add changes or move things to affect how we compute the worst case method which is helpful in our end goal. We don't have the same flexibility in our old JAVA method. Python specific changes for the worst case method include specific elif/if statements, so that the player with the worst card is indicated as loss and the formula covered by Siddanth has changed. This is important because this allows more efficiently... more efficiency in data collection, with the probability on making the randomizer's outputs more accurate. You can add specific names to these separate cards, which is also another helpful application in itself. This simplified Monte Carlo simulations allow for more complexity, as it allows us to add our new wagers based off scenario which involves multiple betting rounds that Siddanth has described. We can change characters by changing the current wager, which affects how the player stays or leaves the betting round, this being a key difference from the Java method. To concept in this program has to do with setting a variable that will eventually return value to the funds and the wager to the initial wager argument. Sending the current wager to zero, we set up a while loop to run the condition continuously until the current wager is equal to count. Then for each set where we get a successful outcome, we increase the value by our wager. If we were to add a command telling us or the character to slow down at a certain wager, then we have a simple way of having threshold to betting. We can also edit the same form to accept specific sequences, like the full house, only allowing a wager when the sequence is present. These thresholds create our upper mentioned risk management level. We can plot the probabilities in ???, where we append to each updated current wager to an array of X values and each updated value to an array of Y values, proceeding to plot both PLT(?) to plot X values and y values. The key component of the better function is the condition if...in the if statement that corresponds to a successful outcome. This can be adapted to any outcome needed, including general scenario and worst case scenario. When we apply our comparisons with the two character and multiple character, we can add important statements that make sure the data is compared properly. Using Python you can make sure drawings like two pairs or instead of higher values ruling out players with a lower value, forcing them out of the game. This allows for more efficiency with the new Python program rather than using the original Java program. Something that would take 30 hours originally can now be done in a matter of seconds. After this program is applied, we are able to run a correlation test with the new results against the original Java results. If we look at the red lines for both the general and worst case methods, we see that they're extremely close to one, indicating strong correlations. This is also paired with the higher R squared value. We also run the one proportion hypothesis test, telling us for both methods we failed to reject the null hypothesis. This value, although high, isn't close to...isn't as close to one. For something that we would expect to be almost identical, because this is computer programming. There are two main reasons for this, the first is sample size. 92 seems like a lot, but it isn't strong...isn't as strong a trend as one would expect. To fix this we are able to increase the sample size to 1,000 or even 10,000. One more reason could be the application of Python. As previously mentioned Python is more comprehensive language, rather than a static language. This could change the effect, but mainly stay the same. Here we see the program finally applied with their results presented. With this program we see what each play...which cards each player draws with the cards... which card is shown and hidden, what cards are on the table, how many chips players gain and lose and, finally, who wins, how they win. In the diagram below, we see Player 1 wins with the full house. the ability to run multiple characters into one computer and one function, describing the different sequences while applying different numbers of players, also showing probability of different outcomes, even adding the multiple betting runs in a regular game of poker, which Siddanth has covered. This is vital to further developments, because it allows for people around the world to use this program and method to develop on ideas as a learning tool able to be utilized to act as therapy to gambling addicts. One more development exploits the neural network (AI) aspect, making this program more detailed by adding features like bluffing, added enough for it to make the program more like actual game seen in casinos. Finally, we were able to see by using Python, not only do we experience higher accuracy, but higher...much higher efficiency, ultimately, proving our original hypothesis that being cautious gives more money and more wins, turning a program that originally took hours in just a few seconds. So next time you go to Vegas or even partake in a game with friends, remember this project and remember that being more careful and taking a bit more time with decisions will help you in the long run. Thank you.  
Chi-hong Ho, Student, STEAMS Training Group Mason Chen, Student, Stanford OHS   In late February 2020, the COVID-19 pandemic began gaining momentum, and the stock market subsequently crashed. Many factors may have contributed to the fall of stock prices, but the authors believed the pandemic may have been the main cause. The authors’ objectives for this project were to learn about stock investments, earn money in the stock market, and find  a model to help determine the timing and amount of the trading. All of the data was Z-standardized to help eliminate bias and for ease of comparison. Specifically, the authors used three Z-standardization values: Z (within stock), Z (NASDAQ Ratio) and Z (Group NASDAQ Ratio). The authors compared the current stock price with the previous stock price, NASDAQ stock price and group NASDAQ price average respectively. After that, the authors combined three Z-values and applied it to the stock index, which much decreases the data bias and is better for reducing risks in investment. Determining the outliers is to acknowledge the timing of investment. Quantile Range Outlier is easily affected by the skew factor, because it is usually used in the normal distribution. Robust Fit Outliers is a better tool, because it can eliminate skew factors. The authors established a model to help people invest the right amount of money.     Auto-generated transcript...   Speaker Transcript Chi-hong Ho Hey, no. Come on. Okay, hello everyone I'm Chi-hong Ho, junior at Henry M Gunn High School. My partner is Mason Chen; he is a sophomore at Stanford Online High School. Our project is to finding the multivariate statistical modeling of stock investment during COVID-19 outbreaks. In late February due to the coronavirus pandemic spread out in the world, the stock markets start crashing. There are several factors causing this year's stock market crash. There are several factors causing this year's stock market crash, such as a COVID-19 pandemic. OPEC/Russia/USA oil price war. Also 2.2 trillion bailout package from US government. And some companies laid off their employees. There is more than 30% unemployment rate due to the pandemic. Let's compare with the 1929 Great Depression unemployment rate. At that time it was about 25%. Also in November, the US during the presidential election. And then the manufacturing supply chain shut down because of the coronavirus pandemic spread out in China before. Yeah, the COVID-19 pandemic influenced the stock market as the past stock market crash that happened in 1929 Great Depression, 1987 Black Monday. Because the COVID-19 pandemic had a huge impact on the world causing lots of deaths, the stock market should get more influenced by pandemic. Compared to US history, the Great Depression of 1929 and Black Monday 1987 both crashes continued for a long period. COVID-19 wasn't. This year, stock market decreased by 25% from the peak in March to April. Before March, the COVID-19 pandemic in the US was not spread out as fast as other countries. After the COVID spread was global, lots of countries were locked down. National and global-wide lockdown situation would affect the stock market. Look at the graph in the left corner. That is the situation that happened Korea. Asian countries got COVID pandemic before America. We use the Asian countries' situation to predict what will happen to the US. Based on this graph with this color that is COVID-19 inflection point is for Phase II, it may impact the stock curve significantly, because it seemed that the case growing speed will be little bit decreased in this short period. Okay. In the left side of graph, it's the correlation map of the case versus the date. This is its correlation map in the US. The cases were added in early February, and much through in late March and early April. The right graph is the stock market down by the date, which means that the case interest to maps are related to each other. When the case grew much, the stock market start to crash and the lowest point was more than 35% down. Compared with China, South Korea and the US, we knew that the Asian countries got COVID pandemic before the US. We could conclude that what will happen in the next few months in the US. Based on just that specific table and data below, I found that the duration of Phase III is really short. And it's now safe for us and we can go back to work. Compare with the direction of Phase IV, which is double the time of Phase III, so in this time we are we are feel more safer to go back to work. After we look at the Asian countries pandemic phase, we can predict the Phase III and Phase IV duration of the US. We would estimate the US end of Phase III should be on April 15 and Phase IV should be on May 25 for the best case. The worst case is end of Phase III is around April 30 and the end of Phase IV on June 10. Our recent projects to define stock investment strategy, our objectives are learning and experiencing the stock investment, earning money in the stock market, and building a model for judging the times to trade or exchange stock. Firstly we own eight high technology stocks which were purchasing 2008 and 2009, and average gain is about 400% in March. Some of the stocks are the top 20 of the Standard and Poor 500 stocks, with an average gain of more than 800%. We want to find a time to sell those stocks and get money back. Because of the COVID-19 situation and the stock market crashed, the stock price was not as high as in March. So we wanted to sell quickly. After selling high tech stocks, we look at 23 COVID impact stocks, which stocks lost ground because of the pandemic. We want to wait for those stocks surge after few months. Our choice of COVID impacted stocks should have a minimum of 3 billion market cap. When we down to trading stock, we could have some exchanges. We should...we sold one tumbling stock and bought one rising stock for balance. We need to make sure our stocks will surge in the few months. We separate the stock transaction decision chart into three levels. The first level is to decide what we will buy, what we will sell and to exchange. In level two, we are picking the stocks from selling group, buying group, and exchange pair from the two groups. The third level is the tool we will use. We may use the Z index, we will use the outlier detection tools. The function of the standardization is to give us the idea of how far from the mean to the real data points. Why do we need to use it? Because we need to convert the actual data to an index that's easier for us to compare. Also using the standardization to eliminate the bias of raw data. On the left side, there are blue boxes, purple boxes and red boxes. The real data is in the input we collect, which is in the blue box. The new index after the Z standardization is found in the red box. That is our output. The Z standardization is a tool we use, which is in the purple box. In the blue box the cubicle represent NASDAQ stocks, which is popular and lots of people will invest in. High tech stocks had grown a large amount during five years. The range of Z standardization is from -3 standard deviation up to +3 standard deviation. After the Z standardized, we can get Z within stock, Z NASDAQ ratio, and Z group NASDAQ ratio. The Z within stock is to compare the stock price with the past five years' previous stock price. The, the Z NASDAQ ratio compared the stock price with the NASDAQ stock price, the Z group NASDAQ ratio compares the stock price with the group NASDAQ mean. We use the Z standardization to help us look up the ris. In the end we will combine all three Z score into a new stock index. The stock index can help us to lower the risk of transactions. Here's the data table after we standardized with the raw data. We can use the stock...the stock price index change. USA stock has been in downward trending since a peak around in mid February. Some stocks are more robust and certain ones are impacted by COVID-19. We establish this modeling algorithm on March 7-8 and and the database on March 14-15. The red in that index shown in this figure speaker, that are good time...present a good time to sell out those stocks and we can earn more money than other not red index. Also, there are some index marked by color blue. That is the time to we can consider to buy the 23 COVID impacted stock because we can lower the cost and we can gain more in the future. The reason why we use the outlier algorithm is that the outlier is helping us determine the timing of trading stocks. The best way to determine the outliers is to use quantile range outliers. Firstly, we used to find the entire quartile range which is Quarter 3 minus Quarter 1. The algorithm is outlier equal to Q1 minus or Q3 plus x IQR, x equal to 1.5 for regular or 3 for extreme. Why do we need to choose the extreme outlier? Because regular outlier cannot shows the longer timing that we wanted. Extreme values are found using a multiplier of the entire quartile range, the distance between two specified quantities. So extreme outlier can have a long detecting level that we can use in the investment, which can help us reduce our risk. But technically the quantile range outliers algorithm is used in the normal distribution situation. The stock market is not as much as the normal distribution, so the outliers will be influenced by the skew factor. Thus we need to use more powerful tools that are not influenced by the skew factor. Because of the stock performance, we would not care more about the tails than the center of the distribution. The next tool we use is robust fit outliers. We use robust fit outliers to ignore the skew factor. Outliers and distribution skewness are very much related. If you have many so called outlies in one in one tail of distribution, then you'll have skewness in the tail. In quantile range outlier detection, the assumption is normal distribution. So skewness in the distribution will introduce an inaccuracy in the outlier detection methodology. If the distribution is significantly skewed, like it probably is in stock market data, the robust fit outlier are a better method to find the outliers accurately because they tend to ignore the skew factor. The robust fit outlier estimates a center and spread. Outliers are defined as those values that are K times robust spread from the robust center. The robust fit outliers provide several options for coupling the robust estimate and multiply K, as well as provide tools to manage the outlier found. We use K=2.7 for regular and 4.7 for extreme outliers. After we use the regular robust fit outliers, we can find out the outlier in the selling index data. Look at the right graph. There are so many shaded red cells in F5 and F8 columns, indicating that we can consider to sell those stocks to maximize our profit, because the stock price is selling above average. Each column is showing some stock index change by day going down to the column. The reason why we use the extreme outlier for buying index is that the buying index is dropping. That means it is really difficult to detect the outliers of the buy index. Not like the selling. Selling index is rising, which is easier to for us to determine the outlier. On this page, there are some color blocks in the data table, like B6, B13, B15 and B19, which indicates that we can consider to buy the stocks. Lots of people make money by investing in stocks and most people may choose the right stock to invest in for reasonable ROI. But investors are challenge to find the right amount of money to invest. Also other human psychological factors will favors our certain investment. We can determine the amount of stock we buy, sell or exchange based on this model, which can minimize a personal investment bias and reduce the overall financial risk. The model provides two ways to judge the amount of investment. The first one is the color block analysis. Now in that analysis, the blocks with dark green are good to sell and the blocks with orange or red is good to buy or it's good to exchange. And then in the bottom, there are the transaction levels we define. The L10 is the least investment amount and L1 is the greatest investment amount. If we want to sell the stocks, we will sell not too high, is in the L5 amount. And do the exchange, we also choose the L5 model. And if we do the buying, we can just to consider to buy more, so we just buy L2 amount. So based on this model, we can just manage our ...based on this model, in the investment you will reduce your financial risk. Then the function....okay...this is...in the Phase III is in the exchanging part. Also we are using Z standardization for...to convert the data point into this index, but this is the exchange index. We set up the exchange threshold Z exchange index should be greater than 15. This is an average index we calculate, which can tell investor the time. On the left side, there's the line chart, which shows the change of each exchange pair. Based on this line chart, we can see the trend of S5-B1 is about 15.8 and S5-B14 is about 15.16. So that means we can consider to doing that exchange between the S5-B1 or S5-B14. After Z standardized, we can get the stock...sell stocks index and the exchange index. The selling index is to compare the stock price with past five years stock price. The Z NASDAQ ratio is compared to stock index with the average stock price. So the exchange index is compared to stock selling index, which was the stock buying index. We use Z standardization to help us look after risk. We consider about 184 choices and we need to make sure our investments will be in the right timing and pick up the right pair to do the exchange. We're also using the quantile range outlier algorithm to help us determine the timing. The small value of Q provides a more radical set of outliers than the large value. Look at right side table. We use quantile range outlier methods and get a top thee outlier whose exchange index value is greater than 19. This is the second time we consider the exchange index. We found out that there are top index, which are the signal and help...and it's the best timing for us to do the exchange. Like S5-B14 pair at 19.27; S5-B13 pair at 19.12; S5-B12 at 19.07. On the left side, we have the timing prediction model. This model is presented with a color code, color box style. That blocks with dark green are good to sell for us, and dark orange or red is good for buying or exchange. The best time will be bold and shown in the graph. April 6, which is the best day to do the exchange, since we come to exchange data from February in 2015. We consider the exchange pair twice, which can double insurance that we can make more money in the stock market and enormously reduce the investment risk. On April 7, the exchange index had little change compared with April 6. On that day, S5-B1 pair had a 19.18 exchange index. In the right graph, that was the exchange stock information. On April 8, we sold the KLAC stock at $154.32. The market stock price was $148.85 so we saved about 3.5% into selling Also on the same day, we bought the Delta stock at $22.42. The stock market price was $22.92, so we gained about 2.2%, so it's changed in the sale. So sellign the same amount of stock and buying the same amount of stocks for balance. The model of selling and buying stock was equal to 65 quantities. After one day, the exchange pair has helped us gain 5.7%. Oh yeah, all stock buyers will focus on their stock trends. My partner Mason and I monitor the NASDAQ stock daily range outliers from 2020 late February to 2020 mid March. We separate the daily trade window into a certain time slots, 30 minutes each, and we want to find when's the best time of trading. There were 24 peak and valley points we detected and the upper threshold is 2.7%. In the right corner figure, we can see the stock price in the open, close, high, low times and also we will calculate the average...we count the price range and rank when we do the stock price peak and valley detection. ??? considered the discrete number of sample size. Among 24 peak/valley points we detected, the data shows that 17 out of 24 points is about 70%-71% were happening in the first or last hour. We set the one-proportion test that we made the null hypothesis is that we wake up early and have a stock lunch session to do trading. Null hypothesis is assuming the uniform distribution probability. Look at the left corner table that we can see the null proportion value is 0.34 which is greater than 0.05 so we cannot to to reject the null hypothesis. So To be four slots among 13 slots available, so we can not reject the null hypothesis. Look at the right corner table or figure which shows the distribution level of peak time and the valley time. In our research we provide a new model to pick the right stocks and the ??? the amount of buying and selling, also the exchanging index. Timing is a really important factor in the investment. This model of stock investment is accurate most of time during the COVID-19 pandemic. Our research group invested in the stock market and gained 2.5% after we finished the project. We may use it to predict in the future if the pandemic doesn't end. Based on our research early bird or last minute favor stock trading and can earn more money. Thank you.  
Saloni Patel, Student, Stanford OHS Mason Chen, Student, Stanford OHS   This project investigates the validation of a prediction model and the actual result of the 2020 United States presidential election. The prediction model consists of the predicted election result, which is derived from the z-scores of the number of infected cases, deaths and unemployment increase rates for each of 15 “swing states” along with the 2012-2016 election result average. In order to identify the most important swing states, a Swing State Index was derived using the 2012, 2016, and 2020 election outcomes.  The predicted election result is then subtracted in response to the media’s report about how Donald Trump is expected to lose 3-5 percent of his votes from the 2016 election. The model is used to compare the level of accuracy between the predicted 2020 election result and its subtracted values against the 2020 actual election result. The paired t-test and regression test are used to test the significance between the 2020 actual result and the 2020 predicted result as well as the 2016 actual result and the 2012 actual result to see how the 2020 predicted result compares with the 2016 election result and the 2012 election result in predicting the 2020 actual result. A one proportion hypothesis test is also used to compare the accuracy of the 2020 predicted result with the 2020 actual result.  The next part of this project studies factors that influenced the voting behavior of the 15 key swing states in the 2020 United States presidential election by linking statistical clustering methods with notable political events. In addition to key decisions made in the Trump administration, factors unique to this presidential election such as the global COVID-19 pandemic and the Black Lives Matter movement were investigated. Hierarchical clustering was used to group the 15 swing states based on the Swing State Index, and the relationships between each cluster were attributed with events that may have factored into the cluster behavior. The most representative and significant swing states were identified to be Arizona, Georgia, Wisconsin, and Pennsylvania (based on the clustering history) as well as Michigan and Minnesota (based on the Swing State Index). After analyzing specific events that affected these six states’ voting behavior, the Black Lives Matter movement and concerns over health care were the most significant factors in President Trump’s defeat. Next, the state of Georgia was further studied to better understand the influence of COVID-19 and the economy on the state’s voting behavior. By adjusting the ratio of the COVID-19 values (infected cases and deaths) and economic value (unemployment rate), it was found that the economy was of greater importance than COVID-19 to Georgian voters. The study of similar events by connecting political science (e.g. government decision-making) and clustering methods can be applied to future elections to better predict the outcome of important swing states and, thus, the overall election results.  All calculations and analysis are done on the JMP 15 platform.     Auto-generated transcript...   Speaker Transcript Saloni Patel Okay, so hello, my name is Saloni and today I'll be presenting our project, the United States presidential election prediction model and swing states study behaviors study. There are two parts in this project, the first involves creating and evaluating a model meant to predict the 2020 US presidential election. The second part of the project will study swing state behavior in the 2020 US presidential election and identify key events that affected the voting patterns in the election using hierarchical clustering methods. All the analysis was done on the JMP 15 platform. To clarify our project does not focus on all 50 US states and instead we will only study the top 15 swing states. The swing states are states that can reasonably be won by either the Democratic or Republican presidential candidate, as opposed to safe states that consistently lean towards the one party. Additionally, the US voting system depends on the Electoral College system that gives a set number of votes to each state based on population numbers. There is a total of 538 electoral votes so a presidential candidate must get 270 electoral votes to win the presidential election. Since most of the states are known to vote for either a Democratic or Republican candidate without hopes of being swayed out of the normal voting pattern, the Electoral College system and the presidential election result depends on the bulk of the swing states that can potentially be won by any of the candidates. A win by even a small margin results in that candidate acquiring all the votes the state has to offer, so swing states are especially impactful in determining the next president. So, to begin we conducted this project in hopes of better understanding the historic 2020 US election that occured in the middle of a global pandemic and socially as well as economically unstable times. The first part of our project's objectives is to identify key swing states, create a prediction model based on the influence of COVID 19 and the economy in those identified swing states, and lastly validate the prediction model with the actual election results once those came out. So the first step in our prediction model is identifying top 15 swing states from the past three elections using this swing state index. We use this formula to determine whether the states are swing states or not. It is also important to note that this swing index does not take into account which side each state votes for, but rather on the election results itself. In other words a state could have voted for the same side all three years yet by very different margins and still be counted as a swing state. We will further study this index in the next part of the project, but right now, all we use this index is to...for is to identify the top 15 swing states. Once the swing states are identified, we derive the first value we will need to calculate the predicted 2020 election result. This is the 2016-2012 composite win margin. To calculate this value we took the 2016 result and the 2012 election result. In the formula, we gave the 2016 results twice the weight because it was more recent than the 2012 election and we gave another twice the wait for the 2016 election. Because President Trump was present in the 2016 election running as president, while Joe Biden was present in the 2012 election as a candidate for vice president. In total, the 2016 results will have four times the weight as in 2012 in the 2016-2012 composite win margin. Next we identify factors that are unique to the 2020 and factors that voters may vote according to. We found that the global COVID-19 pandemic and the following hit the economy took were important factors unique to 2020 so we collected the infected cases and death cases due to COVID-19, as well as the unemployment increase in each state. Next we applied the Z standardized transformation to avoid any sampling mean and variance biases. Using those Z scores, as Z-infected, Z-deaths and Z-unemployment, we derived the Z-COVID index. This index will represent the impact the global pandemic and the following economic hit each state experienced. Lastly, we calculated the 2020 predicted result, using the 2016-2012 composite win margin and the Z-COVID index. Once the 2020 election passed, we recorded the 2020 actual election results and proceeded to validate our prediction model and whether our choice of factors did a good job helping predict the 2020 election result. Additionally, since the media before the election had predicted that Trump will lose 3 to 5% of the votes from 2016, we decided to subtract certain percentages from the predicted results. In the table below, that predicted result is the zero percent category and the reductions can be seen as well. To analyze the results we compared the predicted results with the 2020 actual election results, using the regression and paired t-test. To compare how the 2020 predicted results compared with previous election results at predicting the 2020 actual result, we also include the 2012 election and 2016 election results in our evaluation. Lastly, we also conducted a 1-proportion hypothesis test to test the 2020 predicted results accuracy. To begin we conducted a regression test with the election results presented just from each state. The 2012 result compared to the 2020 actual election result did not yield a significant result. However, the regression tests between 2016 actual and 2020 actual displays a significant result and the highest R squared of 0.81. The results of the regression is also close to 1, at about 1.17, suggesting a strong regression relationship. The regression between the 2020 predicted and the 2020 actual also displays a significant result but a lower R square of 0.3. From these results, the regression between the 2016 actual and 2020 actual results had the highest R Square and slope closest to one, despite having declared a different winner. It is reasonable to find that the 2020 election results would be correlated with the 2016 election results since Trump lost those swing states narrowly wone in the 2016 election by small margin, so just Michigan, Pennsylvania, and Wisconsin. Next, the paired t-tests that will also compare the election results percentages of each state. we use the paired t-test because the same states or pairs are being assessed against each other. The paired t-test only found a significant difference in the means of the 2016 actual election results and the 2020 actual results, which makes sense since these results had a high regression test significance. This would suggest that the means of the 2012 election and the 2020 predicted results are similar to the 2020, actual meaning that the election results are similar. In the 2012 election, the Democrats had won the election and the predicted results had predicted that Democrats would win in 2020, while in 2016 the Republicans had won the election. This can explain why 2016 is significantly different from the 2020 actual election results, while the... while the 2012 actual and 2020 predicted are not. Lastly, we use the 1-proportion hypothesis test to test how the 2020 predicted results matched with the 2020 actual results. Unlike the regression and paired t-test, the 1-proportion test compares the states and which side they voted for. The regression and paired t-test only compared the election results without any indication on which side the states voted for. Therefore, this test is more powerful and validating the prediction model, since it compares the predicted side each state would vote for and which side the states actually voted for. We assign the states that voted for the predicted side with a pass and those that did not vote for the predicted side with a fail. In total 12 out of 15 states received a pass, as they were predicted accurately, while the other three received a fail. We set the success value at pass and the scale is a sample proportion of 0.8. Since we want the sample proportion to be greater than 0.9 or 90% accurate, we set the hypothesized proportion to 0.9. Since the 0.8 proportion failed to exceed the 0.9 at the 95% confidence level, the prediction model failed to be 90% accurate, failed to reject the null hypothesis at the 95% confidence level. According to the proportion of our sample this model is 80% accurate. To summarize the regression test showed significance between the 2016 actual results and the 2020 actual election result., as well as a weaker significance between 2020 predicted and 2020 actual election results. We theorize that this may have been because this election, President Trump lost those swing states narrowly won in the 2016 election. The paired t-test showed significant difference between the 2016 actual and 2020 actual, and we theorized that this may have been because those two elections declared different winners. President Trump won the 2016 election yet lost the 2020 election. Additionally the 2012 and 2020 predicted results are not significantly different from the actual 2020 result...election result, which may have been because they both declared the same political party as the winners. As...lastly, the 1-percent hypothesis test failed to reject the null hypothesis, and so our prediction model is not 90% accurate at the 95% confidence level. Arizona, Wisconsin, and Minnesota, which could suggest that there were other major factors besides the impacts of COVID-19 and unemployment rates that influenced the 2020 election result. This is where we transition to the next part of our project in which we will group states based on their swing state index and identify them with key events that took place in 2020 that could have influenced the swing states' voting behaviors. So the questions this part of the project will address from the last is which events and factors influence the swing states to vote the way that they did. How much more or less did voters care about COVID-19 than the economy and other side investigations? Can we use statistical tools to link political events with voting patterns? The goals for this project is to study the previously identified swing states voting patterns by linking statistical clustering methods to political events. We will also adjust the Z-COVID index, or as we will now call it Z-Ratio, with new ratios to better understand the importance of COVID 19 and the economy in voting behavior. Previously the Z-COVID index had two by one ratio, where the values of COVID-19 infected cases and deaths were given twice the weight compared to the unemployment increase value, since there were two values for COVID-19 and only one meant for the economy. We realized that each State was impacted differently by the pandemic, so we thought it would be appropriate to analyze the effects of switching this two by one ratio to other ratios. First, we will go back to the swing state index, which helps identify the swing of each State using the election result percentages from the past three elections. A negative election result indicates that the swing voted for Democrats, while positive indicates a Republican vote. The larger the magnitude and the more negative the swing index, the more that states voting patterns have swung. If the state changes direction then the signs of the two differences will not be equal, causing the swing state index to be negative and display more of a swing behavior. From this table, we can see that Michigan and Minnesota have negative values of the largest magnitude, which means they have been swinging the most in the past three elections. Overall, the swing state index is quite useful in understanding basic voting patterns for the swing states. However, the swing state index cannot identify key events that caused the voting patterns. We used hierarchical clustering to study states with similar voting patterns and list potential factors that affected their voting behavior. Hierarchical clustering grouped the 15 swing states into four different clusters, as seen on the right. We used this method, because of its bottom up approach, where every state is its own cluster before they emerged one at a time and moved up the hierarchy. On the right, Iowa and Ohio can be seen in red indicating that they're in the same cluster. As mentioned previously, the hierarchical clustering divided the swing states into four clusters, the first cluster consists of Iowa and Ohio. Both of these states had voted blue or Democratic in the 2012 election, yet red or Republican in the 2016 and 2020 election. The second cluster has Georgia, Arizona, North Carolina, and Florida. All these states except North Carolina became bluer or redder, or in other words are starting to favor one side heavily. But third cluster consists of Wisconsin, Pennsylvania, Michigan, Nevada, New Hampshire and Minnesota. All of these states, besides Nevada, have a negative swing index, meaning they're the most inconsistent swing states. The last cluster has Colorado, Virginia and New Mexico, which are all relatively blue states or states that have consistently voted Democrat and in the 2020 elections, voted blue by a larger margin thatn previously. Now that we have all the clusters and idea of their characteristics, we looked at the clustering join history, which identifies the top pairs of states or which two states are the most similar in their clusters. From the join history, the first two pairs are Wisconsin with Pennsylvania from the third cluster and Georgia with Arizona from the second cluster. Both pairs are part of the clusters that had states that switched from red to blue in the 2020 election. After further research, we found that Wisconsin and Pennsylvania appeared to have concerns for the economy, dissatisfaction with President Trump's healthcare related policies, such as his efforts to weaken the Affordable Care Act formed under the Obama administration, as well as concerns for the environment, all of which ultimately made majority of the voters vote Democratic. However, in Georgia and Arizona, we see that major shifts in demographics, such as more registered Latino voters in Arizona and the Black Lives Matter movement that exposed serious racial injustice, ultimately caused majority of the voters to cast a Democratic ballot. Through hierarchical clustering we were able to separate states into different groups based on their voting behaviors and create connections on which key events caused the observed voting behavior. Although we found key events that influenced each state's voting behavior, hierarchical clustering did not tell us the weight each event played in the individual swing state's voting behavior. COVID-19 and the economic recession that followed. COVID-19 and the economy. However, we had to assume that each state would have the same Z-ratio, which gave COVID-19 twice the weight, resulting in a two by one ratio for each state. However, from the hierarchical clustering, we found that each state has a unique situation and their voters cared for different issues. To adjust the Z-ratio and...the Z-ratio, we created a value called the Ratio Variable. The Ratio Variable will determine the ratio of the importance of COVID-19 or Z-COVID index, which represents the infected cases and deaths in each case versus the economy or the unemployment which represents the annual unemployment increase rate in each state. Once the Z ratio is adjusted, with a few different ratio variables, such as 0.1, which creates a ratio of one by 10 giving the economy 10 times the importance, it is implemented into the full formula used to calculate 2020 predicted results. These adjusted 2020 predicted results are compared against the 2020 actual election results to determine which ratio best explains the state's situation and how much of importance COVID-19 and the economy had in influencing the voting behavior. We decided to study Georgia's voting behavior closely since it appeared to stand out compared to the other swing states. For one, Georgia was the first state to reopen business in April, while the rest of the states did not. Additionally, Georgia was a key state in the 2020 election, which President Trump had an eye on even after the election results were announced in attempts to overturn them. Georgia voted blue by a small margin and had an election results of -0.3%. The adjusted 2020 predicted results for Georgia was potted and a marker for the Georgia...for Georgia's actual election result was placed on the graph on the right. From the graph we see that the adjusted 2020 predicted results with the ratio variable of 0.75 had the value of -0.2%, which is the closest to Georgias's actual election result, -0.3%. The 0.75 ratio variable means that the ratio is three by four, indicating that the economy was more important issue to a majority of the voters in Georgia. This makes sense because, as mentioned previously, Georgia was the first state to reopen business in April, indicating a strong concern for the well being of its businesses and economy. In this project we explored different key events and their importance in influencing the voting behaviors of 15 identified swing states using statistical methods. First hierarchical clustering was utilized to group swing states based on their voting behavior in the past three elections. From this we found that in the second cluster consisting of Arizona, Georgia and others, were mostly affected by issues regarding civil rights while states such as Pennsylvania and Wisconsin in the third cluster had voted for Joe Biden due to concerns for the economy, healthcare, and environment. Overall, the worsening COVID-19 situation, racial movements such as Black Lives Matter movement, COVID-19 and the economy, had on each state. Georgia was explored in more detail, and it was found that a three by four ratio matched best with the actual election result, suggesting that the economy was a more important issue to voters compared to COVID=19, and this makes sense because Georgia was the first state to reopen business in April. Thank you for listening to my presentation.  
Anne-Catherine Portmann, USP Development Scientist, Lonza Sven Steinbusch, Senior Project & Team Leader, Microbial Development Service USP, Lonza   Often, the analysis of big data is considered to be essential in the fourth big industrial revolution – the “Data-Based Industrial Revolution” or “Industry 4.0.” However, handling the challenge of unstructured data or a less than in-depth investigation of data prevent using the full potential of the existing knowledge. In this presentation we offer a structured data handling approach based on the “tidy data principle,” which allowed us to efficiently study the data from more than 80 production batches. The results of different statistical analyses (e.g. predictor screening, machine learning or PLS) were used in combination with existing process knowledge to improve the overall product yield. With the newly created knowledge, we were able to identify certain process steps that have a significant impact on the product yield. Additionally, several models demonstrated that the overall product yield can be improved up to 26 percent by the adaptation of different process parameters.     Auto-generated transcript...   Speaker Transcript Anne-Catherine Portmann Hello, today I will present you the power behind data. This presentation is based on the idea that a principal this presentation will allow us to efficiently study the data of more than 80 production batches. We were able to improve the product yield of more than 26% based on the process knowledge and the statistical analysis. The statistical analysis allows also to identify the key process step which have an impact on the product yield. So I will first introduce Lonza Pharma Biotech and then we will go to the historical data analysis. Lonza Pharma Biotech was found firstly in 1897 and shortly thereafter, it was transformed to chemical manufacturer. Today we are one of the world's leading supplier to pharmaceutical, healthcare and life sciences industry. Here at Visp, we are one of the biggest site from Lonza and we are most significant for R&D, development and manufacturing. We are, we have a new part of the company, the Ibex solutions, where we are able to complete biopharmaceutical cycles from preclinical to commercial stage, from drug substances to drug product, all of this in one location. You probably heard about this lately in the Moderna vaccine against the COVID-19, but it's not the only product that we are producing here in this. We are also producing small molecules, mammalian and microbial biopharmaceuticals, high potent APIs, peptides and bioconjugates including antibody-drug conjugates. Now that you know a little bit more about Lonza, I will go to the historical data analysis. So, first of all, I will present to you this process on which the 80 batches are run. So, first the upstream part. So the upstream part have first the fermentation part, where the product is generated by the micro organism. So the product make a microorganism. The... product to produce...the product, the microorganisms is produced (???) from the DNA during fermentation. Then we have the cell lysis where we disturb the cell membrane and allowed to release the product and all what is in the cell. and have access to this product. Then become the separation. In the separation part, we remove the cell fragments, such as the cell membrane or the DNA. Then we come to the downstream part, which is based on three different chromatography and allow the purification of the product. So the product is here in yellow in the below part of the slide. And we can see that during each of the chromatography part, we are able to profile a little bit more of the product. At the end, we perform a sterile filtration of the product. So the goal was to increase the overall project yield, and to do that, we first collect the data of the 80 batches and order them in a way that we can analyze them. Then we perform yield analysis. And then we discuss the result with the process analysis. With the SMEs, so the subject matter experts. Then we have seen and...we went to the data analysis for the upstream part and we perform this for analysis on the left of the slide. Then we go for the downstream part and focus on the Chromatography 1. At the end, we make a conclusion from all what we see in the...in the analysis and what the subject matter expert orders. And at the end, we recalculate the yield. Let's see what...how we organize our data. So we based the data on the tidy data principles. That is a big part of the...before the analysis, which takes time, but it's really important to have clean data and making an efficient analysis afterwards. So first we have about, that is, the observational unit one, for example, we can say to the fermentation. And then on each row of the...of the file we include one batch each time. For each column, we take a parameter. For example, for fermentation, the pH, the temperature, all the titer (that means the amount of the product at the end). And then, for each values, here corresponds the correct value from the column and the batch. And with this one, we can go to JMP and perform the analysis. So let's see how we calculate the yield. So, first we calculate the yield for each step, beginning at 100% for fermentation and see how it decreases along the process. So what we observe is that we have a big variation at the fermentation step. And then we have a decrease in the in the product amount at the separation step, as well as the chromatography 1 step. And so we go with this data to the subject matter experts and they told us that the complex media variability impacts the final titer of fermentation, so we have to explore this spot. Then, for the separation, the strategy that was choose could have a different impact on the mass ratios. And for the chromatography 1, the pooling strategy have most probably an impact. So, then we will see what the data said. So we look at the upstream part and perform different analysis. So the first analysis was the multivariate analysis of each of these USP process stages. So we focus on the fermentation, cell lysis, and separation. And see all the parameters, how they could correlate with the product at the end. So here, what we see the fermentation, the amount added to Reactor 1 had a medium correlation with a good significance probability. For separation, the final mass ratio. The mass ratio at the intermediate separation have both a major impact on a significant probability. You see that other parameters, such as the initial pH from Reactor 2 is very close to the medium correlation threshold and have a significant probability. And we will see if this parameter in the next analysis. We have also selected here only the parameter, which is scientifically meaningful for the other analysis also. Then we went to the partial least squares. For the partial least squares, we see that we have for fermentation a positive correlation for all these parameters. So again we see the amount of Reator 2, the initial pH of Reactor 2 and the initial amount of Reactor 2. As well as a new parameter, that is the hold time. And we see that the amount of Reactor 2 have a positive correlation with this analysis, but the negative with MVA. And this could be explained, because of the 80 batches whereby just...which were running production, but they were not designed to answer a question of positive or negative correlation on the product...on the final product. So that could be done in the future, in another analysis with a proper design. With ??? to still say that we have an impact on the final product. For the other parameters, at the other steps of the upstream part, we also see that the prediction matches the multivariate results. And we have also a possibility to improve the titer. Here we see with the prediction profiler that we can also optimize in the future and the project yield. Then we test the product...the predictor screening. And here we ran 10 times the predictor screening and the five parameters which will always found in the top 10 were selected. And here what we see. It's the initial pH of Reactor 2, the mass ratio at the end of separation, the mass ratio at intermediate separation, the initial amount of Reactor 2, the amount of Reactor 2, and the amount of Reactor 1. So again, we have the same parameter that appears to have an impact. So, then, we went to the machine learning. This machine learning analysis, XGBoost, is a decision tree machine learning algorithm. And to avoid having in this result parameter that's not part of the... of the top of our parameter, we include a fake parameter, which give us a kind of threshold in the parameter importance. And all the parameters which appear above this threshold were considered to have an impact. The other are considered to be random and below this random parameter and have no impact or not significant impact on the on the final product. And here we can see that the negative correlation will appear for Reactor 1. The pH of Reactor 2 and initial amount of Reactor 2, we have a positive correlation. And for the mass ratio, we have a negative correlation. Again, as I explained before, the difference between negative and positive correlation was not the goal and not designed for this experiment, so we know it's an impact, but we don't know if it's positive or negative yet. Then we will go to the downstream part, specifically on the Chromatography 1. And here we use the neural predictive modeling. So in the normal predictive modeling we use the a different fraction of the chromatography. So on the graph on the right, we see that Fraction 8 have...is the main fraction, so where we found most of the product, the highest purity. And then by decreasing from Fractions 7-1, we have product, but also more impurity. and Until now, we were taking into account of the Fraction 4 in our analysis and we would like to see if we can include also Fractions 3, 2 and 1. And what we saw is that the by increasing the number of fraction, we increase the yield but decrease very few the purity. So the graph on the left, we see that if we go to the Fraction 2, we decrease the purity to less than 1% but the yield was increased to 5%. And then, when we include the Fraction 1, there we have a bigger decrease of the purity but it's a little bit than 1%. More than one percent of purity decrease and the yield was in the other side increase of about 10%. So then, with this result we try to summarize everything together. So far, fermentation, we have a final volume of the tanks and reactor, which were identified by most of the methods. The initial pH of the fermenter was also identified by the different analysis methods and the complex compound variability by the process experts. And to be able to see the effect of the complex compound variability, we will need further investigation in the lab. Then for separation, we have the mass ratio, which was identified by some methods, analysis of the data but also with the process...by the process experts. And the strategy is very interesting. The process expert decide to look at it and try to make some tests to be able to improve the yield in the production. For the Chromatography 1, the pooling strategy was identified by the process expert on the neural network analysis. And here the method can be easily implemented in the lab and also in the production. And the yield is really increasing a lot with with this method. So, then, we recalculate with the prediction... prediction profiler how we can increase our yield of the different steps. And for fermentation we were able to increase the yield up to 16%. For the separation, we are able to increase it up to 5%, and for Chromatography 1, here we've raised up to 5%. In the other slide we wrote up to 10%, so we try to be the worst case scenario to say up to 5%. And, at the end, we will have a total increase at the end of 26%, so that is a good way to improve our process and to focus exactly on the part where we can have a big impact. And just based on the data without doing a lot of experiment in the in the lab, that is also cheaper to do these analysis with JMP as doing a lot of experiments in the lab. So we have a lot of of gains at the end. So, thank you very much to all of you for listening me today, and also a big thank you to my colleague, Ludovic, Helge, Lea, Nichola and Sven with the ??? of this presentation.  
Georg Raming, Senior Manager, Siltronic AG   In the course of daily work, users often need to analyze the same or similar data from distributed sources. Because these users are rarely involved in defining the IT infrastructure, it is often the case that the needed data is located across a variety of different platforms (databases, fileservers, etc.). Users are then forced to spend a lot of time querying and combining the data to get it in a form appropriate for analysis. JMP offers several possibilities to query data from different sources and to connect them afterwards. In this presentation, some examples of workflows are shown that can be used to efficiently get data into table(s) and effectively meet the requirements of analytical users. Methods used to accomplish these tasks include Query Builder, SQL, JSL, JMP add-ins and Virtual Join.     Auto-generated transcript...   Speaker Transcript Georg Raming (Siltronic AG) Hello everybody.   Today I want to talk about strategies and examples for data acquisition from distributed and complex sources.   My name is Georg Raming, my job is process development of ??? grown single crystals at Siltronic. I have some experience with statistical evaluation and process   and product data and statistical education. I'm responsible also for JMP software at Siltronic and training activities for a few hundred users.   Siltronic is one of the world's leading manufacturers of highly specialized hyper pure silicon wafers.   Some technical hints. So I'm working here with data tables instead of database in this presentation, but the concepts shown here are originally used for getting data from relational databases via ODBC connection.   And the JMP Query Builder is working in a similar way for JMP data tables and for database tables.   What is the target of today's presentation? It is to establish some ways for getting data from database in an easy and efficient way into your JMP or into a JMP table.   Let's first talk about the building blocks.   So one may be a JMP data table.   So if you have generated a data table from a database query, it may look like this. So I have taken here the famous sample table from JMP, Big Class, and I have queried the data via JMP Query Builder. I deleted all the script that comes with the table.   And you will find   for this data table if you query it by JMP Query Builder, several scripts and table variable.   So inside that variable, you can see the SQL. So the definition of the data table is drawn from the table or database.   And you have these scripts, a source script that simply   gets another copy of the original data table like this. And here you can see the original scripts are also in there, but there is again these scripts from the JMP Query Builder.   There is also a modify query script that lets you edit this query on the JMP data table and also run it.   And there is a query for update from database. So if you change the data in your data table   or there is new data in the data table, you can simply update the data by pressing this button.   Like this. Here also the scripts have come again from the sample table.   Okay, so the next building block is JMP Query Builder.   So once you have one or several tables at database, you want to make a query on like here, I took Big Class and Big Class Families from JMP sample table   directory and use the JMP Query Builder like   here on the tables. JMP Query Builder.   JMP sees these open tables   and there is also one in the primary field, and I can add a secondary table like this, and JMP automatically then creates a join between both tables. So how they are joined. And you can edit this join   like this.   You'll see the details here.   And then you can go on building the query   like this. You have both tables here available.   Here you also can edit again the join   And maybe you want to add all   columns,   like this.   And maybe you want to put a filter,   like this.   And run the query. Then you again   get this result table with the scripts from the   sample tables we do not need here.   I delete them. And these are the scripts, the scripts and variable written by JMP Query Builder   like I used it before. And again, here you can   edit this query. The script for the query is saved in this result table here, like we have defined it here before.   So this is how the JMP Query Builder works.   And as mentioned before, it works the same way on database tables as it works here on the JMP sample tables.   The next building...I also put here some scripts and you always can switch from visual Query Builder   to a custom SQL   query. So this works here. Another red triangle menu.   Convert to custom SQL, and what you have defined here visually,   you can define here in a text SQL query.   You can run it.   And when looking here at this query or pressing Modify Query button,   you will see the text query. So this is only one way,   converting from visual query to custom SQL query.   So let me just tidy   some tables. Okay.   Next building block is to get data from database and you need an ODBC connection to that data sources.   And usually the ODBC data sources and drivers are set up by your IT administration. So under Windows, you can find   these connections under the ODBC data source administrator.   And please keep in mind that the bit-ness of the database sources should fit to JMP. So in this case I have 64-bit JMP application and therefore I need 64-bit drivers   And the user sources, you can set up yourself.   And system DSNs are set up by the administrator, and also drivers for different types of databases are set by the administrator.   In case of troubleshooting, you may use this tab for tracing   where the query went and what may be the problem.   And a tip from my side would be to check whether the connection works properly from these data sources. So if you press Configure and the data source administrator, you can hear for a certain source test the connection after you have provided your credentials.   And if there is everything okay, so the system tells you that   the connection succeeded.   It's not up to Windows that the query doesn't work.   And also other details, you can find in the JMP manual in the documentation. For example Using JMP, chapter three, you can find it here in the   Help menu   and the JMP documentation library. And there it is well-documented how to use all these tools.   Okay, so there are then in scripting and JSL, in the scripting language, three ways to script these SQL database queries. The first one is a new SQL query, as we have already used before. So it's located here in the JMP menu under Table, JMP Query Builder for JMP tables.   Or for the database, you can find it here under database Query Builder.   And then New SQL Query is the most powerful command of all. So, it generates a new query object, you can save itself as JMP query file. It can generate a data table directly and provides very well origin of the data, as we have seen before with source modify and update script.   Another command in JSL to query data from database is the open database command. You will find documentation of all these commands also in the scripting index. Like here,   if I type   open   database,   you can find here how it works for all commands.   And a third way to connect to database and put some queries is database connection.   And it needs three steps. The first one is create a database connection. The second is put your queries, one or several, to the database. And to finalize, you need to close the database connection.   And finally, if you wrote some nice scripts.   You should make it accessible to you, or maybe to colleagues, and there are at least two ways. So one is to put it in an add-in so that you can provide it to others easily also. And the other possibility is to put it in a custom menu, like I did here.   And you can find it   in the menu, then here like this. There are two add-ins installed in my system.   And here is my personal   custom menu.   So let's...we have finished the building blocks. Let's go to the examples.   And the first example is to use a table script to save table layout and the query.   So it may look like this. If you have a data table that comes   from database and you put some scripts in it for nice graph also,   you may want to use it every day. And there is a nice possibility here   to say, copy table script without data.   And this script you can then use.   I would like to make a new script. Sorry.   And paste it into the script window.   And here comes that new table by simply pressing the Run Script button.   And it comes as ???, without data. And you can save this script, for example, in your menu and to get the data, it may be a huge amount of data. You can simply press the Update from Database button and the table gets filled with all the data from maybe database or somewhere else.   So this is a nice possibility   to simply use the script   to get the data from database.   The second example   is to use a query from inside a table.   So like this. If you have a data table and you want to query some additional data, dependent on the content of your current table.   So I will show it.   I delete some rows.   And with these three names, I want to go to a different table to fetch some data.   And it is done by this script.   And here you can see from this table, I got the names and made a query   to query data from another table from Big Class Families.   And how does it work?   You can find it in this script.   It is quite short. Here, the names are taken from the first table query as defined, like here.   The names are substituted into the query.   And here the query is put to the table Big Class Families.   Okay, I need to close this table too.   This one is a custom query script with graphical user interface. So if you have, for example, a large data table on the database and often need small amounts of it,   you can, of course, pull all the data, but maybe it's more flexible and efficient to get well-filtered data. So it works like this.   one table for filtering the data and one table with all the columns of the big data table. And here we can filter graphically. Let's say female and age, so I took only two columns to filter,   like this, 12 and 13 years old.   And I may want to have all columns or remove just one column.   I query it   and here get the proper result with these restrictions. And you can, as you can see here, and the source script or in the SQL,   the restrictions are here   from the graphical user interface defined.   And how does   The scripting work works? It is a little   more complex, of course. So there is the filter query.   The columns query.   Here is a script, a function that converts the   data filter conditions into an SQL condition.   This is the GUI part, graphical user interface.   And here, finally, the graphical user interface results are evaluated and put into this custom script.   The next example is a two step query. So let's assume you have to query first some batch IDs.   And to take these IDs into another query, maybe to another data table to get additional data,   you can do it like this. So here I took the Big Class data table as filter. So I filtered only the male   and   equal or less than 13 year old pupil.   And took these names to query on Big Class Families, all rows, like you can see here.   And   both tables are connected via a virtual join, like you can see here.   So   this table refers via the name to the other table to the filter table. And I can use   the columns of the filter table here in this table as if they were   in this table.   So this is a nice way to have several tables and use the content in one table.   And the last example is   how to run two or several queries in background in parallel. So I have no example here. It is discussed in the JMP Community. And if you're interested, you can have a look at.   So now I'm finished with my examples and the journal and presentation material scripts are available online.   Thanks to the Community for the wonderful discussions, and thanks to the JMP developers for building and maintaining this great piece of software. And finally, thank you for your attention.
Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion Christopher Gotwalt, JMP Director, Statistical R&D, SAS   Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.     Auto-generated transcript...   Speaker Transcript   Hello.   My name is Ron Kenett. This   is a joint talk with Chris   Gotwalt and we basically   have two messages   that should come out of the   talk. One is that we should   really be concerned about   information and information   quality. People tend to talk   about data and data quality, but   data is really not the the issue.   We are statisticians. We are   data scientists. We turn numbers,   data into information so that   our goal should be to make sure   that we generate high quality   information. The other message   is that JMP can help you   achieve that, and this is   actually turning out to be in   surprising ways. So by combining   the expertise of Chris and an   introduction to information   quality, we hope that these two   messages will come across   clearly. So if I had to   summarize what it is that we   want to talk about, after all,   it is all about information   quality. I gave a talk at   at the Summit in Prague four years   ago and that talk was generic.   It talked about moving the   journey from quality by design.   My journey doing information   quality. In this talk we focus   on how this can be done with   JMP. This is a more detailed   and technical talk than the   general talk I gave in Prague.   You can watch that talk.   There's a link listed here. You can   find it on the JMP community.   So we're going to talk about   information quality, which is   the potential of the data set, a   specific data set, to achieve a   particular goal, a specific goal,   with the given empirical method.   So in that definition we have   three components that are   listed. One is a certain data   set. Here is the data. The   second one is the goal,   the goal of the analysis, what   it is that we want to achieve.   And the third one is how we will   do that, which is, with what   methods we're going to generate   information, and that potential   is going to be assessed with the   utility function. And I will   begin with an introduction to   information quality, and then   Chris will will take over,   discuss the case study and and   show you how to conduct an   information quality assessment.   Eventually this should   answer the question how JMP   supports InfoQ, that   would be the the bullet points   that you can...the take away points   from the talk. The setup for   this is that we we encourage   what I called a lifecycle view   of statistics. In other words,   not just data analysis.   We should know...we should be part   of the problem elicitation   phase. Also, the goal   formulation phase, that deserves   a discussion. We should   obviously be involved in the   data collection scheme if it's   through experiments or through   surveys or through observational   data. Then we should also take   time for formulation of the   findings and not just pull out   printed reports on on   regression coefficients   estimates and and their   significance, but we should   also discuss what are the   findings? Operationalization of   findings meaning, OK, what can we   do with these findings? What are   the implications of the   findings? This should should...   needs to be communicated to the   right people in the right way,   and eventually we should do an   impact assessment to figure out,   OK, we did all this; what has   been the added value of our   work? I talked about the life   cycle of your statistics a few years   ago. This is the prerequisite,   the perspective to what   I'm going to talk about. So as I   mentioned, the information   quality is the potential of a   particular data set to achieve a   particular goal using given   empirical analysis methods. This   is identified through four components the goal, the data,   the analysis method, and the   utility measure. So in a in a   mathematical expression, the   utility of applying f to x,   condition on the goal, is how we   identify InfoQ, information   quality. This was published in   the Royal Statistical Society   Series A in 2013 with eight   discussants, so it was amply   discussed. Some people thought   this was fantastic and some   people had a lot of critique on   that idea, so this is a wider   scope consideration of what   statistics is about. We also   wrote in 2006, we meeting myself   and Galit Shmueli, a book called   Information Quality. And in   the context of information   quality we did what is called   deconstruction. David Hand has   a paper called Deconstruction   of Statistics. This is the   deconstruction of information   quality into eight dimensions. I   will cover these eight dimensions.   That's my part in the talk and   then Chris will show how this   is implemented in a specific   case study.   Another aspect that relates to   this is another book I have.   This is recent, a year ago   titled The Real Work of Data   Science and we talk about what   is the role of the data   scientists in organizations and   in that context, we emphasized   the need for the data scientist   to be involved in the generation   of information as an...information   quality as meeting the goals of   the organization. So let me   cover the eight dimensions. That's   that's that's my intro. The   first one is data resolution. We   have a goal. OK, we we would   like to know the level of flu   because in the country or in the   area where we live, because that   will impact our decision on   whether to go to the park where   we could meet people or going to   a...to a jazz concert. And that   concert is tomorrow.   If we look up the CDC data on   the level of flu, that data is   updated weekly, so we could get   the red line in the graph you   have in front of you, so we   could get data of a few days   ago, maybe good, maybe not good   enough for our gold. Google Flu,   which is based on searches   related to flu, is updated   momentarily, so it's updated   online, it will probably give   us better information. So for   our goal, the blue line, the   Google trend...the Google Flu   Trends indicator, is probably   more appropriate. The second   dimension is data structure.   To meet our goal, we're going to   look at data. We should...we   should identify the data sources   and the structure in these data   sources. So some data could be   text, some could be video, some   could be, for example, the   network of recommendations. This   is an Amazon picture on how if   you look at the book, you're   going to get some   other books recommended. And if   you go to these other books,   you're going to have more data   recommended. So the data   structure can come in all sorts   of shapes and forms and this can   be text. This can be functional   data. This can be images. We are   not confined now to what we used   to call data, which is what you   find in an Excel spreadsheet.   The data could be corrupted,   could have missing values, could   have unusual patterns which   which would be   something to look into. Some   patterns, where things are   repeated. Maybe some of the data   is is just copy and paste and we   would like to be warned about   such options. The third   dimension is data integration.   When we consider the data from   these different sources, we're   going to integrate them so we   can do some analysis linkage   through an ID. For example, we   would do that, but in doing that,   we might create some issues, for   example, in in disclosing data   that normally should be   anonymized. Data   integration, yeah, that will   allow us to do fantastic things,   but if the data is perceived to   have some privacy exposure   issues, then maybe the quality   of the information from the   analysis that we're going to do   is going is going to be   affected. So data integration   should be looked into very, very   carefully. This is what people   likely used to call ETL   extract, transform and load. We   now have much better methods for   doing that. The join option, for   example, in JMP will offer   options for for doing that.   Temporal relevance. OK, that's   pretty clear. We have data. It is   stamped somehow. If we're going   to do the analysis later, later   after the data collection, and   if the deployement that we   consider is even later, then the   data might not be temporally   relevant. In a common   situation, if we want to compare   what is going on now, we would   like to be able to make this   comparison to recent data or   data before the pandemic   started, but not 10 years   earlier. The official statistics   on health records used to be two   or three years behind in terms   of timing, which made it very   difficult the use of official   statistics in assessing   what is going on with the   pandemic. Chronology of data and   goal is related to the decision   that we make as a result of our   goal. So if, for example, our   goal is to forecast air quality,   we're going to use some   predictive models on the Air   Quality Index reported on a   daily basis. This gives us a one   to six scale from hazardous to   good. There are some values   which are representing levels of   health concern. Zero-50 is good;   300-500 is hazardous and the   chronology of data and goal   means that we should be able to   make a forecast on a daily   basis. So the methods we use   should be updated on a daily   basis. If, on the other hand,   our goal is to figure out how is   this AQI index computed, then we   are not really bound by the the   the timeliness of the analysis.   You know, we could take our   time. There's no urgency in   getting the analysis done on a   daily basis. Generalizability,   the sixth dimension, is about   taking our findings and   considering where this could   apply in more general terms,   other populations, all   situations. This can be done   intuitively. Marketing managers who   have seen a study on the on the   market, let's call it A, might   already understand what are the   implications to Market B   without data. People who are   physicists will be able to   make predictions based on   mechanics on first principles   without without data.   So some of the generalizability   is done with data. This is the   basis of statistical   generalization, where we go from   the sample to the population.   Statistical inferences is about   generalization. We generalize   from the sample to the   population. And some can be   domain based, in other words,   using expert knowledge, domain   expertise, not necessarily with   data. We have to recognize that   generalizability is not just   done with statistics.   The seventh dimension is   construct operationalization,   which is really about what it is   that we measure. We want to   assess behavior, emotions, what   it is that we can measure, that   will give us data that reflects   behavior or emotions.   The example I give here   typically is pain.   We know what is pain. How do   we measure that? If you go to a   hospital and you ask the nurse,   how do you assess pain, they   will tell you, we have a scale,   1 to 10. It's very   qualitative, not very   scientific, I would say. If we   want to measure the level of   alcohol in drivers on the...on   the road, it will be difficult to   measure. So we might measure   speed as a surrogate measure.   Another part of   operationalization is the other   end of the story. In other   words, the first part, the   construct is what we measure,   which reflects our goal. The the   end...the end result here is that   we have findings and we want to   do something with them. We want   to operationalize our finding.   This is what the action   operationalization is about.   It's about what you do with the   findings and then being   presented here on a podium. We   used to ask three questions.   These are very important   questions to ask. Once you have   done some analysis, you have   someone in front of you who   says, oh, thank you very much,   you're done...you, the statistician   or the data scientist. So this   this takes you one extra step,   getting you to ask your customer these simple questions What do   you want to accomplish? How will   you do that and how will you   know if you have accomplished   that? We we can help answer, or   at least support, some of these   questions we've answered.   The eighth dimension is   communication. I'm giving you an   example from a very famous old   map from the 19th century, which   is showing the march of the   Napoleon Army from France to   Moscow to Russia. You see the   numbers are the the width of   the path indicates the size   of the army, and then on on in   black you see what happened to   them on their on their way back.   So basically this was a   disastrous march. We we we we   can relate this old map to   existing maps, and there is a   JMP add-in, which you can   find on the JMP Community, to to   show you with bubble plots,   dynamic bubble plots what this   looks like. So I I've covered   very quickly the eight information   quality dimensions. My last   slide is that what I've talked   about from a historical   perspective, really put some   proportions to what I'm saying.   I think we are really in the era   of information quality. We used   to be concerned with product   quality in the 18th century, the   17th century. We then moved to   process quality and service   quality. This is a short memo   on proposing a control chart,   1924, I think.   Then we move to management   quality. This is the Juran   trilogy of design, improvement   and control. Six Sigma (define,   measure, analyze, improve)   control process is the   improvement process of Juran,   and Juran was the grand father   of Six Sigma in that sense.   Then in the '80s, Taguchi came   in. He talks about robust   design. How can we handle   variability in inputs by proper   design decisions? And now we are   in the Age of information   quality. We have sensors. We   have flexible systems. We are   depending on AI and machine   learning and data mining and we   are gathering big big big   numbers, but which we call big   data. The interest in information   quality should be a prime prime   interest. I'm going to try and   convince you of, with the help   of Chris, that.   We are here and JMP can   help us achieve that in in a   really unusual way.   What you will see at the end of   the case study that Chris will   show is also how to do it an   information quality assessment and   on a specific study, basically   generate an information quality   score. So if we go top down, I   can tell you this study, this   work, this analysis is maybe 80% or   maybe 30% or maybe 95%.   And through the example you will   see how to do that. There is a   JMP add-in to provide this   assessment. It's it's actually   quite quite easy. There's   nothing really sophisticated   about that. So I'm done and   Chris, after you. Thanks, Ron. So   now I'm going to go through the   analysis of a data set in a way   that explicitly calls out the   various aspects of information   quality and show how JMP can be   used to assess an improvement   for InfoQ. So first off, I'm   going to go through the InfoQ   components. The first InfoQ   component is the goal, so in   this case the problem statement   was that a chemical company   wanted a formulation that   maximized product yield while   minimizing a nuisance impurity   that resulted from the reaction.   So that was the high level goal.   In statistical terms, we wanted   to find a model that accurately   predicted a response on a data   set so that we could find a   combination of ingredients and   processing steps that would lead   to a better product.   The data are set up in 100   experimental formulations with   one primary ingredient, X1,and   10 additives. There's also a   processing factor in 13   responses. The data are   completely fabricated but were   simulated to illustrate the same   strengths and weaknesses of the   original data. The data   formulation was made was also   recorded. We will be looking at   this data closely, so I want to   elaborate beyond pointing out   that they were collected in an   ad hoc way, changing one or two   additives at a time rather than   as a designed or randomized   experiment. There's a lot of   ways to analyze this data, the   most typical being least   squares modeling with forward   selection on selected responses.   That was my original intention   for this talk, but when I showed   the data to Ron, he immediately   recognized the response columns   as time series from analytical   chemistry. Even though the data   were simulated, he could see the   structure. He could see things   in the data that I didn't see   and read into it wasn't. I found   this to be strikingly   impressive. It's beyond the   scope of this talk, but there is   an even better approach based on   ensemble modeling using   fractionally weighted   bootstrapping. Phil Ramsey,   Wayne Levin and I have another   talk about this methodology at   the European Discovery   Conference this year. The   approach is promising because it   can fit models to data with   more active interactions than   there are runs. The fourth and final   component of information quality   is utility, which is how well we   are able to assess our goals. Or   how do we measure how well we've   assessed our goals? There's a   domain aspect which is in this   case we want to have a   formulation that leads to   maximized yields and minimized the   waste in post processing of the   material. The statistical   analysis utility refers to the   model that we fit. What we're   going for there is least   squares accuracy of our model in   terms of how well we're able to   predict what the...what would   result from candidate   combinations of formulation...of   mixture factors. Now I'm going   to go through a set of questions   that make up a detailed InfoQ   assessment as organized into the   eight dimensions of information   quality. I want to point out   that not all questions will be   equally relevant to different   data science and statistical   projects, and that this is not   intended to be rigid dogma but   rather a set of things that are   a good idea to ask oneself.   These questions represent a kind   of data analytic wisdom that   looks more broadly than just the   application of a particular   statistical technology. A copy   of a spreadsheet with these   questions along with pointers to   JMP features that are the most   useful for answering a   particular one will be uploaded   to the JMP Community along   with this talk for you to use. As   I proceed through the questions,   I'll be demoing an analysis of   the data in JMP. So Question 1 is is the data scale used   aligned with the stated goal? So   the Xs that we have consist of   a single categorical variable   processing and the 11 continuous   inputs. These are measured   as percentages and are also   recorded to half a percent. We   don't have the total amounts of   the ingredients, only the   percentages. The totals are   information that was either lost   or never recorded. There are   other potentially important   pieces of information that are   missing here. The time between   formulating the batches and   taking the measurements is gone   and there could have been other   covariate level information that   is missing here that would have   described the conditions under   which the reaction occurred.   Without more information than I   have, I cannot say how important   this kind of covariate information   would have been. We do have   information on the day of the   batch, so that could be used as   a surrogate possibly. Overall we   have what are, hopefully, the most   important inputs, as well as   measurements of the responses we   wish to optimize. We could have   had more information, but this   looks promising enough to try   and analysis with. The second   question related to data   resolution is how reliable and   precise are the measuring devices   and data sources. And the fact   is, we don't have a lot of   specific information here. The   statistician internal the   company would have had more   information. In this case we   have no choice but to trust that   the chemists formulated and   recorded the mixtures well. The   third question relative to data resolution is is the data   analysis suitable for the data   aggregation level? And the   answer here is yes, assuming   that their measurement system is   accurate and that the data are   clean enough. What we're going   to end up doing actually is   we're going to use the   Functional Data Explorer to   extract functional principal   components, which are a data   derived kind of data   aggregation. And then we're   going to be modeling those   functional principal components   using the input variables. So   now we move on to the data   structure dimension and the   first question we ask is, is the   data used aligned with the   stated goal? And I think the   answer is a clear yes here. We're   trying to maximize   yield. We've got measurements for   that, and the inputs are   recorded as Xs. The second data   structure question is where   things really start to get   interesting for me. So this is   are the integrity details   (outliers, missing values, data   corruption) issues described and   handled appropriately? So from   here we can use JMP to be able   to understand where the outliers   are, figure out strategies for   what to do about missing values,   observe their patterns and so   on. So this is this is where   things are going to get a little   bit more interesting. The first   thing we're going to do is we're   going to determine if there are   any outliers in the data that we   need to be concerned about. So   to do that, we're going to go   into the explore outliers   platform off of the screening   menu. We're going to load up the   response variables, and because   this is a multivariate setting,   we're going to use a new feature   in JMP Pro 16 called Robust   PCA Outliers. So we see where   the large residuals are in those   kind of Pareto type plots.   There's a snapshot showing where   there's some potentially   unusually large observations. I   don't really think this looks   too unusual or worrisome to me.   We can save the large outliers   to a data table and then look at   them in the distribution   platform and what we see kind of   looks like a normal distribution   with the middle taken out. So I   think this is data that are   coming from sort of the same   population and there's nothing   really to worry about here,   outliers-wise. So once we've   taken care of the outlier   situation we go in and explore   missing values. So what we're   going to do first is we're going   to load up the Ys as...into the   platform, and then we're going   to use the missing value   snapshot to see what patterns   they are amongst our missing   values. It looks like the   missing values tend to occur in   horizontal clusters, and there's   also the same missing values   across rows. So you can see that   with the black splotches here.   And then we'll go apply an   automated data imputation,   which goes ahead and saves   formula columns that impute   missing values in the new   columns using a regression type   algorithm that was developed by   a PhD student of mine named Milo   Page at NC State. So we can play   around a little bit and get a   sense of like how the ADI   algorithm is working. So it's   created these formula columns   that are peeling off elements of   the ADI impute column, which is   a vector formula column, and the   scoring impute function   is calculating the expected   value of the missing cells given   the non missing cells, whenever   it's got a missing value. And it's   just carrying through a non   missing value. So you can see 207   in YO...Y6 there. It's initially   207 but then I change it to   missing and it's now being   imputed to be 234.   So I'll do this a couple of times so   you can kind of see how how it's   working. So here I'll put in a big   value for Y7 and that's now.   been replaced. And if we go down   and we add a row,   then all missing values are there   initially and the column means   are replaced for the   imputations. If I were to go   ahead and add values for some of   those missing cells, it would   start doing the conditional   expectation of the still missing   cells using the information   that's in the missing one....the   non missing ones. So our next   question on data structure is   are the analysis methods   suitable for the data structure?   So we've got 11 mixture inputs   and a processing variable that's   categorical. Those are going   to be inputs into a least   squares type model. We have   13 continuous responses and   we can model them using...   individually using least   squares. Or we can model   functional principal   components. The...now there are   problems. The Xs are...the   input variables have not been   randomized at all. It's very   clear that they would muck   around with one or more of   the compounds and then move   on to another one. So the   order in which the   the input variables were varied   was kind of haphazard. It's a   clear lack of randomization, and   that's going to negatively   impact our...the generalizability   and strength of our conclusions.   Data integration is the third   InfoQ dimension. These data   are manually entered lab notes   consisting mostly of mixture   percentages and equipment   readouts. We can only assume   that the data were entered   correctly and that the Xs are   aligned properly with responses.   If that isn't the case, then the   model will have serious bias   problems and have   problems with generalizability.   Integration is more of an issue   with observational data   science problems in machine   learning exercises, than lab   experiments like this. Although   it doesn't apply here, I'll   point out that privacy and   confidentiality concerns can be   identified by modeling the   sensitive part of the data using   the to be published component   at the data. If the resulting   model is predicted, then one   should be concerned that privacy   concerns are not being met. Temporal   relevance refers to the   operational time sequencing of   data collection, analysis and   deployment and whether gaps   between those stages leads to a   decrease in the usefulness of   the information in the study.   In this case, we can only simply   hope that the material supplies   are reasonably consistent and   that the test equipment is   reasonably accurate, which is an   unverifiable assumption at this   point. The time resolution we   have when the data collection is   at the day level, which means   that there isn't much way that   we can verify if there is time   variation within each day.   Chronology of data and goal is   about the availability of   relevant variables both in terms   of whether the variable is   present at all in the data or   whether the right information   will be available when the model   is deployed. For predictive   models, this relates to models   being fit to data similar to   what will be present at the time   the model will be evaluated on   new data. In this way, our data   set is certainly fine. For   establishing causality, however,   we aren't in nearly as good a   shape because the lack of   randomization implies that time   effects and factor effects may   be confounded, leading to bias   in our estimates. Endogeneity,   or reverse causation, could   clearly be an issue here, as   variables like temperature and   reaction time could clearly be   impacting the responses, but have   been left unrecorded. Overall,   there is a lot we don't know   about this dimension in an   information quality sense.   The rest of the InfoQ   assessment is going to be   dependent upon the type of   analysis that we do. So at this   point I'm going to go ahead and   conduct an analysis of this data   using the Functional Data   Explorer platform in JMP Pro   that allows me to model across   all the columns simultaneously   in a way that's based on   functional principal components,   which contain the maximum amount   of information across all those   columns as represented in the   most efficient format possible.   I'm going to be working on the   imputed versions of the columns   that I calculated earlier in the   presentation. And I'm going to   point out that I'm going to be   working to find combinations of   the mixture factors that achieve   as closely as possible in a   least square sense, an ideal   curve that was created by the   practitioner that maximizes the   amount of potential product that   could be in a batch while   minimizing the amount of the   impurities that they   realistically thought a batch   could contain. So I begin the   analysis by going to the analyze   menu, bring up the Functional   Data Explorer. This has rows as   functions. I'm going to load up my   imputed rows, and then I'm going   to put in my formulation   components and my processing   column as a supplementary   variable. We've got an ID   function, that's batch ID. Here I   get in. I can see the functions,   both the overlay altogether, and   I can see the individual functions.   Then I can load up the target   function, which is the ideal.   And that will change the   analysis that results once I   start going into the modeling   steps. So these are pretty   simple functions, so I'm just   going to model them with   B splines.   And then I'm going to go into my   functional DOE analysis.   This is going to fit the model   that connects the inputs into   the functional principal   components and then connect all   the way through the   eigenfunctions to make it so   we're able to recover the   overall functions as they   changed, as we are varying the   mixture factors. The   functional principal component   analysis has indicated that   there are four dimensions of   variation in these response   functions. To understand what   they mean, let's go ahead and   explore with the FPC profiler.   So watch this pane right here as   I adjust FPC 1 and we can see   that this FPC is associated with   peak height. FPC2, it looks   like it's kind of a peak   narrowness. It's almost like a   resolution principal   component. The third one is   related to kind of a knee on   the left of the dominant peak.   And Peak 4 looks like it's   primarily related with the   impurity, so that's what the   underlying meaning is of   these four functional   principal components.   So we've characterized our goal   as maximizing the product and   minimizing the impurity, and   we've communicated that into the   analysis through this ideal or   golden curve that we supplied at   the beginning of the FDE   exercise we're doing. To get as   close as possible to that ideal   curve, we turn on desirability   functions. And then we can go   out and maximize desirability.   And we find that the optimal   combination of inputs is about   4.5% of   Ingredient four, 2% of   Ingredient 6. 2.2% of   Ingredient 8 and 1.24% of   Ingredient 9 using processing   method two. Let's review how   we've gotten here. We first   computed the missing response   columns. Then we found B-spline   models that fit those functions   well in the FDE platform. A   functional principal components   analysis determined that there   were four eigenfunctions   characterizing the variation in   this data. These four   eigenfunctions were determined   via the FPC profiler to each   have a reasonable subject   matter meaning. The functional   DOE analysis consisted of   applying pruned forward   selection to each of the   individual FPC scores using the   DOE factors as input variables.   And we see here that these have   found combinations of   interactions and main effects   that were most predictive for   each of the functional principal   component scores individually.   The Functional DOE Profiler   has elegantly captured all   aspects into one representation   that allows us to find the   formulation processing step that   is predicted to have desirable   properties as measured by high   yield and low impurity.   So now we can do an InfoQ   assessment of the   generalizability of the data in   the analysis. So in this case,   we're more interested in   scientific generalizability, as   the experimenter is a deeply   knowledgeable chemist working   with this compound. So we're   going to be relying more on   their subject matter expertise   then on statistical principles   and tools like hypothesis tests   and so forth. The goal is   primarily predictive, but the   generalizability is kind of   problematic because the   experiment wasn't designed. Our   ability to estimate interactions   is weakened for techniques like   forward selection and impossible   via least squares analysis of   the full model. Because the   study wasn't randomized, there   could be unrecorded time in   order effects. We don't have   potentially important covariate   information like temperature and   reaction time. This creates   another big question mark   regarding generalizability.   Repeatability and   reproducibility of the study is   also an unknown here as we have   no information about the   variability due to the   measurement system. Fortunately,   we do have tools like JMP's   evaluate design to understand   the existing design as well as   augment design that can greatly   enhance the generalization   performance of the analysis.   Augment can improve information   about main effects and   interactions, and a second round   of experimentation could be   randomized to also enhance   generalizability. So now I'm   going to go through a couple of   simple steps to show how to   improve the generalization   performance of our study using   design tools in JMP. Before I   do that, I want to point out   that I had to take the data and   convert it so that it was   proportions rather than in   percents. Otherwise the design   tools were not really agreeing   with the data very well. So we   go into the evaluate designer   and then we load up our Ys and   our Xs. I requested the ability to   handle second order interactions.   Then...yeah, I got this alert   saying, hey, I can't do that   because we're not able to   estimate all the interactions   given the one factor at a time   data that we have. So I backed   up. We go to the augment   designer, load everything up,   set augment. We'll choose and I-   optimal design because we're   really concerned with   predicted performance here.   And I   set the number of runs to 148.   The custom designer requested   141 as a minimum, but I went to   148 just to kind of make sure   that we've got all ability to   estimate all of our interactions   pretty well. After that, it   takes about 20 seconds to   construct the design. So now   that we have the design, I'm   going to show the two most   important diagnostic tools in   the augment designer for   evaluating a design. On the   left, we have the fraction of   design space plot. This is   showing that 50% of the volume   of the design has   a prediction variance that is   less than 1. So 1 would be   equivalent to the residual   error. So we're able to get   better than measurement error   quality predictions over the   majority of the design. On the   right we have the color map on   correlations. This is showing   that we're able to estimate   everything pretty well. There's   some...because of the mixture   constraint, we're getting some   strong correlations between   interactions and main effects.   Overall, the effects are fairly   clean. And the interactions are   pretty well separated from one   another, and the main effects   are pretty well separated from   one another as well. After   looking at the design   diagnostics, we can make the   table. Here, I have shown the   first 13 of the augmented runs   and we see that we've got...we   have more randomization. We don't   have use of the same main effect   over and over again streaks.   That's evidence of better   randomization and overall the   design is going to be able to   much better estimate the main   effects and interactions having   received better, higher quality   information in this second stage   of experimentation. So the input   variables, the Xs, are accurate   representations of the mixture   proportion, so that's a clear   objective interest. The   responses are close surrogate   for the amount of the product   and amount of impurity that's in   the batches. We're pretty good on   7.1. there. The justifications   are clear. After the study, we   can of course go prepare a   batch that is the formulation   that was recommended by the FDOE   profiler. Try it out and see if   we're getting the kind of   performance that we were looking   for. It's very clear that that   would be the way that we can   assess how well we've achieved   our study goals. So now under the   last InfoQ dimension   Communication. By describing the   ideal curve as a target   function, the Functional DOE   Profiler makes the goal and the   results of the analysis crystal   clear. But this can be expressed   at a level that is easily   interpreted by the chemists and   managers of the R&D facility.   And as we have done our detailed   information quality assessment,   we've been upfront about the   strengths and weaknesses of the   study design and data   collection. If the results do   not generalize, we certainly   know where to look for where the   problems were. Once you become   familiar with the concepts,   there is a nice add-in written   by Ian Cox that you can use to   do a quick quantitative InfoQ   assessment. The add-in has   sliders for the upper and lower   bounds of each InfoQ dimension.   These dimensions are combined   using a desirability function   approach for an overall interval   for the InfoQ over on the left.   Here is an assessment for the   data and analysis I covered in   this presentation. The add-in is   also a useful thinking tool that   will make you consider each of   the InfoQ dimensions. It's also a   practical way to communicate   InfoQ assessments to your   clients or to your management, as   it provides a high level view of   information quality without   using a lot of technical   concepts and jargon. The add-in   is also useful as the basis for   an InfoQ comparison. My   original hope for this   presentation was to be a little   bit more ambitious. I had hoped   to cover the analysis I had   just gone through, as well as   another simpler one, one where I skip   inputing the responses and doing   a simple multivariate linear   regression model of the response   columns. Today, I'm only able to   offer a final assessment of that   approach. As you can see,   several of the InfoQ   dimensions suffer substantially   without the more sophisticated   analysis. It is very clear that   the simple analysis leads to   much lower InfoQ score.   The upper limit of the simple   analysis isn't that much higher   than the lower limit of the more   sophisticated one. With   experience, you will gain   intuition about what a good InfoQ   score is for data science   projects in your industry and   pick up better habits as you   will no longer be blind to the   information bottlenecks in your   data collection, analysis and   model deployment. Information   quality with an easy to use   interface. This was my first   formal information quality   assessment. Speaking for myself,   the information quality   framework has given words and   structure to a lot of things I   already knew instinctively. It   is already changed how I   approach new data analysis   projects. I encourage you to go   through this process yourself on   your own data, even if that data   and analysis is already very   familiar to you. I guarantee   that you will be a wiser and   more efficient data scientist   because of it. Thank you.
André Caron Zanezi, Six Sigma Black Belt, WEG Electric Equipment Danilo da Silva Toniato, Quality Engineer, WEG Electric Equipment   Quality assurance and customer needs are rigorous terms that frequently refer to reliability. Improving products in terms of reliability challenges engineers in multiple ways, including understanding cause and effect relationships, and developing tests that reproduce customer conditions and properly generate reliable data without exceeding the product launch time deadline. Combining engineering expertise, historical data and lab resources, a design of experiment (DOE) was performed to quantify the product lifetime based on process, product and critical application variables. Performing several analyses using JMP tools, from the DOE platform to the Reliability and Survival modules, the team was able to describe the product lifetime as a function of its critical factors. As a result, an accelerated life test was established which is able to simulate years of product usage in just a few weeks, providing solid evidence of some specific failure modes. Standardizing its methods and procedures, the test became a crucial requirement to verify and validate new technologies implemented at WEG motors, optimizing the development process and reducing time to market.   This poster provides information about how we used JMP to analyze data and develop an accelerated life test. The project followed the step-by-step approach: Project charter: understanding the primary and secondary objectives, the multidisciplinary team was formed to share information and knowledge about, customer historical data, lab resources, motor reliability, cause and effect relationships, environmental application conditions and reliability data analysis. Historical data analysis: knowing and quantifying risks about analysing historical data, the team fitted some life distributions to understand Cycles to Failure (CTF) scale and shape parameters. Mainly, shape parameter refers to the failure mode to be reproduced in the lab tests, according to the Bathtub curve. DOE planning and analysis: in order to reproduce failures, the understanding about motors reliability was endorsed by cause and effect relationships, provided by a Fault Tree Analysis (FTA). The FTA was a source of critical variables combined into a Designed Experiment (DOE) to quantify how to accelerate product cycles to failure. Conclusions: as a result, DOE provided a surface profiler with the indication of the best condition to accelerate products life time. The accelerated test also provided a shape parameter, that when compared with historical data shows an overlap, meaning that the same historical field failures were reproduced under controlled conditions. Implementation: with an accelerated test, development and innovative process will became faster while providing important information about product reliability.     Auto-generated transcript...   Speaker Transcript André Zanezi So, hello, everyone. My name is Andre Zanezi. And as I am a Six Sigma Black Belt at Weg. And I'm here today in Discovery Summit to talk about the development of an accelerated lifetime test to demonstrate and quantify washing machine motors reliability.   We know that some...every company when they are developing new technologies, new solutions, they often   face some challenges when they have to improve their reliability in their product.   We face the same problem, the same challenges. And the project should analyze and understand our historical reliability data to quantify our historical reliability data.   And try to develop a procedure, an accelerated life test in our internal labs, labs to reproduce our failure...our viewed failure modes. Basically doing it, we could develop...   develop models in the first two way. So at the first step, we get some historical data from our motors.   And using reliability and survival modules in JMP, we fit some life distribution for our motors. And we know that   doing in fitting some life distributions as Weibull distributions, we can understand our motors reliability, our motors lifetime. And in JMP, we also can   use lifetime distributions and fit   different distributions for different failure modes. And we did it for four main different failure modes. And we compared it, we analyzed it and understand or understood our ...our motors lifetime, our motors reliability.   And doing it, we were capable to understand and to quantify our scale and shape parameters and it basically doing it, how much cycles was necessary to have a failure.   And also, according to the ??? to know which kind of failure modes we are facing. And we have...we we did also cross check with our internal...our validation KPIs, basically   plotting survival. We have survival plots and cross checking with internal KPI's to understand if the probability and the failure range was correctly with...if our data was reliable. And understanding all these failures modes, we could   we could develop an internal test to accelerate on...accelerated our internal ...our internal test. And basically to do it. We should understand the physics, the environment...   that environment conditions that will our motors are   working in. We did it through fault tree analysis, basically deploying   and understanding the the cause and effect relationships. Doing it, we could set the most critical variables in in this cause and effect relationships. And again, use JMP to do to design an experiment   in order to try to quantify the effect of some variables in our response in our cycles to fail. Basically, we were trying to reproduce field failures in our labs. And we did it, we will run several   tests, and as a result of our experiments, we could have...we could set and fit some using fit model...fits on models to our data and understand the relation of our   environment and motors variables to cycles to failure, and understand to the survival plot, and sort through the surface plot, understand the relation of some variables with cycles to failure,   and set specific point to accelerate our, our motors' lifetime. And again, running some some batch of samples, we could   set and fit lifetime distributions for our internal results, for our internal tests in accelerated life test. A   we were seeing some failures but at the end of this experiment, these accelerated tests, we we should ensure that we are facing and we are causing the same failures as we had in historical data.   So we come back to the lifetime distributions in survival and reliability in survival module.   And again, fit some Weibull distributions now for our internal...internal results for our accelerated lifetime tests.   And we noted that basically they shape parameter, the parameter of that, according to the best to means the filler mode and   we cross this information with our historical data and we can see that crossing both informations, we have an overlap   between the the shape parameter of internal test and the shape parameter of historical data. And it means basically that we are having...we are not reproducing the same failure modes in our accelerated life test.   And basically it means that now we can develop products in a faster way because every time when we have a new technology and new design, we can put it on this accelerated life test   and quantify if we are improving our motors reliability. And we can make it faster than before, and develop faster...develop products in in the faster way.   We also did some technical cross checks to to prove that we are facing in reproducing the same failures to to implement in this this test in into the development process so that that this was how we used JMP to provide a lot of information and put it on our internal test. It was made   by   really good teams. Please feel free to make some contact and send email if you have any questions. And that's the end.
Jordan Hiller, JMP Senior Systems Engineer, SAS Mia Stephens, JMP Principal Product Manager, SAS   For most data analysis tasks, a lot of time is spent up front importing data and preparing it for analysis. Because we often work with datasets that are regularly updated, automating our work using scripted repeatable workflows can be a real time saver. There are three general sections in an automation script: data import, data curation and analysis/reporting. While the tasks in the first and third sections are relatively straightforward – point-and click to achieve the desired result, and capture the resulting script – data curation can be more challenging for those just starting out with scripting. In this talk we review common data preparation activities, discuss the JSL code necessary to automate the process and demonstrate how you can use the new JMP 16 action recording and enhanced log to create a data curation script.     Auto-generated transcript...   Speaker Transcript Mia Stephens Welcome to JMP Discovery Summit. I am Mia Stephens, and I am a JMP product manager. And I'm joined by Jordan Hiller, who is a systems engineer. And today we're going to talk about automating the data curation workflow. And this is the abstract just for reference; I'm not going to talk about it. And we're going to break this talk into two parts. I'm going to kick it off and talk about the analytic workflow and talk about data creation, what we mean by data curation. And we're going to see how to identify potential data quality issues in your data. And then I'm going to turn it over to Jordan, and Jordan is going to talk about the need for reproducibility. He's going to share with us a cheat sheet for data curation and show how to...how to curate your data in JMP. the action recorder and the enhanced log. So let's talk about the analytic workflow. It all starts with having some business problem that we're trying to solve. And of course we need to compile data, and you can compile data from a number of different sources and bring the data in a JMP. And at the end, we need to be able to share results, communicate our findings with others. Now sometimes this is maybe a one-off project, but oftentimes we have analysis that we're going to repeat. So a core question addressed by this talk is, can you easily reproduce your results? Can others reproduce your results? Or, if you have new data or, if you have updated data, can you easily repeat your analysis, and particularly, the data curation steps on these new data? So this is what we're addressing in this talk. But what exactly is data curation? And why do we need to be concerned about it? Well, data curation is all about ensuring that our data are useful in driving analytic discoveries. Fundamentally, we need to be able to solve the problems that we're trying to address. And it's largely about data organization, data structure, and also data cleanup. If you think about issues that we might encounter with data, they tend to fall into four general categories you might have incorrect formatting, you might have incomplete data, missing data, or dirty or messy data. And to help us talk about these issues, we're going to borrow some content from STIPS. And if you're not familiar with STIPS, STIPS is our free course, Statistical Thinking for Industrial Problem Solving. And this is a course based on seven independent modules, and the second module is exploratory data analysis. And because of the iterative and interactive nature of exploratory data analysis and data curation, the last lesson in this model...in this module is called Data Preparation for Analysis, so we're borrowing heavily from this lesson throughout this talk. Let's break down each one of these issues. Incorrect formatting. What do we mean by incorrect formatting? Well, this is when your data are in the wrong form or the wrong format for analysis. This can apply to the data table as a whole. So, for example, you might have called...data stored in separate columns, but you actually need the data stored in one column. Or it could be that you have your data in separate data tables and you need to either concatenate or update or join the data tables together. It can relate to individual variables. So, for example, you might have the wrong modeling type. Or you might have dates in your data table, so you might have columns of dates and they might not be formatted as days, so the analysis might not recognize that these are...this is date data. Formatting can also be cosmetic. So, for example, if you're dealing with a large data table, you might have many columns... you might have names...column names that are not really recognizable that you might want to change. You might have a lot of columns that you might want to group together to make it a little bit more manageable. Your response column might be at the very end of the data table and you might want to move it up. So cosmetic issues won't necessarily get in the way of your analysis, but if you can address some of these issues, you can make your analysis a little bit easier. Incomplete data is when you have a lack of data. This can be a lack of data on important variables. So, for example, you might not have captured data on variables that are fundamental in solving the problem. It could also be a lack of data on a combination of variables. So, for example, you might not have enough information to estimate an interaction. Or you might have a target variable that you're interested in that is unbalanced, so you might be studying something like defects. It may be only 5% of your observations are for defects and you might not have enough data, for you know, you might only have a very small subset of your data where the defect is present. You might not have enough data to allow you to understand potential causes of defects. You might also not have a big enough sample size, so you just simply don't have enough data to have good estimates. Missing data is when you're missing values for variables, and this can take several different forms. If you're missing data and the data...the missingness is not at random, this can cause a serious problem, so you might have biased estimates. If you're missing data completely at random, this might not be a problem if you're only missing a few observations, but if you're missing a lot of data, then this can be problematic. Dirty and messy data is when you have issues with observations or with variables. So you might have incorrect values, values are simply wrong. You might have inconsistency. So, for example, you might have typos or typographical errors when people enter things differently. The values might be inaccurate. So, for example, you might have issues with your measurement system. There can be errors, there can be typos, the data might be obsolete. Obsolete data is when you have data on, for example, a facility or machine that is no longer in service. The data might be outdated, so you might have data going back a two or three year period, but the process might have changed somewhere in that timeframe, which means that those those historical data might not be relevant to the current process as it stands today. Your data might be censored, or it might be truncated. You can have redundant columns, which are columns that contain essentially the same information. Or you might have duplicated observations. So dirty or messy data can take on a lot of different forms. So how do you identify potential issues? Well, a good starting point is to explore your data and, in fact, identifying issues leads you into data exploration and then analysis. And as you start exploring your data, you start to identify things that might cause you problems in your analysis. So a nice starting point is to scan the data table for obvious issues. So we're going to use an example throughout the rest of this talk called Components, and this is an example from the STIPS course where a company is is producing small components and they have an issue with yield. So the data were collected, there are 369 batches. There are 15 characteristics that have been captured, and we want to use these data to help us understand potential root causes of low yield. So if we start looking at the data table itself, there are some clues to what kinds of data quality issues we might have. And a really nice starting point, (and this was added in JMP 15) is is is header graphs. I love header graphs. What they do is, if you have a continuous variable that show you a histogram, so you can see the centering and the shape and the spread of the distribution. They also show you the range of the values. If you have categorical data, it'll show you a bar chart with values for the the most populous bars. So let's take a look at some of these. I'll start with batch number. So batch number is showing a histogram, and it's actually uniform in shape. Batch numbers something that's probably an identifier, so just right off the bat, I can see that these data are probably coded incorrectly. I can see that this distribution is highly skewed and I can also see the lowest value is -6, and this can cause me to ask questions about the feasibility of having a negative scrap number. Process is another one. I've got basically two bars and it's actually showing me a histogram with only two values. And as I'm looking at these these column graphs, these header graphs, I can look at the at the column panel, and it's pretty easy to see that, for example, batch number and part number and process are all coded as continuous. When you import data into JMP, if JMP sees numbers, it's automatically going to code these columns as numeric continuous. So these are things that we might want to change. We can also look at the data itself. So, for example, when I look at humidity (and context is really important when you're looking at your data) humidity is something that we would think of as being continuous data. But I've got a couple of fields here where I've got N/A. If you have alphanumeric data, if you have text data in a numeric column, the column is going to be coded as nominal when you pull the data into JMP. So this is something right off the bat that we see that we're going to need to fix. And I can also look through the other columlns. So, for example, Supplier, I see that I'm missing some values. When you pull data into JMP, if there are empty cells for categorical data, we know that we're missing values. I can see that there's some some entries, where we're not consistent...not consistent in the way that the data were entered. So I'm getting getting some some serious clues into some potential problems with my data. notice all the dots. Temperature is a continuous variable and where I see dots, it's indicating that I'm missing values. So temperature is something that's really important for my analysis, this might be problematic. A natural extension of this is to start to explore data, one variable at a time. One of my favorite tools when I want to first starting to look at data is the columns viewer. And columns viewer gives us numeric summaries for the variables that we've selected. If we're missing values, there's going to be an N Missing column. And here I can see that I'm missing 265 of the 369 values for temperature, so this this, this is a serious gap here if we think temperature is going to be important in analysis. I can also see if I've got some strange values. So when I look at things like Mins and Maxes for number scrapped and the scrap rate, I've got negative values. And if this isn't feasible, then I've got an issue with the data or the data collection system. It's also pretty easy to see miscoding of variables. So, for example, facility and batch number, which should probably coded as nominal were reporting a mean and the standard deviation. And a good way to think about this is, if it's not physically possible to have an average batch number or (part number was the other variable) or part number, then these should be changed to nominal variables, instead of continuous. Distributions is is the next place I go when I'm first getting familiar with my data. And distributions, if you've got continuous data, allow you to understand the shape, centering, and spread of your data, but you can also see if you've got some unusual values. For categorical data, you can also see how many levels you have. So, for example, customer number. If customer number is an important variable or potentially important, I've got a lot of levels or values for customer number. When I'm preparing the data, I might want to combine these into four or five buckets with another category for those customers where I don't really have a lot of data. Humidity. We see the problem with having N/A in the column. We see a bar chart instead of a histogram. We can easily see what we were looking at in the data for supplier. For example, Cox Inc and Cox, Anderson spelled three different ways, Hersh is spelled three different ways. For speed, notice for speed that we've got a mounded distribution that goes from around 60 to 140 but, at the very bottom, we see that there's a value or two that's pretty close to zero, and this might be...it might have been a data entry error but it's definitely something that we'd want to investigate. An extension of this is to start looking at your data two variables at a time. So, for example, using Graph Builder or scatterplots. And when you look at variables two at a time, you can see patterns and you can more easily see unusual patterns that cross more than one variable. So, for example, if I look at scrap rate and number scrapped, I see that I've got some bands. And it might be that you have something in your data table that can explain this pattern. And in this case, the banding is attributed to different batch sizes, so this purple band is where I have a batch size of 5,000. And I have a lot more opportunity for scrap with a larger batch size than I do for a smaller batch size. So that might make some sense, but I also see something that doesn't make sense. These two values down here in the negative range. So it's pretty easy to see these when I'm looking at data in two dimensions. I can add additional dimensionality to my graphs by using column switchers and data filters. This is also leading me into potential analysis, so I might be interested in understanding what are the x's that might be important, that might be related to scrap rate. And at the same time, look at data quality issues or potential issues. So for scrap rate, it looks like there's a positive relationship between pressure and scrap rate. doesn't look like there's too much of a relationship. Scrap rate versus temperature, this is pretty flat, so there's not much going on here. But notice speed. There's a negative relationship, but across the top I see those two values; the one value that I saw on histogram, but there's a second value that seems to stand out. So it could be that this value around 60 is simply an outlier, but it could be a valid point. I would probably question whether this point here down near 0 is valid or not. So we've looked at the data table. We've looked at data one variable at a time. We've looked at the data two variables at a time, and all of this fits right in with the data exploration and leads us into the analysis. There are more advanced tools that we might use (for example, explore outliers, explore missing) that are beyond the scope of this course, or this talk. And when you start analyzing your data, you'll likely identify additional issues. So, for example, if you've got a lot of categories of categorical variables and you try to fit an interaction in a regression model, you know JMP will give you a warning that you can't really do this. So as you start to analyze data ,this this whole process is iterative, and you'll identify potential issues throughout the process. A key is that you want to make note of issues that you encounter as you're looking at your data. And some of these can be corrected as you go along, so you can hide and exclude values, you can reshape you can reclean your data as you go along, but you might decide that you need to collect new data. You might want to conduct a DOE so that you have more confidence in the data itself. If you know that you're going to repeat this analysis or that somebody else will want to repeat this analysis, then you're going to want to make sure that you capture your steps that you're taking so that you have reproducibility. Someone else can reproduce your results, or you can you can repeat your analysis later. So this is where I'm going to turn it over to Jordan, and Jordan's going to Talk about reproducible data curation. Jordan Hiller Okay, thank you, Mia. Hello, I am Jordan Hiller. I am a systems engineer for JMP. Let's drill in a little bit and talk some more about reproducibility for your data curation. Mia introduced this idea very nicely, but let's give a few more details. The idea here is that we want to be able to easily re-perform all of our curation steps that we use to prepare our data for analysis, and there are three main benefits that I see to doing this. The first is efficiency. If you have to...if your data changes and you need to replay these curation steps on new data in the future, it's much more efficient to run it once with a one-click script than it is to go through all of your point-and-click activities over again. Accuracy is the second benefit. Point and click can be prone to error, and by making it a script, you ensure accurate reproduction. And lastly is documentation, and this is maybe underappreciated. If you have a script, it documents the steps that you took. It's a trail of breadcrumbs that you can revisit later when, inevitably, you have to revisit this project and remember, what is it that I did to prepare this data? Having that script is a big help. So today we're going to go through a case study. I'm going to show you how to generate one of these reproducible data curation scripts using only point and click. And the enabling technology is something new in JMP 16. It is the enhanced log and the action recording that's found in the enhanced log. So here's what we're going to do, we are going to perform our data curation activities as usual by point and click. As we do this, the script that we need, the computer code (it's called JSL code, JSL for JMP scripting language) it's going to be captured for us automatically in the new enhanced log. And then when we're done with our point-and-click curation, all we need to do is grab that code and save it out. We might want to make a few tweaks, a few modifications, just to make it a little bit stronger, but that part is optional. Okay, so this is a cheat sheet that you can use. This is some of the most common data cleaning activities and how to do them in JMP 16 in a way so as to leave yourself that trail of breadcrumbs, in a way so as to leave the JSL script in the enhanced log. So it covers things like operating on rows, operating on columns, ways to modify the data table, all of our data cleaning operations and and how to do it by point and click. So it's not an exhaustive list of everything that you might need to do for data cleaning, and it's not an exhaustive list of everything that's captured in the enhanced log either, but, but this is the most important stuff here at your fingertips. Alright, so let's go into our case study using that Components file that Mia introduced and make our data curation script in JMP 16 using the enhanced log. Here we are in JMP 16. I will note that this is the last version of the early adopter program for JMP 16, so this is pre release. However I'm sure this is going to be very, very similar to the to the actual release version of JMP 16. So, to get to the log, I'll show it to you here. This looks different if you're used to the log from previous versions of JMP. It's divided into these two panels, okay, a message panel at the top and a code panel at the bottom. We're going to spend some time here. I'll show you what this is like but let's just give you a quick preview, if I were to do some quick activities like importing a file and maybe deleting this column. You can see that those two steps that I did (the data import and deleting the column), they are listed up here in this message panel and the code, the JSL code that we need for reproducible data curation script, is is down here in in this bottom panel. Okay, so that it's really very exciting, the ability to just have this code and grab it whenever you need it just by pointing and clicking is is a tremendous benefit in JMP 16. So in JMP 16, this this new enhanced log view is on by default. If you want to go back to the old version of the log, that simple text log, you can do that here in the JMP preferences section. There's a new section for the log and you can switch back to the old text view of the log, if you prefer. The default when you install JMP 16 is the enhanced log and we will talk about some of these other features a little bit later on, further on in our case study. Alright, so I'm going to clear out the log for now from the red triangle. Clear the log and let's start with our case study. Let's import that Components data that Mia was sharing with you. We're going to start from this csv file. So I'm going to perform the simplest kind of import just by dragging it in. Oh, I had a...I had a version of it open already. I'm sorry, let me, let me start by closing the old version and clear the log one more time. Okay, simple import by dragging it into the JMP window. And now we have that file, Components A, with 369 batches, and let's now proceed with our data cleaning activities. I'll turn on the header graphs. And first thing we can see is that the facility column has just one entry, one value in it, FabTech, so there's no variation, nothing interesting here. I'm just going to delete it with a right click, delete the column. And again, that is captured as we go in the enhanced log. Okay, what else? Let's imagine that this scrap rate column at near the end of the table is really important to us and I'd like to see an earlier in the table. I'm going to move it to the fourth position by grabbing it in the columns panel and dragging it to right after customer number. There we go. Mia mentioned that this humidity column is incorrectly represented on import, chiefly due to those N/A alphabet characters that are causing it to come in as a character variable. So let's fix that. We are going to go into the column info with the right click here and change the data type from character to numeric, change the modeling type from nominal to continuous. Click OK. And let's just click over to the log here, and you can see, we have four steps now that have been captured and we'll keep going. Alright, what's next? We have several variables that need to be changed from continuous to nominal. Those are batch number, part number, and process. So with the three of those selected, I will right click and change from continuous to nominal. And those have all been corrected. And again, we can see that those three steps are recorded here in the log. All right, what else? Something else a little bit cosmetic, this column, Pressure. My engineers like to see that column name as PSI, so we'll change it just by selecting that column and typing PSI. Tab out of there to go to somewhere else. That's going to be captured in the log as well. The supplier. Mia showed us that there are some, you know, inconsistent spellings. Probably too many values in here. We need to correct the character values. When you have incorrect, inconsistent character values in a column, think of the recode tool. The recode tool is a really efficient way to address this. So with the right click on supplier, we will go to recode. And let's group these appropriately. I'm going to start with some red triangle options. Let's convert all of the values to title case, let's also trim that white space, so inconsistent spacing is corrected. That's already corrected a couple of problems. Let's correct everything else manually. I'm going to group together the Andersons. I'm going to group together the Coxes. Group the Hershes. Trutna and Worley are already correct with a single categories. And the last correction I'll make is things that are, you know, just listed as blank or missing, I'll give them an explicit missing label here. All right, and when we click recode, we've made those fixes into a new column called supplier 2. That just has 1, 2, 3, 4, 5, 6 categories corrected and collapsed. Good. Okay let's do a calculation. We're going to calculate yield here, using batch size and the number scrapped. Right. And yeah, I realize this is a little redundant. We already have scrap rate and yield is just one minus scrap rate, but just for sake of argument, we'll perform the calculation. So I want that yield column to get inserted right after number scrapped, so I'm going to highlight the number scrapped and then I'll go to the columns menu, choose new column. We're going to call this thing yield. And we're going to insert our new column after the selected column, after number scrapped, and let's give it a formula to calculate the yield. We need the number of good units. That's going to be batch size minus number scrapped. So that's the number of good units and we're going to divide that whole thing by the batch size. Number of good units divided by batch size, that's our yield calculation. And click OK. There's our new yield column. We can see that it's a one minus scrap rate. That's...that's good and let's ignore, for now, the fact that we have some yields that are greater than 100%. Okay we're nearly done. Just a few more changes. I've noticed that we have two processes, and they're, for now, just labeled process 1 and process 2. That's not very descriptive, not very helpful. Let's give them more descriptive labels. Process 1, we'll call production process; and process 2, we'll call experimental. So we'll do this with value labels, rather than recoding. I'll go into column info and we will assign value labels to one and two. One in this column is going to represent production. Add that. And two represents experimental. Add that. Click OK. Good. It shows one and two in the header graphs, but production and experimental here in the data table. All right, one final step before we save off our script. Let's say, for sake of argument, that I'm only interested in the data...I want to proceed with analysis only when vacuum is off. Right, so I'm going to subset the data and make a new data table that has only the rows where vacuum is off. I'll do that by right clicking one of these cells that has vacuum off and selecting matching cells. That selects the 313 rows where vacuum is off. And now we'll go to table subset, create a new data table, which we will name vac_off. Click okay. All right, and and that's our new data table with 313 rows only showing data where vacuum is off. So that's the end. We have done all of our data curation and now let's go back and revisit the log and learn a little bit more about what we have. Okay, so all of those steps, and plus a few more that I didn't intend to do, have been captured here in the log. Look over here, we have...every line is one of the steps that we perform. There's also some extraneous stuff, like at one point I cleared out the row selection. I didn't really need to...I don't really need to make that part of my script. Clearing the selected rows, so let's remove that. I'm just going to right click on it and clear that item to remove it. That's good. Okay, so messages up here, JSL code down here. I'd like to call your attention to the origin and the result. This is pretty nifty. Whenever we do a step, whenever we do an action by point and click, you know, there's there's something we do that action on and there's something that results. So that's the origin and the result. So, for instance, when we deleted the facility column... well, maybe that's a bad example...let's choose instead changing the column info for humidity. The origin, the thing that we did it on was the Components A table, and we see that listed here as the data table. When I hover over it, it says Bring Components A to Front, so clicking on that, yeah, that brings us to the Components A table. Very nice. And the result is something that we did to the humidity column. We changed the humidity. We changed it to data type numeric and modeling type continuous. See that down here. So I can click here, go to the humidity column, and it...it highlights, it selects for us. JMP selects the humidity column for us. green for everywhere, except this one last result in blue. Well that's to help us keep track of our activities on different data tables. We did all of these activities on the Components A data table and our last activity, we performed a subset on the Components A data table, the result was a new data table called vac_off. And so vac_off is in blue. Right, so we can use those colors to help you keep track of things. Alright, the last feature I want to show you here in the log that's...that's helpful is if you have a really long series of steps and you need to just find one, this filter box lets you find. Let's say I want to find the subset. There it is. We found the subset data table, and I can get directly to that code that I need. Okay, so this...is this is everything that we need. Our data curation steps were captured. All that we need to do to make a reproducible data curation script is go to the red triangle and we'll save the script to a new script window. the import step, the delete column step, the moving the scrap rate step. All of those steps are are here in the script. We have the syntax coloring to help us read the script. We have all of these helpful comments that tell us exactly what we were doing in each of those steps. Right, so this is everything. This is all that we need and I'm going to save this. I'll save it to my desktop as import and clean...let's call it import and curate components. Right, that is our reproducible data curation script. So if I were to go back to the JMP home window and close off everything except our new script, here's what we do if I need to replay those data curation steps. I just opened the script file and we can run it by clicking the run script button. Opens the data, does all the cleaning, does the subsetting and creates that new data file with 313 rows. Let's imagine now that we need to replay this on new data. I have another version of the Components file. It's called Components B. It has 50 more rows in it, so instead of 369 rows, it has 419 rows. Imagine that, you know, we've run the process for another 50 batches and we have more data. So it's called Components B, and I want to run this script on components B. But you'll notice that throughout the script it'sit's calling Components A multiple times. So we'll just have to search and replace Components A and change it to Components B. Here we go. Edit. Search. We will find Components A, replace with Components B. Replace all. 15 occurrences have been replaced. You can see it here and here. And now we simply rerun the script, and there it is on the new version of the data. You can see it has more rows, in fact, it was off. It was 313 before, it's 358 now. Alright, so a reproducible data curation script that can run against new data. Okay, so here is that cheat sheet once again. This will be in the materials that we save with the talk, so you can get to this and find it. And this tells you just how to point and click your way through data curation and leave yourself a nice replayable, reproducible data curation script. That script that we made didn't require us to do any coding at all, but I'm going to give you just a handful of tips, four tips that you can use to enhance the scripts just a little bit. The first tip is to insert this line at the beginning of your scripts. It's a good thing to do for all your scripting. Just insert the line names default to here. This is to prevent your script from interacting with other scripts. It's called a namespace collision and you don't really have to understand what it does, just do it. It's a good thing to do. It's good programming practice. Second tip is to make sure that there are semicolons in between JSL expressions. The enhanced log is doing this for you automatically. It places that required semicolon in between every step. However, if you do any modification yourself, you're going to want to make sure that those semicolons are placed properly. So just a word to the wise. add comments. Comments are a way for you to leave notes in the program and leave them in a way so that you don't mess up the program, right. It's a way to leave something in the program that won't be interpreted by the JSL interpreter. And there are notes that the enhanced log has left for you...action recording in the enhance log has left for you, but you can modify them and add to them, if you like. So here are the main points about comments. The typical format is two slashes and everything that follows the slashes is a comment. So you can do that at the beginning of a line or also at the end of the line. So the interpreter will run this X = 9 JSL expression, but then it will ignore everything after the slashes. You can also, if you have a longer comment, you can use this format, which is a /* at the beginning and a */ at the end. That encloses a comment. So comments are useful for leaving notes for yourself but they're also useful for debugging your JSL script. If you want to remove a line of code and make it not run, you can just preface it with those two slashes and, if you want to do that for a larger chunk of code, you can use this format. So good to know about how to use comments. The last tip I'm going to leave you with is to generalize data table references. Do you remember how we had to search and replace to make that script run on a new file name, Components B? We had to change 15 instances in the data table, in the script. Wouldn't it be nice if we only had to change it at once, instead of 15 times? So you can make your scripts more robust by generalizing the data tables references. Instead of using the names, we'll use a JSL variable to hold those table names. Here's what I'm talking about. I'll show you an example. On the left is some code that was generated by action recording in the enhanced log. We're opening the Big Class data table. We're operating on the age column, changing it to a continuous modeling type and then we are creating a new calculated column, on open, we use it to perform the change on the age column, and we use it over here. Very simple, what you need to do to make this more robust and generalized. You need to make three changes. This first change over here, we are assigning a name. I chose BC; you can choose whatever you want. You'll see DT a lot. So BC is the name that we're going to refer to that Big Class data table by in the rest of the script. And so when we want to change the age age means BC data table and the age column in that data table. Down here, we're sending a message to the Big Class data table, and that's what the double arrow syntax means. So we're just generalizing that, we're generalizing it so that we use the new name to send that message to the data table. And now if we need to run this script on a new data table that's named something other than Big Class, here's the only change we need to make. We need to change just one place in the script. We don't have to do search and replace. Okay, so after those four tips, if you're ready to take your curation script to the next level, here are some next steps. You could add a file picker. It doesn't take much coding to change it so that when somebody runs this script, they can just navigate to the file they want it to run on instead of having to edit the script manually. So that's one nice idea. If you want to distribute the script to other users in your organization, you can wrap it up in a JMP add-in, and that way users can run the script just by choosing it from a menu inside JMP. Really handy. And lastly, if you need to run this curation script on a schedule in order to update a master JMP data table that you keep on the network somewhere, you can use the task scheduler in windows or the automator in the Mac OS in order to do that on a schedule. So in summary, Mia talked about how to do your data curation by exploring the data and iterating to identify problems. If you automate those steps, you will gain the benefits of reproducibility and those are efficiency, accuracy, and documenting your work. To do this in JMP 16, you just point and click as usual and your data curation steps are captured by the action recording that occurs in the enhanced log. And lastly, you can export and modify that JSL code from the enhanced log in order to create your reproducible data curation script. That concludes our talk. Thanks very much for your time and attention.  
Aaron Andersen, JMP Principal Software Developer, SAS   This is an introduction to JMP Projects, with a focus on new project features introduced in JMP 16. JMP Projects provide a single-window, tabbed interface to JMP data, platforms and results. Projects can be used to quickly save and reopen a set of related files and reports, or to organize windows belonging to separate analyses. Starting in JMP 16, data tables, scripts and other files can be embedded in the project file, creating a self-contained project that can easily be shared, archived and distributed.     Auto-generated transcript...   Speaker Transcript Aaron Andersen, JMP So, welcome to organize and share your work with self contained projects in JMP 16. My name is Aaron Andersen and I am the software developer at JMP responsible for the projects feature. Normally I would say I'm coming to you live from SAS World Headquarters, but this time, neither of those things are true. First, because they asked us to record these presentations in advance, they didn't tell me why but I suspect they have doubts about my ability to be coherent and instructive at three o'clock in the morning, they might be right. And second, because none of us are actually at work yet we're still all working from home. So I'm coming to you this morning pre recorded from the corner of my bedroom closet in Apex, North Carolina. We do the best we can. Nevertheless, I hope to teach you a few things about projects that can help you use JMP better and more efficiently, so let's jump right in. it's a way of viewing a set of JMP content in a single organized window, and it's a way of saving a set of JMP content into a single organized file. The best way to understand that is to see it in action. Let's get started and build an example project together. To create a new JMP project, launch JMP 16 and go to file, new, project or Control-Shift P on Windows and something similar on Mac. What I get is a new empty project window. Now we just need some data. The data I want to use today is called dinosauriformesUSA.csv. And dinosauriformes are a glade of reptiles that include dinosaurs, their close relatives and all their descendants. I got this data from a group called the Paleontology Database. They have information on, I think, every archaeological site in the history of the world or paleontology site. With information on who and where how what and so on. It's all creative commons licensed and you can download any subset of it you want in any format you want, so it's fun data to play with. And this is something I made, just the dinosauriformes in the United States of America. What I want to do is import this data into my project. So first I'm going to drag it over into the project contents. I'll explain what does in a bit here. Just for now, I add that to the project, I can then right click and use JMP's import wizard to import the data. My data...here's some header information and then there's the actual data. My data starts on line 19. I know that because I rehearsed this presentation, not because I can count that incredibly fast. Next, the call numbers are right and import. And there we have 9,531 rows of dinosauriforme data, and lots and lots of useful columns, most of which don't mean much to me. I'm not a paleontologist, but I did read the information that came with a table, and I know some of these are. So try to get a feel for what's in this data. I think I'll first look at just the count by state, so I can kind of see where in the country all of these fossils were found. To do that I'm going to use JMP's distribution platform. I'm going to run distribution by state. Filter's helpful here...state...okay. And that opens in a tab in the project. Now I have a data table, and I have this distribution report, all of them live inside the project window. I can use this workspace tool pane to toggle between them, or I click on the tabs up top. So we're starting to see the value of projects here. I can order this data by count. Now I can start to see where this data is. At the bottom here, we have the states in the United States where the most paleontology digs have been located. Wyoming, Montana, and New Mexico. Those all sound like dinosaury places, this archtypical hot dry barren plains from, you know, the opening scene in Jurassic Park. I am surprised by how few there are in the Dakotas though. What really surprises me is Florida. I don't think of Florida as being a particularly dinosaury place. It's very wet and it's very low lying, I think that most of the state is like 10 meters above sea level or something. So that surprises me. I'd like to know more. To get a little bit better picture of where things are, I can create a map of where each of the finds have been and maybe what kind of fossils were found there. To do that I'm going to launch Graph Builder. It shows up in a separate tab. And I am going to graph latitude by longitude. Again these filters, new in JMP 16, are super useful. Graph latitude by longitude, just showing the points. I want to put a background map, so I can see the state boundaries. JMP comes with several of these pre included, including this one. There we go. Maybe zoom in on this a little bit with the magnifier tool. And then I'd like to color by class, get a broad generalization of what sorts of things were found were. Now done. Maybe zoom in a little bit more. Focus. And I can already see the answer to the Florida question I had earlier, because all of the dinosaurs in Florida, they're all aves, or birds. Whereas your... whereas your saurishcia and ornithischia (those words are...I still struggle to pronounce them after all this time) there's more found in the center of the country where we think more of the typical dinosaur find places. So that's... that's already pretty useful. But to really appreciate JMP's data discovering abilities, we have to remember that everything is linked. So it'd be helpful if we could see both the data, the distribution and the map, all the same time and interact with them together. Projects makes this really easy. I just take the distribution, grab it, drag it over to the side...there we are. Close that. Size this slightly. Then I'm going to take my map, I'm going to drag it to the top, just like that, and now, I can see the map, the data table, and the distribution all together at once. And if I want to know about a certain... certain state's dinosaurs, for example, I can click on it in here and follow it in there. Let's try that with, say, Utah. Now Utah is a state with a lot of dinosaur finds. It also contains something called Dinosaur National Monument, which I have fond memories of going to as a kid, where they have this neat little visitors center that's actually built into the side of the rockface where the fossils were found. So you can go in and see the dinosaur fossils in their natural habitat, as it were. So I can look at this and I can start to see the kinds of fossils that are found, and certainly clustered, in the southern areas of the state where it's hot and dry. Not match up in the north, where all the ski slopes are. But I also see something else that troubles me a little bit. This point here is coded as being in Utah, but it doesn't really look like it's in Utah at all. It's clear out in Arizona so something's wrong. Maybe the data is coded wrong that PBDB, maybe I misunderstood exactly what that state column means or maybe something else. I can click on the row here and then down in the data table, it will be selected if I go to rows, next selected, scroll straight to them. I can look at this a little bit. It's sort of different of...there's the same guy, found both of the other...doesn't tell me anything. Latitude and longitude. County, though, is the same as up here, so it may be that the state and county are right, but the actual latitude and longitude are wrong. I'm not sure. That warrants some further investigation, but for now, I'm just going to hide and exclude those rows from my analysis since they seem to be problematic, and I can come back later and investigate what's wrong with them. That's looking pretty good. file, save project, it's going to ask you for a file name. I'm going to say dinosaurs, but I can't really spell that, so just change to dinosjmpprj, save and that's it. You're done. Saving a project in JMP 16 is a one-click operation. The data table, the reports, and the layout of them on the screen have all been saved into this project file, which I just saved. And now I can leave this for as long as I want and five seconds or five months later, I can come back to it, reload this, and everything comes back right where I left it off. It's one of the very powerful features of projects, it's the ability to save your work whenever you want and come back to it whenever you want. So, how does that work? Well since project goes out and saves everything in the appropriate place, depending on the type of thing that it is. These reports, it saves that as JSL, puts that in the project file. The data tables, it depends on whether the table has been saved yet and, if so, where. Right now, this dinosauriformesUSA data table isn't saved anywhere. We imported it, and then we just left it as a new unsaved table. When we save a project containing a new unsaved file, JMP saves automatically to a secret location in the project file and then, when we reopen the project, it stores it back to the new unsaved state. Returns to the same same state it was in when we saved the project. However, since this is a very important data table in this project, we probably do want to actually save it at some point, if for no other reason than that would allow us to restore to a saved state if we make a change to the data that we later regret doing. To save a data table in the project, you select data table, and you can use the menu here, or you can use file save. And now we get a dialogue that's new in JMP 16 because the projec...because JMP is going to ask us where we want to save the table to. This default location, project contents, is a reference to the files and folders that are saved inside the project file itself. Anything we save here will be embedded in the project file when the project is saved. Which is what I'd like to do with the dinosaur table, but I just use the default location, file name is fine, hit save and it shows up here in the project contents. If I save the project again, everything is written to disk in that way. This project contents path, it shows me an always available view of everything that I have saved inside the project. I can create folders in here, like original data, and I might move my csv file into that folder. There it is. Now it's gone. So I can use this area, just like I would use a location folder on on my disk somewhere, but everything I put here becomes part of the project. Project, close it out. Now I have this file with that saved file in it. Of course there's just this one data table at the moment, but maybe for larger projects, I'll often want to have more than one table and that's perfectly fine in JMP. Maybe I want to look at dinosaurs in North Carolina where I live. I could look into whether any of those are still able to be visited. We could take my kids there. Find North Carolina. There we are, the North Carolina dinosaurs. I could go down to the data table and do table, subset by selected, call this dinosauriformesNC. Okay. There it is. Maybe I tab this with my original data table. And there I have it. I've got two data tables, one that has been saved to the project contents and one that hasn't been saved at all. I have a map in Graph Builder and a distribution. Save all of that together. Close it out. This is where the sharing part comes in and the title of this presentation said organize and share. And the reason why are we talking about sharing so much is that sharing projects is super easy, now that we have self contained projects. This one file contains everything that I need, and therefore, everything anyone else needs to open and view this project. So if I, say, wanted to share it with a coworker, I could just maybe attach it to an email. And that's it. That's everything that John needs to open my dinosaur project in this single attached file. I can also use this...the self-contained projects to upload to the parties of the JMP user community and share my work with the larger JMP community. Or if I encounter a bug, I could send this project to JMP tech support and ask them to look into what went wrong and why. Another thing that I like to do is make backup copies of the project, especially if I'm working with complicated scripts. And I want it to return to a known state, I can take dinos and do dinos version one or whatever, and now this has got all the data tables inside of it, so if I open this one, and I make a change the data table, I still have my backup copy back there that I can pull back if I need to. I can even write an automated script that took this every day timestamp that and stuck on some archive somewhere. The other thing projects can do, of course, is open things that aren't saved in the project. You don't have to make a standalone project if you don't want to. I can open a file on disk, like maybe I want to include this visitor center picture in the project. I can just open that. There it is, part of my project, opened to the tab. If I save the project now, the image is just going to be referenced. I can see that it has a path and projects is just going to reference it by path. So this file here now requires this file here in order to work properly. It's no longer self contained. But there are some advantages to this if, maybe I want to include a file that is really, really big. I don't want to have a copy in the project. If the file is created by some external system and automatically updated every night at midnight and I want always have the latest version. Or if I want multiple projects to share the same data table or same script, such that if any one of them makes a change to it, the others all get the updated version. I can have them all point to the same file somewhere on my computer or a network drive somewhere. It's really up to you how you work with projects. You can find the pattern and, kind of, the way that makes the most sense for how you use JMP and the thing that solve the problems that you run into. I think we're very close to out of time, but I will show off just a few more things quickly that you can look into on the JMP website or other resources for more information. One is the project bookmarks. It's another useful tool pane over here. And this works just like bookmarks would in a web browser or other system in that I can make links to files that maybe aren't always open. I can bookmark a file on disk or I can bookmark a file from the project contents and that that will just be there as a quick link way of opening that file in the future if I want to have access to it, but not always have it open. Also, a project log pane, which I can use to see log messages generated by the project contents of...that is, by the all the things open in the project. This is especially useful for scripters and activities that generate lots of log messages that I care about. And lastly, in JMP 16 we have a setup JMP preferences to allow you to customize how you use projects and how projects appear to you. These first two could allow you to automatically create projects when JMP opens or automatically create a project around a file that you might open that is in a project. Here you can decide what you...what you want the default project to look like, what tool panes you want to appear. And this one, I'm quite proud of. The project template allows you to create a project, save it to disk, and tell JMP that every time I do file new project, I want a copy of that project to show up as my new blank project. So if you have certain files that you or your organization uses all the time and you'd like every project you start to have that file open or maybe just that file bookmarked so you can get access to it quickly, you can create a project template, point JMP to it, and then every new project that you start will contain those files or those bookmarks or maybe just the layout of the windows and panes that you prefer. That is about all I've got. This is normally where I'd pause for questions, but since this is pre recorded, I'll just issue this invitation instead. If you have any questions, comments, feature suggestions or bug reports on projects, please feel free to reach out to me directly. Again, my name is Aaron Andersen. During or after the Conference or, and especially in the case of bug reports, talk to my friendly colleagues at JMP tech support. Lastly, I will note that the project we just built here, dinosauriformes.jmpprj, is included in JMP 16 as a sample project. So if you'd like to continue looking into this, figure out what's wrong with those Utah but not Utah rows, or if you'll visit the United States in the near future and you'd like to see if there are any dinosaurs that have been found in the area where you will be, you can find this project in the JMP 16 sample projects folder. Thank you for coming. I hope you have a good time at the rest of the conference. Best wishes and good night.  
Scott Wise, Principal Analytical Training Consultant, Learning Data Science Team, SAS   A picture is said to be worth a thousand words, and the visuals that can be created in JMP Graph Builder can be considered fine works of art in their ability to convey compelling information to the viewer. This journal presentation features how to build popular and captivating advanced graph views using JMP Graph Builder. Based on the popular Pictures from the Gallery journals, the Gallery 6 presentation highlights new views available in the latest versions of JMP. We will feature several popular industry graph formats that you may not have known could be easily built within JMP. Views such as death spiral run charts, marker expressions, hexagonal heat maps, cartograms and more will be included that can help breathe life into your graphs and provide a compelling platform to help manage your results.  A JMP Journal with data, scripts and instructions is attached so you can recreate these views for yourself.     Auto-generated transcript...   Speaker Transcript Scott Wise More Advanced Graph Building Views. My name is Scott Wise and in this edition, we're going to see a collection of surprisingly beautiful and very useful visualizations   that can be built in Graph Builder. So to kick us off, I'm joined by my new business associate, the Gillman, or better known to those in the movie business as the creature from the Black Lagoon.   So Gillman, I found out that you're using a lot of our firm's money to buy beachfront property. Is that right?   Okay, are you worried at all about global warming that might be causing this sea levels to flood? You a little worried about that? I am too, so let me show you.   I'm gonna share my screen real quickly, and let me show you what I found out is going on in JMP. So I've got some data that the University of Colorado has been   posting on the average rise in global sea levels. And if I just take a look at a quick graph,   you can see that there's a big trend upward when it comes to, you know, proof, right, that the sea levels are increasing, year after year. All right, all right. Well,   here's the problem I see. I'm going to apply pull...I looked at the projections, and I took one of your latest properties and it was in a pretty high risk coastal flooding area.   And I just kind of looked at how many feet might the floods be, over, you know, projected it out if things do go bad. And   I even came up with an example of our two level beach house, does that look right? Okay, and this is what the water level will look like over the years. So if we take a look at it from maybe about   four years out,   it's getting...we're starting to lose our beach, you know. We take a look at it almost, you know, nine years out, it's almost up to our steps.   We get out about 19 years out, it's getting into the first floor. I mean if we're around there for 80 years, we don't even have a first floor anymore. So do you have any contingency plans if this happens?   You do.   Oh, thank you, you got you got a prospective as well?   Oh.   Oh, my goodness, thank you. So your Creature from the Black Lagoon Bed and Breakfast, huh? So I noticed that Gillmen are welcome. And there's...let me show this to the folks who are look...watching over zoom here, we have ocean access and free fish buffet.   Well I'm glad you got that covered and I guess you're ready for it. I don't think most of us that are human are ready for it, as well, but thank you for helping me out and I'll see you for sushi later.   All right. Thank you so much, thank you, Mr Gillman.   All right, so I'm showing my....I'm going to go back and show my screen.   And let me show you how I built this chart.   So this chart here   is showing me...this is...this is just simply a projection,   got year over here, starting with, you know, this year and going up in the future, and then I've just put...   I put the possible flooding sea rise level on this column. So to make this thing work, all we're going to do is I'm just going to go in the Graph Builder.   I'm going to put the sea level rise on this y axis.   And instead of points I'm going to ask for an area chart.   In the area style I want to see is overlaid.   In the summary statistic will be the range. Okay now, there's 10...there's a 10 foot possible range. I want to make this look like my average house. I'm going to go and make it about a 20 foot house, like 10 foot ceilings, maybe I'm so lucky, right? And then   now, it's just real simple right now to go ahead and size down this chart the way I want it. And I'm going to go to,   basically, find a picture that I think is representative, so here's a picture of this beach house. I just grab it and drag it. Now I grabbed it and dragged it, I now can go in and just kind of manipulate this view   and stretch it   so it fits my space.   Yeah I'll say done here.   And   under the hotspot, now I can bring in the year and make it interactive. So I'm going to go under this red triangle, I like to call them hotspots, go to local data filter, add in the year. You notice   as I added the year, it's continuous. That's not going to be as good for a range selection, maybe, so I'm going to go under that red triangle. Go to the modeling type, change it to nominal/ordinal. I get kind of this list view and now I can just go click on 2021   and holding my control key down, or my command key if you're using a Mac like I am, that would be after the first year of range.   Here would be where we are again and then about nine years out, then here would be the maximum if we go out almost 80 years.   So this actually...I saw this used for great effect on the climate.gov website, where in the US, they're trying to convince people that we have some...definitely some areas that were at risk   of coastal flooding and subject to this rising sea level. And   so they took pictures of some of my favorite places on the North Carolina coast and the Georgia coast and the Texas coast, and it was   very impactful, even in a 2-D world here, to see that high level floodmark go up in some of my favorite places. So I'm going to do a blog on this one, if you just go to the Community just look up a local sae rise and   you can go and see me apply this again and get the data, as well, to play with. All right, that was a lot of fun and thank you, thank you for welcoming my guest.   Let's talk about pictures from the gallery, so if you're a fan of this presentation I've been doing and it's on its sixth year,   every year we come up with six really advanced views, and they're not advanced in terms of you have to be an advanced JMP user,   they're advanced in terms of these are probably views you didn't know the Graph Builder could give you in JMP. And they're really easy to do. You just got to know how to set them up, and what are the little tricks   that make it shine. And so we're going to show six new ones. We have hexagonal heat maps. This is a new one coming up in JMP 16 and I'm really excited about it. We have map cartograms.   We have time series. This came out in JMP 15   by adding a time series to your Graph Builder. We have row ordinal spirals.   We have HDR box plots. This one came out in 15 and graphlets, which came out in 16. So I'm excited to show these to you. So if you...if you've seen my presentation before, you know I always give you a gift.   This journal I'm showing you, that's going to be posted in the link, it's going to be posted in the Community.   You will have it and anytime we talk about one of these graphs, you'll be able to see it, you'll be able to not only get some hints but you'll be able to follow the steps, and I even put in the data for you.   So hopefully this will be a big help for you, alright. So everything's got kind of a pandemic theme because we're all being affected by the pandemic here, and we're doing this talk virtually, not face to face.   our honey and the bees that make the honey. And   there's been a lot of good data on what's going on with the bee population in the United States and it's very threatened here.   I know SAS actually has their own beehives and we've taken a little bit of data on them and had great talks about what they look at, in terms of data.   But one of the things they worry about are colony collapse. And a colony collapse is when all the male bees, all the worker bees take off and they...   they go and take off and   leave the hive. Leave, kind of, leave the queen bee and the little bees behind, you know, the young bees behind. And it basically is is pretty devastating and a lot of research is going on what is causing those. But what we have here   is   a heat map, but it has a special shape, it has the hexagonal shape, and I'm going to see if this can help us see the effect of colony loss over some recent years, all right. So you're going to have your tips, you're going to have your steps. Let me open it up and show you how this works.   So, right here I've got my US State Bee Colony   information. I have time periods, so this is kind of a season, you know, and I look over the winter and see how many bees get lost.   So that's this percentage right here. So here in 2008 in Alabama, of all hives that got reported, they lost about 39.7% of their bees.   And we know how many beekeepers and how many colonies were done that year or active that year and being measured.   So, to pick this up I'll just stick the Graph Builder. Here in the Graph Builder, I'll take the total winter/single state only loss. I'm going to put this on the Y. I am going to take,   as well,   the time period.   Or before I take the time period, I think I'll put the colonies on the X and I'll put the beekeepers on the color.   And I'm going to take off this smoother, I've just got points. Right now, I'm going to go and I'm going to select the heat map.   Now I've got this heat map selected.   I am going to go and change the view. Now, before I change the view, it's looking like I got a lot of things that   are kind of between zero and...what's this, like 50,000 colonies? So to handle that one, I'm going to right click on the colonies on the axis setting and I'm going to ask for the log.   And up here, this one's doing percentage. That's pretty good. I'm going to go here to the axis settings and I know they consider anything over 30% to be a problem, so I'm going to add a little   reference line right there. So I got that line going across. Now I can go under the heat map area. I can go under this bin shape and now in 16, I can turn on this hexagonal shape.   Now this looks pretty good. Got too much stuff that is in the blues here in this beekeeper gradient, so I'll right click. I'll go to gradient. I will go under the color theme. I've got this...   it's called black body, but it really goes from kind of white and yellow all the way up...that kind of looks to a dark red to a black kind of color. I like that one. I'm going to say okay. I'm going to maybe put eight labels of them.   And I'm going to make this one log as well, so the colors are a little more dispersed. That's looking better. Now can we do something about the time period?   Yes, we can, so I'm going to put the time period up here on the X axis, that gives me different panel views. I'm going to right click right there under time period, levels in view. I'm going to put two at a time and I'm just going to scoot over 'til the most recent two years of the data.   Okay, this looks good. Now some things that are lighter color are starting to fade and how you handle that, you will right click   within the data, and I'm gonna have to do it twice since I've already got this thing paneled, but if I did it before, I wouldn't have to do it twice.   And I'm just going to right click, go to customize, and where I see the heat map shapes, I'm going to add an outline. And you can see what it did. And I'm just going to do the same thing over here as well.   And heat map cells, add it to the line width here. There we go and then you can work and get the right size bin. And now I can see something like here 2018, 2019 in Pennsylvania,   they were between 43 and 50% of colony loss, so they might have been having some of this colony collapse, but now in 2019, 2020 things have gotten a lot better,   33 to 40% and it looks much better, you know, the previous season. We'll see how this season turns out. Researchers think that colony collapse could have something to do with, like, pesticides that might be affecting   the neural behavior of bees, even beekeeping methods and so it's something we can deal with. But in other parts of the world, the bee colonies are actually growing at a rapid pace like China, so I think this is definitely something   we can get a handle on but measuring it is a good place to start. Alright, so that is a pretty good start to our data. Again, you have all this available to you. I will move on. I am going to go to   the cartogram, alright. So this is looking at the number of McDonald's outlets because,   you know, during the pandemic, picking up food is one of the things...one of the few things we can do. I can't eat in restaurants very safely,   so I appreciate a good quality fast food restaurant and they are...some things at McDonald's are very, very good. They've got good coffee too.   So they're around the world, and I know they're consistent everywhere I've traveled. So if I'm telling my friends out in Europe, maybe where where the McDonalds are   in recommending that for dinner, for lunch, here is a map. Now it's doing something a little different here, you see the outline of the states...   of the countries, I should say, but you'll also see that in some of the countries, the shaded area still...is still respecting the boundaries   of the country, but it's smaller and that's giving you some sort of idea of the people per outlet. So the fuller it is, the more people per outlets are being served.   The smaller it is, the more spread out the people are, some of the outlets the McDonald restaurants are serving less people.   You can get kind of a color by number of outlets. And so this...the reason we like this is it kind of helps you with unbalanced   sizes in your map shapes, you know, because some countries might have a lot of territory,   be big and wide but they don't have too many people, and others might be small, but they're very densely packed. They might have a lot of McDonald outlets compared to a bigger place with less people, so this can help you.   I'm going to use backgrounds and lines is a good tip to do. Again, you have the instructions, but here we go. Pretty easy to do. I'm going to pick up from the data   what's going on. I'm going to put the country name on the shape.   I'm going to put the people per outlet on the size, and I'm going to put the number of outlets on the color.   Okay, now it doesn't look like anything is going on here, but um it will get better when we zoom in, because we've got every McDonald's outlet in the country.   So it's not really showing up. So under my red triangle, let's go to my local data filter. I'm going to add the only EU countries and those that have only have 300 or more. So when I click on this, I'm going to select just the EU countries. Now you see something starting to happen.   And just 300 plus outlets and that restricted myself down to just a couple of European countries, but now they pop out there, and now I can see what's going on in Spain   in France   and Germany and Poland and Italy. Okay and.   other things you can do. As you saw before, I definitely can drag in   the golden arches here.   Just from a from a picture file and, of course, you can as well,   you know, go and orient that picture. It's a nice little background. What if I wanted to add the country map? I would right click. I would go under graph. I would go under background map.   And maybe let's do a detailed earth, and there we go. There's the detailed earth we have. And again, you can play with the gradients and do other things as well. This one I'm probably going to right click as well. To to customize in here under the   shapes here, I'm probably going to make this more prominent.   Give them more of a black color and now I can see the outline a little better. So again a really cool thing you can do with maps to actually help you interpret it.   All right, number three is time series.   So here in the pandemic, well, I'm sure we're all playing more games with our families and the people we live with, so I took some classic board games.   And I went out to Google trends, which actually will let you have...which will actually, it's a data mining tool and it will go out and   show you the frequency of words that are talked about out there in social media over time, which is pretty cool. So I took since October and every day I look to see if we were starting to increase us talking about   games, right, these games. Are we talking about playing them? Are we talking about learning them? Are we looking stuff up online?   And you can see some things are happening on this chart. So let me show you how to do this. And the neat thing on this one is in 15, they allow us not only to   do a trend, you know, and put a line through it, if you fit a line, but they let us to do a time series forecast, which is really cool. And it's not going to replace   all the options you have under our main time series and time series forecast platforms in the modeling sections of JMP, but it will give you a good one, a good basic one you can use just in Graph Builder.   And let me show you how this one works.   Alright, so let's go ahead and let's pick up that data I got from the Google trends. I can see   how many times these games were mentioned over their analysis, however, they scaled them, scaled them over an index.   So what's going on? Oh, by the way, I do have a column here that's an expression column,   where I dragged in a couple of pictures I thought might help me when I hover over some of these labels. So I'm not going to show you what those pictures are to ruin it. I'll show you to you live, alright. So here I'm going to take the day   and put it on the X axis. I'm going to take all the games, I'm going to put them on the Y axis. So I've got points. I've got a smoother. I don't want a smoother, I'll disconnect it.   I'm not going to go to line, I'm going to go to fitted line, I hold my little shift key down, and you can see, you know, this chess has something going on   better than the others. Those lines are straight, this one looks like it's trending, but down here under fit, now from polynomial and 15 you can go to time series.   Now the forecast model is showing you the information up here. Unless you're really good at a time series, it's probably not going to mean much for you. I'm going to turn that off.   But seasonal period, remember, I was doing this by day and I knew that they talk about games, and we play games more on the weekends so I'm going to assume like a seven day seasonal period, there'll be a trend.   That's why you see, kind of, the up and down but it's look but it's going over seven days, so it basically looks it looks over seven days to actually figure out   where things are going. And how many periods do you want to forecast? Let's forecast out 14 or...since we're doing this, I'm going to forecast out 21 days, how about that?   OK, and now that we do this now, you can see what's going on. And I'll say done, and I can see that games like dominoes and backgammon are just not that interesting, right? Pretty flat. Mahjong and poker looked interesting. Now everybody   went up slightly over Christmas, because we played more games over Christmas, but let's take a look what's going on with poker.   Right before Christmas, oh I've got a little picture of the chips they...and the symbol they use for the world series of poker.   What happened was, given the length of a pandemic, these big tournaments, these big poker tournaments that play Texas hold'em, in a lot of cases,   they decided that people weren't going to come back to conference centers, so why don't we hold most of it online? So like 95, 98% of all the gameplay is online, and then only that final table goes and meets in a,   you know, socially distancedj, controlled way, and does it live. So they've all gone to this format and it's increased in popularity, so I think that's going to continue.   As well...so virtual online poker is hot, but what about this? I mean Christmas, that jumped up for chess but it's been going gangbusters for chess. What's going on with chess?   And if I look what's going on here, oh, The Queen's Gambit. So if you're like me and you've watched a lot of shows   on Netflix or or other type of places, you can watch these these fictional historical dramas, this was a really nice one about chess, about a young girl that   fought her own demons, got...was really good at chess and not only became a master but started to beat some of the best players in the world. And so the minute that happened, everybody wanted to learn chess. So if there was a stock on chess, we should all have invested it.   All right.   So we're down to the bottom three here, checking how we're doing on time. Run ordered spirals. Now Julian Paris is one of my good friends and he he hosted   and did a great job with the JMP On Air series that you might have seen on the Community or watched live a little earlier in the year. And   he used to bring up some challenge graphs that his friends would give them and one of his friends gave him a chart that spiraled around and looked like a tornado.   And he was able to replicate it in Graph Builder, but he made the comment, I don't see where this will be helpful to anybody. And I got thinking, where do we care about   figure skating (and there's something called the death spiral in figure skating,   which is where one you're skating partners and your partner's in the middle   of the of the radius there and he's and he's holding you and then you're skating on the outside and you make a...you're doing a 360 and you make a wide circle and then   he pulls you in and you go faster and you go even closer and closer and closer and it's really eye popping, right?) Well, pilots do this as well, and they call theirs a graveyard spiral   or sometimes, it's suicide spiral. And what happens, you're flying your plane, you're in clouds, you're at night, you don't see the horizon, you can't see the ground.   Maybe you don't have instruments that would be the worst, but what happens is you feel like you're dropping an altitude.   And you feel like something's wrong because you feel like your level, but what it is, is you're actually got yourself into a spiral and you're actually banked.   And if you stay with it you'll actually go faster and you'll actually tighten that spiral, and you will eventually   land, you know, crash into the ground, unless you can pull out. So they teach you how to get out of a graveyard spiral, and I wrote these lessons down and people was like can you can you put notes...   labels, you know, on on JMP charts? Oh sure, on Graph Builder, it's great. So I colored this first spiral, this phase one. Don't panic, the secret here is trust your instruments, not what you feel.   And your instruments will tell you that you need to level your wings. So in phase two, you have time to level your wings, if you take it calm, if you reduce the power.   You pull up a little on the nose, then you take control. In phase three, once your power's normal, you return to normal airspeed and you can see, this plane got out of it and it's going to a good area.   Okay, so, as we say, here how we make this graph work. So we go into this graveyard spiral, okay. I've got X, Y, I've got the phase.   So here under Graph Builder, I'm going to take my X, I'm going to put it on the X; I'm going to take the Y, I'm going to put it on the Y.   I kind of see the shape, just with the points, but the minute I try to add   a line, it gets all out of whack. Well there's a big row order box right here. I click that row order. I bring it back. I'm going to hold my shift key down and add back in the points. And now I'm just going to go phase by color in there I'm at. That easy.   So pretty cool chart.   Alright, so next, our fifth graph.   Our fifth graph is HDR contour plot. It stands for high density region and it's a type of box plot that allows you to make some comparisons.   And you're going to set it up the standard way, but you have to know where to get it and it's under the contour, which is kind of a different place to have it.   So I'm going to open this up. Now you're probably in the pandemic, you're probably eating a bunch of stuff you probably shouldn't be eating.   And candy is probably one of them, right. So Valentine's Day is coming up here pretty soon, and probably has passed by the time you see this recording,   but definitely chocolate's good and in moderation, can be good for you, but I've got these major brands and I have their nutritional information. So let's see if charting can help me a little bit.   So I'm going to go to Graph Builder.   I'm going to put the brand on the bottom and under nutrition here, I've got several things. And let's just start with one and get the view right. Let's just start with my calories.   So I put the calories on the Y axis. Now, instead of points I'm also going to add in under contour.   I click on the contour element here. It says violin type. I don't want violin, I want the HDR. And now what you're seeing is, I am seeing the shaded area is around the modal, that's the density mode and this is the...this is the most dense area.   And it's where the more points are, if the more points are on the end, it will be...   let me give you...it will be on the ends. Okay, and you can see where those points are   in relation.   So that's a pretty cool chart. Now can I make this a little better? Oh sure I can color by brand. We can add in other things. We can add in the carbs.   You know, you might add in the total fats. Say done. Now at this point, you might want to worry about these marker colors and make the points blue so they don't get totally   washed out by the bars. And now, you know, you can you can click on M&M and see where they all are. So this can help you make a better decision. And you can also make some comparisons here over,   how the brands are doing or I should say the...yeah how the brands are doing with their products in relation to making healthier chocolate.   Alright, so the last chart we are going to take a look at   is the graphlet. And this graph is a little washed out, but I think you can see why it's good. So you know we could always...I knew how to make...use filters   to change the...change up what I'm seeing in a graph. I even knew how to create a dashboard and instead of using a filter, make one...the settings of one graph cast...   cast into other graphs, you know, control other graphs like it was a filter.   So there's just a lot of ways to do that, but the graphlets are a wonderful way of of using the hover labels to actually bring up, usually, other graphs, and generally, drill down graphs.   And I'll be able to drill down more and more. So the things in this red bar and this happens to be an Asia region and I'll show you this data live,   when I click on the hover, it will give me the option to bring up this graph, which is a tree map, and when I click into this square, I can now bring up...I can bring up now   tabulate. And this tab here, this table's giving me all the raw information underneath   try to order your columns from a high level, drill down to a low,   and start with the end in mind.   Okay, so I'm going to bring up this international beer nutrition, because what goes better with the chocolate than a good international beer. So no american...no U.S. beers, but I do have Canada and Jamaica in here, as well as several countries in Asia and Europe.   Alright, so the first graph, and I'm going to start with the last one I care about, is tabulate and I hope you've seen tabulate before. It's under analyze, tabulate, it's where we kind of do our   like, you know, pivot table created type of tables summaries, and you can just literally go in and dump in region and dump in country next to region.   And maybe the brewer next to that one, and then you can take...you can take a thing, like the out...   output, like the alcohol percentage, and you can get whatever statistic you want. So I've got this type of graph already open. Save that   as a script, so it's there. I'm going to keep that one open. Let's go to what the second graph was, the one in the middle, the middle level of detail. This one was a tree map, where I have country nested in region,   and I have it sized by calories, by average calories, I should say, and colored by average carbs. So to create this kind of chart, it's going to be simple to go out to Graph Builder.   It's going to be simple, just to go, let me take my region, first, and let me take my country, next to my region. Let me go ahead and ask for the tree map. So things are looking pretty good so far. I'm going to size it by...I think I sized it by carbs.   And I think I colored it by calories.   Now it jumps out at you and there's just a lot you can do here. You can even control the layout. I like the square five layout.   But that's how you create this chart, right. Now we'll keep that one open there in the back. And the last thing we're going to pull up is this point and line fit chart and it's kind of a cool chart.   I really just want to show the mean of, like, here's the mean alcohol level of Asia and the rest of the Americas and Europe. And to make this chart but with these kind of shading around my mean prediction there, to make that happen, I'm going to go on the Graph Builder.   There I pulled it up. Alcohol percentage. Maybe I'll just put the region down here, instead of points, when the summary statistic is mean. Now when I do that and I hold my shift key down and I turn in a fit line, it gives me, kind of, this confidence interval around my my my mean.   And I'm going to color by that region as well, or overlay, I should say, by the region. That's how I generated this view. Very simple. Okay, now I have   all these charts open. How did I make the graphlet work? Because when I'm hovering over this one, it's not giving me anything but just information.   How do we make that happen? We'll start with the most detailed chart, the one at the bottom, right, the one you're going to drill down and end with.   This is the one I want. I right click. Save script to the data table, which is a good place to put it, or save it to a script window, which this is just puts those clicks.   In the JMP scripting language that I made to make this chart. I'll tell you what you do. Just go save the script to the clipboard.   You can close this now, go to your second chart, go under,   like, where it says United Kingdom, any of these squares. Just right click. Go to hover label. There's a lot of features on hover label, you got this wide open editor, you can do some charts on the fly, but I've already got the JSL saved in my clipboard so just say paste graphlet.   And did it work? Well there's United Kingdom, I click on this. It brings up the tabulate, in fact, it gives me a nice little filter so I can change things around,   which is really cool. So I can say, my second home is in the Philippines, I can see what's going on, what brands are being represented by the Philippines. That's kind of cool. All right, same thing, right click, save script to clipboard, right. What this one has done   if I show you in the script window, it has all this tree map Graph Builder stuff I built, but it still has what I did with the tabulate as the graphlet.   OK, so now that it's in the clipboard, I should be able, if I did this right and cross your fingers here, I'm going to close this one, I'm going to go into just any of these shaded areas, right click, hover label on the first graph,   the high level graph. And now I'm going to go to paste graphlet. If I did it right, I go to Europe, there's a tree map. I click on the tree map, okay. What's going on in Ireland? I click.   Let's click on the graph and now it brings up all the   good international and export beers there in Ireland and   sometimes it's better just to have a Guinness, because the carbs aren't too bad. The calories aren't so bad, the alcohol levels, not too bad. All my friends   in Ireland, good shout out to them, because we drank many a Beamish together. So there you go. So that's how this works, and when you save   this one and I'll save this script to my data table and I'll call this my Finished Graphlet Graph.   I close out of it and I pull back this data.   There it is, Finished Graphlet Graph. Click on it, there I go, and it's every bit as interactive as before.   And that's the wonderful world of graphlets, so please use this technology if you can.   All right, so you've been a fantastic audience here. I'm at time, would love to show you more, but definitely let me know if you have questions.   You will, again, get this journal in your link and also looking at the Community.jmp.com, you can find those. You can find the links to the older galleries, we did about six for each year.   Six unique views so now we're up to about 36, you know, or or just big number out there, usually six is what you get. So here we have several blogs that feature graphs, so definitely check your Community blogs. Also there's some great   presentations in the YouTube in the Community on how to actually get the most out of Graph Builder, build dashboards and there's some great tutorials as well. And this is all in your journal and you'll be able to go click the link and learn more.   All right, so I thank you so much for joining us for the pictures from the gallery.   And I look look forward to the next time we can meet, especially face to face, so take care.
Aurora Tiffany-Davis, JMP Senior Software Developer, SAS Josh Markwordt, JMP Senior Software Developer, SAS Annie Dudley Zangi, JMP Senior Research Statistician Developer, SAS   In this session, we will introduce an exciting feature new in JMP Live 16. You and your colleagues can now get notifications (onscreen and via email) about out-of-control processes.   We will demonstrate control chart warnings, from the perspective of: A JMP Desktop user A JMP Live content publisher A regular JMP Live user We will point out which aspects of control chart warnings were available before version 16, and which aspects are new.     Join us to learn: Which JMP control chart warnings platforms are supported. How to control which posts produce notifications. How to control who gets notifications. How to pause notifications while you get a process back under control. How to review (at a high level) the changes over time for a particular post.     Auto-generated transcript...   Speaker Transcript   Thank you for joining us today   to learn about a new feature in   JMP Live 16: Control Chart   Warnings. You may be thinking   Control Chart Builder has been   available in JMP Desktop since   version 10 and Control Chart   Warnings have been available in   JMP Desktop since version 10. So   what's actually new here? What's   new is in JMP Live version   16 we now have a way to grab   your attention if there's a   control chart post that has   warnings associated with it. In   other words, if there's a   process that might have a   problem. We do this through the   use of onscreen indicators as   well as active notifications   that can go out to users   on screen and by email.   I'd like to now introduce just   a few of the people who helped   to develop this feature.   I am Aurora Tiffany Davis,   senior software developer on the   JMP Live team, and during   today's demonstrations, I'll be   showing you JMP Live from the   perspective of a regular user.   We also have with us today Josh   Marquordt. Josh is a senior   software developer on the JMP   HTML5 team, and during today's   demonstrations he's going to   show you the perspective of a   JMP Live content publisher.   Finally, we have Annie Dudley   Zangi. She is a senior research   developer on the JMP statistics   team and she's going to be   demonstrating the control chart   features within JMP Desktop   itself. Annie, would you like   to get us started?   Thanks Aurora. Yes, so I'm gonna   be demoing this with showing   you how this works using a   simulated data set based on a   wine grape trial that happened   in California. So what we have   here is 31 lots, several   cultivars and yield, brix   sugar and pH. So let's start with   Control Chart Builder.   First, I'm going to pull in the   yield (that's in kilograms).   And then I'll pull   in the location.   Has a subgroup variable.   I don't care about...so much about   the limits. What I am concerned   about instead is whether or not   we have any particular lots that   are going out of control there.   Going above the limits or below   the limits. And I care about how   each of the cultivars are doing,   so I'll pull that into the phase   location, which basically   subgroups all the all the   different...or subsets all the   different cultivars for us. So we   can see that we have differences   and kind of unique things going   on with the different grapes.   Next I'm going to turn on   the warnings.   And to...the easiest way to do that   is to scroll down under the   control panel and select   warnings and then tests.   We're going to turn on Test 1,   one point beyond the limits, and   then Test 5 as well.   OK, I see no tests have failed.   That's pretty good.   And now you might recall we   were looking at two other   response variables, so I'm   going to turn on the Column   Switcher so we can look at all   three of them. We can just flip   through them using the Column   Switcher.   So we started with yield. We'll   take a look at sugar. Alright,   we can see the Aglianico has   very low sugar content, whereas   the other four have a higher   sugar content. And we can see the   different pH levels for each of   the five grape varieties. OK,   well we've got these 31 lots in.   I think we're ready to publish   it. Josh, would you like to show   us how to send that up?   Thanks Annie, so I have the same   report up that Annie just showed   you and I'm ready to publish to   JMP Live. The first thing I   would need to do as a new   publisher would be to set up a   connection to JMP Live.   If I go to file, publish,   and manage connections.   You can see that   I have a couple of connections   already created, but I'm going   to add a new one.   First you need to give   the connection a name just to help keep   track of multiple connections.   The next thing you need is   the URL of the server you're   trying to connect to,   including the port number.   Finally, at the bottom of the   dialog you can supply an API key   which says for scripting access   only. You only need this if you   are going to be   interacting with the server   using JSL, which we're going to   do later in this demonstration,   so I'm going to get my API key   from JMP Live.   I'm logged in.   I go to my avatar in the upper   right-hand corner and select   settings to see my user   settings. At the top there is   some information about my   account, including the API   key. I click generate new   API key and copy this by   clicking the copy button.   to my clipboard, then I can   return to JMP and simply paste   it in here and click next.   Authenticate   to   JMP Live. And you will be told that   your connection was created   successfully and you can save it   now. It is now present in my   list of connections and ready to   use for publishing. You only   have to do that the very first   time you set up the connection.   The next time you publish, you   can just use it.   So now I can go to file, publish   and publish to JMP Live.   And select my connection from   the dropdown at the top.   Create a new post is selected by   default. So I click next.   And this dialogue looks very   similar to what it did in 15.2,   except now there's an   additional checkbox here that   says enable warnings.   This is present for every   warnings-capable report.   If I hover over it, it says,   "Selecting enable warnings will   notify interested parties when   this post has Control Chart   warnings." I'll get back to who   the interested parties are in   a moment, but first I wanted   to explain what warnings-   capable reports are. In JMP   16 only the Control Chart   Builder is warnings-capable   and able to tell JMP   Live about warnings that are   present within it. There are   plans to expand to other   platforms in the future.   A Control Chart Builder can be   combined with other reports in a   dashboard or tabs report.   And it can be combined with   the Column Switcher as we're   showing in this example.   Some more complex scenarios   could cause an otherwise   warnings-capable report to not   be able to share warnings and   this enable warnings checkbox   would be gone.   For example, the Column Switcher   only works with a single Control   Chart Builder. If you try to   combine it with multiple control   charts in a dashboard, that   would no longer be warnings-   capable in JMP 16.   So back to who are the   interested parties? I, as the   publisher of the report,   am an interested party,   as well as the members of any   group I publish to, if that   group is warnings-enabled.   So I am going to publish this   report to the Wine Trials Group   and leave enable warnings   checked so that JMP will   tell JMP Live about any   warnings that are present.   The report will come up in JMP   Live. And the contents of the   report look much like they did   in 15 and 15.2   The points have   tooltips. You're able to brush   and select and the Column   Switcher is active, allowing you   to explore the results in   multiple columns. Now I'm going   to hand it over to Aurora so she   can show you some of the new   features in JMP Live.   Yeah, thank you Josh. So Josh   just published a post to JMP   Live that is warnings-enabled,   but that doesn't actually have   any warnings going on right now,   so I can show you what that   looks like from the perspective   of a regular JMP Live user   and what it looks like is not a   whole lot. There really isn't   anything to draw my attention to   Josh's post. There isn't any   special icon showing up on his   post. I don't have any new   notifications. If I open the   post itself, and I opened   the post details,   and I scroll down, I will see   that there's a new section that   did not exist prior to JMP Live   version 16. And that's the   warnings section. This section   is here because the publisher   said, by checking that   enable warnings checkbox in   JMP desktop at publish time,   the publisher is saying, I   think that other JMP Live   users are going to care whether   or not there are warnings on my   post. And so we have a warning   section here. But right now it   just tells us a very reassuring message there are no warnings   present. If we scroll down   further, we can see the   system comment that JMP   Live left in the comments   stream at publish time, and   again, this just tells   us a nice reassuring message this post has zero   active control chart   warnings. I'll pass it back   to Annie now so that she can   walk us through the next   step of the grape trial.   Thanks, Aurora.   So as I said before, we're   getting new data in. We had 31   locations before. Now we have 32.   The original study   was adding some some actual   restricted irrigation lots so   that they could find out how the   five different grapes responded   with restricted...with more dry   regions. So if we take a look at   the control chart with these   these restricted values,   we can see that the yield is   lower in this new...in this new   lot that was just added. And   in fact, with the Tempranillo   grape it is...it is below the   lower limit. We can take a look   at the sugar to see how that   responded and we can see that   the sugar actually went up for   our new restricted irrigation   dry spot.   The pH wasn't wasn't   anything abnormal.   So I think we need to   update this. Josh, do you   want to show us how?   Yes, so new in JMP 16 in JMP   Live is the ability to update   just the data of a report.   This is useful because you don't   need to rerun the JSL or   recreate the report in JMP and   republish. You simply want to   update the existing report with   new data. This can be done   directly from the JMP Live UI   by selecting details   and scrolling down to the data   section where you can view the   data table that is associated   with the report. And click manage   to update it.   Click on update data   and select update next to the   table you want to update.   And click submit.   You're returned to the report.   You will see that it is   regenerating and the updated   content shows the warnings that   Annie mentioned. Now I'm going to   hand it over to Aurora to   demonstrate some of the other   ways that JMP Live lets you   know you have warnings. Thank   you, Josh. OK, so Josh has taken   a post that was warnings-enabled   and now he's updated the data on   it so there actually are   warnings now, so I can show you   what that looks like from the   perspective of a regular JMP   Live user. We can see now that   his post looks a bit different   than it did before. It has a new   red icon on it that draws the   eye, and when we hover on that   icon it says there are control   chart warnings in this post.   What that's telling me in a   little bit more detail is, first   I know that the publisher of   this post cares about control   chart warnings because the   publisher has chosen to turn on   those tests within JMP desktop.   Second, I know the publisher   thinks that other JMP Live   users might care about control   chart warnings on this post   because that publisher has chosen to   enable that JMP Live feature.   And third, of course, I know that   there actually are control chart   warnings on the post. Now I'll   see this icon on any post. I   also see it on a folder if that   folder has a post inside of it   that fulfills all these same   criteria. If I click on this   icon, I am taken to   the warnings section of the post   details, just like I showed you   last time, only now there's more   interesting stuff in this   section. Now it tells me that   there are control chart warnings   and which columns those warnings   are present on (yield and brix   sugar) and it tells me some   details about the warnings. But   if I want more details, I can   scroll down just a bit and click   open log. That tells me a lot.   It tells me for every column   How many warnings there are;   what that translate to in terms   of warning rate; which tests   the publisher actually   decided to turn on an JMP   desktop; and also specifically   which data points failed   tests and which tests they   failed. I can also copy this   to my clipboard.   If I scroll down further to the   comments stream, I can see a new   system comment. It says the   posters regenerated because the   post content was updated, and   when the post content was   updated, there are now control   chart warnings on the following   columns. So you can see here   that these comments stream can   serve as kind of a high-level   history of what's been going on   with the post. Right now I'll   leave Josh a quick comment   saying it looks like reduced   irrigation had a   big impact. Now   the icon that I saw on the card,   that would be seen by any JMP   Live user, and any JMP Live user,   if they open the post details,   would see these system comments   and they would see this warning   section. But not just any JMP   Live user would get a new   notification actively pushed to   them, but I do have that   notification. I can see it up   here in my notifications tray.   And I also have one sitting in   my email inbox right now, and   it's very detailed. The email   contains all of the information   that is present when we saw open   log just a moment ago. Now, why   did I get this notification? I   got it because I'm a member of   the group that published...that   the post was published to. And   furthermore, the administrator   of that group has turned on this   JMP Live warnings feature.   They've enabled warnings for the   group itself, and by doing that,   the group admin was telling JMP Live I think the members of my   group are really going to care   about control chart warnings, so   much so that you should actively   push notification out to them if   we get any new control chart   warnings on the posts in this   group. In other words, my   group admin agrees with the   publisher. They both want to   draw my attention to these   potential problems.   Now I'll turn it back to Annie   so she can walk us through the   next part of the grape trial.   Thanks, Aurora.   OK, so we we last looked at the   adding of the restricted   irrigation lot and now we have a   couple new lots come in.   Nothing, nothing special about   those. Let's take a look at the   graph. Um, what do we see here?   Well, we see the restricted   irrigation, but nothing special   with those. Let's see if   anything happened with the   sugar. No, we see the two new   points at the end after the   restricted irrigation, but   nothing special there and not a   whole lot new. But we do still   need to update the graph and   update it on the web. So Josh, do   you want to show us how we can   update it this time?   So I have already   demonstrated how you could   update the data through the   JMP Live UI, but you can   also do this through JSL.   First, I'm going to declare a   couple of variables, including   the report ID. The report ID can   just be found at the end of the   URL after the last slash; it's a   series of letters and numbers   that identifies the report to   replace. There are ways to   retrieve the report ID through   JSL, which I will show in a   moment, but for now we're just   going to save that.   We're also going to update our   updated data set that Annie just   showed you so that we can   provide it to JMP Live. So if I   run these, it opens the data table.   The next thing we need to   do is to create a   connection to JMP Live.   This will use this the   named connection that we...   that I created at the   beginning of the demo,   Discovery Demo server, here.   I use the new JMP Live command,   which will create a JMP Live   connection object. I provide   an existing connection and   it can prompt if needed, but   I've already authenticated.   So if I run this,   I get a new connection. As I had   mentioned at the beginning, you   can use this connection to   search for reports, as well as   get a particular report object   by ID. I'm going to use our   variable that I pasted in   to get the report we've   been working on.   That report can be...   you can get a scriptable   report from that result   object to get a live report   that you can examine for a   number of pieces of information.   Here I grabbed a live report   and got the ID.   I got the title,   description and the URL.   And you can see in the log that   the ID I retrieved   matches the one that I   pasted in   to the report.   I also got the title.   The description is blank 'cause   we didn't provide one when we   originally published. And I also   got the URL, the full URL, that I   could use to either open the   report through the script or for   some other purpose, such as   creating a larger report   that links to it.   In preparation of the next step,   I'm just going to get the   current date and time, which I'm   going to use to decorate the   title a bit, prove that we've   updated it through JSL.   But the key command here is   the update data command, which   lets us update just the data   of the report, just like I did   through the JMP Live UI. It   takes the ID as well, which   here I'm going to retrieve   from the live report object.   And then takes the data command   which you provide the new   data table that you are   uploading, as well as the name of   the the current data table that   you want to replace.   That   update result object can also   be queried to retrieve a   number of pieces of   information, like if it was   successful, the status you got   back, any error messages which   could be useful in a more   automated setting to provide   details as to why publish of   the new data failed. So I'm   going to run this.   And it said that it was   successful. And if I bring up   my report, you get this popup   that says an updated version   of this report is available.   Now I can choose to dismiss it   and continue looking at the   current context I have, but I'm   going to say to reload.   And we see the new data points   here without having to refresh   the entire page.   We go back to   JMP.   The last thing I wanna do is   show that other pieces of the   report can also be manipulated   through JSL. Here I'm simply   going to give it a new title. I   don't like the one that was   provided by default, so I had   declared this variable with   a new title, and I'm going   to append to that, the date   and time to help   distinguish when this   update was done.   I'm going to use the set Title   Command to send that to the live   report and then close my data   table to wrap up.   Run these and   bring up the report. In   a moment you'll see the   title refresh both here   and in the details.   Here it is with the date   and time.   Now I'm gonna hand it back over   to Aurora so she can show you   more of what happened in JMP   Live with this update. Thank   you, Josh. So I can see Josh has   updated title on his post. And so   he has updated a post that was   warnings-enabled and had   warnings. He's updated with new   data, and the new data, and just   like the previous data, has   control chart warnings. So I can   show you what this kind of   persistent warning situation   looks like to a regular JMP   Live user. So I can see here that   the icon that draws the eye and   says there are control chart warnings in the post   that's still present in a   persistent warning scenario. If   I open the post,   and I open the post details, I   can see the warning section. Only   now, it tells me I have warnings on three columns yield brix   sugar and pH. If I scroll down   to the comments stream, I can   see that same notification about   the warnings here in the   comments stream. And I also, I'd   like to point out, have a new   active notification that is pushed   out to me. I have a new one here   and that's telling me that the   new data, just like the old   data, does have warnings   associated with it. Now I'll   turn it back to Annie and she   can take us through the next   step of the grape trial.   Thanks, Aurora.   So last we talked, we were   looking at Lot #34. 33 and 34   were added, so we've got   one new lot come in. That's lot   #35. Let's see how it looks.   Oh my goodness, the yield is way   out of...out of control. This is...   this is just unbelievable. This   is...this is just remarkable. How's   the sugar look? Well, the sugar   looks about normal, like we would   expect. The pH is also   about where we would expect.   This is something that's   clearly going to involve some   investigation, but we still need   to report this. Josh, would you   like to update the web?   So,   we've demonstrated that you   can update just the data of   the report, which is useful   when you want to keep the   report contents the same   and just update the data.   But there's also the ability to   replace the report which existed   before, and it's still useful if   you want to update the contents   of the report itself.   I realize in addition to   updating the data, I don't   really want to have this moving   range chart at the bottom. It   doesn't really make sense in   this context, so I'm going to   right click and say remove   dispersion chart and get rid of   that. So now the report is ready   to be replaced.   I got to file publish, publish   to JMP Live, and it looks like   it did before except instead of   selecting create a new post, I'm   going to decide to replace an   existing post and click next.   New in JMP 16,   we've updated this search   window.   My report is right at the top of   the list, but you also have the   ability to search by by keyword.   And...   and restrict the number of   reports if you've published a   lot, or this was a while ago and   you have difficulty finding it.   I'm going to pick the report I   want to replace and click next.   On this screen I get a   summary of the existing   picture and title. I'm going   to update the title,   just to draw attention to   the fact that I replaced it,   and give the description.   This time I know something might   be wrong with the yield. So   while the report does have   warnings, this time I'm going to   decide to uncheck the enable   warnings checkbox. Information   about the warnings will still be   sent to JMP Live   and be available at a later   time, but I don't want everyone   to get notified about the   warnings just yet.   Click publish.   And again, I'm told that my   report has been updated and I   can reload it.   And the new information for the   title and description appear in   the details. I'll hand it   back to Aurora so she can   show you what else has   happened in JMP Live.   Thank you, Josh. So just to   summarize, again, Josh has   taken a post that has control   chart warnings in it, but this   time when he republished it,   he decided not to enable the   JMP Live warnings feature.   I'm gonna show you what that   looks like to a regular JMP   Live user because the content   publisher has control over   whether their control chart   warnings are exposed on JMP   Live in a way that's going to   draw the attention of other   users. And Josh decided that   that attention really wouldn't   be productive right now. So what   does it look like to me? It   really doesn't look like a whole   lot. There is no icon on the   card to draw my eye to it. I   don't have a new notification.   If I open the post and I open   post details, and I scroll down,   that warning section that I've   showed you several times before,   it's not even present, because Josh has   said I don't think other JMP Live   users really need to know   about the state of the warnings   on this post right now.   Furthermore, if I scroll down to   the comments stream, you know, I   can go back all the way to the   beginning and I can see when it   was published, it did not have   control chart warnings and then   it was updated and it did. It   was updated again and it still   did. The most recent comment   that I see says Josh Marquordt   has republished the post, and   it doesn't tell me anything   one way or the other about   control chart warnings. And   again, that's because the   publisher has control over   whether these things are   exposed to other JMP Live   users.   While I'm here, I'll leave a   quick comment because I see in   the description that Josh wants   us to look at the yield.   And it looks very, very   off to me, so I'm going to say,   could this be   A data entry error?   Oops, that was my scroll mistake   and I'll submit that and then   I'll turn it back over to Annie   so that they can do some   troubleshooting on this process.   Thanks, Aurora. So we went back   and we talked with the data   entry people and it turns out   they were entering in pounds   instead of kilograms. As you   notice right here, we're in   kilograms. So we updated the   data, did a little division   on it, and now the yield   looks like more like what we   would expect. The sugar and   the pH have been unaffected.   Josh, would you like to   show how to republish?   Yes, so we've shown several ways   to update the data. I'm going to   go back to the first way I   replaced it by updating it to   the the JMP Live UI. I'll click   on details. And scroll down   to the data section   again, click manage.   Update data.   And when I click update I'm   going to select the fixed   data that Annie just   presented and submit.   Go back to the report, see   it regenerate.   And   like we noted the   yield is back to looking   normal. I'm going to leave   a comment for Aurora to let   her know that we fixed the   ...the units.   Then hand it back to her to show   you what has changed in JMP   Live. Aurora... Thank you, Josh.   So I can see his post here. I   can open it and right away   looking at the report itself I   can see that things look a lot   better on the yield. So I'm   curious about what that what   that was. I'm going to scroll   down here and actually I can get   here because I notice I have a   new notification. What's that   about? I click on it and I see   that Josh has replied to my   comment; that will take me   directly to the post also. And   if I scroll down to those   comments and I look at that   reply, I can see, OK the units   were in pounds instead of   kilograms. It's been fixed now.   Fantastic. So it looks like the   grape trial is back on track   and we're making good progress.   Um, I'd like to take a step back   now and talk about the different   kinds of JMP Live users that   there are and how they interact   with control chart warnings.   We've talked a lot during these   demonstrations about the power   that Josh had as the content   publisher. The content publisher   has control over which tests are   turned on or not in JMP   desktop. And the publisher   also has control over whether   or not to enable this JMP   Live feature on the post.   But before when I got a   notification about control chart   warnings, I mentioned that I got   it because the post is published   to a particular group. So I'd   like to show you a little bit   more about those groups. If I go   to the groups page, I can see   the Wine Trials Group that this   post has been published to, and   I can see that it is warnings-   enabled. If I hover over that, it   says control chart warning   notifications will be sent to   members of this group. Let's   open that group up.   You can see here as well that   it's enabled and because I   actually happen to be the   administrator of this group, I   can change that. If you come over   here to the overflow menu,   which is these three dots,   click that, and I have the   option to disable warnings and   stop sending these   notifications out to my group   members.   I can also change it bac. If I   change it from disabled to   enabled, then I get a prompt and   it says send notifications now.   JMP Live is telling me,   OK, you've got a group;   it's got some posts in it;   because you didn't care   previously about control chart   warnings in this post there   could be posts in this group   already that have warnings and   none of your members know   about it. So now that you do   care about control chart   warnings in this group, would   you like me to go ahead and   send out notifications to all   of the members of the group   about any control warnings   that already exist on the   posts in here? I'll say no for   now because we already know   about this particular problem.   But what if I'm not a content   publisher and I'm not a group   administrator? I'm just a   regular JMP Live user and I'm   getting notifications about   other people's processes. As with   any other kind of notification,   I can opt out. And I would do   that by going up here and   clicking on my notification bell   icon, clicking on the settings   icon. And if I scroll down, I'll   see that there is a new type of   notification called control   chart warnings. I can toggle   this on or off to say whether or   not I want these notifications   at all. And if I do,   I can let JMP Live know with   what frequency I want to receive   emails about these   notifications. I think that   Josh also has some closing   thoughts for us, so I'll   turn it over to him, Josh.   Thanks, Aurora.   So we demonstrated the new control   chart warnings in JMP Live 16,   how it lets you notify   interested parties about tests   that generate warnings in   Control Chart Builder.   We've shown some new features in   the JMP Live UI that draw   attention to the warnings and   give you details about what   occurred. And settings to control   the notifications and warnings   from the perspective of both the   publisher and group admins.   We've also shown that there's   several ways to update   reports and get data into   JMP Live 16. You can publish   a report from the JMP   desktop.   You can update just the data,   which is a new feature in JMP   Live 16, through both the JMP   Live UI, as well as updating just   the data through JSL.   And you can also still   republish a report   from the JMP desktop   to change its contents.   I only briefly touched on the   JSL capabilities in JMP Live   16 so if you're interested in   more details or on how to take   this process and automate it,   please see Brian Corcoran's   talk on the JMP community at The Morning Update Creating   An Automated Daily Report to   Viewers Using Internet-Based   Data. It takes a control chart   warnings example and shows how   you might make this a daily   process that publishes   automatically.   Please see our talk on the JMP   community and leave us feedback.   Finally, we wanted to say thank   you. We are just a few members   of a much larger...several teams   that have worked on this   feature. On the JMP   desktop in Statistics, Annie   Dudley Zangi and Tonya Mauldin   worked on Control Chart Builder.   The JMP Live team led by Eric   Hill contributed to both this   feature and many of the other   features that we got to   indirectly show while giving   this demo. The JMP   Interactive HTML team led by   John Powell created the content   of the reports of control chart   folder in JMP Live.   Our UX and design work is done   by Stephanie Mencia and our project   manager is Daniel Valente.   Thank you.   Thank you everyone. Thank you.
Brian Corcoran, JMP Director of Research and Development, SAS   JMP Live is a powerful new collaboration tool. But it is only as useful as the quality of the content that you provide to it. This talk discusses the development of a JMP JSL script to acquire data through the internet via a REST API. It then will show how to publish an initial report to JMP Live and automatically update the data within that same report on a daily basis. In this fashion you can provide automated reporting to your viewers who just want to see the latest data when they start work in the morning.     Auto-generated transcript...   Speaker Transcript Brian Corcoran Welcome to the morning update. This is my talk for JMP Discovery Europe 2021. My name is Brian Corcoran and I am a JMP development manager. So what are we hoping to do today? I would like to show you how to create a report in JMP based on an internet based data provider using a REST protocol. Once we do that I'm going to introduce you to how we could publish this report to JMP Live using the updated JMP scripting engine that we've put into JMP 16. Finally, I'm going to show you how you can automate this test, so that every day, when you come into work, reports have already been updated for you and you can just view it with your morning coffee or tea. Okay, so first let's talk about internet data providers. Most of them are based on something called a REST protocol. It's a stateless call, essentially it looks like a URL with some parameters tagged on to the end, and an increasing number of organizations are using it to expose their public data to end users. Some examples are the World Bank, US Census, Google. So JMP has a facility to help you with this called HTTP Request. It will allow you to access these services. Typically they use something called a GET or POST verb to get to these. HTTP Requests will allow you to use those. For this particular report I'm going to use the Johns Hopkins COVID REST API with some data from the pandemic. So Johns Hopkins is the university United States that aggregates this data from all over the world, and then provides this free public API to access. Now there is also a premium version of this, and that gives you better access and more granularity with the data, but we're going to try to get by with the free version, for now, and hopefully you can take some of the scripts that I give you and try them out yourself. So what does the REST API look like? Well, let's take a look. Here's an example from Johns Hopkins. It starts out with this base URL, which in this case is API, the COVID19API.com. And here, you can kind of look at the URL and say, hey, we're asking for the total confirmed cases for a country. Where you see this bracketed country, you have to actually insert the name. In my case, I'm going to use Germany (they use the English names), but there are probably 100 countries where you could try this out. At Johns Hopkins requires you to supply a starting and ending date after this base URL where you see the question mark. And those are essentially parameters to this API call you're making and that will allow us to return values within the date range that we specify. And it's kind of this long format. It's the month...the year, month and day, T for time and then the 24 hour time with a Z appended to it. So fortunately JMP has facilities to help you with that. You can use a format call, and today we'll give you the exact time for right now, along with the date, and we specify the format string that we want to use, and then we can just append a Z to the end to get the format we need for Johns Hopkins. So, like I mentioned, REST calls typically use either a GET or POST post verb. Johns Hopkins uses a GET; I'm going to jump out of PowerPoint for a minute to show that. If you go to that website that I had in there, in the slide or also in the paper, you'll see that it provides the APIs by type and showing you essentially how you pass the information and what you expect to get back here. Alright, so you can kind of go through here, see what each one requires, and you can see, there are premium categories, where you have to pay so much per month to access that. Okay. So. What do you get back? Well, you get back JSON, which is just a bunch of strings in name value pairs (for instance cases, colon, and then a numeric string) that's going to represent the number of cases for this particular observation, alright. And JMP has nice facilities to access JSON and we'll show you that in a minute. Here is our actual HTTP Request call. We're just going to pass in our URL, the method we want, which is GET and then this secure zero. Why do we do that? Well, Johns Hopkins is a public API and it does not want to use secure socket layer, or ssl, so we need to turn that off or the call will fail. And then we just make our call with the send command and our JSON will be returned in the data. So let's drill down a little bit into a script. I'm going to get out of PowerPoint and we're going to bring up JMP. I'm using JMP Pro 16, but this will work with regular JMP as well. And I'll mention that this script is included with the conference materials, along with the paper that we'll be looking at. Okay, so. What am I doing here to set up? First of all, I'm going to say I'm using my...the documents folder for where I'm going to store my data, and I'm going to store it in a table named covid19_de.jmp. Later on I'm going to generate a report and I'm going to make sure it always has this name. It's very important, in this case, that it's a standardized name and I'll show you why later. Finally, I'm just converting my document's path in my name and my file to a full path to use. All right, we're not going to worry about this date formatting function here, and we're going to go into the meat of how we acquire our data. Right, so we're going to use a pattern here, and that is, we're going to assume that we've never run the script before. Now, if we do not find a file where we have previously accumulated data, then we will create a data table and fill it in with values. However, if we already find a data table, then we will just update the table with what the latest days worth of data, just one value, all right. And that way we don't have to worry about whether we've run this before or not. We can just run this script kind of blindly, you can give it to somebody else. It'll work for them. Alright, so here we're going to say if our file exists, our data table in the documents folder, just go ahead and open it and, by the way, go ahead and set this flag to say we have data already. Otherwise I'm going to create a data table and I'm going to create it with a date, column, cases, and daily change. Right. Now the next part is we're going to format our strings for the call to Johns Hopkins. Remember we needed to have a from and to range. Alright, so here's the string we already looked at. This is for the today value. Alright, so in the case where I only need the value from... the current day's value, I'm just updating my data. I'm going to go from yesterday to today essentially. Alright, so I'm going to create yesterday by saying today minus two days and I'm going to go to today. I do two days, because depending on the time and when the data gets updated at Johns Hopkins, sometimes you get one value, sometimes you get two. When I...if I get two, I will just take the most recent value, but I want to make sure I get something. Alright, the next thing is if I've never gotten data, I want to have a start date. Now I'm arbitrarily going to start on September 1 of 2020; you could put whatever you wanted. Now here you see I'm actually using August 31, that's because Johns Hopkins does not actually give us a value for the change between days, so they will only give you the total cumulative cases for pandemic data. So in order to calculate the change, I have to subtract today's value from yesterday's value. Well, if I want to start in September 1 then I need the August 31 data in order to compute the change for September 1, so that's why we do that. If you pay for the premium API at Johns Hopkins, you can get the change value. Alright, so here's our URL that we discussed earlier. Alright, so this is important here. If we have data that our URL is just going to start from yesterday, but we don't, and we're using this if statement, if we don't, then we're going to start from September 1, our start date. And then we're just going to go to today. I show this URL just for debugging purposes, but then this is where we actually do our request and send call. Right. I put in a little wait to make sure that it has a chance to run. And here, is where we get our JSON data back, and this is where JMP has a really handy facility for this. Parse JSON will take this big block of strings and break it into an array of name value pairs. You can then call in items on that array to find out how many pieces of data you've gotten, and you can reference that data as an array with array subscripts. Okay. So now let's navigate down a little bit. Here we're going to fill in our data table. If we already have data, then we're just going to add one row to the table at the end, and we're going to fill in that data value, along with our change, which we compute from today's value versus yesterday's value, all right. Then we just save it off. Now. If we've never created it before, then we're going to add the number of rows we have, minus one because there's a header information and then we're going to cycle through this and calculate all our daily changes, date values and put the case data in the table. And I think I'm going to demonstrate, hopefully, running this from scratch right now, we're at like 163 days or something like that. And then we will save out that table to the documents folder. Okay, now that we have the data, we can think about publishing to JMP Live. But this is probably a good chance for me to describe the...how you do JSL programing JMP Live and how we've changed it in JMP Live 16 and JMP 16. Let me bring up the paper associated with this talk and you'll have this in the Community as well. Alright, so in JMP 16, we rewrote the scripting to be, hopefully, more powerful but easier to use, and the scripting revolves around the idea of having a connection or managed connection information stored away. So. What happens is that...let's bring up JMP again. I'll show you what that means. You go to file, publish, manage connections. All right. And we can add one. Here, you would specify a connection name of your choice, and then the URL, which JMP Live is essentially a REST service itself, where your JMP Live site is. If your administrator requires in a secret API key to enable scripting, you would need to supply it here. When you do this and you hit the next button, you're going to be prompted, most likely depending on your authentication mechanism for credentials. When you enter those, it will essentially give you an access token, which means that it stores it away on disk for you, not your credentials, but just this access token that allows you to access this site and script to it. That way, you can, without having to provide any of this information in the script, you can just reference this connection name that you supply. And I'll show you one, for instance, this is JMP Live Daily, which is what I'm going to use. Here's my URL on point, my API key. I can just reference JMP Live Daily, and it will know how to connect within my script. Okay. So, to create the connection then, I just say new JMP Live and the name of my connection. Now here I'm saying, let's prompt if we need to. What does that mean? Well if, for some reason your credentials, you know, expire or your access token is old, then it will prompt you to enter your credentials once the script starts. If you don't supply this and your credentials have expired, then the script will just fail. Okay. So how do we actually publish a report to JMP Live? Alright, so here's an example of just a simple bivariate that you might run out of Big Class, all right. So, to make that a published report, you're going to just say, create a new web report and assign it to a variable. And then I'm going to take this bivariate reference and I'm going to say, add that report to the web report. And I'm going to optionally provide a title and a description. And then I'm just going to call publish and the publish will return a result. Okay, and we might publish up to JMP Live and look like something like that. If we want, we, the result will tell us if we succeeded, and since we're actually the public...publication is actually like an HTTP call, we can look at the status, if we so desire, or an error message. Okay. All right. The other interesting thing is if the result of our operation is a...like adding a report or a folder to JMP Live that...the result we get back contains information that allows us to further manipulate that item. We can call this As Scriptable to use that information to create an object within scripting, like a report, that we can then access. For instance, after I've done this, I can use the report and set my report title. It will go up to JMP Live and change the report title. Okay. The whole idea within JMP Live now is around the idea of manipulating reports and folders, searching folders, searching for reports, things like that. The report understands that it has an identifier. And this identifier, if you were to look at it, is a long alphanumeric string that really would be awkward enter into a script or remember. But it's also...it's required for you to, like, uniquely identify that report. And the reason you might want to uniquely identify it, let's suppose you want to delete it. You tell JMP Live that you want to delete the report and then the report can...you can use the Get ID on that report object to to supply that unique ID so JMP Live knows which one to remove. I'm showing you these particular items because there'll be important in our ultimate script that we hope to produce. The other area where it can be really important to have...know that ID is in searching. Now here, I can find reports. For instance, let's suppose I want to find all the bivariate reports that start with Biv, I can just ask JMP Live to find reports and return a list of results. Then I can make...actually turn that list into a list of reports. There's a function called Get Number of Items on that report list that allows me to cycle through each one by subscript. And then, if I so desire to, like, I could delete all of them if I wanted to, alright. So the search capability is new and we hope fairly powerful for you to use to do large operations on a JMP Live site. Right so there's one operation that we need to address too, before we're really ready to show our script off a little bit further. And that is update data. So in JMP Live 16, we've added the capability to update just the data for a report without having to publish the entire report back up to JMP Live. Let's suppose you get a report just the way you want it, and you know, maybe it's a little customized and you like the appearance and you don't want to mess with it. But you do want to update the data and have it recalculated. Well, now you can pass just the data table for that report up to JMP Live and, also reduce, you know, the transmission time, and you do that by calling Update Data, providing the report ID, and then just the data table with the updated data. The report will recalculate on JMP Live, rather than having to do it on your desktop, and anybody who happens to be viewing that report will also see the update. Okay, so now we kind of have all the tools that we need to actually do our script, so let's go take a look. All right. So, here's where I create my JMP Live connection. And now I'm going to create a control chart. Now a control chart really is not the ideal analysis platform for this data, so why am I using it? Well, two reasons. One, it is nice to see day to day changes, and two, it allows me to plug or advertise the fact that we have another new feature in JMP Live 16, and that is to show control chart warnings. If you publish control charts and there are observations that are out of bounds that would generate a warning on the desktop, well, when you publish it to JMP Live or update the data, JMP Live can also generate warnings to send to anybody who's subscribed to that report and wants to get an email or a notification within the website that something is out of bounds. This can be really useful for things like process control. So my colleagues are doing a talk on control chart warnings and I encourage you to also check that out if you have a chance. How did I generate this? Well, I just went to JMP and, you know, with some older data and, for instance, you know there's a facility within JMP if you're doing an analysis, where you can just say save script to script window. I just took that information and plugged it into this script, so that's pretty handy. Okay. So let's get into the meat of how we're going to publish a report, and I promise you, we will run this in a little while. Okay, so. Once again I'm going to have a pattern here. I'm going to look for the report to see if it is already up on JMP Live. If it is not, I will publish it, but if it is already up there, then I will just take the updated data table that we created earlier and I will provide that to update the data on the server side and have it recalculate the report if it sees fit. All right, so how do I do that? First of all, I'm going to search for our report (and remember we use a standardized report name that I had specified earlier so we're always looking for the same one). And I'm also going to say only published by me, just in case somebody else had a report of the same name. I wouldn't want to get that. All right, I'll turn it into a list that I can look at. And if the number of reports is zero, that means I didn't find a previously published report. So I'm going to create a new web report, add my control chart builder output to it, and publish it. And I want to make sure that it's available for everybody by saying Public(1). If I did find one, I'm going to take that report, referencing the first item returned, and I'm just going to update the data here, using our updated tables that we generated previously. The rest of this is just debugging information that I showed in the log just to see if everything went alright, but it's not really necessary. Finally, at the end here, I have a Quit statement. When we actually go to automate this later, this is important because we want JMP to shut down and close down all the windows. Otherwise, the next time we go to run it, it might take a look and see JMP's already running and think that things are hung from a previous operation. However, for interactive operation, I'm going to comment this out right now. Okay. So I think we're ready to go here. We can give this a try and we'll hope for the best. Sometimes Johns Hopkins gets very busy, and will actually reject the request to get the data which would be unfortunate, but let's try this out. And just to show you, in the documents folder at this point, I do not have a JMP table with the name that I'm specifying, and if I go to the JMP Live site that I hope to publish to, we don't see any output from control chart builder there. Alright, so let's give this a try. Right, there's our control chart. I'm going to refresh JMP Live. Okay, there's our control chart builder output. This one did have warnings to it. If we look at this within JMP Live, we can hover over points and see what the most recent data is. This is for February 10; I'm on the 11th so we have up-to-date data. This is some data that is considered out of control, based on the moving average from back in January and late December. Right, so far, so good. If I open up my documents folder and refresh that, we can see that our JMP table has been created. All right. So, and we see now this has been published a minute ago. Alright, so let's go ahead and shut this down. And we're going to...actually, I didn't want to do that...hold on a second. We're going to cheat a little here. Let's go ahead and we're going to delete the last value that we got, right. Then we're going to save that and close it down, alright. So we're going to simulate the fact that we have not run it today yet, all right, and then we're going to run this again. Okay, just fetch the last value. And we go up to our website. And we see it just regenerated a few seconds ago again. So in this case, we just updated the data and we just got the last value. If I were to bring up my mail, I happen to be subscribed for warnings and hopefully, might see a little update here too. We're getting notifications that there is a publication in this control chart builder and there were warnings, and if I want, I can go and see where those failed and what points are out of bounds. Okay. Alright, so I think we're in good shape for trying the automated task. So I'm going to go ahead and I'm going to delete this post. Right. Let's shut this down. Shut this down. I am going to put our quit back in there, because now we're going to need that, for one way run in an automated fashion. And I will close this. Go to the documents folder and I'm going to delete our data and pretend that we're running this from scratch. Right. And let's make sure JMP is shut down. Okay. Now. If you've seen some of my previous Discovery talks, you may have seen me use the task scheduler before. It's a popular topic with me. You just type in task scheduler here on Windows; I hope you saw that. On the Mac, you would have to use automator or a chron job. I would suggest automator. All right, but the task scheduler allows you to run just about anything on a regular basis. So let's go ahead and create a new task. We'll just create task here and we'll name it COVID Data for Germany. I want to run with high privileges. I'm going to run only when the user is logged in because I don't want to enter credential data, but I would suggest selecting run when the user is logged on or not, if you're doing this for a production purpose, because if your machine gets rebooted due to a windows update or some other reason, you want it to still run and this will allow you to do that if you specify that option. It will require you to enter your credentials and when you finally save out this task. Alright. So for triggers, what is that? That's when I want it to run, so let's go ahead and do that. And let's say we want to run it daily starting tomorrow. And maybe I just want to run it at six o'clock a.m. before I get in in the morning, whatever get in means anymore. Before I roll out of bed and go to work. All right. So I'm going to stop this task if it runs longer than 30 minutes because that probably means it's hung. And otherwise I think we're good to go there. So what action do we want to perform? Well, we want to run JMP, so you have to navigate where jmp.exe is installed, which is in program files, SAS, and either JMP or JMP Pro 16. Go ahead and select that. And then, our argument is our JSL script which, unfortunately, you have to enter in manually here, which I'll do. But just make sure that you're careful with that. Okay. Now, under settings I'm going to make sure that we have always allow. It has to be run on demand, because that'll allow us to try it right now and make sure it works right. And if there's already one running, make sure to stop it. That probably means it's hung and stop the task if runs longer than an hour again, just in case it hangs. Alright, so there's our task to run every day. So we can debug it essentially by trying it out right now, since we allowed it to be run at any time. Let's go ahead, right click the mouse button and say run. Hopefully we'll see the taskbar. JMP will briefly come up, run and then go away. Looking down here, hopefully, things are happening. Okay, and then it's gone. Let's take a look at our website; it will refresh that. And there is our report generated a few seconds ago. If we look at our folder, we can see that our JMP table's been generated and hopefully tomorrow morning at 6 a.m., our task will run and get us a fresh batch of data and an updated report. And when we come in with our coffee or tea, we can take a look at that and make our decisions for the day. So that concludes my talk. I hope one of the three aspects we've discussed today, internet based data acquisition, JMP Live scripting, or automated task generation, has helped you with your job. Thank you for attending and I hope you enjoy the rest of the conference. Bye.