Choose Language
Hide Translation Bar
--Select Language--
(English) English
(Français) French
(Deutsch) German
(Italiano) Italian
(日本語) Japanese
(한국어) Korean
(简体中文) Simplified Chinese
(Español) Spanish
(繁體中文) Traditional Chinese
JMP User Community
:
JMP Discovery Summit Series
:
Abstracts
All community
This category
Events
Knowledge base
Users
Products
cancel
Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
Showing results for
Show
only
|
Search instead for
Did you mean:
Sign In
Sign In
JMP.com
User Community
Online Documentation
My JMP
JMP Store
JMP Marketplace
Discovery Summits
Discussions
Learn JMP
Support
JMP Blogs
File Exchange
Add-Ins
Scripts
Sample Data
JMP Wish List
Community
About the Community
JSL Cookbook
JMP Technical Resources
JMP Users Groups
Interest Groups
JMP Discovery Summit Series
JMP Software Administrators
Options
Add Events to Calendar
Mark all as New
Mark all as Read
Showing events with label
Mass Customization
.
Show all events
«
Previous
1
2
3
4
…
12
Next
»
0 attendees
0
0
スポーツ科学とケガのSTEAMS方法論_Lily Sun(2020-JA-PO-P6)
Friday, December 18, 2020
レベル:中級 フィギュアスケートを中心にスポーツ傷害のリスクと予防について検討した結果を発表します。 まず、JMPのクラスター分析、主成分分析を用いることで、傷害の諸情報を分析するための適切なアルゴリズムを選ぶのに役立つ様々なクラスタリング手法を理解しました。また、私たちはJMPのデータ可視化ツールを使用して、傷害のメカニズムを理解するための多くの相関関係と因果関係のパターンを生み出しました。これらをもとに、私たちは、JMPによる結果を用いてフィギュアスケート選手のための適切な傷害予防プログラムを開発しました。 Lily Sunは、スタンフォード・オンライン・ハイスクールの2年生で、フィギュアスケートで3度の全国大会出場経験があり、銅メダルを獲得しました。 また、彼女は、JMPの統計分析を利用して、他のスケーターやアスリートのために予防的なトレーニング技術や怪我に関する多くの情報を提供したいと願っています。 現在、彼女は、以下のような幅広い活動をしています。 「GenShe」のデジタルマーケティングインターン 「the Empathetic Leaders Movement」の、CMO兼リードインストラクター 「She Helps Her」のメディア責任者 「Women in Politics」の編集長
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
多変量管理図を用いたTennessee Eastmanプロセスにおける故障検知と診断_Jeremy Ash(2020-JA-45MP-24)
Friday, December 18, 2020
レベル:中級 モデルに基づく多変量管理図(MDMVCC)プラットフォームでは主成分分析モデル(PCAモデル; Principal Component Analysis)やPLSモデルに基づいて、管理図を作成します。この管理図は多次元データでの故障検知や診断に用いることができます。ここでは、PLSモデルに基づくMDMVCCによるモニタリングをTennessee Eastmanプロセス(シミュレーションされた化学工業プロセス)を用いてデモンストレーションします。このシミュレーションでは、化学反応器がガス状反応物から液体製品を生成する際に、品質変数と工程変数が測定されます。 まず、オフライン状態での故障診断をします。この場合、多変量管理図や単変量管理図そして工程の診断レポートを交互に参照することになりますが、MDMVCCプラットフォームでは非常に簡便に行うことができます。次に、JMPを外部データベースに接続し、MDMVCCプラットフォームによるオンラインのモニタリングをデモンストレーションします。製品の品質変数はすべての測定結果が出揃うまで、時間遅れが発生するため、故障検知も大体は遅延します。PLSモデルに基づくMDMVCCでは、品質変数のばらつきは工程変数の関数としてモニタリングされますが、一般に工程変数は比較的すぐに利用できるため、故障の早期検知に役立ちます。 Jeremy Ashはノースカロライナ州立大学でバイオインフォマティクスの学位を取得し、現在JMPのアナリティクスソフトウェアテスターとして業務に従事しています。学位論文ではケモインフォマティクス、計量化学、バイオインフォマティクスでの計算手法について執筆しました。また、ノースカロライナ州立大学で統計学の修士号を、テキサス大学オースティン校で生物学の理学士号を取得しています。
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
臨床試験における有害事象の要約_Kelci Miclaus(2020-JA-30MP-21)
Friday, December 18, 2020
レベル:中級 被験者に発生する有害事象の報告、追跡、分析は、臨床試験の安全性評価において重要です。多くの製薬会社と新薬申請を提出する先である規制当局は、この有害事象の評価を支援するJMP Clinicalを用いています。バイオメトリック分析プログラミングチームは、メディカルモニターやレビューアのために静的な表、リスティング、および図を作成する場合があります。 このことは、特定事象の発生による医学的影響を理解しているドクターが有害事象の要約と直接対話ができないといった非効率につながります。しかし、有害事象の単純なカウントと頻度分布を作ることでさえ、必ずしも簡単であるとは限りません。このプレゼンテーションでは、JMP Clinicalの主要なレポートである有害事象のカウント、頻度、発生率、事象が発生するまでの時間の出力に焦点を当てます。JMP Clinicalの常識を超えたレポート機能により、JMPの計算式、データフィルタ、カスタムスクリプト化された列スイッチャー、仮想結合されたテーブルに大きく依存する複雑な計算を行っている場合でも、完全に動的な有害事象分析を簡単に行うことができます。 Kelci Miclausは、JMPライフサイエンスR&Dのマネージャーであり、JMP GenomicsとJMP Clinicalソフトウェアの統計機能を開発しています。彼女は、2006年にSASに入社し、ノースカロライナ州立大学で統計学の博士号を取得しています。
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
空間データおよび形態解析ツールとしてのJMP_千場 良司(2020-JA-PO-P5)
Friday, December 18, 2020
レベル:初級 発表タイトル:空間データおよび形態解析ツールとしてのJMP―最短距離法クラスター分析を用いた上気道上皮内癌とその前駆病変のGradingの試み 口腔,咽頭や喉頭など上気道粘膜の悪性腫瘍の多くは粘膜の表面を覆う重層扁平上皮という細胞の層に発生する扁平上皮癌である.さらにこの前段階ないし初期と考えられる病変が粘膜の白色ないし赤色の局面として見出されることが臨床的に知られており,患者から採られた組織の顕微鏡標本の観察によりそれぞれ上皮異形成epithelial dysplasia,上皮内癌carcinoma in situと名付けられている.さらに異形成は細胞形態の異常とそれらが上皮層内に占める比率に応じて変化の軽いものから軽度mild,中等度moderateおよび高度severeのグレードに三分されている.この鑑別は病理医の視察により直観的に行われ,ある程度の再現性を有しているものと考えられるが,客観的な検討は多くない.今回上皮層における細胞(核)の配置を定量化し,非腫瘍性(正常),異形成,上皮内癌でどのような差異があるか検討を試みた.デジタル画像解析により顕微鏡写真上で細胞の核の重心座標を抽出し,JMP ver.15の最短距離法の階層的クラスター分析を用いて各重心をつなぐ最小木(Minimum spanning trees; MST)を生成させ,各枝の長さのヒストグラムを比較検討したところ,各群の間に差異が見出された. 千場 良司 東北大学元講師(加齢医学研究所病態臓器構築研究分野).医学博士.元文部省在外研究員(医学)(デンマーク王国オーフス大学). 人体病理学の領域において疾患の病理発生過程を幾何確率論や積分幾何学を応用した定量形態学,デジタル画像解析および多変量統計解析を用いて研究してきた.肝硬変,肺胞上皮,膵管上皮および子宮内膜に発生する早期癌とその前駆病変や癌の肝転移に関する研究論文がある. (https://pubmed.ncbi.nlm.nih.gov/7804428/ , https://pubmed.ncbi.nlm.nih.gov/7804429/, https://pubmed.ncbi.nlm.nih.gov/8402446/ , https://pubmed.ncbi.nlm.nih.gov/8135625/, https://pubmed.ncbi.nlm.nih.gov/7840839/ , https://pubmed.ncbi.nlm.nih.gov/10560494/) 癌の発生過程やその組織診断の観点からそれらの解析に応用可能な数理的手法に興味があり,クラスター分析や判別分析などの数値分類法に特に関心がある.統計解析のプラットフォームとしてはメインフレーム上のFortran統計サブルーチン,PC上のSPSSやSYSTATなどを経て優れたデータテーブルの機能と柔軟な分析環境に注目しバージョン8からのJMPユーザーである. 千場 叡 公立はこだて未来大学システム情報科学部複雑系知能学科卒.在学中は物理化学反応における複雑系現象に興味を持ち,アミノ酸熱重合物のアルコール液相中におけるカプセル形成機構に関する実験と研究を行った.現在はデジタル画像解析,データサイエンスおよびニューラルネットワークを用いた形態および画像の認識や分類にも関心を持っている.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
SDGs時代における社会課題解決型イノベーションの創出 - ドラえもんのひみつ道具の開発を例に_新井 崇弘(2020-JA-PO-P4)
Friday, December 18, 2020
レベル:初級 ドラえもんのひみつ道具は1,000個以上存在しており、主に特定個人の課題解決に使われる理想の科学技術である。しかし現在では、個人の課題よりも社会全体の課題解決が急務である。そこで本研究では、ドラえもんのひみつ道具を通じてSDGs時代の課題解決策を検討し、社会課題解決型イノベーションの創出を目指した。解析対象は「タケコプター」や「どこでもドア」など有名なアイテムに関するアンケートデータとし、主成分分析と選好回帰分析を行い、抽出された属性・水準に対してL8直交表を用いたコンジョイント分析を行った。その結果、人々の求める新しい道具の要素として得られた7項目:「知的能力の向上」、「道具の小ささ」、「運動機能の向上」、「時間移動なし」、「空間移動なし」、「時間コストあり」、「ファッション性のシンプルさ」を盛り込んだ理想のドラえもんの道具案を作成した。分析結果から我々が考案した新しい道具のアンケートより、「使ってみたい」「欲しい」は92%という高い支持が得られ、「視覚障害・聴覚障害・身体障害のある方のQOLを向上させることができると思いますか」への解答は「できる」が100%という結果が得られた。 新井 崇弘 千葉大学卒業後、千葉大学医学部附属病院にて経営分析業務に従事。その後、慶應義塾大学大学院健康マネジメント研究科修士課程(Master of Science in Health Care Management)。 現在、JMPを使用したデータ解析によるヘルスケア領域の研究を行っている。 山口みなみ 2013 東京医科歯科大学医学部保健衛生学科看護学専攻 卒業 2013~2019 看護師として新生児看護に従事 2019~現在 慶應義塾大学大学院健康マネジメント研究科看護学専攻 洪 東方 2017 UNSW大学 生命科学学部(病理学専攻)卒業 2018 シドニー大学 公衆衛生修士課程(MPH)卒業 2019~現在 慶應義塾大学健康マネジメント研究科医療マネジメント学専攻
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Design of Experiments: Modern Approach, 1st Edition、Bradley Jones, Douglas C. Montgomery Wiley 2019の紹介_細島 章(2020-JA-PO-P3)
Friday, December 18, 2020
レベル:中級 モダンアプローチと題しているように、従来の実験計画法の本と違って、ソフトウエア(JMP)の使用を前提とした本である。著者のB. JonesはSAS社の’Doctor DOE’、D. C. Montgomeryはアリゾナ州立大教授。従来の実験計画本では解析結果の主体は分散分析表であるが、この本はそのほかに、プロファイル・予測値と実測値のプロット・効果の要約・残差プロットなどのJMP解析結果が示されるので、読者は直感的・多面的に理解できる。本の内容は一般的な内容が一通り網羅されているが、ハイライトはスクリーニング計画(8章)である。実務家はどの実験計画を使うべきか、かなり踏み込んだ提案がされており有意義である。難解なレゾリューションの概念も、JMPの相関のカラーマップ・交絡行列・計画の生成ルールが理解を助けてくれる。連続量主体の実験でDSDがリーズナブルな実験計画であることが理解できる。 ランダム化・繰り返し・ブロックの扱いによる解析結果の違い、欠測値処理という実務で良く起きるやっかいな問題への対処法、主効果の直交性と交絡最適性のトレードオフ、その歴史的な経緯など、有意義な内容が豊富である。分割実験(SPD)もわかりやすく書かれている。 山武ハネウエル(現Azbil)でFA開発部長,理事 研究開発本部長,理事 品質保証推進本部長,アズビル金門参与,などを歴任したのち東林コンサルティングを設立.専門領域は生産データ解析による歩留まり改善や品質改善,市場不良予測・ロバスト設計・最適化設計・実験計画などの統計的問題解決全般,デザインレビュ―・根本原因分析手法(RCA)・ヒューマンエラーの未然防止・工程改善などの現場指導など.著書は『ネットビジネスの本質』 日科技連出版 2001(共著)【テレコム社会科学賞受賞】,『実践ベンチャー企業の成功戦略』 中央経済社 2011(共著),『よくわかる「問題解決」の本』 日刊工業新聞社 2014(単著).主な論文は「生産ラインのヒヤリハットや違和感に関する気づきの発信・受け止めを促進するワークショップの提案」品質管理学会 2016【2016年度品質技術賞受賞】.主な講演「作業ミスを誘発する組織要因を可視化し改善を促進する仕組みの提案」(Discovery-Japan 2018) 「JMPによる品質問題の解決~製造業の不良解析と信頼性予測~」(Discovery-Japan 2019)
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
パンの官能評価系の確立と食感マッピングによる特徴の視覚化_岩政 菜津紀(2020-JA-PO-P2)
Friday, December 18, 2020
レベル:初級 樋口 侑夏, 研究開発部, ユニテックフーズ株式会社 浅野 桃子, 研究開発部, ユニテックフーズ株式会社 近年の傾向として日本人の米離れが進んでおり、代わりの主食としてパンの需要が増加している。当社はパンの食感改良を目的とした生地改良剤の開発を行っているが、パンの食感を言葉で他人と共有し同じ認識を持つことは、同一のテクスチャー用語でも個々人によってズレが生じてしまうため非常に困難であった。そこで、官能評価による食感評価で特徴を二次元的にマッピングすることができれば、視覚的に誰もが同じ認識をもつことができると考えた。 本研究では、官能評価が容易な食パンとその応用であるメロンパンをモデルとし、統計解析や官能評価によりテクスチャー用語の選定とその定義付けを行い、官能評価系を確立した。これにより、人によって表現が異なっていたパンの特徴を共通の尺度で評価することが可能となった。また、市販品シェア上位5種の食感マッピングを作成し、その物性的特徴を可視化して示すことができた。 【発表者概要】 ペクチンをはじめとするハイドロコロイドや天然食品素材、機能性素材を、海外の素材メーカーから取りそろえ、国内食品メーカーに活用ノウハウを提供するユニテックフーズ株式会社で研究開発を行う。品質管理の効率化や商品開発の精度向上に携わる中で、JMPによる統計解析を活用している。
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
ゲルの官能評価用語の選定および食感マッピングによるゲル化剤の特徴の視覚化_樋口 侑夏(2020-JA-PO-P1)
Friday, December 18, 2020
レベル:初級 食品のおいしさに重要な因子の一つがテクスチャーである。ゼリーやプリンのようなゲル状食品のテクスチャーを付与するゲル化剤において、高分子多糖類の種類やその配合比率を調整することにより、多様な食感設計を可能にする。メーカーのニーズやトレンドに合わせたゲル化剤の開発および提案を行うために、官能評価が重要になる。しかしテクスチャーの捉え方には個人差があるため、評価用語を主観的に設定してしまうと、適切な用語の選出漏れが起こる可能性や、担当者によるバイアスが強くかかってしまう。また、評価基準が個人に依存してしまい、ゲル化剤の特徴の認識にも差異が生じる恐れがある。本研究では、多重対応分析を用いて客観的に評価用語を選定し、評価基準を標準化した。さらに官能評価の結果を多変量分析(主成分分析、クラスター分析)することでゲル化剤の有する食感の特徴を相対的に位置づけし、食感マッピングによって視覚的に共有化できた。 ペクチンをはじめとするハイドロコロイドや天然食品素材、機能性素材を、海外の素材メーカーから取りそろえ、国内食品メーカーに活用ノウハウを提供するユニテックフーズ株式会社で研究開発を行う。統計解析を活用したデータ分析から、商品開発やコア技術の創生に携わる。
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
COVID-19感染拡大による緊急事態宣言発令と自動車交通量ならびに大気質の変化_堺 温哉(2020-JA-25MP-15)
Friday, December 18, 2020
レベル:初級 早崎 将光, 主任研究員, エネルギ・環境研究部 環境評価グループ, 一般財団法人日本自動車研究所 伊藤 晃佳, グループ長・主任研究員, エネルギ・環境研究部 環境評価グループ, 一般財団法人日本自動車研究所 我々の主要な研究テーマは、自動車交通と大気環境、ならびに大気環境と人への健康影響であり、自動車交通量は重要な情報の一つである。自動車交通量の指標の一つである断面交通量は、車両感知器などによる交通量の情報で、それぞれの地点における5分毎のデータが公開されている。現在、東京都内では約2,400ヵ所の断面交通量情報が公開されている。断面交通量は、比較的広い範囲における自動車交通量を、面的にとらえる指標として重要である。 新型コロナウィルス(COVID-19)の感染拡大による緊急事態宣言によって、社会経済活動は大きく変化し、自動車交通にも影響があったと考えられる。今回我々は、緊急事態宣言期間の前後における東京都内の自動車交通量の変化を、断面交通量を指標として解析を行った。また、同期間における大気質の変化についても検討を行った。解析の主要なツールとしてjmpを用いた。jmpのテーブル結合、連結機能などのデータテーブル編集機能、データとリンクしているグラフ機能を用いることで、効率的に解析を実施することが出来た。本報告では、我々のjmp使用例について紹介をする。 堺 温哉 愛媛大学大学院連合農学研究科博士課程修了(農学博士) 学術振興会特別研究員(PD)、浜松医科大学(教務補佐員)、横浜市立大学医学部(助教)、信州大学医学部(特任助教)を経て2012年9月より一般財団法人日本自動車研究所に所属(主任研究員)、2020年4月より現職。現在の主要な研究テーマは、Traffic Related Air Pollution (TRAP) を対象とした大気環境疫学。 早崎 将光 筑波大学大学院博士課程地球科学研究科を単位取得退学(2000年).同大学生命環境科学研究科地球環境科学専攻にて博士(理学)取得(2006年).国立環境研究所,千葉大学環境リモートセンシング研究センター,富山大学,九州大学,東京大学大気海洋研究所での勤務(PD,特任研究員など)を経て,2017年より現職.主な研究テーマは,高濃度大気汚染事象の要因解明など. 伊藤 晃佳 2002年3月北海道大学大学院工学研究科環境資源工学専攻博士後期課程修了,博士(工学).2002年4月より一般財団法人日本自動車研究所に所属し,2010年より現職.近年の主要な活動として,大気環境に対する発生源寄与度の評価,大気観測結果(常時監視局等)を用いた解析,大気シミュレーション(CMAQ等)を用いた解析等が挙げられる. ※配布資料はございません。
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
NASケースにDebian 10, MySQLを組み込みJMPから利用する_田久 浩志(2020-JA-45MP-14)
Friday, December 18, 2020
レベル:初級 大学院でSQLに詳しくない者がビッグデータを扱う必要が生じたので標記の内容を試みた。 今回は、大学院生の情報関係のスキルアップも目的とし、Buffalo社のNASであるLinkstation410のケースのみを入手し、LinuxのDebian 10.5をインストールした。その後、NAS上にMariaDB(MYSQL互換品)を設定し、Windows PCにあったACCESSベースの救急救命関係の5000万件規模のデータをMariaDBに移送した。最終的にPC上のJMP Pro15.1.0のクエリビルダーからNAS上のMariaDBを接続して解析を行っている。本手法は、NASが小型(640g)なため、テレワークの大学院生にデータサーバー環境を宅配で支給するのも可能である。 本報告では、NASのケースのみからMYSQLサーバーを組み上げJMPと連携させるまでの、技術的ノウハウと運用上の留意点について報告する。 1980年3月慶応義塾大学工学研究科電気専攻修了、工学修士。 1993年6月東邦大学医学部、医学博士。 現在、国士舘大学大学院救急システム研究科教授。 受賞歴:SAS Users' Group International Japan功績賞(1999), SAS ユーザー総会ポスター賞(2011)、ヨーロッパ蘇生協議会, Best of the Best Abstract(2010,2018)
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
離職による医療崩壊を防ぐために_濃沼 政美(2020-JA-25MP-13)
Friday, December 18, 2020
レベル:初級 発表タイトル:離職による医療崩壊を防ぐために! 医療者の離職原因および転職前後の満足度変化に関する解析 医療者の離職や病院経営の悪化等を受けて、安定的・継続的な医療提供体制が成り立たなくなることを医療崩壊という。 これまで我々は離職による医療崩壊を防ぐために、離職に至る潜在要因の解析を行ってきた。そこで今回は離職経験のある医師・看護師・病院薬剤師を対象に、離職に至った顕在要因ならびに転職による職務満足度の変化(満足度変化率)について調査し、職種間の比較等を行った。調査はWeb調査会社にモニター登録された医療者を対象にアンケートを実施した。職種間および年代間の離職理由を比較するため対応分析から二次元付置図を作図したところ男性病院薬剤師(37歳未満)と男性医師(37歳未満)は、給与やキャリアアップの考え方について同一方向に付置し、女性病院薬剤師(37歳未満)と女性医師(37歳未満)においては、結婚・子育てについて同一方向に付置した。次いで転職による職務満足度の変化をパーティションにより分析した結果、病院薬剤師では、子供なし、年収高め、37歳未満、未婚者ほど転職により逆に職務満足度が下がった傾向にあった。転職による職務満足度低下に関連する因子を見える化すれば、少しでも離職を思いとどまらせられるかもしれない。 【略歴】H5日本医科大学付属病院薬剤部,H16日本大学薬学部,H25帝京平成大学薬学部薬学科・大学院薬学研究科 教授,H26信州大学医学部附属病院臨床試験センター 特任教授(兼務) 【学会・団体】日本薬学会代議員,日本医療薬学会代議員,日本クリニカルパス学会評議員・広報委員長,都病薬臨床研究専門薬剤師養成委員会副委員長,神奈川県病薬特別委員 【他所属】東京都健康長寿医療センター倫理審査委員・認定臨床研究審査委員,新渡戸記念中野総合病院IRB委員 【資格等】日本医療薬学会指導薬剤師,鍼灸師,診療情報管理士 【趣味】林道サイクリング・キャンプ料理・スイミング・バードフォト
Labels
(9)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
データ操作の質を確保する手段としての統計的方法_安井 清一(2020-JA-25MP-12)
Friday, December 18, 2020
レベル:中級 データ解析を行う前に,必要なデータの抽出,変数間の対応付けの変更・整形,変数変換・カテゴリ化・再カテゴリ化等を行って,解析用データセットを作成する必要がある。JMPには行や列の抽出,結合等のデータベース操作,変数変換等に必要な計算関数が用意されている。これらの機能を用いて解析用データセットは容易に作成できる。しかし,抽出対象の設定や変数変換などの操作命令は解析者が指示する必要があり,操作命令が複雑なったとき,意図した結果が得られていない可能性が高くなる。例えば,データ抽出における範囲設定やif文によるカテゴリ化の際,”and”,”or”ルールが複雑になればなるほど,所望の解析用データセットが得られていない可能性が高まる。そこで,解析用データセットが解析者の意図したものに一致しているかを機械的に調べる必要がある。 JMPの統計的方法によって,解析用データセットの質を確認することができる。ある変数の最大値や最小値を求める方法は最も簡単なものであるが,「1変数の分布」,「2変数の関係」も強力であり,「2変数の関係」において寄与率1がエビデンスである。 本発表では大規模データに対して解析用データセットの質を確かめた事例を報告する。 東京理科大学理工学部経営工学科講師。東京大学大学院化学システム工学専攻主幹研究員。研究専門分野は統計的品質管理。主に品質管理に必要な統計解析法について研究しているが,統計的品質管理の防火建築,火災現象,医療や介護への応用も行っており,JMPを用いて大規模データをモデリングして背後に潜む情報を抽出し,研究対象となる固有の分野へフィードバックしている。
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
組織調査データの解析と提案_川崎 昌(2020-JA-25MP-30)
Friday, December 18, 2020
レベル:初級 近年、企業経営において日々蓄積されるデータを分析・可視化し、戦略策定や意思決定に役立てるビジネスインテリジェンス(Business Intelligence)やピープルアナリティクス(People Analytics)が注目を集めている。本発表では、従業員や組織に関する調査データをJMPによって解析し、その結果を経営の意思決定のためのコンサルティング提案に活用する事例について報告する。 企業の経営コンサルティング活動において、組織の実態を把握するために行われる定量・定性の組織調査は欠かせないものである。従業員一人ひとりの成長によってもたらされる組織の持続的な成長を実現するには、これらの調査データから組織の状態を可視化し、将来の予測や意思決定に活用できることが望ましい。 本事例では、A社で取得したデータに対し、JMPの多変量解析機能および解析模型図や構造模型図という可視化ツールを用いた方法論を適用する。その結果、A社の経営層に向け、わかりやすい提案を行うことが可能になる。本発表では、記述統計を用いた一般的な分析から一歩進んだ解析手法について、データ取得から提案までの一連の流れを紹介する。 マーケティング関連会社、EAP(Employee Assistance Program:従業員支援プログラム)サービスを提供するプロバイダー、ベンチャー企業勤務を経て、組織人事コンサルタントとして独立。企業の組織・人材開発の業務に携わりながら、社会人大学院生として博士課程に進学し、質問紙調査・質問紙実験に基づく解析と設計をテーマとした研究に取り組む。修了後も引き続き社会科学領域のテーマを中心に企業実務と研究を両輪で実践し、現在は桜美林大学ビジネスマネジメント学群 特任講師、NPO法人GEWEL理事、FREELY合同会社代表として理論開発と開発した理論の実務への適用を進めている。http://researchmap.jp/sho-kawasaki/ 高橋 武則 50年近くに亘りQM(質経営),SQM(統計的質経営)および設計論の研究を行ってきた.21世紀に入ってからは設計パラダイムである超設計(Hyper Design)を提案し,その数理であるHOPE理論を開発しその支援ソフトHOPE-Add-inをSAS社との共同開発行っている.考え方である超設計,統計数理であるHOPE理論,支援ツールであるHOPE-Add-in for JMPの三位一体で新しい設計法を実現している.そしてこの理論の社会科学的延長線上で多群主成分回帰分析を提案している. 橘 雅恵 社労士事務所を開業以来、人事制度構築に注力。サポート企業は80社以上。各社に最適な制度構築は、社員インタビューや社員アンケートを使った組織風土診断・賃金分析が不可避であると考えている。経営全般をサポートする専門家集団ジャパンコンサルティングファームを設立し、経験や勘だけではなく、データに基づいて因果関係を見つけ出し精度が高い経営課題を抽出し、企業の業績アップ、組織開発を提案できるチームを目指している。
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
仮想教材を用いた包括的な実践的実験計画法の教育_小川 昭(2020-JA-25MP-11)
Friday, December 18, 2020
レベル:初級 実務での実験は多額の費用と長い時間と多大の労力が必要なために,確実に成果を挙げることが求められる.このためには以下のことを体験的に理解しかつ実行できる包括的な対応能力を身に着ける必要がある. ①誤差のばらつきを小さくすることの重要性を知る.このためには誤差のばらつきがどう影響するかを知る必要がある. ②確実に効いている可能性のある因子をいくつか取り上げる.このためにはスクリーニング実験を活用する. ③モデル(模型,関数)には不足の項がないようにする.このためにはLOF(不適合)のないモデリング実験が必要である. ④設計の本質は数理計画法を用いた最適化にほかならない.このためには優れた機能を有した使いやすいソフトが不可欠である. ⑤統計処理が正しくてもばらつきの影響で解はしばしば目標値からずれる.このためには事後に回帰修正が必要である. 上記の内容を短時間で安全にかつ納得のいく形で習得するには仮想教材を用いた体験型教育が不可欠である. 本研究は飛球シミュレーターを用いた包括的な実践的実験計画法の教育カリキュラムを提案する.計画立案,実験実施,データ解析,設計(最適化),回帰修正等について具体的に紹介する. 大学で応用物理を学び、大手制御機器メーカー入社後は自社製品に伴う半導体デバイスの研究開発、生産の要素技術開発から製造ライン構築、顧客品質保証からISO認証取得等、生産に関しては基礎から応用まで全てを担当して参りました。その後大学院にて統計的品質管理 (SQC) に基づく経営視点の最適化を研究し経営学の学位 (博士) を取得しました。現在は経営視点の中に情報通信技術の進歩に伴うデータマネジメントをより強く意識するようになりました。計測技術がインターネットと融合してIoTに進化したように、SQCを推進した技術者はデータサイエンティストとして活躍する時代になりました。JMPは2003年から業務に活用し始め、そのポテンシャルの高さをすぐに実感しました。JMPer’s Meetingでは実験計画法や最適化について発表し、Discovery Summit Japan にも2016年から口頭ならびにて発表しております。 ※さらに詳細な理解を希望される方には、論文をご用意しております。ご希望の方は、発表者である小川様まで直接ご連絡ください。
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Williams/Shirley-Williams多重比較を実行するためのJMPアドインの開発_福島 慎二(2020-JA-25MP-09)
Friday, December 18, 2020
レベル:初級 製薬業界の非臨床試験では、薬剤の用量を複数設定した試験での効果を仮説検定で証明する場面が多く、第1種の過誤を厳密に制御するために多重比較が必須となっている。とりわけ、対照群と複数の薬剤用量群との比較のためにDunnett多重比較法が汎用されているが、対照群と各用量の比較を一斉に行うために群数を多く設定した試験では検出力が低下する問題点がある。 Williams多重比較法は、薬剤効果に用量依存性(単調性)がある場合に適用し、高用量群から閉手順で逐次検定を実施するため、2群のt検定と比較してもほとんど検出力が低下しない魅力的な多重比較手法であるが、残念ながらJMPには現時点で搭載されていない。 そこで、JSLを用いてWilliamsの方法に基づき高用量群から閉手順で逐次検定を実施し、算出した統計量を成書のWilliams多重比較用の統計数値表と比較して有意差の有無を出力するアドイン開発を進めている。また、ノンパラメトリック版であるShirley-Williams多重比較法も搭載を計画している。発表ではプロトタイプのデモを含めてアドインの機能を紹介する。 株式会社タクミインフォメーションテクノロジーにて昨年より製薬関連の非臨床向けビジネスを担当しているシステムエンジニア。非臨床部門向けの統計解析ソフトウェアの開発および統計セミナー講師・コンサルティングに従事している。 製薬企業の薬理部門にてin vitroおよびin vivoの薬効薬理試験に長年従事し、製造販売承認申請を経験。2007年にグループ企業の研究管理部門に転籍し非臨床の統計解析を担当した際に、芳賀敏郎先生・高橋行雄先生等にJMPを教えていただき、現在に至る。最近はJMPのスクリプト言語(JSL)を用いた開発を主に担当しており、JMP機能を拡張する非臨床向けアドインのシリーズ化を計画している。
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
JMPを使えば怖くない寿命データの分析と製品信頼性の評価_廣野 元久_遠藤 幸一(2020-JA-45MT-07)
Friday, December 18, 2020
レベル:中級 おそらくJMPは信頼性データ分析のソフトウェアの中では最も強力な能力と体系的なアプリケーションを持っているソフトウェアである。本報告では,JMPの信頼性/生存時間分析のプラットフォームを使って、一変量の分布、二変量の関係、予測、モデル化と許される時間の中で寿命データの分析方法を体系的に紹介する。特にモデル化では再生定理や信頼性試験で使われる方法を実例から飛躍しない程度の仮想的な例を通じて信頼性活動とデータ分析プロセスを紹介する予定である。 廣野 元久 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、統計学の啓発普及に従事。品質本部QM推進室室長、SF事業センター所長を経て、現職。東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。 遠藤 幸一 1987年 株式会社東芝に入社。パワーIC(電源用IC、モーター用500V耐圧ドライバIC等)の製品開発・プロセス開発を経て、現在は故障解析技術開発に従事。博士(情報科学) 大阪大学。 ※配布資料はございません。
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
JMPによる信頼性データの分析の実際_廣野 元久_遠藤 幸一(2020-JA-45MT-06)
Friday, December 18, 2020
レベル:中級 信頼性工学において使われるワイブル確率紙を用いるワイブル解析の話を中心に、信頼性データ分析のケーススタディを紹介する。信頼性データは処理によっては誤った解釈をしてしまう場合が少なくない。本発表では昨年に引き続きJMPを使ったデータ分析の正しい使い方のツボを既存のやり方と比較しながら、デモを交えて紹介する。 廣野 元久 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、統計学の啓発普及に従事。品質本部QM推進室室長、SF事業センター所長を経て、現職。東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。 遠藤 幸一 1987年 株式会社東芝に入社。パワーIC(電源用IC、モーター用500V耐圧ドライバIC等)の製品開発・プロセス開発を経て、現在は故障解析技術開発に従事。博士(情報科学) 大阪大学。 ※配布資料はございません。
Labels
(8)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
JMPによる統計的機械学習入門_廣野 元久(2020-JA-45MT-05)
Friday, December 18, 2020
レベル:中級 JMP (Pro)を使えばR , Pythonなどに較べて手軽に分析を楽しめます。フルオーダーメイドの分析とはいきませんが、セミオーダーには十分に対応が可能です。JMPを使えば以下のようなことが簡単に実行できます。 ①コマンドを打ちこまなくてもマウス1つで分析が可能に、②グラフと統計量のセット、③分析プロセスをスクリプトに残せる、④分析プロセスの流れに沿ったレポートの出力が可能、⑤統計的な思想が基本にあるから体系的な理解と学習に最適、など。本報告では数値例を使ってJMPでできる予測や分類の話をします。扱う方法はカーネル平滑化、SVMやニューロ判別などです。また、従来の統計的な多変量解析との対比も行い理解を深めます。 1984年、株式会社リコー入社。以来、社内の品質マネジメント・信頼性管理の業務、統計学の啓発普及に従事。品質本部QM推進室室長、SF事業センター所長を経て、現職。東京理科大学工学部(1997-1998)、慶應義塾大学総合政策学部 非常勤講師(2000-2004)。主な専門分野は統計的品質管理、信頼性工学。主著に「グラフィカルモデリングの実際」、「JMPによる多変量データの活用術」、「アンスコム的な数値例で学ぶ統計的計算方法23講」、「JMPによる技術者のための多変量解析」、「目からウロコの多変量解析」などがある。 ※配布資料はございません。
Labels
(12)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
関数データエクスプローラを用いた面内分布の設計_三井 正(2020-JA-25MP-04)
Friday, December 18, 2020
レベル:上級 2017年Discovery Summit Japanで「JMPで自在に面形状をつくる」と題した,面内分布の設計手法について発表した.その際,今後の課題として高次多項式モデルへの拡張をあげた.本発表では,JMP Proに実装されている関数データエクスプローラを用いて,更に高度な面内分布の設計手法について紹介する.2017年の発表の一部も併せて紹介することで,面内分布の設計手法についても概観する. 統計的問題解決コンサルタント兼JMPトレーナーとして,実験計画をはじめ技術系のデータ分析全般を指導している.2020年より,フリーランスとしてより広範囲な技術分野に業務展開し,統計教育・事例コンサルテーションへと活動の場を広げている.著書として『JMPではじめる統計的問題解決入門』『JMPではじめるデータサイエンス』(ともにオーム社)がある. 履歴:株式会社東芝の半導体研究開発部門で,画像処理をメインに計測技術開発に従事する.足掛け8年の米国赴任中に,計測データを有効に活用するためには統計処理が必須であることを実感し,イノベーション推進部に異動後はその経験を踏まえ社内でのデータサイエンス普及にも注力する. ※配布資料はございません。
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
JMPによる実験と解析の効率化_細島 章(2020-JA-45MP-01)
Friday, December 18, 2020
レベル:中級 働き方改革が叫ばれる昨今、実験と解析の効率化の重要性が高まっている。JMPによる実験効率化の威力を示すには、実験データを決定的スクリーニング計画(DSD)やカスタム計画で置き換えて見せて実験数を大幅に削減できることを示すと良い。その際の応答データは既存実験データから拾い出す。複数の表に分けられた実験データを見つけた時は、一つのテーブルにまとめて多変量解析を行い、プロファイルで可視化して見せて、OFAT(One Factor at a Time)的方法の落し穴に気づいてもらう。繰り返しのある実験データを平均で分析する考え方に対しては、積み重ね処理や平均・分散による多目的最適化やロバスト最適化の方法があることを示す。 開発現場で実験計画法を使う場合は交互作用の存在を予測できないことが多く、しかも交互作用は決して稀なことではない。DSDは主効果と交互作用(2FI)の交絡や2FI間の交絡がなく、実験数が因子数の2倍程度の少ない数で済む。これは大きな利点である。実際にDSDを使って分かったこと、主効果数+交互作用項数が因子数に近づく時に起きる破綻、拡張計画による解決方法、JMPコミュニティやASQから入手したDSD関連論文の中で実務上重要と思われる点、などについて報告する。 山武ハネウエル(現Azbil)でFA開発部長,理事 研究開発本部長,理事 品質保証推進本部長,アズビル金門参与,などを歴任したのち東林コンサルティングを設立.専門領域は生産データ解析による歩留まり改善や品質改善,市場不良予測・ロバスト設計・最適化設計・実験計画などの統計的問題解決全般,デザインレビュ―・根本原因分析手法(RCA)・ヒューマンエラーの未然防止・工程改善などの現場指導など.著書は『ネットビジネスの本質』 日科技連出版 2001(共著)【テレコム社会科学賞受賞】,『実践ベンチャー企業の成功戦略』 中央経済社 2011(共著),『よくわかる「問題解決」の本』 日刊工業新聞社 2014(単著).主な論文は「生産ラインのヒヤリハットや違和感に関する気づきの発信・受け止めを促進するワークショップの提案」品質管理学会 2016【2016年度 品質技術賞受賞】.主な講演「作業ミスを誘発する組織要因を可視化し改善を促進する仕組みの提案」(Discovery-Japan 2018) 「JMPによる品質問題の解決~製造業の不良解析と信頼性予測~」(Discovery-Japan 2019)
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
特定検診結果による受診者全体の健康状況把握 ~NDBオープンデータを用いて~_増川 直裕(2020-JA-25MP-16)
Friday, December 18, 2020
レベル:中級 特定検診は、生活習慣病予防の観点から、40歳から74歳を対象にメタボリックシンドローム(通称:メタボ)の該当者を減少させることを目的としている。特定検診受診者全体に対する検診結果の要約は、個人と全体を比較するベンチマークとなり得るため、中高年個々人の健康管理に対して参考になると思われる。 厚生労働省が提供するNDBオープンデータでは、特定検診の情報として、年度ごとに検査項目(腹囲、血糖値、血圧など)の平均値や階級別分布を入手することができ、メタボの判断基準となるいくつかの検査項目に対し、性別、年代などの属性ごとに基準外の人数、検査人数を求めることができる。属性ごとに各検査項目に対する基準外の割合をグラフ化してみると、検査項目によっては年代による傾向が表れないなど興味深い結果が得られる。 本発表ではこれらグラフ化とともに、各検査項目に対する基準外の割合に対し、年度、都道府県、性別、年代を要因とした一般化線形モデルをあてはめた結果を示す。このモデル化により、対象者の属性(性別、年代、居住している都道府県)における基準外の割合を予測することができ、特定検診受診者全体を深く理解できる。 JMPジャパン事業部の技術エンジニア。現在は主に製薬会社、食品会社を対象としたJMP製品のプリセールス業務を行っている。JMPをお客様に紹介する立場ではあるが、自身も一人のJMPユーザであるという意識が強い。近年はメディア等で話題となる事柄について、JMPで分析した結果をブログや分析レポートの共有サイトである「JMP Public」に投稿している。 https://public.jmp.com/users/259
Labels
(13)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
計算式エディタの使い方_岡田 雅一(2020-JA-25MT-17)
Friday, December 18, 2020
レベル:初級 最近のJMPには、簡単な計算式であればメニューからマウス操作だけで新規列として作成する機能が用意されています。今回は初心に戻り、ゼロから計算式を列に設定するための計算式エディタの使い方について説明します。 ※このセッションでは、配布資料はございません。 2001年にJMPジャパン事業部に転属し、テクニカルサポートに従事。 2010年からは電気・電子・半導体関連業界のセールスエンジニアを 担当する。
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
CDISCとJMP Clinicalの紹介_宮田 英明(2020-JA-30MT-20)
Friday, December 18, 2020
レベル:初級 2020年4月1日から医薬品の承認申請に電子データの提出が求められるようになりました。本講演ではまず、申請に用いられるCDISCデータ標準について紹介します。その後、CDISC準拠データの解析に強みを持つJMP Clinicalについて、デモを交えて紹介します。 SAS Institute Japan株式会社JMPジャパン事業部のアカデミック向け営業チームでSEを担当。前職ではCRO及び製薬企業で臨床試験の統計解析業務に従事。理学修士(数学)。
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
A Dark Tale: Visual Storytelling With JMP® (2021-EU-30MP-792)
Sunday, March 7, 2021
Caleb King, JMP Research Statistician Tester, SAS In this talk, we illustrate how you can use the wide array of graphical features in JMP, including new capabilities in JMP 16, to help tell the story of your data. Using the FBI database on reported hate crimes from 1991-2019, we’ll demonstrate how key tools in JMP’s graphical toolbox, such as graphlets and interactive feature modification, can lead viewers to new insights. But the fun doesn’t stop at graphs. We’ll also show how you can let your audience “choose their own adventure” by creating table scripts to subset your data into smaller data sets, each with their own graphs ready to provide new perspectives on the overall narrative. And don’t worry. Not all is as dark as it seems... Auto-generated transcript... Speaker Transcript Caleb King Hello, my name is Caleb King, I am a developer at the at the JMP software. I'm specifically in the design of experiments group, but today I'm going to be a little off topic and talk about how you can use the graphics and some of the other tools in JMP to help with sort of the visual storytelling of your data. Now the data set I've chosen is a...to illustrate that is the hate crime data collected by the FBI so maybe a bit of a controversial data set but also pretty relevant to what what's been happening. And my goal is to make this more illustrative so I'll be walking through a couple graphics that I've created previously for this data set. And I won't necessarily be showing you how I made the graphs but the purpose is to kind of illustrate for you how you can use JMP for, like I said, visual storytelling so use the interactivity to help lead the people looking at the graphs and interacting with it to maybe ask more questions, maybe you'll be answering some of their questions, as you go along. But kind of encourage that data exploration, which is what we're all about here at JMP. So with that said let's get right in. I'll first kind of give you a little bit of overview of the data set itself, so I'll kind of just scroll along here. So there's a lot of information about where the incidents took place. As we keep going, and the date, well, when that incident occurred. You have a little bit of information about the offenders and what offense, type of offense was committed. Again some basic information, what type of bias was presented in the incident. Some information about the the victims. And overall discrimination category, and then some additional information I provided about latitude and longitude, if it's available, as well as some population that I'll be using other graphics. Now just for the sake of, you know, to be clear, the FBI, that's the United States Federal Bureau of Investigation, defines a hate crime is any criminal offense that takes place that's motivated by biases against a particular group. So that bias could be racial, against religion, gender, gender identity and so forth. So, as long as a crime is motivated by a particular bias, it's considered a hate crime, and this data consists of all the incidents that have been collected by the FBI, going back all the way to the year 1991 and as recent as 2019. I don't have data from 2020, as interesting as that certainly would be, but that's because the FBI likes to take its time and making sure the data is thoroughly cleaned and prepared before they actually create their reports. So you can rest assured that this data is pretty accurate, given how much time and effort they put into making sure it is. Alright, so with that let's kind of get started and do some basic visual summary of the data. So I'll start by running this little table script up here. And all this does is basically give us a count over over the days, so each day, how many incidents occurred, according to a particular bias. From this I'm going to create a basic plot, in this case it's simple sort of line plot here, showing the number of incidents that happen each day over the entire range. So you get the entire...you can see the whole range of the data 1991 to 2019 and how many incidents occurred. Now this in and of itself would probably a good static image, because you get kind of get a sense of where the the number of incidents falls. In fact here I'm going to change the axis settings here a little bit. Let's see, we got increments in 50s, let's do it by 20s. There we go. So there's a little bit of interactivity for you... interaction. We changed the scales to kind of refine it and get a better sense of how many incidents, on average, there are. I ran a bit of a distribution and, on average, around 20 incidents per day that we see here. Now of course, you're probably wondering why I have not yet addressed the two spikes that we see in the data. So yes, there are clearly two really tall spikes. And so, if this were any other type of software, you might say, okay, I'd look like to look into that. So you go back to the data you try and, you know, isolate where those dates are and maybe try and present some plots or do some analysis to show what's going on there. Well, this is JMP and we have something that can help with that, and it's something that we introduced in JMP 15 called graphlets and it works like this. I'm just going to hover and boom. A little graphlet has appeared to help further summarize what's going on at that point. So in this case there's a lot of information. We'll notice first the date, May 1, 1992. So if you're familiar with American history, you might know what's happening here, but if not, you can get a little bit of an additional clue by clicking on the graph. So now you'll see that I'm showing you the incidents by the particular bias of the incident. So we see here that most of the incidents were against white individuals and then next group is Black or African-American and it continues on down. I kind of give away the answer here, in that the incidents that occurred around this time where the Rodney King riots in California. Rodney King, an African-American individual who was unfortunately slain by a police officer and that led to a lot of backlash and rioting around this time. So that's what we're seeing captured in this particular data point, and if you didn't know that, you would at least have this information to try to start and go looking there...looking online to figure out what happened. We can do the same thing here with a very large spike. And again, I'll use the hover graphlet, so hover over it. I'll pause to let you look. So we look at the date, September 12, 2001. That's in it of itself a very big clue as to what happened. But if we look here at the breakdown, we can see that most of the incidents were against individuals of Muslim faith, of Arab ethnicity or some other type of similar ethnicity or ancestry. In this case, we can clearly see that, after the unfortunate events of September 11, the terrorist attacks that occurred then, there was on the following day, a lot of backlash against members who were of similar ethnicity, similar faith and so forth, so we had an unfortunate backlash happening at that time. So already with just this one plot and some of that interactivity, we've been able to glean a lot of information, a lot of high level information in areas where you might want to look further. But we can keep going. Now something new in JMP 16 is, because we have date here on the X axis, we can actually bin the dates into a larger category, so in this case let's bin it by month. And we see that the plot disappears. So here's what I'm going to do. I'm going to rerun it and let's see. There we go. You never know what will happen. In this case, so this is what's supposed to happen; don't worry. So we've binned it by month and we noticed an interesting pattern here. There seems to be some sort of seasonal trend occurring, and let's use the hover graphlets to kind of help us identify what might be happening. So I'm going to hover over the lower points. So if I do that, we see okay, January, December, okay. Interesting. Let's hover over another one, December. And yet another one, December. Ah, there might actually be some actual seasonal trend in this case going on. We seem to hit low points around the the winter months. And in fact, if I go back to my data table, I've actually seen that before. It was something I kind of discovered while exploring that technique and I've already created a plot to kind of address that. So this was something I created based off of that, kind of, look at, you know, what's the variation in the number of incidents over all the years within this month. And here we can see them the mean trends, but we also see a lot of variation, especially here in September because of that huge spike there. So maybe we need something a little more appropriate. So I'll open the control panel and hey, let's pick the the median. That's more robust and maybe look at the interquartile range, so that way we have a little bit more robust metrics to play with. And so, again, we see that seasonal trends, so it seems that there's definitely a large dip within the winter months as opposed to peaking kind of in the spring and summer months. Now this might be something someone might want to look further into and research why is that happening. You might have your own explanations. My personal explanation is that I believe the Christmas spirit is so powerful that overcomes whatever whatever hate or bias individuals might have in December. Again that's just my personal preference, you probably have your own. But again, with just a single plot, I was able to discover this trend and make another plot to kind of explore that further. So again with just this one plot, I've encouraged more research. And we can keep going. So let's see, let's bin it by year, and if we do that, we can clearly see this kind of overall trend. So we see a kind of peak in the late '90s around early 2000s before dropping, you know, almost linear fashion, until it hits its midpoint about in the mid 2010s before starting to rise again. So keep that in mind, you might see similar trends in other plots we show. But again, let's take a step back and just realize that in this one plot we've seen different aspects of the data. We even answered some questions, but we've also maybe brought up a few more. What's with that seasonal trend? And if you didn't know what those events were that I told you, you know, what were those particular events? So that's the beauty of the interactivity of JMP graphics is it allows the user to engage and explore and encourages it all within just one particular medium. All right. Let's keep going. So I mentioned, this is sort of visual storytelling, so you can think of that sort of as a prologue, as sort of the the overall view. What's...what's what's the overall situation? Now let's look at kind of the actors, that is, who's committing these types of offenses? Who are they committing them against? What information can we find out about that? So here I've created, again, a plot to kind of help answer that. Now this might be a good start. Here I've created a heat map that then emphasizes the the counts by, in this case, the offender's race versus their particular bias. So we see that a lot of what's happening, in this case I've sorted the columns so we can see there's a there's a lot going on. Most of its here in this upper left corner and not too much going on down here, which I guess is good news. There's a lot of biases where there's not a lot happening, most of it's happening here in this corner. Now, this might be a good plot, but again there's a lot of open space here. So maybe we can play around with things to try and emphasize what's going on. So one way I can do that is I'll come here to the X axis and I'm going to size by the count. Now you'll see here, I had something hidden behind the scenes. I'd actually put a label, a percentage label on top of these. There was just so much going on before that you couldn't even see it, but now we can actually see some of that information. So kind of a nice way to summarize it as opposed to counts. But even with just the visualization, we can clearly see the largest amount of bias is against Black or African American citizens and then Jewish and on down until there's hardly any down here. So just by looking at the X axis, that gives you a lot of information about what's going on. We can do the same with the Y, so again, size by the count. And again, there's a lot of information contained just within the size and how I've adjusted the the axes. And this case we include...we've really emphasized that corner, so we can clearly see who the top players are. In this case, most of it is offenders are of white or unknown race against African-Americans, the next one being against Jewish, and then anti white and then it just keeps dropping down. So we get a nice little summary here of what's going on. Now, you may have noticed that as I'm hovering around, we see our little circle. That's my graphlet indicator, so again I've got a tool here. We've we've interacted a little bit and again, this could be a great static image, but let's use the power of JMP, especially those graphslets, to interact and see what further information we can figure out. So in this case, I'll hover over here. And right here, a graph, in this case, a packed bar chart, courtesy of our graph guru Xan Gregg. In this case, not only can you see, you know, what people are committing the offenses and against whom, your next question might have been, you know, what types offenses are being committed? Well, with a graphlet, I've answered that for you. We can see here the largest...the overwhelming type of offense is intimidation, followed by simple and aggravated assault, and then the rest of these, that's the beauty of the packed bar chart. We can see all the other types offenses that are committed. If you stack them all on top of each other, they don't even compare. They don't even break the top three. So that tells you a lot about the types of...these types of offenses, how dominant they are. Now, another question you might have is, okay, we've seen the actors, we've seen the actions they're taking, but there's a time aspect to this. Obviously this is happening over time, so has this been a consistent thing? Has there been a change in the trends? Well, have no fear. Graphlets again to the rescue. In this case, I can actually show you those trends. So here we can see how has the types of...the number of intimidation incidents changed over time? And again, we see that the pattern seems to follow what that overall trend was. A peak in the like, late 90s, and then the steep trend...almost linear drop until about the mid 2010s, before kind of upticking again more recently. And again we can maybe see that trend and others. I won't click to zoom in, but you can just see from the plot here, those trends in simple assault here and aggravated assault as well, a little bit there. And you can keep exploring. So let's look at the unknown against African-Americans and see what difference there might be there. In this case, we can clearly see that there are two types of offenses that really dominate, in this case, destruction or damage to property (which, if you think about it, might make sense; if you see your property's been damaged, there's a good chance you may not know who did it) and intimidation are the dominant ones. And again, you can...the nice thing about this is the hover labels kind of persist, so you can again look and see what trends are happening there. So in this case, we see with damage, there's actually two peaks, kind of peaked here in the late '90s early 2000s, before dropping again. And with intimidation, we see a similar trend as we did before. Again within just one graphic, there's a lot of information contained and that you, as the user, can interact with to try and emphasize certain key areas, and then you, as the user, just visualize...just looking at this and interacting with it, can play around and glean a lot of information. All right. And let's keep going. Now you'll notice that amongst the reporting agencies, so, most of them are city/county level police departments and so forth, but there's also some universities in here. So there might be someone out there who might be interested in seeing, you know, what's happening at the universities. And so, with that, I've created this nice little table script to answer that. Now this time, I've been just running the table scripts and I mentioned, I won't go too much behind the scenes, this is more illustrative. Here I'm going to let you take a peek, because I want to not only show you the power of the graphics but also the power of the table script. Now if you're familiar with JMP, you might know, okay, the table script's nice because I can save my analyses, I can save my reports, I can even use it to save graphics like I...like I did in the last one, so you may not have noticed that you can also save scripts to help run additional tables and summary tables and so forth. So let me show you what all is happening behind here, in fact, when I ran the script, I actually created two data tables. You only saw the one, so in this case I first created the data table that selected all the universities and then from that data table it created a summary and then I close the old one. And then I also added to that some of the graphics. So I won't go into too much detail here about how I set this up, because I want to save that for after the next one. I'll give you a hint. It's based off of a new feature in JMP 16 that will really amaze you. All right, let's go back to...excuse me...university incidents. And here again I've saved the table script. This one that will show us a graphic. So here we can see again is that packed bar chart, and here I'm kind of showing you which universities had the most incidents. Now again, this in and of itself might be a pretty good standard graphic. You can see that, you know, which university seem to have the most incidents happening and again it's kind of nice to see that there's no real dominating one. You can still pack the other universities on top of them, and nobody is dominating one or the other. So that in and of itself is kind of good news, but again there's a time aspect to this. So have these been maybe... has the University of Michigan Ann Arbor, have they had trouble the entire time? Have they...would they have always remained on top? Did they just happen to have a bad year? Again, graphlets to the rescue. In this case, you'll see an interesting plot here. You might say, you know, what what is this thing? This looks like it belongs in an Art Deco museum. What... what kind of plot is this? Well, it's actually one we've seen before. I'm just using something new that came out in JMP 16, so I'm going to give you a behind the scenes look. And in this case, we can see, this is actually a heat map. All I've done is I do a trick that I often like to do, which is to emphasize things two different ways, so not only emphasizing the counts by color, which is what you would typically do in a heat map, the whites are the missing entries, I can also now in JMP 16 emphasize by size. And so I think this again gets back to where we size those axes before. It emphasizes...helps emphasize certain areas. So here we can see now maybe there's a little bit of an issue with incidents against African-Americans, that has been pretty consistent, with an especially bad year in apparently 2017, as opposed to all of the other incidents that have been occurring. Now there's no extra hover labels here. All I'll do is summarize the data, but that's okay. This in and of itself gives you a lot of information, so this is a new thing that came out in JMP 16 that can again help with that emphasis. And again, we can keep going. We can look at other universities, so here, this might be an example of a university where they seem to have a pretty bad period of time, the University of Maryland in College Park, but then there was an area where things were really good, and so you might be interested in knowing, well, what happened to make this such a great period? Is there something the university instituted, what they did that seemed to cause the count, the number of incidents to drop significantly? That might be something worth looking into. And you can keep going and looking again to see whether it's a systemic issue, whether like, in this case, it seemed there's just a really bad year that dominated, overall they were just doing okay. They were doing pretty good. Again, this might be another one. They had a really bad time early on, but recently they've been doing pretty good, and so forth. So again, kind of highlighting that interactivity yet again, and in this case, with some of the newer features in JMP 16. Now, before we transition to the last one, I have a confession. I'm a bit of a map nerd, so I really like maps and any type of data analysis that, you know, relates to maps. I don't know why. I just really like it and so I'm really excited to show you this next one, because now we look at the geography of the incidents. But I'm also excited because this really, I believe, highlights the power of both the table scripts and the JMP graphics, especially the hover graphs. So hopefully that got you excited as well, so let's run it. Now this one's going to take a little while because there's actually a lot going on with this table script. It's creating a new table. It's also doing a lot of functions in that table and computing a lot of things. So here we've got not just, you know, pulling in information but also there's a lot of these columns here near the end that have been calculated behind the scenes. Now I have to take a brief moment to talk about a particular metric I'm going to be using. So a while back, I wrote a blog entry called the Crime Rate Conundrum on on the JMP Community (community.jmp.com), so shameless plug there, but in that I talked about how, you know, typically when you're reporting incidents, especially crime incidents, usually we kind of know that you don't just want to report the raw counts, because there might be a certain area where it has a high number of counts, a high number of incidents, is that just because that's...there's a problem at problem there? Or is it because there's just a lot of people there? And so we, of course, would expect a lot of incidents because there's just a lot of people. So of course people report incidents rates. Now that's fine because everybody's now on a level playing field but one side effect of that is it tends to elevate places that have small populations. Essentially, you have, if you have small denominator, you will tend to have a larger ratio just because of that. And so that's sort of an unfortunate side effect, and so there, I talk about an interesting case where we have a place with a really small population that gets really inflated. And how some people deal with that. One way I tried to address that was through this use of a weighted incident rate, essentially, the idea is I take your incident rate, but then I weight you by, sort of, what proportion... excuse me...basically a weight by how many people you have there. In this case, I have a particular weight, I basically rank the populations, so that the the largest place would have rank of of the smallest. However, in this case there's 50 states, so the state with the largest population would have a rank of 50 and the smallest state a rank of one. If you take that and divide that by you know the maximum rank, that's essentially your weight so it's it's a way to kind of put a weight corresponding to your total population and the idea here is that, if your incident rate is such that it overcomes this weight penalty, if you will, then that means that you might be someone worth looking into. So it tries to counteract that inflation, just due to a small population. If you are still relatively small, but your incident rate is high enough that you overcome your weight essentially, we might want to look into you. So hopefully that wasn't too much information, but that's the metric that I'll be primarily using so I'll run the script and here we go. So first I've got a straightforward line plot that kind of shows the weighted incident rates over time for all the states. Now I'll use a new feature here. We can see here that New Jersey seems to dominate. Again interactivity, we can actually click to highlight it. There's some new things that we do, especially in JMP 16. I'm going to right click here and I'm going to add some labels. So let's do the maximum value and let's do the last value just for comparison. So here we can see this...the peak here was about 11.4 incidents per 1,000 (that's a weighted incident rate) here in sort of the early '90s. And then we see a decreasing trend, again it seems to drop about the same that all the the overall incident rate did before starting to peak again here in 2019. So again with just some brief numbers again this, in and of itself, would be an interesting plot to look at, but as you could see, my little graphlet indicator is going, so there's more. Here's where the the map part comes in. So I'm going to hover over a particular point. In this case, not only can you see sort of the overall rate, I can actually break it down for you, in this case by county. So here I've colored the counties by the total number of incidents within that year. And again, there's that time aspect, so this shows you a snapshot for one particular year, in this case 2008. But maybe you're interested in the overall trend, so one one way you could do that is, hey, these are graphlets. I could go back, hover over another spot, pull up that graph, click on it to zoom, repeat as needed. You could do that or you could use this new trick I found actually while preparing this presentation. Let's unhide...notice over here to the side, we have a local data filter. That's really the key behind these graphlets. I'm going to come here to the year and I'm going to change its modeling type to nominal, rather than continuous, because now, I can do something like this. I can actually go through and select individual years or, now this is JMP, we can do better. Let me go here and I'm going to do an animation. I'm going to make it a little fast here. I'm going to click play, and now I can just sit back relax and, you know, watch as JMP does things for me. So here we can see it cycle through and getting a sense of what's happening. I'll let it cycle through a bit. We see...already starting to see some interesting things happening here. Let's let it cycle through, get the full picture, you know. We want the complete picture, not that I'm showing off or anything. Alright, so we've cycled through and we noticed something. So let's let's go down here to about say 2004, 2005. So somewhere around here, we noticed this one county here, in particular, seems to be highlighted. And in fact, you saw my little graphlet indicator. So again, I can hover over it, and here yet another map. Now you can see why I'm so excited. Again, in this case, I can actually show you at the county level, so the individual county level... Excuse me, let me...let's move that over a bit. There we are. Some minor adjustments and again, you can see my trick of emphasizing things two ways by both size and color. We can kind of see dispersion within the ???, this is individual locations and because there's that time aspect again, we know...now we know better, we don't have to go back and click and get multiple graphs, we can again use the local data filter tricks. So I can go back. I'll do the year, and so in this case, we can again click through. Here I'm just going to use the arrow keys on my keyboard to kind of cycle through. And just kind of get a sense of how things are varying over time. In this case, you see a particular area, you've probably already seen it, starting about 2006, 2007ish frame. There's this one area...this here. Keansburg, which seems to be highlighted and you'll notice yet another graphlet. How far do you want to go? Graphception, if you will. We can keep going down further and further in. In this case, I get...I break it out by what the bias was, and again I could do that trick if I wanted to, to go through and cycle through by year. So, again so much power in these graphs. With this one graphlet, I was able to explore geographical variation at county level and even further below, and so it might be allowing you to kind of explore different aspects of the data, allowing you to generate more questions. What was happening in Keansburg around this time to make it pop like this? That's something you might want to know. So that's all I have for you today, hopefully I've whet your appetite and was able to clearly illustrate for you how powerful the the JMP visualization is in exploring the data. If you want to know more, there's going to be a plenary talk on data viz. I definitely encourage you to explore that and it kind of helps address different ways of visualization and how JMP can help out with that. But I did promise you, at one point, to give you a peek as to how I was able to create these pretty amazing table scripts and I'll do that right now. It's called the enhanced log now in JMP 16. This is one of the coolest new features in JMP 16. Enhanced log actually follows along as you interact and it keeps track of it. And so whenever I closed, in this case, closed a data table, opened a data table, ran a data table, if I added a new column, if I created a new graph, it gets recorded here in the log. This is something that John Sall will be talking about in his plenary talk. It's, again, one of the most new amazing features here. And this is the key to how I was able to create these tables scripts. I can honestly say that if this hadn't been present, I probably wouldn't have been able to create these pretty cool table scripts, because it'd be a lot of work to do. So again, this is a really cool feature that's available in JMP 16. So I hope I was able to convince you that JMP is a great tool for exploring data, for creating awesome visualizations, interactive visualizations. And that's all I have. Thank you for coming.
Labels
(9)
Labels:
Labels:
Automation and Scripting
Basic Data Analysis and Modeling
Data Access
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Expanding Our Text Mining Toolkit with Sentiment Analysis and Term Selection in JMP® Pro 16 (2021-EU-45MP-790)
Sunday, March 7, 2021
Ross Metusalem, JMP Systems Engineer, SAS Text mining techniques enable extraction of quantitative information from unstructured text data. JMP Pro 16 has expanded the information it can extract from text thanks to two additions to Text Explorer’s capabilities: Sentiment Analysis and Term Selection. Sentiment Analysis quantifies the degree of positive or negative emotion expressed in a text by mapping that text’s words to a customizable dictionary of emotion-carrying terms and associated scores (e.g., “wonderful”: +90; “terrible”: -90). Resulting sentiment scores can be used for assessing people’s subjective feelings at scale, exploring subjective feelings in relation to objective concepts and enriching further analyses. Term Selection automatically finds terms most strongly predictive of a response variable by applying regularized regression capabilities from the Generalized Regression platform in JMP Pro, which is called from inside Text Explorer. Term Selection is useful for easily identifying relationships between an important outcome measure and the occurrence of specific terms in associated unstructured texts. This presentation will provide an overview of both Sentiment Analysis and Term Selection techniques, demonstrate their application to real-world data and share some best practices for using each effectively. Auto-generated transcript... Speaker Transcript Ross Metusalem Hello everyone, and thanks for taking some time to learn how JMP is expanding its text mining toolkit in JMP Pro 16 with sentiment analysis and term selection. I'm Ross Metusalem, a JMP systems engineer in the southeastern US, and I'm going to give you a sneak preview of these two new analyses, explain a little bit about how they work, and provide some best practices for using them. So both of these analysis techniques are coming in Text Explorer, which for those who aren't familiar, this is JMP's tool for analyzing what we call free or unstructured texts, so that is natural language texts. And it's a what we call a text mining tool, so that is a tool for deriving quantitative information from free text so that we can use other types of statistical or analytical tools to derive insights from the free text or maybe even use that text as inputs to other analyses. So let's take a look at these two new text mining techniques that are coming to Text Explorer in JMP Pro 16, and we'll start with sentiment analysis. Sentiment analysis at its core answers the question how emotionally positive or negative is a text. And we're going to perform a sentiment analysis on the Beige Book, which is a recurring report from the US Federal Reserve Bank. Now apologies for using a United States example at JMP Discovery Europe, but the the Beige Book does provide a nice demonstration of what sentiment analysis is all about. So this is a monthly publication, it contains national level report, as well as 12 district level reports, that summarize economic conditions in those districts, and all of these are based on qualitative information, things like interviews and surveys. And US policymakers can use the qualitative information in the Beige Book, along with quantitative information, you know, in traditional economic indicators to drive economic policy. So you might think, well, we're talking about sentiment or emotion right now. I don't know that I expect economic reports to contain much emotion, but the Beige Book reports and much language does actually contain words that can carry or convey emotion. So let's take a look at an excerpt from one report. Here's a screenshot straight out of the new sentiment analysis platform. You'll notice some words highlighted, and these are what we'll call sentiment terms, that is, terms that we would argue have some emotional content to them. For example at the top, "computer sales, on the other hand, have been severely depressed," where "severely depressed" is highlighted in purple, indicating that we consider that to convey negative emotion, which it seems to if somebody describes computer sales as "severely depressed" it sounds like they mean for us to interpret that as as certainly a bad thing. If we look down, we see in orange, a couple positive sentiment terms highlighted, like "improve" or "favorable." So we can look for words that we believe have positive or negative emotional purple for negative, orange for positive, and some sentiment analysis keeps things at that level, so just a binary distinction, a positive text or a negative text. There are additional ways of performing sentiment analysis and, in particular, many ways try to quantify the degree of positivity or negativity, not just whether something is positive or negative. So consider this other example and I'll point our attention right to the bottom here, where we can see a report of "poor sales." And I'm going to compare that with where we said "computer sales are severely depressed." So both of these are negative statements, but I think we would all agree that "severely depressed" sounds a lot more negative than just "poor" is. So we want to figure out not only is a sentiment expressed positive or negative, but how positive or negative is it, and that's what sentiment analysis in Text Explorer does. So how does it do it? Well, it uses a technique called lexical sentiment analysis that's based on some sentiment terms and associated scores. So what we're seeing right now is an excerpt from what we'd call a sentiment dictionary that contains the terms and their scores. For example, the term "fantastic" has a score of positive 90 and the term "grim" at the bottom has a score -75. So what we do is specify which terms we believe carry emotional content and the degree to which they're positive or negative on an arbitrary scale, here -100 to 100. And then we can find these terms in all of our documents and use them to score how positive or negative those documents are overall. If you think back to our example "severely depressed," that word "severely" and takes the word "depressed" and what we call intensifies it. It is an intensifier or a multiplier of the expressed sentiment, so we also have a dictionary of intensifiers and what they do to the sentiment expressed by sentiment term. For example, we say "incredibly" multiplies the sentiment by a factor of 1.4 where as "little" multiplies the sentiment by a factor of .3, so it actually, kind of, you know, attenuates the sentiment expressed a little. Now, finally there's one other type of word we want to take into account and that is negators, things like "no" and "not," and we treat these basically as polarity reversals. So "not fantastic" would be taking the score for "fantastic" and multiplying it by -1. And so, this is a common way of doing sentiment analysis, again called lexical sentiment analysis. So what we do is we take sentiment terms that we find, we multiply them by any associated intensifier or negators and then for each document, when we have all the sentiment scores for the individual terms that appear, we just average across all them to get a sentiment score for that document. And JMP returns these scores to us in a number of useful ways. So this is a screenshot out of the sentiment analysis tool and we're going to be, you know, using this in just a moment. But you can see, among other things, it gives us a distribution of sentiment scores across all of our documents. It gives us a list of all the sentiment terms and how frequently they occur. And we even have the ability, as we'll see, to export the sentiment scores to use them in graphs or analyses. And so I've actually made a couple graphs here to just try to see as an initial first pass, does the sentiment in the Beige Book reports actually align with economic events in ways that we think it should? You know, do we really have some validity to this sentiment as some kind of economic indicator? And the answer looks like, yeah, probably. Here I have a plot that I've made in Graph Builder; it's sentiment over time, so all the little gray dots are the individual reports and the blue smoother kind of shows the trend of sentiment over time with that black line at zero showing neutral sentiment, at least according to our scoring scheme. The red areas are times of economic recession as officially listed by the Federal Reserve. So you might notice sentiment seems to bottom out or there are troughs around recessions, but another thing you might notice is that actually preceding each recession, we see a drop in sentiment either months or, in some cases, looks like even a couple years, in advance. And we don't see these big drops in sentiment in situations where there wasn't a recession to follow. So maybe there's some validity to Beige Book sentiment as a leading indicator of a recession. If we look at it geographically, we see things that make sense too. This is just one example from the analysis. We're looking at sentiment in the 12 Federal Reserve districts across time from 1999 to 2000 to 2001. This was the time of what's commonly called the Dotcom Bust, so this is when there was a big bubble of tech companies and online businesses that were popping up and, eventually, many of them went under and there were some pretty severe economic consequences. '99 to 2000 sentiment is growing, in fact sentiment is growing pretty strongly, it would appear, in the San Francisco Federal Reserve district, which is where many of these companies are headquartered. And then in 2001 after the bust, the biggest drop we see all the way to negative sentiment in red here, again occurring in that San Francisco district. So, just a quick graphical check on these Beige Book sentiment scores suggests that there's some real validity to them in terms of their ability to track with, maybe predict, some economic events, though of course, the latter, we need to look into more carefully. But this is just one example of the potential use cases of sentiment analysis and there are a lot more. One of the biggest application areas where it's being used right now is in consumer research, where people might, let's say, analyze some consumer survey responses to identify what drives positive or negative opinions or reactions. But sentiment analysis can also be used in, say, product improvement where analyzing product reviews or warranty claims might help us find product features or issues that elicit strong emotional responses in our customers. Looking at, say, customer support, we could analyze call center or chats...chat transcripts to find some support practices that result in happy or unhappy customers. Maybe even public policy, we analyze open commentary to gauge the public's reaction to proposed or existing policies. These are just a few domains where sentiment analysis can be applied. It's really applicable anywhere you have text that convey some emotion and that emotion might be informative. So that's all I want to say up front. Now I'd like to spend a little bit of time walking you through how it works in JMP, so let's move on over to JMP. Here we have the Beige Book data, so down at the bottom, you can see we have a little over 5,000 reports here, and we have the date of each report from May 1972 October 2020, which of the 12 districts it's from, and then the text of the report. And you can see that these reports, they're not just quick statements of you know, the economy is doing well or poorly, they they can get into some level of detail. Now, before we dive into these data, I do just want to say thank you to somebody for the idea to analyze the Beige Book and for actually pulling down the data and getting it into JMP, in a format ready to analyze. And that thanks goes to Xan Gregg who, if you don't know, Xan is a director on the JMP development team and the creator of Graph Builder, so thanks, Xan,for your help. Alright, so let's let's quantify the degree of positive and negative emotion in all these reports. We'll use Text Explorer under the analyze menu. Here we have our launch window. I'll take our text data, put it in the text columns role. A couple things to highlight before we get going. Text Explorer supports multiple languages, but right now, sentiment analysis is available only in English, and one other thing I want to draw attention to is stemming right here. So for those who do text analysis you're probably familiar with what stemming is, but for those who aren't familiar, stemming is a process whereby we kind of collapse multiple... well, to keep it nontechnical...multiple versions of the same word together. Take "strong," "stronger," and "strongest." So these are three versions of the same word "strong" and in some text mining techniques, you'd want to actually combine all those into one term and just say, oh, they all mean "strong" because that's kind of conceptually the same thing. I'm going to leave stemming off here actually, and it's because...take "strongest," that describes as strong as something can get versus "stronger," which says that you know it's strong, but there are still, you know, room for it to be even stronger. So "strongest" should probably get a higher sentiment score than "stronger" should, and if I were to stem, I would lose the distinction between those words. Because I don't want to lose that distinction, I want to give them different sentiment scores. I'm going to keep stemming off here. So I'll click OK. And JMP now is going to tokenize our text, that is break it into all of its individual words and then count up how often each one occurs. And here we have a list of all the terms and how frequent they are. So "sales" occurs over 46,000 times and we also have our phrase list over here. So the phrases are sequences of two or more words that occur a lot, and sometimes we actually want to count those as terms in our analysis. And for sentiment analysis, you would want to go through your phrase list and, let's say, maybe add "real estate," which is two words, but it really refers to, you know, property. And I could add that. Now normally in text analysis, we'd also add what are called stop words, that's words that don't really carry meaning in the context of our analysis and we'd want to exclude. Take "district." This happen...or the Beige Book uses the word "district" frequently, just saying, you know, "this report from the Richmond district," something like that, it's not really meaningful. But I'm actually not going to bother with stop words right here and that's because, if you remember, back from our slides, we said that all of our sentiment scores are based on a dictionary, where we choose sentiment words and what score they should get. And if we just leave "district" out, it inherently gets a score of 0 and doesn't affect our sentiment score, so I don't really need to declare it as a stop word. So once we're ready, we would invoke text or, excuse me, sentiment analysis under the red triangle here. So what JMP is doing right now, because we haven't declared any sentiment terms or otherwise, it's using a built-in sentiment dictionary to score all the documents. Here we get our scores out. Now before navigating these results, we probably should customize our sentiment dictionary, so the sentiment bearing words and their scores. And that's because in different domains, maybe with different people generating the text, certain words are going to bear different levels of sentiment or bear sentiment in one case and not another. So we want to really pretty carefully and thoroughly create a sentiment dictionary that we feel accurately captured sentiment as it's conveyed by the language we're analyzing right now. So JMP, like I said, uses a built-in dictionary and it's pretty sparse. So you can see it right here, it has some emoticons, like these little smileys and whatnot, but then we have some pretty clear sentiment bearing language, like "abysmal" at -90. Now it's it's probably not the case that somebody's going to use the word "abysmal" and not mean it in a negative sense, so we feel pretty good about that. But, you know, it's not terribly long list and we may want to add some terms to it. So let's see how we do that, and one thing I can recommend is just looking at your data. You know, read some of the documents that you have and try to find some words that you think might be indicative of sentiment. We actually have down here a tool that lets us peruse our documents and directly add sentiment terms from them. So here, I have a document list. You can see Document 1 is highlighted and then Document 1 is displayed below. I could select different documents to view them. Now, if we look at Document 1, right off the bat, you might notice a couple potential sentiment terms like "pessimism" and "optimism." Now you can see these aren't highlighted. These actually aren't included in the standard sentiment dictionary. And a lot of nouns you'll find actually aren't, and that's because nouns like "pessimism" or "optimism" can be described in ways that suggests their presence or their absence, basically. So I could say, you know, "pessimism is declining" or "there's a distinct lack of pessimism," "pessimism is no longer reported." And, in those cases, we wouldn't say "pessimism" is a negative thing. It's...so you want to be careful and think about words in context and how they're actually used before adding them to a sentiment dictionary. For example, I could go back up to our term list here. I'm just going to show the filter, look for "pessimism" and then show the text to have a look at how it's used. So we can see in this first example, "the mood of our directors varies from pessimism to optimism." And the next one "private conversations a deep mood of pessimism." If you actually read through, this is the typical use, so actually in the Beige Book, they don't seem to use the word pessimism in the ways that I might fear, "optimism is increasing." So I actually feel okay about adding "pessimism" here, so let's add it to our sentiment dictionary. So if I just hover right over it, you can see we bring up some options of what to do with it. So here I can, let's say, give it a score of -60. And so now that will be added to our dictionary with that corresponding score, and it's triggering updated sentiment scoring in JMP. So that is, it's now looking for the word "pessimism" and adjusting all the document scores where it finds it. So let's go back up now to take a look at our sentiment terms in more detail. If I scroll on down, you will find "pessimism" right here with the score of -60 that I just gave it. Now I might want to actually...if you notice, "pessimistic" is, by default, has a score of -50, so maybe I just type -50 in here to make that consistent. And I could but I'm not going to, just so that we don't trigger another update. You'll also notice, right here, this list of possible sentiment terms. So these are terms that JMP has identified as maybe bearing sentiment, and you might want to take a look at them and consider them for inclusion in your dictionary. For example, the word "strong" here, if you look at some of the document texts to the right, you might say, okay, this is clearly a positive thing. And if you've looked at a lot of these examples, it really stands out that the word "strong" and correspondingly "weak" are words that these economists use a whole lot to talk about things that are good or bad about the current economy. So I could add them, or add "strong" here by clicking on, let's say, positive 60 in the buttons up there. Again, I won't right now, just for the sake of expediting our look at sentiment analysis. So we could go through, just like our texts down below, we could go through our sentiment term list here to choose some good candidates. Under the red triangle, we also can just manage the sentiment terms more directly, so that is just in the full term management lists that we might be used to for a Text Explorer user, so like the phrase management and the stop word management. You can see we've added one new term local to this particular analysis, in addition to all of our built-in terms. Of course, we could declare exceptions too, if we want to just maybe not actually include some of those. And importantly, you can import and export your sentiment dictionary as well. Another way to declare sentiment terms is to consult with subject matter experts. You know, economists would probably have a whole lot to say about the types of words they would expect to see that would clearly convey positive or negative emotion in these reports. And if we could talk to them, we would want to know what they have to say about that, and we might even be able to derive a dictionary in, say, a tab separated file with them and then just import it here. And then, of course, once we make a dictionary we feel good about, we should export it and save it so that it's easy to apply again in the future. So that's a little bit about how you would actually curate your sentiment dictionary. It would also be important to curate your intensifier terms and your negation terms, and again, you don't see scores here, because these are just polarity reversals. Just to show you a little bit more about what that actually looks like, if we...let's take a look at sentiment here, so we can see instances in which JMP has found the word "not" before some of these sentiment terms and actually applied the appropriate score. So at the top there, "not acceptable" gets a score of -25. So I show you that just to, kind of, draw your attention to the fact that these negators and intensifiers, they are kind of being applied automatically by JMP. But anyways let's let's move on from talking about how to set the analysis up to actually looking at the results. So I'm going to bring up a version of the analysis that's already done, that is, I've already curated the sentiment dictionary, and we can actually start to interpret the results that we get out. So we have our high level summary up here, so we have more positive than negative documents. As we discussed before we can see, you know, how many of each. In fact, at the bottom of that table on the left, you see that we have one document that has no sentiment expressed in it whatsoever. "strong" occurring 14,000 times, "weak" occurring 4,500 times approximately and looking at these either by their counts or their scores, looking at the most negative and positive, even looking at them in context can be pretty informative in and of itself. I mean, especially in, say, a domain like consumer research, if you want to know when people are feeling positively or expressing positivity about our brand or some products that we have, what type of language are they using, maybe that would inform marketing efforts, let's say. This list of sentiment terms can be highly informative in that regard. manufacturing, tourism, investments. And sometimes we want to zero in on one of those subdomains in particular, what we might call a feature. And if I go to this features tab in sentiment analysis, I'll click search. JMP finds some words that commonly occur with sentiment terms and asks if you want to maybe dive into the sentiment with respect to that feature. So take, for example, "sales." We can see "sales were reported weak," "weakening in sales," "sales are reported to be holding up well" and so forth. So if I just score this selected feature, now what JMP will do is provide sentiment scores only with respect to mention of "sales" inside these larger reports, and this is going to help us refine our analysis or focus it on a really well-defined subdomain. And that's particularly important. It could be the case that the domain in the language that we're analyzing isn't, you know, so well-restricted. Take, for example, product reviews. You're interested in how positive or negative people feel about the product, but they might also talk about, say, the shipping and you don't necessarily care if they weren't too happy with the shipping, mainly because it's beyond your control. You wouldn't want to just include a bunch of reviews that also comment on that shipping. And so it's important to consider the domain of analysis and restrict it appropriately and the feature finder here is one way of doing that. So you can see now that I've scored on "sales," we have a very different distribution of positive and negative documents. We have more documents that don't have a sentiment score because they simply don't talk about sales or don't use emotional language to discuss it, and we have a different list of sentiment terms now capturing sales in particular. Let me remove this. One thing I realized I forgot to mention, I mentioned it briefly, is how these overall document scores that we've been looking at are calculated, and I said that they're the average of all the sentiment terms of... that occurred in a particular document. So let's look at Document 1. I'd just like to show you that if you're ever wondering where does this score come from, let's say, -20, you can just run your mouse right over and it'll show you a list of all the sentiment terms that appeared. And you can see, here we have 16 of them, including all at the bottom, "real bright spot," which was a +78 and then, if you divide...add all those scores up. divide by 78... or divide by 16, excuse me, then you get an average sentiment of -20. And this is one of two ways to provide overall scores. Another one is a min max scoring, so differences between minimum and maximum sentiments expressed in the text. Now we can get a lot of information from looking at just this report, you know, obviously sentiment scores, the most common terms. But oftentimes we want to take the sentiments and use them in some other way, like look at them graphically, like we did back in the slides. So when it comes time for that part of your analysis, just head on up to the red triangle here and save the document scores. And these will save those scores back to your data table so that you can enter them into further analyses or graph them, whatever it is you want to do. So that's a sneak preview of sentiment analysis coming to Text Explorer in JMP Pro 16. The take-home idea is that sentiment analysis uses a sentiment dictionary that you set up to provide scores corresponding to the positive and negative emotional content of each document, and then from there, you can use that information in any way that's informative to you. So we'll leave sentiment analysis behind now and I'm going to move on back to our slides to talk about the other technique coming to Text Explorer soon. And that is term selection, where term selection answers a different question, and that is, which terms are most strongly associated with some important variable or variable that I care about? We're going to stick with the Beige Book. We're going to ask which words are statistically associated with recessions. So in the graph here, we have over time, the percent change GDP (gross domestic product) quarter by quarter, where blue shows economic growth, red shows economic contraction. And we might wonder, well, what terms, as they appear in the Beige Book, might be statistically associated with these periods of economic downturn? For example, a few of them right here. You know, why would we want to associate specific terms in the Beige Book with periods of economic downturn? Well, it could potentially be informative in and of itself to obtain a list of terms. You know, I might find some potentially, you know, subtle drivers of or effects of recessions that I might not be aware of or easily capture in quantitative data. I might also find some words for further analysis. I might...I might find some potential sentiment terms, some terms that are being used when the economy is doing particularly poorly that I missed my first time around when I was doing my sentiment analysis. Or maybe I could find some words that are so strongly associated with recessions that I think I might be able to use them in a predictive model to try to figure out when recessions might happen in the future. So there are a few different reasons why we might want to know which words are most strongly associated with recessions. So, how does this work in JMP? Well, we we basically build a regression model where the outcome variable is that variable we care about, recessions, and the inputs are all the different words. The data as entered into the model takes the form of a document term matrix, where each row corresponds to one document or one Beige Book report, and then the columns capture the words that occur in that report. Here we have the column "weak" highlighted and it says "binary," which means that it's going to contain 1s and 0s; a 1 indicates that that report contained to the word "weak" and 0 indicates that that report didn't contain the word "weak." And this is one of several ways we could kind of score the documents, but we'll we'll stick with this binary for now. So we take this information and we enter it into a regression model. So here's the what the launch would look like. We have our recession as our Y variable and that's just kind of a yes or no variable, and then we have all of these binary term predictors entered in as our model effects. And then we're going to use JMP Pro's generalized regression tool in order to build the model, and that's because generalized regression or GenReg, as we call it, includes automatic variable selection techniques. So if you're familiar with regularized regression, we're talking like the Lasso or the elastic net. And if you don't know what that means, that's totally fine. The idea is that it will automatically find which terms are most strongly associated with the outcome variable "recession," and then ones that it doesn't think are associated with it, it will zero those out. And this allows us to look at relationships between "recession" and perhaps you know hundreds, thousands of possible words that that would be associated with them. So what do we get when we run the analysis? We get a model out. So what we have here is the equation for that model. Don't worry about it too much. the idea is that we say the log odds of recession, so just it's a function of the probability that we're in a recession and when the Beige Book is issued is a function of all the different words that might occur in the Beige Book report. And you can see, we have, you know, the effect of the occurrence of the word "pandemic" with a coefficient of 1.94. That just means that the log odds of "recession" go up by 1.94 if the Beige Book report mentions the word "pandemic." Then we see minus 1.02 times "gain." Well, that means if the Beige Book report mentions the word "gain," then the probability of recession... or the log odds of recession drops by 1.02. So we get out of that are a list of terms that are strongly associated with an increase in the probability of recession, things like "pandemic," "postponement," "cancellation," "foreclosures." And we also get a list of terms that are associated with a decreasing probability of recession, so like "gain," "strengthen," "competition." We also see "manufacturing" right there, but it's got a relatively small coefficient, about -.2. And you'll actually notice here, and if we if we look at a graphical representation of all the terms that are selected in this analysis, you don't see too many specific domains like "manufacturing," "tourism," "investments" and all that. That's because those things are always talked about, whether we're in a recession or not, so what we're really looking for words that are used, you know, when we're in a recession more often than you would expect by chance. So we have...for example, those are "pandemic" being the most predictive. Makes a lot of sense. We're not talking about pandemics at all until pretty recently and then we've also experienced the recession recently, so we've picked up on that pretty clearly. Then we have a few others in this, kind of, second tier, so that's "postponed" "cancel," "foreclosed," "deteriorate," "pessimistic." And it's kind of interesting, this "postponement" and "cancellation" being associated with recessions. It makes sense, you really want to talk about postponing, say, economic activity when a recession is happening, or at least that's perhaps a reliable trend. It's...so that that's insight, in and of itself. In fact, I mean, I couldn't tell you how the Federal Reserve tracks postponing or canceling of economic activities, but the the fact that those terms, get flagged in this analysis suggests that's something probably worth tracking. Alright, so that's term selection. We actually get this nice list of terms associated with recessions out and we can see the relative strength of association. Now let's actually see that briefly here, how it's done in JMP. So I'm gonna head back on over to JMP and what we're going to do is pull up a slightly different data table. It's still Beige Book data, though, now we have just the national reports. And we have this accompanying variable whether or not the US was in a recession at the time. And of course there's some auto correlation in these data. I mean, if we're in a recession last month, it's more likely we're going to be in a recession this month than if we weren't. And you know that typically could be an issue for regression based analyses, but this is purely exploratory. We're not too too concerned with it. So I'm going to just pull Text Explorer up from a table scripts just because we've kind of seen how it's launched before. Note though that I've done some phrasing already, as we did before. I've also added some stop words, you can't see here, but words that I don't want them necessarily returned by this analysis. And I've turned on stemming, which is what those little dots in the term list mean. For example, this for "increase" now is actually collapsing across "increases," "increasing," "increasingly." And that's because now I, kind of, consider those all the same concept, and I just want to know if, you know, that concept is associated with recessions. So to invoke term selection, we'll just go to the red triangle, and I'll select it here. We get a control window popping up first, or I should say section, where we select which variable we care about, that's recessions. Select the target level, so I want this to be in terms of the probability of recession, as opposed to the probability of no recession. I can specify some model settings. If you're familiar with GenReg, you'll see that you can choose whether you want early stopping, whether you want one of two different penalizations to perform variable selection, what statistic you want to perform validation. And if that stuff is new to you, don't worry about it too much. The default settings are good way to go at first. We have our weighting, if you remember, we had the 1s and 0s in that column for "weak," just saying whether the word occurred in a document or not, but you can select what you want. So for example, frequency is not, did "weak" occur or not, it's how many times did it occur. And this kind of affects the way you would interpret your results. We're going to stick with binary for now. But I'm going to say, I want to check out the 1,000 most frequent terms, instead of the 400 by default, which you can see, that's a lot more than 436, and normally you can't fit a model with 1,000 Xs but only 436 observations, but thanks to the automatic variable selection in generalized regression, this isn't a problem. So once again it selects which of these thousand terms are most strongly related, hence the name term selection. So I'm gonna run this. You can see what has happened is JMP has called out to the generalized regression platform and returned these results, where up here, we see some information about the model as it was fit. For example, we have 37 terms that were returned. Let me just move that over. Because over here on the right is where we find some really valuable information. This is the list of terms most strongly associated with recession. Now I'll sort these by the coefficient to find those most strongly associated with the probability of recession, so once again that's "pandemic" "postponement" "cancellations" and, as you might expect, at this point if I click on one of these, it'll update these previews or these text snippets down below, so we can actually see this word in context. So this list of terms in itself could be incredibly valuable. You, you might learn some things about specific terms or concepts that are important that you might not have known before. You can also use these terms in predictive models. Now a few other things to note. You can see down here, we have once again a table by each individual document, but instead of sentiment scores, we now have basically scores from the model. We have for each one the... what we call the positive contribution, so this is the contribution of positive coefficients predicting the probability of recession. Here we have the ones on the negative end. And then we even have the probability of recession from the model, 71.8% here and then what we actually observed. And we're not building a predictive model here necessarily, that is, I'm not going to use this model to try to predict recessions. I mean, I have all kinds of economic indicator variables I would use too, but this is a good way to basically sanity check your model. Does it look like it's getting some of the its predictions right? Because if it's not, then you might not trust the terms that it returns. You also have plenty of other information to try to make that judgment. I mean, we have some fits statistics, like the area under the curve up here. Or we can even go into the generalized regression platform, with all of its normal options for assessing your model there further as well. I'm not going to get into the details there, but all of that is available to you so that you can assess your model, tweak your model how you like, to make sure you have a list of terms that you feel good about. Now you see right here, we have this, under the summary, this list of models and that's because you might actually want to run multiple models. So if I go back to the model...oh, excuse me...if we go back to our settings up here, I could actually run a slightly different model. Maybe, for example, I know that the BIC is a little more conservative than the AICc and I want to return fewer terms, maybe did an analysis that returned 900 terms and you're a little overwhelmed. So I'll click run and build another model using that instead. And now we have that model added to our list here, and I can switch between these models to see the results for each one. In this case, we've returned only 14 terms, instead of 37 and I would go down to assess them below. So two big outputs you would get from this, of course, is this term list. If you want to save that out, these are just important terms to you and you want to keep track of them, just right click and make this into a data table. Now I have a JMP table with the terms, their coefficients and other information. And if what you want to do is actually kind of write this back to your original data table, maybe, so that you can use the information in some other kind of analysis or predictive model, just head up to term selection and say that you want to save the term scores as a document term matrix, which if I bring our data table back here, you can see I've now written to their columns for each of these terms that have been selected. In this case filled in with their coefficient values, and now I can use this quantitative information however I like. That's just a bit then about term selection. Again, the big idea here is I have found a list of terms that are related to a variable I care about and I even have, through their coefficients, information about how strong that relationship might be. So let's just wrap up then. We've covered two techniques. We just saw term selection, finding those important words. Before that we reviewed sentiment analysis, all about quantifying the degree of positive or negative emotion in a text. These are two new text mining techniques coming to JMP Pro 16's Text Explorer. We're really excited for you to get your hands on them and look forward to your feedback.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Applications of Bayesian Methods Using JMP® (2021-EU-45MP-786)
Sunday, March 7, 2021
William Q. Meeker, Professor of Statistics and Distinguished Professor of Liberal Arts and Sciences, Iowa State University Peng Liu, JMP Principal Research Statistician Developer, SAS The development of theory and application of Monte Carlo Markov Chain methods, vast improvements in computational capabilities and emerging software alternatives have made it possible for the wide use of Bayesian methods in many areas of application. Motivated by applications in reliability, JMP now has powerful and easy-to-use capabilities for using Bayesian methods to fit different single distributions (e.g., normal, lognormal, Weibull, etc.) and linear regression. Bayesian methods, however, require the specification of a prior distribution. In many applications (e.g., reliability) useful prior information is typically available for only one parameter (e.g., imprecise knowledge about the activation energy in a temperature-accelerated life test or about the Weibull shape parameter in analysis of fatigue failure data). Then it becomes necessary to specify noninformative prior distributions for the other parameter(s). In this talk, we present several applications showing how to use JMP Bayesian capabilities to integrate engineering or scientific information about one parameter and to use a principled way to specify noninformative or weakly informative prior distributions for the other parameters. Auto-generated transcript... Speaker Transcript Our talk today shows how to use JMP to do Bayesian estimation. Here's an overview of my talk. I'm going to start with a brief introduction to Bayesian statistical methods. Then I'm going to go through four different examples that happen to come from reliability, but the methods we're presenting are really much more general and can be applied in other areas of application. Then I'm going to turn it over to Peng and he's going to show you how easy it is to actually do these things in JMP. Technically, reliability is a probability. The probability of a system, vehicle, machine or whatever it is that is of interest, will perform its intended function under encountered operating conditions for a specified period of time. I highlight encountered here to emphasize that reliability depends importantly on the environment in which a product is being used. Condra defined reliability as quality over time. And many engineers think of reliability is being failure avoidance, that is, to design and manufacture a product that will not fail. Reliability is a highly quantitative engineering discipline, but often requires sophisticated statistical and probabilistic ideas. Over the past 30 years, there's been a virtual revolution where Bayes methods are now commonly used and in many different areas of application. This revolution started by the rediscovery of Markov chain Monte Carlo methods and was accelerated by spectacular improvements in computing power that we have today, as well as the development of relatively easy to use software to implement Bayes methods. In the 1990s we had BUGS. Today Stan and other similar packages have largely replaced BUGS, but the other thing that's happening is we're beginning to see more Bayesian methods implemented in commercial software. So for example, SAS has PROC MCMC. And now JMP has some very powerful tools that were developed for reliability, but as I said, they can be applied in other areas as well, and there's strong motivation for the use of Bayesian methods. For one thing, it provides a means for combining prior information with limited data to be able to make useful inferences. Also, there are many situations, particularly with random effects complications like censor data, where maximum likelihood is difficult to implement, but where Bayes methods are relatively easy to implement. There's one little downside in the use of Bayes methods. You have to think a bit harder about certain things, particularly about parameterization and how to specify the prior distributions. My first example is about an aircraft engine bearing cage. These are field failure data where there was 1,703 aircraft engines that contained this bearing cage. The oldest ones had 2,220 hours of operation. The design life specification for this bearing cage was that no more than 10% of the units would fail by 8,000 hours of operation. However, 6 units had failed and this raised the question do we have a serious problem here? Do we need to redesign this bearing cage to meet that reliability condition? This is an event plot of the data. The event plot illustrates the structure of the data, and in particular, we can see the six failures here. In addition to that, we have right censored observations, indicated here by the arrows. So these are units that are still in service and they have not failed yet, and the right arrow indicates that all we know is if we wait long enough out to the right, the units will eventually fail. Here's a maximum likelihood analysis of those data, so the probability plot here suggests that the Weibull distribution provides a good description of these data. However, when we use the distribution profiler to estimate fraction failing at 8,000 hours, we can see that the confidence interval is enormous, ranging between about 3% all the way up to 100%. That's not very useful. So likelihood methods work like this. We specify the model and the data. That defines the likelihood and then we use the likelihood to make inferences. Bayesian methods are similar, except we also have prior information specified. Bayes theorem combines the likelihood with the prior information, providing a posterior distribution, and then we use the posterior distribution to make inferences. Here's the Bayes analysis of the bearing cage. The priors are specified here for the B10 or time at which 10% would fail. We have a very wide interval here. The range, effectively 1,000 hours up to 50,000 hours. Everybody would agree that B10 is somewhere in that range. For the Weibull shape parameter, however, we're going to use an informative prior distribution based upon the engineers' knowledge of the failure mechanism and their vast previous experience with that mechanism. They can say with little doubt that the Weibull shape parameter should be between 1.5 and 3, and here's where we specify that information. So instead of specifying information for the traditional Weibull parameters, we've reparameterized, where now the B10 is one of the parameters, and here's the specified range. And then we have the informative prior for the Weibull shape parameter specified here. And then JMP will generate samples from the joint posterior, leading to these parameter estimates and confidence intervals shown here. Here's a graphical depiction of the Bayes analysis. The black points here are a sample from the prior distribution, so again very wide for the .1 quantile and somewhat constrained for the Weibull shape parameter beta. On the right here, we have the joint posterior, which in effect is where the likelihood contours and the prior sample overlap. And then those draws from the joint posterior are used to compute estimates and Bayesian credible intervals. So here's the same profiler that we saw previously where the confidence interval was not useful. After bringing in the information about the Weibull shape parameter, now we can see that the confidence interval ranges between 12% and 83%, clearly illustrating that we have missed the target of 10%. So what have we learned here? With a small number of failures, there's not much information about reliability. But engineers often have information that can be used, and by using that prior information, we can get improved precision and more useful inferences. And Bayesian methods provide a formal method for combining that prior information with our limited data. Here's another example. Rocket motor is one of five critical components in a missile. In this particular application, there were approximately 20,000 missiles in the inventory. Over time, 1,940 of these missiles had been fired and they all worked, except in three cases, where there was catastrophic failure. And these were older missiles, and so there was some concern that there might be a wearout failure mechanism that would put into jeopardy the roughly 20,000 missiles remaining in inventory. The failures were thought to be due to thermal cycling, but there was no information about the number of thermal cycles. We only have the age of the missile when it was fired. That's a useful surrogate, but the effect of using a surrogate like that is you have more variability in your data. Now in this case, there were no directly observed failure times. When a rocket is called upon to operate and it operates successfully, all we know is that they had not failed at the age of those units when they were asked to operate. And for the units that failed catastrophically, again, we don't know the time that those units failed. At some point before they were called upon to operate. They had failed, so all we know is that the failure was before the age at which it was fired. So as I said, there was concern that there is a wear out failure mechanism kicking in here that would put into jeopardy the amount of remaining life for the units in the stockpile. So here's the table of the data. Here we have the units that operated successfully, and so these are right censored observations, but these observations here are the ones that failed and as I said, at relatively higher ages. This is the event plot of the data and again, we can see the right censored observations here with the arrow pointing to the right, and we can see the left censored observations with the arrow pointing to the left indicating the region of uncertainty. But even with those data we can still fit a Weibull distribution. And here's the probability plot showing the maximum likelihood estimate and confidence bands. Here's more information from the maximum likelihood analysis. And here we have the estimate of fraction failing at 20 years, which was the design life of these rocket motors, and again the interval is huge, ranging between 3% and 100%. Again, not very useful. But the engineers, knowing what the failure mechanism was, again had information about the Weibull shape parameter. The maximum likelihood estimate was extremely large and the engineers were pretty sure that that was wrong, especially with the extra variability in the data that would tend to drive the Weibull shape parameter to a lower value. As I showed you on the previous slide, confidence interval for fraction failing at 20 years was huge. So once again, we're going to specify a prior distribution and then use that in a Bayes analysis. Again, the prior for B10, the time at which 10% will fail, is chosen to be extremely wide. We don't really want to assume anything there, and everybody would agree that that quantity is somewhere between five years and 400 years. But for the Weibull shape parameter, we're going to assume that it's between one and five. Again, we know it's greater than one because it's a wear out failure mechanism, and the engineers were sure that it wasn't anything like the number 8 that we had seen in the maximum likely estimate. And indeed, five is also a very large Weibull shape parameter. Once again, JMP is called upon to generate draws from the posterior distribution. And here are plot similar to the ones that we saw in the bearing cage example. The black points here, again, are a sample from the prior distribution. Again very wide in terms of the B10, but somewhat constrained for the beta, so the beta is an informative prior distribution. And again the contour plots represent the information in the limited data. In our posterior, once again, is where we get overlap between the prior and the likelihood and we can see it here. So once again we have a comparison between the maximum likelihood interval, which is extremely wide, and the interval that we get for the same quantity using the Bayes inference, which incorporated the prior information on the Weibull shape parameter. And now the interval ranges between .002 and .98 or about .1, about 10% failing, so that might be acceptable. Some of the things that we learned here, even though there were no actual failure times, we can still get reliability information from the data, but with very few failures there isn't much information there. But we can use the engineer's knowledge about the Weibull shape parameter to supplement the data to get useful inferences and JMP makes this really easy to do. My last two examples are about accelerated testing. Accelerated testing is a widely used technique to get information about reliability of components quickly when designing a product. The basic idea is to test units at high levels of variables like temperature or voltage to make things fail quickly and then to use a model to extrapolate back down to the use conditions. Extrapolation is always dangerous and we have to keep that in mind. That's the reason we would like to have our model be physically motivated. So here's an example of an accelerated life test on a laser. Units were tested at 40, 60 and 80 degrees C, but the use condition was 10 degrees C. That's the nominal temperature at the bottom of the Atlantic Ocean, where these lasers were going to be used in a new telecommunications system. The test lasted 5,000 hours, a little bit more than six months. The engineers wanted to estimate fraction failing at about 30,000 hours. That's about 3.5 years and again, at 10 degrees C. Here's the results of the analysi. In order to appropriately test and build the model, JMP uses these three different analyses. The first one fits separate distributions to each level of temperature. The next model does the same thing, except that it constrains the shape parameter Sigma to be the same at every level of temperature. This is analogous to the constant Sigma assumption that we typically make in regression analysis. And then finally, we fit the regression model, which in effect, is a simple linear regression connecting lifetime to temperature. And to supplement this visualization of these three models, JMP does likelihood ratio tests to test whether there's evidence that the Sigma might depend on temperature and then to test whether there's evidence of lack of fit in the regression model. And from the large P values here, we can see that there's no evidence against this model. Another way to plot the results of fitting this model is to plot lifetime versus temperature on special scales. A log rhythmic scale for hours of operation in what's known as an Arrhenius scale for temperature. Corresponding to the Arrhenius model, which describes how temperature affects reaction rates, and thereby lifetime. And this is the results of the maximum likelihood estimation for our model. The JMP distribution profiler gives us an estimate of the fraction failing at 30,000 hours. And we can see it ranges between .002 and about .12, or 12% failing. The engineers in applications like this, however, often have information about what's known as the effective activation energy of the failure mechanism, and that corresponds to the slope of the regression line in the Arrhenius model. So we did a Bayes analysis and in that analysis, we made an assumption about the effective activation energy. And that's going to provide more precision for us. So what we have here is a matrix scatterplot of the joint posterior distribution after having specified prior distributions for the parameters, weakly informative for the .1 quantile at 40 degrees C. Again, everybody would agree that that number is somewhere between 100 and 32,000. Also weakly informative for the lognormal shape parameter. Again, everybody would agree that that number is somewhere between .05 and 20. But for the slope of the regression line, we have an informative prior ranging between .6 and .8, based upon previous experience with the failure mechanism. And that leads to this comparison, where now on the right-hand side here, the interval for fraction failing at 30,000 hours is much narrower than it was with the maximum likelihood estimate. In particular, the upper bound now is only about 4% compared with 12% for the maximum likelihood estimates. So lessons learned. Accelerated tests provide reliability information quickly, and engineers often have information about the effect of activation energy. And that can be used to either improve precision or to reduce cost by not needing to test so many units. And once again, Bayesian methods provide an appropriate method to combine the engineers' knowledge with the limited data. My final example concerns an accelerated life test of an integrated circuit device. Units were tested at high temperature and the resulting data were interval censored. That's because failures were discovered only during inspections that were conducted periodically. In this test, however, there were only failures at the two high levels of temperature. The goal of the test was to estimate the .01 quantile at 100 degrees C. This is a table of the data where we can see the failures at 250 and 300 degrees C. And no failures all right censored at the three lower levels of temperature. Now when we did the maximum likelihood estimation, in this case, we saw strong evidence that the Weibull shape parameter depended on temperature. So the P value is about .03. That turns out to be evidence against the Arrhenius model, and that's because the Arrhenius model should only scale time. But if you change the shape parameter by increasing temperature, you're doing more than scaling time. And so that's a problem, and it suggested that at 300 degrees C, there was a different failure mechanism. And indeed, when the engineers followed up and determined the cause of failure of the units at 250 and 300, they saw that there was a different mechanism at 300. What that meant is that we had to throw those data away. So what do we do then? Now we've only got failures at 250 degrees C and JMP doesn't do very well with that. It's surprisingly, actually runs and gives answers, but the confidence intervals are enormously wide here, as one would expect. But the engineers knew what the failure mechanism was and they had had previous experience and so they can bring that information about the slope into the analysis using Bayes methods. So again, here's the joint posterior and the width of the distribution in the posterior for beta 1 is effectively what we assumed when we put in a prior distribution for that parameter. So again, here's the specification of the prior distributions, where we used weakly informative for the quantile and for Sigma, but informative prior distribution for the slope beta 1. And I can get an estimate of the time at which 1% fail. So the lower end point of the confidence interval for the time at which 1% will fail is more than 140,000 hours. So that's about 20 years, much longer than the technological life of these products in which this integrated circuit will be used. So what did we learn here? Well, in some applications we have interval censoring because failures are discovered only when there's an inspection. We need appropriate statistical methods for handling such data, and JMP has those methods. If you use excessive levels of an accelerating variable like temperature, you can generate new failure modes that make the information misleading. So we had to throw those units away. But even with failures at only one level of temperature, if we have prior information about the effective activation energy, we can combine that information with the limited data to make useful inferences. Finally, some concluding remarks, improvements in computing hardware and software have greatly advanced our ability to analyze reliability and other data. Now we can also use Bayes methods, providing another set of tools for combining information with limited data and JMP has powerful tools for doing this. So, although these Bayesian capabilities were developed for the reliability part of JMP, they can certainly be used in other areas of application. And here are some references, including the 2nd edition of Statistical Methods Reliability, which should be out probably in June of 2021. OK, so now I'm going to turn it over to Peng and he's going to show you how easy it is to do these analyses. Thank you, professor. Before I start my demonstration, I would like to show this slide about Bayesian analysis workflow in life distribution and Fit Life by X. First, you need to fit a parametric model using maximum likelihood. I assume you already know how to do this in these two platforms. Then you need to review or find model specification graphical user interface for Bayesian estimation within the report from the previous step. For example, this is screenshot of Weibull model in Life Distribution. You need to go to the red triangle menu and choose Bayesian estimates to reveal the graphical user interface for the Bayesian analysis. In Fit Life by X, please see the screenshot of a lognormal result. And the graphical user interface for the Bayesian inference is on the last step. After finding the graphical user interface for Bayesian analysis, you will need to supply the information about the priors. You need to decide the priors dispersion for individual parameters. You need to supply the information for the hyperparameters and additional information such as the probability for the quantile. In addition to that, we need to provide the number of posterior samples. Also need to provide a random seed in case you want to replicate your result in the future. Then you can click Fit Model. This will generate a report of the model. You can fit multiple models in case you want to study the sensitivity of different Bayesian models given different prior distribution. The result of a Bayesian model, including the following things, first is a method of sampling. And then is a copy of the priors. Then had a posterior estimates of the parameters. And then there are some scatterplots, either for the prior or the posteriors. In the end, we have two profilers. One for distribution and the one for the quantile. Using these results, you can make further inferences such as failure prediction. Now look at the demonstration. We will demonstrate with the last example that the professor mentioned that ??? presentation. It's the IC device there. We have two columns for the time to event HoursL and HoursU to represent the censoring situation. We have a count for individual observation and temperature for individual observation. We exclude the last four observations because they are associated with a different failure mode. We want to exclude these observations from the analysis. Now start to specify. ???. We put hours into Y We put Count into frequency. We put Degrees C into X. We use the Arrhenius Celsius for our relationship. We use lognormal for our distribution. Then click OK. The result is the maximum likelihood influence for the lognormal. We go to the Bayesian estimates, and start to specify our priors like Professor did in his presentation. We choose a lognormal for the quantile. It's 250 degrees C. And its B10 life. So probability is 0.1. The two ends of the lognormal distribution is 100 and 10,000. Now specify the slope. Distribution is lognormal. And two ends of the distribution is .65 and .85 because it's informative to require the range is narrow. Now we specify the prior distribution for Sigma, which is a lognormal and it had a wide range; it's .05 and 5. We decided to draw 5,000 posterior samples. And assign an arbitrary random seed. And then we click...click on Fit Model. And what the report generates for this specification. The method is simple rejection. And here's a copy of our proir specification. The posterior estimates summarize our posterior samples. You can export the posterior samples by either clicking this Export Monte Carlo samples or choose it from the menu, which is here Export Monte Carlo Samples. Posterior samples are illustrated in these scatterplots. We have two scatterplots here. The first to use the same paramaterization as the prior specification, which is...which use quantile, slope and signal. The other scatter plot....the second scatter plot use a traditional parameterization, which includes the intercept of regression, a slope of the regression and Sigma. In the end to make ???, we can you we can look at the profiler. Here let's look at the second profiler, quantile profile, so we can find the same result as what Professor had shown in one of the previous slides. Enter 0.1... 0.01 for probability. So this is 1%. We enter 100 degree C for DegreesC. And we adjust the axes. So now we see a similar profiler. It has that was already in the previous slide. And we can read off the Y axis to get the result we want, which is the time that 1% of the device will fail at 100 degrees C. So this concludes my demonstration. And let me move on. This slide explains about...explains JMP implementation of sampling algorithm. We have seen that the simple rejection has shown up in the previous example and this is the first stage of our implementation. The simple rejection algorithm is tried and true method to your samples, but it can be impractical if rejection rate is high. So if the rejection rate is high, we...our implementations switch to the second stage, which is a Random Walk Metropolis-Hastings algorithm. The second algorithm is efficient, but in case...in situations it can fail undetectably if the likelihood is irregular. For example, the likelihood is rather flat. We designed this implementation because we have a situation, there are very few failures or even no failures. In that situation the likelihood is relatively high, but ??? situation, we use simple rejection algorithm and the rejection rate is not that bad and this method will suffice. When we have more and more failures, the likelihood it becomes more regular. So it has a peak in the middle. In that situation, the simple rejecction rate...the simple rejection method becomes impractical because of the high rejection rate. But the Random Walk algorithm becomes more and more promising to succeed without failure. So this is our implementation and explanation of why we do that. This slide explains how do we specify truncated normal prior in these two platforms. Because truncated normal is not a building prior distribution in these two platforms. First, look at what is truncated normal. Here we give example of truncated normal with two ends at 5 and 400. The two ends are illustrated by this L and this R. But truncated normal is nothing but a normal by discarding all the values that are negative, which is represented by this through curve as the equivalent normal distribution for this particular truncated normal distribution. If we want to specify this truncated normal or we need to define is equivalent...equivalent normal distribution with two ends that had that had the same new and Sigma parameters as those of these truncated normal distributions. So we provide a script to do this. In this script, the calculation is the following. We find the new and Sigma from the truncated normal. So we get can get the equivalent normal distribution. And we that ??? Sigma of this normal distribution we can find out the two ends of this normal distribution. So to specify the truncated normal with this two end value to specify equivalent normal distribution with two end...these two end values. This is how we specify truncated normal in these two platforms using equivalent normal distribution. And here's the content of the script. All you need to do is by calling a function, which here is the reverse parameter tnorm to normal value. What do you need to provide are the two ends of the truncated normal distribution and it will give the two ends of the equivalent normal distribution and you can use those two numbers to specify the prior distribution. So this concludes my demonstration. In my demonstration I showed how to start Bayesian analysis in Life distribution and Fix Life by X, how to enter prior information, and what's the content of Bayesian result. Also I explained what our implementation of the sampling and why do you do that. And in the end I explain how do we specify a truncated normal prior using an equivalent normal prior in these two...in these two platforms. Thank you.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Blending and Cleanup
Data Exploration and Visualization
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Candidate Set Designs: Tailoring DOE Constraints to the Problem (2021-EU-30MP-784)
Sunday, March 7, 2021
Christopher Gotwalt, JMP Director of Statistical R&D, SAS There are often contraints among the factors in experiments that are important not to to violate, but are difficult to describe in mathematical form. These contraints can be important for many reasons. If you are baking bread, there are combinations of time and temperature that you know will lead to inedible chunks of carbon. Another situation is when there are factor combinations that are physically impossible, like attaining high pressure at low temperature. In this presentation, we illustrate a simple workflow of creating a simulated dataset of candidate factor values. From there, we use the interactive tools in JMP's data visualization platforms in combination with AutoRecalc to identify a physically realizabable set of potential factor combinations that is supplied to the new Candidate Set Design capability in JMP 16. This then identifies the optimal subset of these filtered factor settings to run in the experiment. We also illustrate the Candidate Set Designer's use on historical process data, achieving designs that maximize information content while respecting the internal correlation structure of the process variables. Our approach is simple and easy to teach. It makes setting up experiments with constraints much more accessible to practitioners with any amount of DOE experience. Auto-generated transcript... Transcript Hello Chris Gotwalt here. Today, we're going to be constructing the history of graphic paradoxes and oh wait, wrong topic. Actually we're going to be talking about candidate set designs, tailoring DOE constraints to the problem. So industrial experimentation for product and process improvement has a long history with many threads that I admit I only know a tiny sliver of. The idea of using observation for product and process innovation is as old as humanity itself. It received renewed focus during the Renaissance and Scientific Revolution. During the subsequent Industrial Revolution, science and industry began to operate more and more in lockstep. In the early 20th century, Edison's lab was an industrial innovation on a factory scale, but it was done to my knowledge, outside of modern experimental traditions. Not long after RA Fisher introduced concepts like blocking and randomization, his associate and then son in law, George Box, developed what is now probably the dominant paradigm in design of experiments, with the most popular book being Statistics for Experimenters by Box, Hunter and Hunter. The method described in Box, Hunter and Hunter are what I call the taxonomical approach to design. So suppose you have a product or process you want to improve. You think through the things you can change. The knobs can turn like temperature, pressure, time, ingredients you can use or processing methods that you can use. These these things become your factors. Then you think about whether they are continuous or nominal, and if they are nominal, how many levels they take or the range you're willing to vary them. If a factor is continuous, then you figure out the name of the design that most easily matches up to the problem and resources that you...that fits your budget. That design will have... will have a name like a Box Behnken design, a fractional factorial, or a central composite signs, or possibly something like a Taguchi array. There will be restrictions on the numbers of runs, the level...the numbers of levels of categorical factors, and so on, so there will be some shoehorning the problem at hand into the design that you can find. For example, factors in the BHH approach, Box Hunter and Hunter approach, often need to be whittled down to two or three unique values or levels. Despite its limitations, the taxonomical approach has been fantastically successful. Over time, of course, some people have asked if we could still do better. And by better we mean to ask ourselves, how do we design our study to obtain the highest quality information pertinent to the goals of the improvement project? This line of questioning lead ultimately to optimal design. Optimal design is an academic research area. It was started in parallel with the Box school in the '50s and '60s, but for various reasons remained out of the mainstream of industrial experimentations, until the custom designer and JMP. The philosophy of the custom designer is that you describe the problem to the software. It then returns you the best design for your budgeted number of runs. You start out by declaring your responses along with their goals, like minimize, maximize, or match target, and then you describe the kinds of factors you have, continuous, categorical mixture, etc. Categorical factors can have any number of levels. You give it a model that you want to fit to the resulting data. The model assumes at least squares analysis and consists of main effects and interactions in polynomial terms. The custom designer make some default assumptions about the nature of your goal, such as whether you're interested in screening or prediction, which is reflected in the optimality criterion that is used. The defaults can be overridden with a red triangle menu option if you are wanting to do something different from what the software intends. The workflow in most applications is to set up the model. Then you choose your budget, click make design. Once that happens, JMP uses a mixed, continuous and categorical optimization algorithm, solving for the number of factors times the number of rows terms. Then you get your design data table with everything you need except the response data. This is a great workflow as the factors are able to be varied independent from one another. What if you can't? What if there are constraints? What if the value of some factors determine the possible ranges of other factors? Well then you can do....then you can define some factor constraints or use it disallowed combinations filter. Unfortunately, while these are powerful tools for constraining experimental regions, it can still be very difficult to characterize constraints using these. Brad Jones' DOE team, Ryan Lekivetz, Joseph Morgan and Caleb King have added an extraordinarily useful new feature that makes handling constraints vastly easier in JMP 16. These are called candidate or covariate runs. What you can do is, off on your own, create a table of all possible combinations of factor settings that you want the custom designer to consider. Then load them up here and those will be the only combinations of factor settings that the designer will... will look at. The original table, which I call a candidate table, is like a menu factor settings for the custom designer. This gives JMP users an incredible level of control over their designs. What I'm going to do today is go over several examples to show how you can use this to make the custom designer fulfill its potential as a tool that tailors the design to the problem at hand. Before I do that, I'm going to get off topic for a moment and point out that in the JMP Pro version of the custom designer, there's now a capability that allows you to declare limits of detection at design time. If you want a non missing values for the limits here the custom designer will add a column property that informs the generalized regression platform of the detection limits and it will then automatically get the analysis correct. This leads to dramatically higher power to detect effects and much lower bias in predictions, but that's a topic for another talk. Here are a bunch of applications that I can think of for the candidate set designer. The simplest is when ranges of a continuous factor depend on the level of one or more categorical factors. Another example is when we can't control the range of factors completely independently, but the constraints are hard to write down. There are two methods we can use for this. One is using historical process data as a candidate set, and then the other one is what I call filter designs where you create...design a giant initial data set using random numbers or a space filling design and then use row selections in scatter plots to pick off the points that don't satisfy the constraints. There's also the ability to really highly customize mixture problems, especially situations where you've got multilayer mixturing. This isn't something that I'm going to be able to talk about today, but in the future this is something that you should be looking to be able to do with this candidate set designer. You can also do nonlinear constraints with the filtering method, the same ways you can do other kinds of constraints. It's it's very simple and I'll have a quick example at the very end illustrating this. So let's consider our first example. Suppose you want to match a target response in an investigation of two factors. One is equipped...an equipment supplier, of which there are two levels and the other one is the temperature of the device. The two different suppliers have different ranges of operating temperatures. Supplier A's is more narrow of the two, going from 150 to 170 degrees Celsius. But it's controllable to a finer level of resolution of about 5 degrees. Supplier B has a wider operating range going from 140 to 180 degrees Celsius, but is only controllable to 10 degrees Celsius. Suppose we want to do a 12 run design to find the optimal combination of these two factors. We enumerate all possible combinations of the two factors in 10 runs in the table here, just creating this manually ourselves. So here's the five possible values of machine type A's temperature settings. And then down here are the five possible values of Type B's temperature settings. We want the best design in 12 runs, which exceeds the number of rows in the candidate table. This isn't a problem in theory, but I recommend creating a copy of the candidate set just in case so that the number of runs that your candidate table has exceeds the number that you're looking for in the design. Then we go to the custom designer. Push select covariate factors button. Select the columns that we want loaded as candidate design factors. Now the candidate design is loaded and shown. Let's add the interaction effect, as well as the quadratic effect of temperature. Now we're at the final step before creating the design. I want to explain the two options you see in the design generation outline node. The first one, which will force in all the rows that are selected in the original table or in the listing of the candidates in the custom designer. So if you have checkpoints that are unlikely to be favored by the optimality criterion and want to force them into into the design, you can use this option. It's a little like taking those same rows and creating an augmented design based on just them, except that you are controlling the possible combinations of the factors in the additional rows. The second option, which I'm checking here on purpose, allows the candidate rows to be chosen more than once. This will give you optimally chosen replications and is probably a good idea if you're about to run a physical experiment. If, on the other hand, you are using an optimal subset of rows to find to try in a fancy new machine learning algorithm like SVEM, a topic of one of my other talks at the March Discovery Conference. You would not want to check this option if that was the case. Basically, if you don't have all of your response values already, I would check this box and if you already have the response values, then don't. Reset the sample size to 12 and click make design. The candidate design in all its glory will appear just like any other design made by the custom designer. As we see in the middle JMP window, JMP also selects the rows in the original table chosen by the candidate design algorithm. Note that 10 not 12 rows were selected. On the right we see the new design table, the rightmost column in the table indicates the row of origin for that run. Notice that original rows 11 and 15 were chosen twice and are replicates. Here is a histogram view of the design. You can see that the different values of temperature were chosen by the candidate set algorithm for different machine types. Overall, this design is nicely balanced, but we don't have 3 levels of temperature in machine type A. Fortunately, we can select the rows we want forced into the design to ensure that we have 3 levels of temperature for both machine types. Just select the row you want forced into the design in the covariate table. Check include all selected covariant rows into the design option. And then if you go through all of that, you will see that now both levels of machine have at least three levels of temperature in the design. So the first design we created is on the left and the new design forcing there to be 3 levels of machine type A's temperature settings is over here to the right. My second example is based on a real data set from a metallurgical manufacturing process. The company wants to control the amount of shrinkage during the sintering step. They have a lot of historical data and have applied machine learning models to predict shrinkage and so have some idea what the key factors are. However, to actually optimize the process, you should really do a designed experiment. As Laura Castro-Schilo once pointed... As Laura Castro-Schilo once told me, causality is a property not of the data, but if the data generating mechanism, and as George Box says on the inside cover of Statistics for Experimenters, to find out what happens when you change something, it is necessary to change it. Now, although we can't use the historical data to prove causality, there is essential information about what combinations of factors are possible that we can use in the design. We first have to separate the columns in the table that represent controllable factors from the ones that are more passive sensor measurements or drive quantities that cannot be controlled directly. A glance at the scatter plot of the potential continuous factors indicates that there are implicit constraints that could be difficult to characterize as linear constraints or disallowed combinations. However, these represent a sample of the possible combinations that can be used with the candidate designer quite easily. To do this, we bring up the custom designer. Set up the response. I like to load up some covariate factors. Select the columns that we can control as factor...DOE factors and click OK. Now we've got them loaded. Let's set up a quadratic response surface model as our base model. Then select all of the model terms except the intercept. Then do a control plus right click and convert all those terms into if possible effects. This, in combination with response surface model chosen, means that we will be creating a Bayesian I-optimal candidate set design. Check the box that allows for optimally chosen replicates. Enter the sample size. It then creates the design for us. If we look at the distribution of the factors, we see that it is tried hard to pursue greater balance. On the left, we have a scatterplot matrix of the continuous factors from the original data and on the right is the hundred row design. We can see that in the sintering temperature, we have some potential outliers at 1220. One would want to make sure that those are real values. In general, you're going to need to make sure that the input candidate set it's clear of outliars and of missing values before using it as a candidate set design. In my talk with Ron Kennet this...in the March 2021 Discovery conference, I briefly demo how you can use the outlier and missing value screening platforms to remove the outliers and replace the missing values so that you could use them at a subsequent stage like this. Now suppose we have a problem similar to the first example, where there are two machine types, but now we have temperature and pressure as factors, and we know that temperature and pressure cannot vary independently and that the nature of that dependence changes between machine types. We can create an initial space filling design and use the data filter to remove the infeasible combinations of factors setting separately for each machine type. Then we can use the candidate set designer to find the most efficient design for this situation. So now I've been through this, so now I've created my space filling design. It's got 1,000 runs and I can bring up the global data filter on it and use it to shave off different combinations of temperature and pressure so that we can have separate constraints by machine type. So I use the Lasso tool to cut off a corner in machine B. And I go back and I cut off another corner in machine B so machine B is the machine that has kind of a wider operating region in temperature and pressure. Then we switch over to machine A. And we're just going to use the Lasso tool to shave off the points that are outside its operating region. And we see that its operating region is a lot narrower than Machine A's. And here's our combined design. From there we can load that back up into the custom designer. Put an RSM model there, then set our number of runs to 32, allowing coviariate rows to be repeated. And it'll crank through. Once it's done that, it selects all the points that were chosen by the candidate set designer. And here we can see the points that were chosen. They've been highlighted and the original set of candidate points that were not selected are are are gray. We can bring up the new design in Fit Y by X and we can see a scatterplot where we see that the the the machine A design points are in red. They're in the interior of the space, and then the Type B runs are in blue. It had the wider operating region and that's how we see these points out here, further out for it. So we have quickly achieved a design with linear constraints that change with a categorical factor without going the annoying process of deriving the linear combination coefficients. We've simply used basic JMP 101 visualization and filtering tools. This idea generalizes to other nonlinear constraints and other complex situations fairly easily. So now we're going to use filtering and multivariate to set up a very unique new type of design that I assure you you have never seen before. Go to the Lasso tool. We're going to cut out a very unusual constraint. And we're going to invert selection. We're going to delete those rows. Then we can speed this up a little bit. We can go through and do the same thing for other combinations of X1 and the other variables. Carving out a very unusual shaped candidate set. We can load this up into the custom designer. Same thing as before. Bring our columns in as covariates, set up a design with all... all high order interactions made if possible, with a hundred runs. And now we see our design for this very unusual constrained region that is optimal given these constraints. So I'll leave you with this image. I'm very excited to hear what you were able to do with the new candidate set designer. Hats off to the DOE team for adding this surprisingly useful and flexible new feature. Thank you.
Labels
(11)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Data Blending and Cleanup
Data Exploration and Visualization
Design of Experiments
Mass Customization
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Extending Hamcrest Automated Testing of JSL Applications for Continuous Improvement (2021-EU-45MP-783)
Sunday, March 7, 2021
Vince Faller, Chief Software Engineer, Predictum Wayne Levin, President, Predictum This session will be of interest to users who work with JMP Scripting Language (JSL). Software engineers at Predictum use a continuous integration/continuous delivery (CI/CD) pipeline to manage their workflow in developing analytical applications that use JSL. The CI/CD pipeline extends the use of Hamcrest to perform hundreds of automated tests concurrently on multiple levels, which factor in different types of operating systems, software versions and other interoperability requirements. In this presentation, Vince will demonstrate the key components of Predictum’s DevOps environment and how they extend Hamcrest’s automated testing capabilities for continuous improvement in developing robust, reliable and sustainable applications that use JSL: Visual Studio Code with JSL extension – a single code editor to edit and run JSL commands and scripts in addition to other programming languages. GitLab – a management hub for code repositories, project management, and automation for testing and deployment. Continuous integration/continuous delivery (CI/CD) pipeline – a workflow for managing hundreds of automated tests using Hamcrest that are conducted on multiple operating systems, software versions and other interoperability requirements. Predictum System Framework (PSF) 2.0 – our library of functions used by all client projects, including custom platforms, integration with GitLab and CI/CD pipeline, helper functions, and JSL workarounds. Auto-generated transcript... Speaker Transcript Wayne Levin Welcome to our session here on extending Hamcrest automated testing of JSL applications for continuous improvement. What we're going to show you here, our promise to you, is we're going to show you how you too can build a productive cost-effective high quality assurance, highly reliable and supportable JMP-based mission-critical integrated analytical systems. Yeah that's a lot to say but that's that's what we're doing in this in this environment. We're quite pleased with it. We're really honored to be able to share it with you. So here's the agenda we'll just follow here. A little introduction, my self, I'll do that in a moment, and just a little bit about Predictum, because you may not know too much about us, our background, background of our JSL development, infrastructure, a little bit of history involved with that. And then the results of the changes that we've been putting in place that we're here to share with you. Then we're going to do a demonstration and talk about what's next, what we have planned for going forward, and then we'll open it up, finally, for any questions that that you may have. So I'm Wayne Levin, so that's me over here on the right. I'm the president of Predictum and I'm joined with Vince Faller. Vince is our chief software engineer who's been leading this very important initiative. So just a little bit about us, right there. We're a JMP partner. We launched in 1992, so 29 years old. We do training in statistical methods and so on, using JMP, consulting in those areas and we spend an awful lot of time building and deploying integrated analytical applications and systems, hence why this effort was very important to us. We first delivered JMP application with JMP 4.0 in the year 2000, yeah, indeed over 20 years ago, and we've been building larger systems. Of course, since back then, it was too small little tools, but we started, I think, around JMP 8 or 9 building larger systems. So we've got quite a bit of history on this, over 10 years easily. So just a little bit of background...until about the second half of 2019, our development environment was really disparate, it was piecemeal. Project management was there, but again, everything was kind of broken up. We had different applications for version control and for managing time, you know, our developer time, and so on, and just project management generally. Developers were easily spending, and we'll talk about this, about half their time just doing routine mechanical things, like encrypting and packaging JMP add-ins. You know, maintaining configuration packages and, you know, and separating the repositories or what we generally call repo's, you know, for encrypted and unencrypted script. It was...there was a lot we hade to think about that wasn't really development work. It was really work that developer talent is...was wasted on. We also had, like I said, we've been doing it a long time, even at 2019, we had easily 10 years, so over 10 years of legacy framework going all the way back even to JMP 5, you know, with, you know, it was getting bloated and slow. And we know JMP has come a long way over the years. I mean in JMP 9, we got namespaces and JMP 14 introduced classes and that's when Hamcrest began. And it was Hamcrest that really allowed us to go this this...with this major initiative. So we began this major initiative back in August of 2019. And that's when we are acquired our first Gitlab licenses and that's the development of our new...the development of our new development architecture, there you go, started to take shape and it's been improving ever since. Every month, basically, we've been adding and building on our capabilities to become more and more productive, as we go forward. And and that's continuing, so we actually consider this, if you will, a Lean type of effort. It really does follow Lean principles and it's accelerated our development. We have automated testing, thanks to this system, and Vince is going to show us that. And we have this little model here, test early and test often And that's what we do. It supports reusing code and we've redeveloped our Predictum system framework. It's now 2.0. We've learned a lot from our earlier effort. All that's gone, pretty much all of its gone, and it's been replaced and expanded. And Vince will tell us more about that. Easily, easily we have over 50% increase in productivity, and I'm just going to say the developers are much happier. They're less frustrated. They're more focused on their work, I mean the real work that developers should be doing, not the tedious sort of stuff. There's still room for improvement, I'm going to say, so we're not done and Vince will tell us more about that. We have development standards now, so we have style guides for functions and all of our development is functionally based, you might say. Each function requires at least one Hamcrest test, and there are code reviews that the developers, they're sharing with one another to ensure that we're following our standards. And it raises questions about how to enhance those standards, make them better. We also have these, sort of, fun sessions, where developers are encouraged to break code, right, so they're called like, these break code challenges, or what have you. So it's become part of our modus operandi and it all fits right in with this development environment. It leads to, for example, further tests, further Hamcrest tests to be added. We have one small, fairly small project that we did just over a year ago. We're going into a new phase of it. It's got well over... well over 100 Hamcrest tests are built into it and they get run over and over and over again through the development process. So some other benefits is it allows us to assign and track our resource allocation, like what developers are doing what. Everyone knows what everyone else is doing, continuous integration, continuous deployment, something like that), there's...code collisions are detected early so if we have... and we do, we have multiple people working on some projects, so, you know, somebody's changing a function over here and it's going to collide with something that someone else is doing. We're going to find out much sooner. It also allows us to improve supportability across multiple staff. We can't have code dependent on a particular developer; we have to have code that any developer or support staff can support ging forward. So that's was an important objective of ours as well. And it does advance the whole quality assurance area just generally, including supporting, you know, FDA requirements, concerning QA, you know, things like validation, the IQ OQ PQ. So it's...we're automating or semi automating those tasks as well through this infrastructure. We do use it internally and externally, so you may know, we have some products out there, (???)Kobe sash lab but new ones spam well Kobe send spam(???) are talked about also elsewhere in the JMP Discovery European Conference in 2021. You might want to go check them out, but they're fairly large code bases and they're all developed, in other words, we eat our own dog food, if you know that expression, but we also use it with all of our client development, so this is something that's important to our clients, so because we're building applications that they're going to be dependent on. And so we, we need to...we need to have the infrastructure that allows us to be dependable, and anyway, that's a big part of this. I mentioned the Predictum system framework. You can see some snippets of it here. It's right within the scripting index, and you know, we see the arguments and the examples and all that. We built all that in and 95%, over 95% of them have Hamcrest tests associated with them. Of course, our goal is to make sure that all of them do and we're we're getting there. We're getting there. Have...these framework...this framework is actually part of our infrastructure here. That's one of the important elements of it. Another is just that...Hamcrest... the ability to do the unit testing. And I'm going to have...there's a slide at the...at the end, which will give you a link into the Community where you can learn more about Hamcrest. This is a development that was brought to us by by JMP, back in JMP 14, as I mentioned a few minutes ago. Gitlab is a big part of this; that gives us the project management repository, the CI/CD pipeline, etc. And also there's a visual...visual studio code extension for JSL that we created and we'd...you see five stars there because it was given five stars on the on the visual studio. I'm not sure what we call that. Vince, maybe you can tell us, the store, what have you. It's been downloaded hundreds of times and we've been updating it regularly. So that's something you can go and look for as well. I think we have a link for that as well in the resource slide at the end. So what I'm going to do now is I'm going to pass this over to Vince Faller. Vince is, again, our chief software engineer. Vince led this initiative, starting in August 2019, as I said. It was a lot of hard work and the hard work continues. We're all, in the company, very grateful for Vince and his leadership here. So with that said, Vince, why don't you take it from here? I'm gonna... I'm... Vince Faller Sharing. So Wayne said Hamcrest a bunch of times. For people that don't know what Hamcrest is, it is an add-in created by JMP. Justin Chilton and Evan McCorkle were leading it. It's just a unit testing library that lets you run, test, and get results of it in an automated way. It really started the ball rolling of us being able to even do this, hence why it's called extending. I'm going to be showing some stuff with my screen. I work pretty much exclusively in the VSCode extension that we built. This is VSCode. We do this because it has a lot of built-in functionality or extendable functionality that we don't have to write, like get integration, get lab integration. Here you can see this is a JSL script and it reads it just fine. If you want to get it, if you're...if you're familiar with VSCode, it's just a lightweight text editor. You just type in JMP and you'll see it. It's the only one. But we'll go to what we're doing. So. For any code change we make, there is a pipeline run. We'll just kind of show what it does. So if I change the README file to this is a demo for Discovery. 2021. And I'm just going to commit that. If you don't know get, committing as just saying I want a snapshot of exactly where we are at the moment, and then you push it to the repo and it saved on the server. Happy day. Commit message. More readme info. And I can just do get push, because VSCode is awesome. Pipeline demo. So now I've pushed it. There is going to be a pipeline running. I can just go down here and click this and it will give me my merge request. So now pipeline started running. I can check the status of the pipeline. What it's doing right now is it's going to go through and check that it has the required Hamcrest files. We have some requirements that we enforce so that it can... we can make sure that we're doing our jobs well. And then it's done. I'm going to press encrypt. Now encrypt is going to take the whole package and encrypt it. If we go over here, this is just a vm somewhere. It should start running in a second. So it's just going through all the code. Writing all the encrypted passwords, going through, clicking all that stuff. If you've ever tried to encrypt multiple scripts at the same time, you'll probably know that that's a pain, so we automated it so that we don't have to do this because, as Wayne said, it was taking a lot of our time to do these. Like, if we have 100 scripts to go through and encrypt every single one of them every time we want to do any release, it was awful. Because we have to have our code encrypted because, yeah sorry, opinion, all right, I can stop sharing that. Ah. So that's gonna run. It should finish pretty soon. Then it will go through and stage it and then the staging basically takes all of the sources of information we want, as our as in our documentation, as in anything else we've written, and it renders them into the form that we want in the add-in, because much like the rest of github, gitlab, most of our documentation is written in markdown and then we render it into whatever. I don't need to show the rest of this but yeah. So it's passing. It's going to go. We'll go back to VSCode. So. If we were to change, so this is just a single function. If I go in here like, if I were to run this... JSL, run current selection. So. You can see that it came back...all that it's trying to do is open Big Class, run a fit line, and get the equation. It's returning the equation. And you can actually see it ran over here as well. But. So this could use some more documentation. And we're like, oh, we don't actually want this data table open. But let's let's just run this real quick. And say, no. This isn't a good return, it turns the equation in all caps apparently. So if I stage that. Better documentation. Push. Again back to here. So, again it's pushing. This is another pipeline. It's just running a bunch of power shell scripts in order, depending on however we set it up. But you'll notice this pipeline has more stages. So when we in an effort to help be able to scale this, we only test the JSL minimally at first, and then, as it passes, we allow to test further. And we only tested if there are JSL files that have changed. But we can go through this. It will run and it will tell us where it is in the the testing, just in case the testing freezes. You know, if you have a modal dialog box that just won't close, obviously JMP isn't going to keep doing anything after that. But you can see, it did a bunch of stuff, yeah, awesome. I'm done. Exciting. Refresh that. Get a little green checkmark. And we could go, okay, run everything now. It would go through, test everything, then encrypt it, then test the encrypted, basically the actual thing that we're going to make the add-in of, and then stage it again, package it for us, create the actual add-in that we would give to a customer. I'm not going to do that real quick because it takes a minute. But let's say we go in here and we're, like, oh, well, I really want to close this data table. I don't know why I commented out in the first place. I don't think it should be open, because I'm not using it anymore, we don't...we don't want that. We'll say okay. Close the dt. Again push. Now, this could all be done manually on my computer with Hamcrest. But you know, sometimes a developer will push stuff and not run all of their Hamcrest for everything on their computer, and this is...the entire purpose of it is to catch it. It forced us to do our jobs a little better. And yeah. Keep clicking this button. I could just open that, but it's fine. So now you'll see it's running the pipeline again. Go to the pipeline. And I'm just going to keep saying this for repetition. We're just going through, testing, and encrypting, then testing because sometimes encryption enters its own world of problems, if anybody's ever done encrypting. Run, run, run, run, run. And then, oh, we got a throw. Would you look at that? I'm not trying to be deadpan, but you know. So if we were to mark this as ready and say, yeah we're done, we'd see, oh, well, that test didn't pass. Now we could download why the test didn't pass in the artifacts. And this will open a J unit file that I'm just going to pull out here. It will also render it in getlab, which might be easier, but for now we'll just do this. Eventually. Minimize everything. Now come on. So, we could see that something happened with R squared and it failed. Inside of blue. So we can come here. Say, why is there something in boo that is causing this to fail? We see, oh, somebody called our equation and then they just assumed that the data table was there. So because something I changed broke somebody else's code, as if that would ever happen. So we're having that problem. Where did you go? Here we go. So that's the main purpose of everything we're doing here, is to be able to catch the fact that I changed something and I broke somebody else's stuff. So I could go through. Look at what boo does. Say, oh well, maybe I should just open Big Class myself. Yeah, cool. Well, if I save that, I should probably make it better. Open Big Class myself. I'll stage that. Open Big Class.Get push. And again, just show the old pipeline. Now this should take not...not too long. So we're going to go in here. We're...we only test on one...to... JMP version, but you can see automatically, we only test on one. Then it waits for the developer to say, yeah, I'm done and everything looks good. Continue. We do that for resource reasons, because these are running on vms that are automatically just chugging all the time, and we have multiple developers, who are all using these systems. We're also... You can see, this one is actually a docker system, we're containerizing these. Well, we're in the process of containerizing these. We have them working, but we don't have all the versions yet. But we run 14.3, at least for this project, we run 14.3, 15, 15.1, and that should work. Let's just revert things. Because that you know works. Probably should have done a classic...but it's fine. So yeah. We're going to test. I feel like I keep saying this over and over. We're going to test everything. We'll actually let this one run to show you kind of the end result of what we get. It should only take. a little bit. And so we'll test this, make sure it's going, and you can see the logs. We're getting decent information out of what is happening, on where it is, like it'll tell you the runner that is running. I'm only running on Windows right now. Again, this is a demo and all that but we should be able to run more. While that's running, I'll just talk about VSCode some more. In VSCode, we also...there's also snippets and things, so if you want to make a function, it will create you all over the the function information. We use natural docs again, that was stolen from the Hamcrest team, as our development documentation. So it'll just put everything in a natural docs form. So it just, again, the idea is helping us do our jobs and forcing us to do our jobs a little better, with a little more gusto. Wayne Levin For the documentation? Vince Faller So that's for the documentation, yeah. Wayne Levin As we're developing, we're documenting at the same time. Vince Faller Yep. Absolutely. You know, it also has for loops, while loops. For with an associate row, stuff like that. Are we...is this...is this done yet? It's probably done, yep. So we get our Green checkmark. Now it's going to run on all of the systems. If we can go back to here, you'll just see it. Open JMP. It'll run some tests, probably will open Big Class. Then close all...close itself all down. Wayne Levin So we're doing this, largely because many of our clients have different versions of JMP deployed and they want a particular add-in but they're running it, they have, you know, just different versions out there in the field. We also test against the early adopter versions of JMP, which is a service to JMP because we report bugs. But also for the clients, it's helpful because then they know that they can...they can upgrade to the new version of JMP. They know that the applications that we built for them have been tested. And that's just routine for us. Good. Vince Faller You're done. You're done. You're done. Change to... I can talk about... And this is just going to run, we can movie magic this if you want to, Meg. Just to make it run faster. Basically, I just want to get to staging but it takes a second. Is there anything else you have to say, Wayne, about it? Cool. I'll put that... Something I can say, when we're staging, we also have our documentation in mk docs. So it'll actually run the mk doc's version, render it, put the help into the help files, and basically be able to create a release for us, so that we don't have to deal with it. Because creating releases is just a lot of effort. Encrypting. It's almost done. Probably should just have had one pre loaded. Live demos, what are you gonna do. Some. Run. Oh, one thing I definitely want to do. So, the last thing that the pipeline actually does is checks that we actually spent our time, because, you know, if we don't actually record our time spent, we don't get paid, so forcing us to do it. Great, great time. Vinde Faller So Vince Faller the job would have failed without that. I can just show some jobs. Trying. That's the docker one. We don't want that. So You can see that gave us our successes. No failures. No unexpected throws. That's all stuff from Hamcrest. Come on. One more. Okay got to staging. One thing that it does it creates the repositories. It creates them fresh every time. So it's like, it tries to keep it in a sort of stateless way. Okay, we can download the artifacts now. And now we should have this pipeline demo. I really wish it would have just went there. What. Why is Internet Explorer up? So now you'll see pipeline demo is a JMP add-in. If we unzip it. If you didn't know, a JMP add-in is just a zip file. If we look at that now, you can see that has all of our scripts in it, it has our foo, it has our bar. If we open those, open, you can see it's an encrypted file. So this is basically what we would be able to give to the customer and not have so much mechanical work. Wayne mentioned that it's less frustrated developers, and personally, I think that's an understatement, because doing this over and over was very frustrating before we got this in place, and this has helped a bunch. That. Wayne Levin Now, about the encryption, when you're delivering an add-in for use by users within a company, you typically don't want, for security reasons and so on, you don't want them to anyone to be able to go in and deal with the code. You know, that sort of thing, so we may deliver a code unencrypted just for, you know, so the client has their code on encrypted, but for delivery to the end user, you typically want everything encrypted, just so it can't be tampered with. Just one of those sort of things. Vince Faller Yep, and that is the end of my demo. Wayne, if you want to take it back for the wrap up. Wayne Levin Yeah, terrific. Sure, thanks, very much for that, Vince. So there's a lot of moving parts in this whole system so it's, you know, basically, making sure that we've got, you know, code being developed by multiple users that are not colliding with one another. We're building in the documentation at the same time. And actually, the documentation gets deployed with the application and we don't have to weave that in. It's... We set the infrastructure up so that it's automatically taken care of. We can update that, along with the code comprehensively, simultaneously, if you will. The Hamcrest tests that are going on, each one of those functions that are written, there are expected results, if you will. So they get compared and so we saw, briefly, there was, I guess, some problem with that equation there. An R square or whatever came back with a different value, so it broke, in other words, to say hey, something's not right here; I was expecting this output from the function for a use case. So that's one of the things that we get from clients, so you know, we build up a pool of use cases that get turned into Hamcrest tests and away we go. There are some other slides here that are available to you, like, when you, when you, if you go and download the slides. So I'll leave that available for you and here's a little picture of of the pipeline that that we're employing and a little bit about code review activity for developers too. If you want to to go back and forth with it. Vince, do you want to add anything here about how code review and approval takes place? Vince Faller Yeah, so inside of the merge request it will have the JSL code on the diffs of the code. And again, a big thank you to the people who did Hamcrest, as well, because they also started a lexer(?) for GitHub and GitLab to be able to read JSL, so actually this is inside of getlab. And they can also read the JSL. It doesn't execute it, but it has nice formatting. It's not all just white text, it's it's beautiful there. We will just go in, like in this screenshot, you click a line, you put in a comment that you want, and it becomes a reviewable task. So we try to do as much inside of GitLab as we can for transparency reasons, and once everything is closed out, you can say yep, my merge request is ready to go. Let's put it into the master branch, main branch. Wayne Levin Awesom. So you're really it's...it's helping, you know, we're really defining coding standards, if you will, and I don't like the word enforcement but that's what it what it amounts to. And it reduces variation. It makes it easier for multiple developers, if you will, to understand what what others have done. And as we bring new developers on board, they come to understand the standard and they know what to look for, they know what to do. So it...it makes onboarding a lot easier, and again it deals with...everything's attached to everything here, so you know supportability and so on. This is the slide I mentioned earlier, just for some resources so we're using GitLab. I suppose the same principles applied to any git generally so like GitHub or what have you. Here's the community link for Hamcrest. There was a talk at in tucson, that was in 2019, in the old days when we used to travel and get together. That was a lot of fun. And here's the the marketplace link for the visuals...visual code studio. Visual studio code, what have you. So as Vince said, yeah we make a lot of use of that editor, as opposed to using the built-in JMP editor just because it's all integrated. It's... it's just all part of one big application development environment. And with that, Vince and I, on behalf of Vince and myself, I want to thank you for your your interest in this, and want to, again we really want to thank the JMP team. Justin Chilton and company, I'll call out to you. If not for Hamcrest, we would not be on this. That was the missing piece or was the enabling piece that really allowed us to to take JSL development to, basically, the kinds of standards you expect in code development, generally, in industry. So we're really grateful for it, and I know our... you know, that that is propagated out with each application we've deployed. And at this point, Vince and I are happy to take any questions that... info@predictum.com and it'll get forwarded to us and we'll get back to you. But at this point, we'll open it up to Q&A.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Automation and Scripting
Basic Data Analysis and Modeling
Consumer and Market Research
Content Organization
Design of Experiments
Mass Customization
Quality and Process Engineering
Reliability Analysis
Sharing and Communicating Results
0 attendees
0
0
0 attendees
0
0
Fault Detection and Diagnosis of the Tennessee Eastman Process Using Multivariate Control Charts (2021-EU-45MP-782)
Sunday, March 7, 2021
Jeremy Ash, JMP Analytics Software Tester, SAS The Model Driven Multivariate Control Chart (MDMCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMCC monitoring of a PLS model using the simulation of a real-world industrial chemical process: the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate how MDMCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available which can delay fault detection substantially. When MDMCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aid in the early detection of faults. We also demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts and diagnostic plots. MDMCC provides a user-friendly interface to move between these plots. Auto-generated transcript... Speaker Transcript Hello, I'm Jeremy Ash. I'm a statistician in JMP R&D. My job primarily consists of testing the multivariate statistics platforms in JMP, but I also help research and evaluate methodology. Today I'm going to be analyzing the Tennessee Eastman process using some statistical process control methods in JMP. I'm going to be paying particular attention to the model driven multivariate control chart platform, which is a new addition to JMP 15. These data provide an opportunity to showcase the number of the platform's features. And just as a quick disclaimer, this is similar to my Discovery Americas talk. We realized that Europe hadn't seen a model driven multivariate control chart talk due to all the craziness around COVID, so I decided to focus on the basics. But there is some new material at the end of the talk. I'll briefly cover a few additional example analyses, then I put on the Community page for the talk. First, I'll assume some knowledge of statistical process control in this talk. The main thing it would be helpful to know about is control charts. If you're not familiar with these, these are charts used to monitor complex industrial systems to determine when they deviate from normal operating conditions. I'm not gonna have much time to go into the methodology of model driven multivariate control chart, so I'll refer to these other great talks that are freely available on the JMP Community if you want more details. I should also say that Jianfeng Ding was the primary developer of the model driven multivariate control control chart in collaboration with Chris Gotwalt and that Tonya Mauldin and I were testers. The focus of this talk will be using multivariate control charts to monitor a real world typical process; another novel aspect will be using control charts for online process monitoring. This means we'll be monitoring data continuously as it's added to a database and detecting faults in real time. So I'm going to start off with the obligatory slide on the advantages of multivariate control charts. So why not use univariate control charts? There are a number of excellent options in JMP. Univariate control charts are excellent tools for analyzing a few variables at a time. However, quality control data are often high dimensional and the number of control charts you need to look at can quickly become overwhelming. Multivariate control charts can summarize a high dimensional process in just a couple of control charts, so that's a key advantage. But that's not to say that univeriate control charts aren't useful in this setting. You'll see throughout the talk that fault diagnosis often involves switching between multivariate and univariate charts. Multivariate control charts give you a sense of the overall health of the process, while univariate charts allow you to monitor specific aspects of the process. So the information is complementary. One of the goals of monitoring multivariate control chart is to provide some useful tools for switching between these two types of charts. One disadvantage of univariate charts is that observations can appear to be in control when they're actually out of control in the multivariate sense and these plots show what I mean by this. The univariate control chart for oil and density show the two observations in red as in control. However, oil and density are highly correlated and both observations are out of control. in the multivariate sense, specially observation 51, which fairly violates the correlation structure of the two variables, so multivariate control charts can pick up on these types of outliers, while univariate control charts can't. Model driven multivariate control chart uses projection methods to construct the charts. I'm going to start by explaining PCA because it's easy to build up from there. PCA reduces the dimensionality of the process by projecting data onto a low dimensional surface. Um, this is shown in the picture on the right. We have P process variables and N observations, and the loading vectors in the P matrix give the coefficients for linear combinations of our X variables that result in square variables with dimension A, where the dimension A is much less than P. And then this is shown in equations on the left here. The X can be predicted as a function of the score and loadings, where E is the prediction error. These scores are selected to minimize the prediction error, and another way to think about this is that you're maximizing the amount of variance explained in the X matrix. Then PLS is a more suitable projection method. When you have a set of process variables and a set of quality variables, you really want to ensure that the quality variables are kept in control but these variables are often expensive or time consuming to collect. The plant could be making product without a control quality for a long time before a fault is detected. So PLS models allow you to monitor your quality variables as a function of your process variables and you can see that the PLS models find the score variables that maximize the amount of variation explained of the quality variables. These process variables are often cheaper or more readily available, so PLS can enable you to detect faults in quality early and make your process monitoring cheaper. And from here on out I'm going to focus on PLS models because it's more appropriate for the example. So PLS model partitions your data into two components. The first component is the model component. This gives the predicted values of your process variables. Another way to think about it is that your data has been projected into the model plane defined by your score variables and T squared monitors the variation of your data within this model plane. And the second component is the error component. This is the distance between your original data and the predicted data and squared prediction error (SPE) charts monitor this variation. Another alternative metric we provide is the distance to model X plane or DModX. This is just a normalized alternative to SPE that some people prefer. The last concept that's important to understand for the demo is the distinction between historical and current data. Historical data are typically collected when the process was known to be in control. These data are used to build the PLS model and define the normal process variation so that a control limit can be obtained. And current data are assigned scores based on the model but are independent of the model. Another way to think about this is that we have training and test sets. The T squared control limit is lower for the training data because we expect less variability for the various... observations used to train the model whereas there's greater variability in P squared when the model generalizes to E test set. Fortunately, the theory for the variance of T squared is been worked out so we can get these control limits based on some distributional assumptions. In the demo will be monitoring the Tennessee Eastman process. I'm going to present a short introduction to these data. This is a simulation of a chemical process developed by Downs and Vogel, two chemists at Eastman Chemical. It was originally written in Fortran, but there are wrappers for Matlab and Python now. I just wanted to note that while this data set was generated in the '90s, it's still one of the primary data sets used to benchmark multivariate control methods in the literature. It covers the main tasks of multivariate control well and there is an impressive amount of realism in the simulation. And the simulation is based on an industrial process that's still relevant today. So the data were manipulated to protect proprietary information. The simulated process is the production of two liquid products from gaseous reactants within a chemical plant. And F here is a byproduct that will need to be siphoned off from the desired product. Um and... That's about all I'll say about that. So the process diagram looks complicated, but it really isn't that bad, so I'll walk you through it. Gaseous reactants A, D, and E flow into the reactor here. The reaction occurs and the product leaves as a gas. It's then cooled and condensed into liquid in the condenser. Then a vapor liquid separator recycles any remaining vapor and sends it back to the reactor through a compressor, and the byproduct and inert chemical B are purged in the purge stream, and that's to prevent any accumulation. The liquid product is pumped through a stripper, where the remaining reactants are stripped off. And then sent back to the reactor. And then finally, the purified liquid product exits the process. The first set of variables being monitored are the manipulated variables. These look like bow ties in the diagram. I think they're actually meant to be valves and the manipulated process...or the manipulated variables mostly control the flow rate through different streams of the process. And these variables can be set to any values within limits and have some Gaussian noise. The manipulated variables are able to be sampled in the rate, but we use the default 3 minutes sample now. Some examples of the manipulated variables are the valves that control the flow of reactants into the reactor. Another example is a valve that controls the flow of steam into the stripper. And another is a valve that controls the flow of coolant into the reactor. The next set of variables are measurement variables. These are shown as circles in the diagram. They were also sampled at three minute intervals. The difference between manipulated variables and measurement variables is that the measurement variables can't be manipulated in the simulation. Our quality variables will be the percent composition of two liquid products and you can see the analyzer measuring the products here. These variables are sampled with a considerable time delay, so we're looking at the purge stream instead of the exit stream, because these data are available earlier. And will use a PLS model to monitor process variables as a proxy for these variables because the process variables have less delay and affect faster sampling rate. So that should be enough background on the data. In total there are 33 process variables and two quality variables. The process of collecting the variables is simulated with a set of differential equations. And this is just a simulation, but as you can see a considerable amount of care went into modeling this after a real world process. Here is an overview of the demo I'm about to show you. We will collect data on our process and store these data in a database. I wanted to have an example that was easy to share, so I'll be using a SQLite database, but the workflow is relevant to most types of databases since most support ODBC connections. Once JMP forms an ODBC connection with the database, JMP can periodically check for new observations and add them to a data table. If we have a model driven multivariate control chart report open with automatic recalc turned on, we have a mechanism for updating the control charts as new data come in and the whole process of adding data to a database would likely be going on a separate computer from the computer that's doing the monitoring. So I have two sessions of JMP open to emulate this. Both sessions have their own journal in the materials on the Community, and the session adding new simulated data to the database will be called the Streaming Session and session updating the reports as new data come in will be called the Monitoring Session. One thing I really liked about the Downs and Vogel paper was that they didn't provide a single metric to evaluate the control of the process. I have a quote from the paper here "We felt that the tradeoffs among the possible control strategies and techniques involved much more than a mathematical expression." So here are some of the goals they listed in their paper, which are relevant to our problem. They wanted to maintain the process variables at desired values. They wanted to minimize variability of product quality during disturbances, and they wanted to recover quickly and smoothly from disturbances. So we'll see how well our process achieves these goals with our monitoring methods. So to start off in the Monitoring Session journal, I'll show you our first data set. The data table contained all of the variables I introduced earlier. The first variables are the measurement variables; the second are the composition. And the third are the manipulated variables. The script up here will fit a PLS model. It excludes the last 100 rows as a test set. Just as a reminder, the model is predicting 2 product composition variables as a function of the process variables. If you have JMP Pro, there have been some speed improvements to PLS in JMP 16. PLS now has a fast SVD option. You can switch to the classic in the red triangle menu. There's also been a number of performance improvements under the hood. Mostly relevant for datasets with a large number of observations, but that's common in the multivariate process monitoring setting. But PLS is not the focus of the talk, so I've already fit the model and output score columns and you can see them here. One reason that the monitor multivariate control chart was designed the way it is, is that imagine you're a statistician and you want to share your model with an engineer so they can construct control charts. All you need to do is provide the data table with these formula columns. You don't need to share all the gory details of how you fit your model. Next, I'll provide the score columns to monitor the multivariate control chart. Drag it to the right here. So on the left here you can see two types of control charts the T squared and SPE. Um, there are 860 observations that were used to estimate the model and these are labeled as historical. And then the hundred that were left out as a test set are your current data. And you can see in the limit summaries, the number of points that are out of control and the significance level. Um, if you want to change the significance level, you can do it up here in the red triangle menu. Because the reactor's in normal operating conditions, we expect no observations to be out of control, but we have a few false positives here because we haven't made any adjustments for multiple comparisons. It's uncommon to do this, as far as I can tell, in multivariate control charts. I suppose you have higher power to detect out of control signals without a correction. In control chart lingo, this is means you're out of control. Average run length is kept low. So on the right here we also have contribution plots and on the Y axis are the observations; on the X axis, the variables. A contribution is expressed as a portion. And then at the bottom here, we have score plots. Right now I'm plotting the first score dimension versus the second score dimension, but you can look at any combination of score dimensions using this dropdown menus or the arrow button. OK, so I think we're oriented to the report. I'm going to now switch over to the scripts I've used to stream data into the database that the report is monitoring. In order to do anything for this example, you'll need to have a SQLite ODBC driver installed for your computer. This is much easier to do on a Windows computer, which is what you're often using when actually connecting to a database. The process on the Mac is more involved, but I put some instructions on the Community page. And then I don't have time to talk about this, but I created the SQLite database I'll be using in JMP and I plan to put some instructions in how to do this on the Community Web page. And hopefully that example is helpful to you if you're trying to do this with data on your own. Next I'm going to show you the files that I put in the SQLite database. Here I have the historical data. This was used to construct the PLS model. There are 960 observations that are in control. Then I have the monitoring data, which at first just contains the historical data, but I'll gradually add new data to this. This is the data that the multivariate control chart will be monitoring. And then I've simulated new data already and added it to the data table here. These are another 960 odd measurements where a fault is introduced at some time point. I wanted to have something that was easy to share, so I'm not going to run my simulation script and add to the database that way. We're just going to take observations from this new data table and move them over to the monitoring data table using some JSL and SQL statements. This is just an example emulating the process of new data coming into a database. Somehow you might not actually do this with JMP, but this was an opportunity to show how you can do it with JSL. Clean up here. And next I'll show you this streaming script. This is a simple script, so I'm going to walk you through it real quick. This first set of commands will open the new data table and it's in the SQLite database, so it opens the table in the background so I don't have to deal with the window. Then I'm going to take pieces from this data table and add them to the monitoring data table. I call the pieces bites and the bite size is 20. And then this next command will connect to the database. This will allow me to send the database SQL statements. And then this next bit of code is iteratively sending SQL statements that insert new data into the monitoring data. And I'm going to initialize K and show you the first iteration of this. This is a simple SQL statement, insert into statement that inserts the first 20 observations into the data table. This print statement is commented out so that the code runs faster and then I also have a wait statement to slow things down slightly so that we can see their progression in the control chart. And this would just go too fast if I didn't slow it down. Um, so next I'm going to move over to the monitoring sessions to show you the scripts that will update the report as new data come in. This first script is a simple script. That will check the database every .2 seconds for new observations and add them to the JMP table. Since the report has automatic recalc turned on, the report will update whenever new data are added. And I should add that realistically, you probably wouldn't use a script that just iterates like this. You probably use task scheduler in Windows or Automator on Mac to better schedule runs of the script. And then there's also another script that will push the report to JMP Public whenever the report is updated, and I was really excited that this is possible with JMP 15. It enables any computer with a web browser to view updates to the control chart. Then you can even view the report on your smartphone, so this makes it really easy to share results across organizations. And you can also use JMP Live if you wanted the reports to be on restricted server. I'm not going to have time to go into this in this demo, but you can check out my Discovery Americas talk. Then finally down here, there is a script that recreates the historical data in the data table if you want to run the example multiple times. Alright, so next...make sure that we have the historical data... I'm going to run the streaming script and see how the report updates. So the data is in control at first and then a fault is introduced, but there's a plantwide control system that's implemented in the simulation, and you can see how the control system eventually brings the process to a new equilibrium. Wait for it to finish here. So if we zoom in, seems like the process first went out of control around this time point, so I'm going to color it and label it, but it will show up in other plots. And then in the SPE plot, it looks like this observation is also out of control but only slightly. And then if we zoom in on the time point in the contribution plots, you can see that there are many variables contributing to the out of control signal at first. But then once the process reaches a new equilibrium, there's only two large contributors. So I'm going to remove the heat maps now to clean up a bit. You can hover over the point at which the process first went out of control and get a peek at the top ten contributing variables. This is great for giving you a quick overview which variables are contributing most to the out of control signal. And then if I click on the plot, this will be appended to the fault diagnosis section. And as you can see, there's several variables with large contributions and just sorted on the contribution. And for variables with red bars the observation is out of control in the univariate control charts. You can see this by hovering over one of the bars and these graphlets are IR charts for an individual variable with a three Sigma control limit. You can see in the stripper pressure variable that the observation is out of control, but eventually the process is brought back under control. And this is the case for the other top contributors. I'll also show you one of the variables where we're in control, the univariate control chart. So the process was... there are many variables out of control in the process at the beginning, but process eventually reaches a new equilibrium. Um... To see the variables that contribute most to the shift in the process, we can use mean contribution proportion plot. These plots show the average contribution that the variables have to T squared for the group I've selected. Um, here if I sort on these. The only two variables with large contributions measure the rate of flow of reactant A in stream one, which is the flow of this reactant into the reactor. Both of these variables are measuring essentially the same thing, except one measurement...one is a measurement variable and the other is a manipulated variable. You can see that there is a large step change in the flow rate, which is what I programmed in the simulation. So these contribution plots allow you to quickly identify the root cause. And then in my previous talk I showed many other ways to visualize and diagnose faults using tools in the score plot. This includes plotting the loadings on the score plots and doing some group comparisons. You can check out my Discovery Americas talk on the JMP Community for that. Instead, I'm going to spend the rest of this time introducing a few new examples, which I put on the Community page for this talk. So. There are 20 programmable faults in the Tennessee Eastman process and they can be introduced in any combination. I provided two other representative faults here. Fault 1 that I showed previously was easy to detect because the out of control signal is so large and so many variables are involved. The focus on the previous demo was to show how to use the tools and identify. faults out of a large number of variables and not to benchmark the methods necessarily. Fault 4, on the other hand, is a more subtle fault, and I'll show you it here. The fault i...that's programmed is a sudden increase in the temperature in the reactor. And this is compensated for by the control system by increasing the flow rate of coolant. And you can see that variable picked up here and you can see the shift in contribution plots. And then you can also see that most other variables aren't affected by the fault. You can see a spike in the temperature here is quickly brought back under control. Because most other variables aren't affected, this is hard to detect for some multivariate control methods. And it can be more difficult to diagnose. The last fault I'll show you is Fault 11. Like Fault 4, it also involves the flow of coolant into the reactor, except now the fault introduces large oscillations in the flow rate, which we can see in the univariate control chart. And this results in a fluctuation of reactor temperature. The other variables aren't really affected again, so this can be harder to detect for some methods. Some multivariate control methods can pick up on Fault 4, but not Fault 11 or vice versa. But our method was able to pick up on both. And then finally, all the examples I created using the Tennessee Eastman process had faults that were apparent in both T squared and SPE plots. To show some newer features in model driven multivariate control chart, I wanted to show an example of a fault that appears in the SPE chart but not T squared. And to find a good example of this, I revisited a data set which Jianfeng Ding presented in her former talk, and I provided a link to her talk in this journal. On her Community page, she provides several useful examples that are also worth checking out. This is a data set from Cordia McGregor's (?) classic paper on multivariate control charts. The data are processed variables measured in a reactor, producing polyethylene, and you can find more background in Jianfeng's talk. In this example, we have a process that went out of control. Let me show you this. And it's out of control in... earlier in the SPE chart than in the T squared. And if we look at the mean contribution plots for SPE, you can see that there is one variable with large contribution and it also shows a large shift in the univariate control chart, but there are also other variables with large contributions, but that are still in control in the univariate control charts. And it's difficult to determine from the bar charts alone why these variables had a large contributions. Large SPE values happen when new data don't follow the correlation structure of the historical data, which is often the case when new data are collected, and this means that your PLS model you trained is no longer applicable. From the bar charts, it's hard to know which pair of variables have their correlation structure broken. So new in 15.2, you can launch scatterplot matrices. And it's clear in the scatterplot matrix that the violation of correlations with Z2 is what's driving these large contributions. OK, I'm gonna switch back to the PowerPoint. And real quick, I'll summarize the key features of model driven multivariate control chart that were shown in the demo. The platform is capable of performing both online fault detection and offline fault diagnosis. There are many methods provided in the platform for drilling down to the root cause of faults. I'm showing you here some plots from a popular book, Fault Detection and Diagnosis in Industrial Systems. Throughout the book, the authors demonstrate how one needs to use multivariate and univariate control charts side by side to get a sense of what's going on in a process. An one particularly useful feature in model driven multivariate control chart is how interactive and user friendly it is to switch between these two types of charts. And that's my talk. Here is my email if you have any further questions. And thanks to everyone that tuned in to watch this.
Labels
(10)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Consumer and Market Research
Data Access
Data Exploration and Visualization
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Reliability Analysis
0 attendees
0
0
0 attendees
0
0
Driving Product Development Through Modelling Historic Data in JMP® (2021-EU-30MP-780)
Sunday, March 7, 2021
Stuart Little, Lead Research Scientist, Croda This presentation will show how some of the tools available in JMP have been successfully used to visualize and model historic data within an energy technology application. The outputs from the resulting model were then used to inform the generation of a DOE-led synthesis plan. The result of this plan was a series of new materials that have all performed in line with the expectations of the model. Through this approach, a functional model of product performance has been successfully developed. This model, alongside the visualization capabilities of JMP, has allowed for the business to begin to embrace a more structured approach to experimentation. Auto-generated transcript... Speaker Transcript Stuart Little Hi everyone, and welcome to this talk around how JMP is being used at Croda to help drive new product development. So what we're going to cover today, and firstly we're going to cover some context to Croda, how we are using JMP and where we are on that journey, and a summary of the problem we're trying to solve. Once you've been covered the problem, we then are going to move to JMP, look at how the tools and platforms in JMP have allowed easy data exploration and easy development of a structure performance model. And then finally we'll wrap this up by discussing the outcomes of this work and how by doing this kind of research, we've been able to increase buy in to the use of data and DOE techniques in and...in the research side of the business. So firstly, who we are as Croda. It's a question that does come up quite a lot, because we're a business-to-business entity. But as a business, Croda are the name behind a lot of high-performance ingredients and technologies, and behind a lot of the biggest and most successful brands across the world, across a range of markets. We create, make, and sell speciality chemicals. From the beginning of Croda, these have been predominantly sustainable materials. So we started by making lanolin, which is from sheep's wool, and we continually build on that sustainability. Last year we made a public pledge to be climate, land, and people positive by 2030 and have signed up to the UN Sustainable Development Goals as part of our push to achieve this and become the most sustainable supplier of innovative ingredients across the world. So, in terms of the markets we serve, we have a kind of very big personal care business where we deal with skincare and sun care and sort of hair care, color cosmetics, and those kind of traditional personal care products. Life sciences business, our products...our products and expertise help customers optimize their formulations and their active ingredient use. I mean, most recently, we in an agreement with Pfizer to provide materials that are going into their COVID-19 vaccine. Our industrial chemicals business, that's...that part of the business is responsible for supplying technically differentiated, predominantly sustainable materials to a huge range of markets. A lot of markets aren't quite...don't quite fit into anything else on this slide. And then finally we've got our performance technologies business. This covers a lot, again, a lot of similar areas, providing high performance answers across across all of these. And then today, in particular, we're talking about our energy technologies business, and specifically, kind of, battery technology in high performance electronics. So where we are in Croda and JMP is we've been using JMP for about two years and we've had a lot of interest internally. It's been harder to build confidence that these techniques have real value to research. And so to prove this, we've gone away, we've created a number of case studies that have been pretty successful on the whole. We've demonstrated the potential and some of the pitfalls within that. And then all of that has then led to a slightly bigger set of projects, one of which is the one we're going to talk to about...talk to you today. how do we improve the efficiency of electrical cooling systems? The primary driver for this project is sort of transport electrification, so that's battery vehicles. How do you maintain the battery properly? How do you make, make sure the motors are working at their optimum level? And how'd you do that without electrocuting anyone? So currently there's a set of cooling methods for these things, that our customers are certainly looking at how that can be improved, because the better the control of your battery cooling, for instance, the better battery capacity you have, and the more consistent the range will be. And because, you know, this is critical and there's lots of different applications that are broadly similar, the really useful thing for us would be to build an understanding of these fluids, by having some sort of data led model and that's where JMP came in. So how can we do that? Well, the first thing we we looked at, was what are the current cooling methods? So for batteries, predominantly they're air cooled or coldplaee cooled in the previous generation. And the electronics in the car, you have the opposite problem of the battery but thing at tend to get too hot so, then we have heatsinks, to try and take that energy away. And in electric motors, we're trying to minimize the resistance in there, they tend to be jacketed with fluid. In all three of these cases, the incoming alternative method of cooling relies on fluid, so that's direct immersion for batteries and electronics, and then in terms of the electric motors, that tends to be more of a flow. So what does that fluid look like? Obviously, we're dealing with high voltages so we have to have something that's not electrically conductive. It also needs to have a really high conductivity of heat, so that it can pull heat out of the electronics. And because these fluids need to be moved around the system, the viscosity has to be low. So we have kind of practical physical constraints that have been introduced by the application itself. If you look at it in a bit more depth, the ability of the fluids to transfer heat is based predominantly on this equation. And what this tells us is there is a part that we can control by the fluid, which is the heat transfer coefficient, and then there is a part that is controlled by the engineering solution in the application to that. What's the area for cooling, and what are the temperatures of the surfaces that you're trying to cool? But for in all cases, to get an efficient heat transfer, we have to have a high heat transfer coefficient, and as that's the thing we can effect, that's where we looked. That heat transfer coefficient is defined by this equation in a simplistic way, there are other terms in there, but predominantly, it's a function of density, thermal conductivity, the heat capacity, and then having an effect of the viscosity of the system. So, if we look specifically for the applications were interested in, if we want to optimize our dielectric fluid, we need to increase the density, increase the thermal conductivity, and increase the heat capacity, but alongside that, we need to reduce the viscosity of the...of the fluid. And these match up pretty well with the engineering challenges that we have, which is helpful. So from that, what we really wanted to do was, we knew what the target was. And we really wanted to understand what the relationship was between structure and product performance as a dielectric fluid. So initially we proceeded in kind of a fairly traditional way and we started conducting a large-scale study measuring the physical properties of a lot of esters and a lot of other materials. And then, when when we saw that and looked at that, we thought, well actually, this data exists so why don't we use these data sets to try and build some models, and say, can we really understand that physical property to structure to performance relationship. So that's where...we're just going to pop into JMP so just bear with me one second. Okay. So the first thing that we we did was we collated that mix of historic data, data that was being obtained through targeted testing by the applications teams. And once we've got that into one place, we kind of examined that in JMP to say is there...to understand, is there a relationship, but at a really simple level between the physical properties we're measuring. So, if we look at that that data set, the first port of call for me, as ever, is the distribution platform in JMP. And it's a really easy way just to see if something that you want has any kind of vague pattern anywhere else. So if I, in this case, if you say, oh, we want everything that's got a high thermal conductivity, what we see is the properties that are pretty stretched out across the other...the other properties we've measured. So it doesn't really say, oh, there's brilliant relationship, what you need is this, which is kind of what we expected. But it's nice to have a check. Similarly, if we then plot everything as scatterplots, what we see is a lot of noise. I mean, these lines of fit are just there for reference to show there isn't really any fit. In no way am I claiming your correlation on these. And all of that was disappointing, that there isn't an obvious answer was expected. Where it got interesting to us is, we said, well, we were expecting that there isn't a clear... a clear relationship between any of these factors, because if there was, it would have been obvious to the experienced scientists doing the work, and we would have known that. So, then, we said, well, what we do know is these properties all have an understanding...a relation to their structure. What happens if we calculate some physical parameters for these things and combine that with a number of, sort of, structural identifiers and ways of looking at these molecules? What happens if we take that and add that data to the test data? Do...you know, can we then build some kind of model? This starts being able to estimate structure and performance, so that's exactly what we did. You know, in this case, what we see is that, again, if we use the multivariate platform, just as a quick look to see if there's any correlation on some of these these factors, this clear differentiation in some cases, between up and down, and maybe a little hints of correlation, but nothing clear that says, this is the one thing that you need. Again, this is what we expected. So then, what we did was we used the regression platforms in JMP to try and understand whether we could build a model, and what that relationship looks like. So to do this, we randomly selected a number of rows for the row selection tool in JMP. Generally, pulling out five samples at random, which weren't going to participate in the model, then it's a relatively built up these models and refined them that way, so we always had a validation set from the initial data, just to...just to check that what we were doing had any chance of success. So then, if we just look at the 80 degree models, the first model that we we came to was this one. Clearly, as we can see, there are a number of factors that were included in this model that make no sense from a statistical point of view, because they're just overfitting and they are just non significant. However, these are fairly important in terms of describing the molecules that are in there, so as a chemist, we created this model. So this is a model that allows, you know ,molecules, if you like, to be designed for this application, even though we know it's over fitted. And we know that it's not...it's not really a valid model because these terms are just driving the R squared up and up and up. We also built the model without those terms. This is a far better model in terms of estimating the performance of these things, the R squared is a touch lower, but all the terms that are in there have a significant impact on the performance. The downside of this model is it doesn't really help us design any new chemistry. But, in both cases, when we look at the predicted values against the actual measured values, we see a reasonable correlation between them. Certainly when we expect things to be high, they are. So that gave us some confidence that this model might actually perform for us. In terms and...then in terms of when we looked at how good this might be, we just simply looked at what's the percentage difference between the measured value and the actual value. And what we see is they are almost universally within 10%, predominantly within 5% for either model. Again across a range of different types of material, this gave us confidence that what we were...what we were seeing might be a real effect. All of which is very nice, is this just an effect of the data we've measured? So what we did was that was we used the profiler platform in JMP, produced a shareable model that we could send around the project team, and essentially set up a competition and said look, whoever can find the highest... the highest thermal conductivity in this model from a molecule that could actually be made, wins. You know, from that we had a list of about 14 materials back that looked promising. We had to cut a few out because they were impossible to source of raw materials, so we ended up with about nine new materials that were synthesized and tested. Now these materials were almost exclusively made up of materials the model hadn't really seen before, you know. In some cases, part of the molecule would be the same, but they were quite distinct from the original materials. So once we'd made them and, once they had been tested, we put them back into the model and to see, just to see what the predictive power of this model was like. So if we have a look at that data, you know, I think, given the differences of these materials, I was fully expecting this... this to break the model. However, when we... if we look at the predictions again, what we see is the highlighted blue ones are the new materials that were made. You know, we deliberately picked a couple that were lower just out of curiosity, just to check. And all the ones that we picked that we thought would be high were high. So in the overfitted model that had value from a designing a structure point of view, what we see is one outlier. In the...in the model that was statistically reasonable, we actually see a much better fit overall. And that was edifying, that we can start to be able to not, sort of, design a single molecule and say, oh, here you go, off you...off you pop; here's the one thing you need to make. But certainly be able to direct synthetic chemists to the right, sort of, types of materials to really drive projects forward. So then, if we just look again at these residuals, what we see is for the you know, for the statistically good model with no overfitting, what we saw was everything was within 10% of all the new materials, which, for what we were trying to achieve, was good enough. There's a few in the overfitting model that were a little bit under 10% but, again, this is kind of what I would expect to see. And it's, you know, it was it was nice that they were all in the right range, because it shows that this approach was was having value, but it was also quite to find that they weren't all at exactly right, because I tlhink, had we produced nine materials and they'd all been within 1%, I'm not sure that people would have believed that either. So the fact is, we were getting a similar... a similar level of difference to the predictions from the materials we started with and the new materials that we made. So we started having some real confidence in this in this model. And then, if we just go back to the slides a second. So what we can say then is that the structure performance relationship of these materials has been created in JMP using the regression platforms. We've used the visualization tools in JMP to be able to see that there's real benefits to do this, and that the model itself is being used to direct this emphasis of new materials in this project. It's being used to screen likely materials to test from things we already make. And it's a, you know, there's an acceptable correlation in the results between the model and the new molecules we're making, all of which has given real confidence to this approach, and it's really allowed us to, kind of, push this project further and sort of split it out into specific target materials. So, in terms of new molecules, we've directed this emphasis of molecules with higher thermal conductivity. So as you can see in this plot, you know, all the new molecules are sort of medium to high on that range of thermal conductivity, which is kind of what we wanted to achieve from them. We demonstrated that we could target an improvement, using data and then verify that in the lab and make it. Where this project then becomes harder still is, we're now trying to build similar models for all the other factors that influence the performance of these dielectric fluids, and then we will be trying to balance those models against each other to find the best outcomes. So all of that further development is ongoing, but that momentum has come purely by the ease of use of JMP and the platforms in it to take a data set and with a bit of kind of domain knowledge, really push that forward and say, yep here's a model that will help direct this emphasis for this project and subsequent projects in this area for Croda. So then, just in conclusion, data that we've obtained from testing has been used to successfully model the performance of these these materials. It's not absolutely perfect, but it's good enough for what we want. The model... the model demonstrates that there is a structure performance a relationship of esters (sorry, not sure why my taskbar is jumping around). The model has been used to predict materials of high thermal conductivity. Those predictions are then verified initially by just exclusion and then laterally by making new materials, and really showing that this this model holds for that type of chemistry. It's also demonstrated the possibility of tailoring properties of, in this case, dielectrics but other materials, if you build similar models, so that you can start being able to create specific materials for specific applications. And I think most importantly for me, the real success of this work has built internal momentum to sort of demonstrate that JMP is not a nice to have, it's a...it's a real platform to develop research, to very quickly look at data sets and say, is there something there? And with that, I just like to say thank you for for watching. Obviously I can't answer any questions on a recording, but if you want to get in touch, feel free to comment in the Community. Yeah, thank you very much.
Labels
(7)
Labels:
Labels:
Advanced Statistical Modeling
Basic Data Analysis and Modeling
Design of Experiments
Mass Customization
Predictive Modeling and Machine Learning
Quality and Process Engineering
Sharing and Communicating Results
0 attendees
0
0
«
Previous
1
2
3
4
…
12
Next
»