Showing results for 
Show  only  | Search instead for 
Did you mean: 
Fault Detection and Diagnosis of the Tennessee Eastman Process Using Multivariate Control Charts (2021-EU-45MP-782)

Level: Intermediate


Jeremy Ash, JMP Analytics Software Tester, SAS


The Model Driven Multivariate Control Chart (MDMCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMCC monitoring of a PLS model using the simulation of a real-world industrial chemical process: the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate how MDMCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available which can delay fault detection substantially. When MDMCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aid in the early detection of faults. We also demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts and diagnostic plots. MDMCC provides a user-friendly interface to move between these plots.



Auto-generated transcript...




  Hello, I'm Jeremy Ash. I'm a
  statistician in JMP R&D. My job
  primarily consists of testing
  the multivariate statistics
  platforms in JMP, but I also
  help research and evaluate
  methodology. Today I'm going to
  be analyzing the Tennessee
  Eastman process using some
  statistical process control
  methods in JMP. I'm going to
  be paying particular attention
  to the model driven multivariate
  control chart platform, which is
  a new addition to JMP 15.
  These data provide an
  opportunity to showcase the
  number of the platform's
  features. And just as a quick
  disclaimer, this is similar to
  my Discovery Americas talk. We
  realized that Europe hadn't seen a
  model driven multivariate
  control chart talk due to all the
  craziness around COVID, so I
  decided to focus on the basics.
  But there is some new material
  at the end of the talk. I'll
  briefly cover a few additional
  example analyses, then I put on
  the Community page for the talk.
  First, I'll assume some knowledge
  of statistical process control
  in this talk. The main thing it
  would be helpful to know about
  is control charts. If you're
  not familiar with these, these
  are charts used to monitor
  complex industrial systems to
  determine when they deviate
  from normal operating
  I'm not gonna have much time to
  go into the methodology of model
  driven multivariate control
  chart, so I'll refer to these other
  great talks that are freely
  available on the JMP Community
  if you want more details. I
  should also say that Jianfeng
  Ding was the primary
  developer of the model driven
  multivariate control control
  chart in collaboration with
  Chris Gotwalt and that Tonya
  Mauldin and I were testers. The
  focus of this talk will be using
  multivariate control charts to
  monitor a real world
  typical process; another novel
  aspect will be using control
  charts for online process
  monitoring. This means we'll be
  monitoring data continuously as
  it's added to a database and
  detecting faults in real time.
  So I'm going to start off with
  the obligatory slide on the
  advantages of multivariate
  control charts. So why not use
  univariate control charts? There
  are a number of excellent
  options in JMP. Univariate
  control charts are excellent
  tools for analyzing a few
  variables at a time. However,
  quality control data are often
  high dimensional and the number
  of control charts you need to
  look at can quickly become
  overwhelming. Multivariate
  control charts can summarize a
  high dimensional process in
  just a couple of control charts,
  so that's a key advantage.
  But that's not to say that
  univeriate control charts aren't
  useful in this setting. You'll
  see throughout the talk that
  fault diagnosis often involves
  switching between multivariate
  and univariate charts.
  Multivariate control charts give
  you a sense of the overall
  health of the process, while
  univariate charts allow you to
  monitor specific aspects of the
  process. So the information is
  complementary. One of the goals
  of monitoring multivariate
  control chart is to provide some
  useful tools for switching
  between these two types of
  charts. One disadvantage of
  univariate charts is that
  observations can appear to be in
  control when they're actually
  out of control in the multivariate
  sense and these plots show what I
  mean by this. The univariate
  control chart for oil and
  density show the two
  observations in red as in
  control. However, oil and density
  are highly correlated and both
  observations are out of control.
  in the multivariate sense,
  specially observation 51, which
  fairly violates the correlation
  structure of the two variables,
  so multivariate control charts
  can pick up on these types of
  outliers, while univariate
  control charts can't.
  Model driven multivariate
  control chart uses projection
  methods to construct the charts.
  I'm going to start by explaining PCA
  because it's easy to build up
  from there. PCA reduces the
  dimensionality of the process by
  projecting data onto a low
  dimensional surface. Um,
  this is shown in the picture
  on the right. We have P
  process variables and N
  observations, and
  the loading vectors in the P
  matrix give the coefficients for
  linear combinations of our X
  variables that result in
  square variables with
  dimension A, where the dimension
  A is much less than P. And then
  this is shown in equations on
  the left here. The X can be
  predicted as a function of the
  score and loadings, where E is
  the prediction error.
  These scores are selected to
  minimize the prediction error,
  and another way to think about
  this is that you're maximizing
  the amount of variance explained
  in the X matrix.
  Then PLS is a more suitable
  projection method. When you have
  a set of process variables and a
  set of quality variables, you
  really want to ensure that the
  quality variables are kept in
  control but these variables
  are often expensive or time
  consuming to collect. The plant
  could be making product without
  a control quality for a long
  time before a fault is detected.
  So PLS models allow you to
  monitor your quality variables
  as a function of your process
  variables and you can see that
  the PLS models find the score
  variables that maximize the
  amount of variation explained of
  the quality variables.
  These process variables are
  often cheaper or more readily
  available, so PLS can enable you
  to detect faults in quality
  early and make your process
  monitoring cheaper. And from here
  on out I'm going to focus on PLS
  models because it's more
  appropriate for the example.
  So PLS model partitions your
  data into two components. The
  first component is the model
  component. This gives the
  predicted values of your process
  variables. Another way to think
  about it is that your data has
  been projected into the model
  plane defined by your score
  variables and T squared monitors
  the variation of your data
  within this model plane.
  And the second component is the
  error component. This is the
  distance between your original
  data and the predicted data and
  squared prediction error (SPE)
  charts monitor this variation.
  Another alternative metric we
  provide is the distance to model
  X plane or DModX. This is just
  a normalized alternative to SPE
  that some people prefer.
  The last concept that's
  important to understand for the
  demo is the distinction between
  historical and current data.
  Historical data are typically
  collected when the process was
  known to be in control. These
  data are used to build the PLS
  model and define the normal
  process variation so that a
  control limit can be obtained.
  And current data are assigned
  scores based on the model but
  are independent of the model.
  Another way to think about this
  is that we have training and
  test sets. The T squared control
  limit is lower for the training
  data because we expect less
  variability for the various...
  observations used to train the
  model whereas there's greater
  variability in P squared when
  the model generalizes to E test
  set. Fortunately, the theory
  for the variance of T squared is
  been worked out so we can get
  these control limits based on
  some distributional assumptions.
  In the demo will be monitoring
  the Tennessee Eastman process.
  I'm going to present a short
  introduction to these data. This
  is a simulation of a chemical
  process developed by Downs and
  Vogel, two chemists at Eastman
  Chemical. It was originally
  written in Fortran, but there
  are wrappers for Matlab and
  Python now. I just wanted to note
  that while this data set was
  generated in the '90s, it's still
  one of the primary data sets
  used to benchmark multivariate
  control methods in the
  literature. It covers the
  main tasks of multivariate
  control well and there is
  an impressive amount of
  realism in the simulation.
  And the simulation is based on
  an industrial process that's
  still relevant today.
  So the data were manipulated
  to protect proprietary
  information. The simulated
  process is the production of
  two liquid products from
  gaseous reactants within a
  chemical plant. And F here is
  a byproduct
  that will need to be siphoned
  off from the desired product.
  Um and...
  That's about all I'll say about that.
  So the process diagram looks
  complicated, but it really isn't
  that bad, so I'll walk you
  through it. Gaseous
  reactants A, D, and E flow into
  the reactor here.
  The reaction occurs and the
  product leaves as a gas. It's
  then cooled and condensed into
  liquid in the condenser.
  Then a vapor liquid separator
  recycles any remaining vapor and
  sends it back to the reactor
  through a compressor, and the
  byproduct and inert chemical B
  are purged in the purge stream,
  and that's to prevent any
  accumulation. The liquid product
  is pumped through a stripper,
  where the remaining reactants
  are stripped off.
  And then sent back to the reactor.
  And then finally, the
  purified liquid product
  exits the process.
  The first set of variables being
  monitored are the manipulated
  variables. These look like bow
  ties in the diagram. I think
  they're actually meant to be
  valves and the manipulated
  process...or the manipulated
  variables mostly control the
  flow rate through different
  streams of the process.
  And these variables can be set
  to any values within limits and
  have some Gaussian noise.
  The manipulated variables are able
  to be sampled in the rate,
  but we use the default 3
  minutes sample now.
  Some examples of the manipulated
  variables are the valves that
  control the flow of reactants
  into the reactor.
  Another example is a valve
  that controls the flow of
  steam into the stripper.
  And another is a valve that
  controls the flow of coolant
  into the reactor.
  The next set of variables are
  measurement variables. These are
  shown as circles in the diagram.
  They were also sampled at three
  minute intervals. The
  difference between manipulated
  variables and measurement
  variables is that the
  measurement variables can't be
  manipulated in the simulation.
  Our quality variables will be
  the percent composition of
  two liquid products and you
  can see the analyzer
  measuring the products here.
  These variables are sampled with
  a considerable time delay, so
  we're looking at the purge
  stream instead of the exit
  stream, because these data are
  available earlier. And will use
  a PLS model to monitor process
  variables as a proxy for these
  variables because the process
  variables have less delay and
  affect faster sampling rate.
  So that should be enough
  background on the data. In
  total there are 33 process
  variables and two quality
  variables. The process of
  collecting the variables is
  simulated with a set of
  differential equations. And this
  is just a simulation, but as you
  can see a considerable amount of
  care went into modeling this
  after a real world process. Here
  is an overview of the demo I'm
  about to show you. We will collect
  data on our process and store
  these data in a database.
  I wanted to have an example that
  was easy to share, so I'll be
  using a SQLite database, but
  the workflow is relevant to most
  types of databases since most
  support ODBC connections.
  Once JMP forms an ODBC
  connection with the database,
  JMP can periodically check for
  new observations and add them to
  a data table.
  If we have a model driven
  multivariate control chart
  report open with automatic
  recalc turned on, we have a
  mechanism for updating the
  control charts as new data come
  in and the whole process of
  adding data to a database would
  likely be going on a separate
  computer from the computer
  that's doing the monitoring. So
  I have two sessions of JMP open
  to emulate this. Both sessions
  have their own journal
  in the materials on the
  Community, and the session
  adding new simulated data to
  the database will be called
  the Streaming Session and
  session updating the reports
  as new data come in will be
  called the Monitoring Session.
  One thing I really liked about
  the Downs and Vogel paper was
  that they didn't provide a
  single metric to evaluate the
  control of the process. I have
  a quote from the paper here
  "We felt that the tradeoffs
  among the possible control
  strategies and techniques
  involved much more than a
  mathematical expression."
  So here are some of the goals
  they listed in their paper,
  which are relevant to our
  problem. They wanted to maintain
  the process variables at
  desired values. They wanted to
  minimize variability of product
  quality during disturbances, and
  they wanted to recover quickly
  and smoothly from disturbances.
  So we'll see how well our
  process achieves these goals
  with our monitoring methods.
  So to start off in the
  Monitoring Session journal, I'll
  show you our first data set.
  The data table contained all of
  the variables I introduced
  earlier. The first variables are
  the measurement variables; the
  second are the composition.
  And the third are the
  manipulated variables.
  The script up here will fit
  a PLS model. It excludes the
  last 100 rows as a test set.
  Just as a reminder,
  the model is predicting 2
  product composition
  variables as a function of
  the process variables. If
  you have JMP Pro, there
  have been some speed
  improvements to PLS
  in JMP 16.
  PLS now has a
  fast SVD option.
  You can switch to the
  classic in the red
  triangle menu. There's
  also been a number of
  performance improvements
  under the hood.
  Mostly relevant for datasets
  with a large number of
  observations, but that's
  common in the multivariate
  process monitoring setting.
  But PLS is not the focus of the
  talk, so I've already fit the
  model and output score columns
  and you can see them here.
  One reason that the monitor
  multivariate control chart was
  designed the way it is, is that
  imagine you're a statistician
  and you want to share your model
  with an engineer so they can
  construct control charts. All
  you need to do is provide the
  data table with these formula
  columns. You don't need to share
  all the gory details of how you
  fit your model.
  Next, I'll provide the score
  columns to monitor the
  multivariate control chart.
  Drag it to the right here.
  So on the left here you can see
two types of control charts the
  T squared and SPE.
  Um, there are 860 observations
  that were used to estimate the
  model and these are labeled as
  historical. And then the hundred
  that were left out as a test set
  are your current data.
  And you can see in the limit
  summaries, the number of points
  that are out of control and the
  significance level. Um, if you
  want to change the significance
  level, you can do it up here in
  the red triangle menu.
  Because the reactor's in normal
  operating conditions, we expect
  no observations to be out of
  control, but we have a few false
  positives here because we
  haven't made any adjustments for
  multiple comparisons. It's
  uncommon to do this, as far as I
  can tell, in multivariate
  control charts. I suppose you
  have higher power to detect out
  of control signals without a
  correction. In control chart
  lingo, this is means you're out
  of control. Average run length
  is kept low.
  So on the right here we
  also have contribution
  plots and on the Y axis are
  the observations; on the X
  axis, the variables. A
  contribution is expressed
  as a portion.
  And then at the bottom here,
  we have score plots. Right
  now I'm plotting the first
  score dimension versus the
  second score dimension, but
  you can look at any
  combination of score
  dimensions using this
  dropdown menus or the arrow
  OK, so I think we're oriented
  to the report. I'm going to
  now switch over to the
  scripts I've used to stream
  data into the database that
  the report is monitoring.
  In order to do anything for this
  example, you'll need to have a
  SQLite ODBC driver installed
  for your computer. This is much easier
  to do on a Windows computer,
  which is what you're often using
  when actually connecting to a
  database. The process on the Mac
  is more involved, but I put some
  instructions on the Community
  page. And then I don't have time
  to talk about this, but I
  created the SQLite database
  I'll be using in JMP and I
  plan to put some instructions
  in how to do this on the
  Community Web page. And hopefully
  that example is helpful to you
  if you're trying to do this with
  data on your own.
  Next I'm going to show
  you the files that I put
  in the SQLite database.
  Here I have the historical data.
  This was used to construct
  the PLS model. There are 960
  observations that are in
  control. Then I have the
  monitoring data, which at first
  just contains the historical
  data, but I'll gradually add new
  data to this. This is the data
  that the multivariate control
  chart will be monitoring.
  And then I've simulated new
  data already and added it to the
  data table here. These are
  another 960 odd measurements
  where a fault is introduced at
  some time point. I wanted to
  have something that was easy to
  share, so I'm not going to run
  my simulation script and add to
  the database that way. We're
  just going to take observations
  from this new data table and
  move them over to the monitoring
  data table using some JSL and
  SQL statements. This is just an
  example emulating the process
  of new data coming into a
  database. Somehow you might not
  actually do this with JMP, but
  this was an opportunity to show
  how you can do it with JSL.
  Clean up here.
  And next I'll show you this
  streaming script. This is a
  simple script, so I'm going to
  walk you through it real quick.
  This first set of
  commands will open the
  new data table and
  it's in the SQLite database,
  so it opens the table in the
  background so I don't have to
  deal with the window.
  Then I'm going to take pieces
  from this data table and add
  them to the monitoring data
  table. I call the pieces
  bites and the bite size is 20.
  And then this next command will
  connect to the database. This
  will allow me to send the
  database SQL statements.
  And then this next bit
  of code is
  iteratively sending SQL
  statements that insert new
  data into the monitoring data.
  And I'm going to
  initialize K and show you the
  first iteration of this.
  This is a simple SQL statement,
  insert into statement that
  inserts the first 20
  observations into the data
  table. This print statement is
  commented out so that the code
  runs faster and then I also
  have a wait statement to slow
  things down slightly so that
  we can see their progression
  in the control chart.
  And this would just go too fast
  if I didn't slow it down.
  Um, so next I'm going to move
  over to the monitoring sessions
  to show you the scripts
  that will update the report
  as new data come in.
  This first script is a simple
  script. That will check the
  database every .2 seconds for
  new observations and add them
  to the JMP table. Since the
  report has automatic recalc
  turned on, the report will update
  whenever new data are added. And
  I should add that
  you probably wouldn't use a
  script that just iterates like
  this. You probably use task
  scheduler in Windows or
  Automator on Mac to better
  schedule runs of the script.
  And then there's also another
  script that will
  push the report to JMP Public
  whenever the report is updated,
  and I was really excited that
  this is possible with JMP 15.
  It enables any computer with a
  web browser to view updates to
  the control chart. Then you
  can even view the report on
  your smartphone, so this makes
  it really easy to share
  results across organizations.
  And you can also use JMP Live
  if you wanted the reports to
  be on restricted server.
  I'm not going to have time
  to go into this in this
  demo, but you can check out
  my Discovery Americas talk.
  Then finally down here, there is
  a script that recreates the
  historical data in the data
  table if you want to run the
  example multiple times.
  Alright, so next...make sure
  that we have the historical data...
  I'm going to run the
  streaming script and see
  how the report updates.
  So the data is in control at
  first and then a fault is
  introduced, but there's a
  plantwide control system
  that's implemented in the
  simulation, and you can see
  how the control system
  eventually brings the process
  to a new equilibrium.
  Wait for it to finish here.
  So if we zoom in,
  seems like the process first
  went out of control around this
  time point, so I'm going to
  color it and
  label it, but it will
  show up in other plots.
  And then in the SPE plot,
  it looks like this
  observation is also out of
  control but only slightly.
  And then if we zoom in on
  the time point in the
  contribution plots, you can
  see that there are many
  variables contributing to
  the out of control signal at
  first. But then once the
  process reaches a new
  equilibrium, there's only
  two large contributors.
  So I'm going to remove the heat
  maps now to clean up a bit.
  You can hover over
  the point at which the process
  first went out of control and
  get a peek at the top ten
  contributing variables. This
  is great for giving you a
  quick overview which variables
  are contributing most to the
  out of control signal.
  And then if I click on the plot,
  this will be appended to the
  fault diagnosis section.
  And as you can see, there's
  several variables with large
  contributions and just sorted
  on the contribution.
  And for variables with
  red bars the observation is
  out of control in the univariate
  control charts. You can see
  this by hovering over one of
  the bars and these graphlets
  are IR charts for an
  individual variable with a
  three Sigma control limit.
  You can see in the stripper
  pressure variable that the
  observation is out of
  control, but eventually the
  process is brought back under
  control. And this is the case
  for the other top
  contributors. I'll also show
  you one of the variables
  where we're in control, the
  univariate control chart.
  So the process was...
  there are many variables out
  of control in the process at
  the beginning, but
  process eventually reaches
  a new equilibrium.
  To see the variables that
  contribute most to the shift in
  the process, we can use mean
  contribution proportion plot.
  These plots show the average
  contribution that the variables
  have to T squared for the group
  I've selected. Um, here if I
  sort on these.
  The only two variables with
  large contributions measure the
  rate of flow of reactant A in
  stream one, which is the flow of
  this reactant into the reactor.
  Both of these variables are
  measuring essentially the
  same thing, except one is a
  measurement variable and the
  other is a manipulated
  You can see that there is a
  large step change in the flow
  rate, which is what I programmed
  in the simulation. So these
  contribution plots allow you to
  quickly identify the root cause.
  And then in my previous talk I
  showed many other ways to
  visualize and diagnose faults
  using tools in the score plot.
  This includes plotting the
  loadings on the score plots and
  doing some group comparisons.
  You can check out my Discovery
  Americas talk on the JMP
  Community for that. Instead, I'm
  going to spend the rest of this
  time introducing a few new
  examples, which I put on the
  Community page for this talk.
  There are 20 programmable faults
  in the Tennessee Eastman process
  and they can be introduced in any
  combination. I provided two other
  representative faults here. Fault
  1 that I showed previously was
  easy to detect because the out
  of control signal is so large
  and so many variables are
  involved. The focus on the
  previous demo was to show how to
  use the tools and identify.
  faults out of a large number of
  variables and not to benchmark
  the methods necessarily.
  Fault 4, on the other hand,
  is a more subtle fault,
  and I'll show you it here.
  The fault i...that's programmed
  is a sudden increase in the
  temperature in the reactor.
  And this is compensated for by
  the control system by increasing
  the flow rate of coolant.
  And you can see that
  variable picked up here and
  you can see the shift in
  contribution plots.
  And then you can also see
  that most other variables
  aren't affected
  by the fault. You can see a
  spike in the temperature here
  is quickly brought back under
  control. Because most other
  variables aren't affected, this
  is hard to detect for some
  multivariate control methods.
  And it can be more
  difficult to diagnose.
  The last fault I'll show you
  is Fault 11.
  Like Fault 4, it also involves
  the flow of coolant into the
  reactor, except now the fault
  introduces large oscillations in
  the flow rate, which we can
  see in the univariate control
  chart. And this results in a
  fluctuation of reactor
  temperature. The other
  variables aren't really
  affected again, so this can be
  harder to detect for some
  methods. Some multivariate
  control methods can pick up on
  Fault 4, but not Fault 11 or
  vice versa. But our method was
  able to pick up on both.
  And then finally, all the
  examples I created using the
  Tennessee Eastman process had
  faults that were apparent in
  both T squared and SPE plots. To
  show some newer features in
  model driven multivariate
  control chart, I wanted to show
  an example of a fault that
  appears in the SPE chart but not
  T squared. And to find a good
  example of this, I revisited a
  data set which Jianfeng Ding
  presented in her former talk, and
  I provided a link to her talk
  in this journal.
  On her Community page,
  she provides several
  useful examples that are
  also worth checking out.
  This is a data set from Cordia
  McGregor's (?) classic paper on
  multivariate control charts. The
  data are processed variables
  measured in a reactor, producing
  polyethylene, and you can find
  more background in Jianfeng's
  talk. In this example, we
  have a process that went out of
  control. Let me show you this.
  And it's out of control in...
  earlier in the SPE chart than in
  the T squared.
  And if we look at the mean
  plots for SPE,
  you can
  see that there is one variable
  with large contribution and it
  also shows a large shift in the
  univariate control chart, but
  there are also other variables
  with large contributions, but
  that are still in control in the
  univariate control charts.
  And it's difficult to determine from
  the bar charts alone why these
  variables had a large
  contributions. Large SPE values
  happen when new data don't
  follow the correlation structure
  of the historical data, which is
  often the case when new data are
  collected, and this means that
  your PLS model you trained is
  no longer applicable.
  From the bar charts, it's hard
  to know which pair of variables
  have their correlation structure
  broken. So new in 15.2, you
  can launch scatterplot matrices.
  And it's clear in the
  scatterplot matrix that the
  violation of correlations
  with Z2 is what's driving
  these large contributions.
  OK, I'm gonna switch back
  to the PowerPoint.
  And real quick, I'll summarize
  the key features of model driven
  multivariate control chart that
  were shown in the demo. The
  platform is capable of
  performing both online fault
  detection and offline fault
  diagnosis. There are many
  methods provided in the platform
  for drilling down to the root
  cause of faults. I'm showing you
  here some plots from a popular
  book, Fault Detection and
  Diagnosis in Industrial Systems.
  Throughout the book, the authors
  demonstrate how one needs to
  use multivariate and univariate
  control charts side by side
  to get a sense of what's going
  on in a process.
  An one particularly useful
  feature in model driven multivariate
  control chart is how
  interactive and user friendly
  it is to switch between these
  two types of charts.
  And that's my talk. Here is
  my email if you have any
  further questions. And
  thanks to everyone that
  tuned in to watch this.