cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Maximizing Data Science Success with Information Quality (InfoQ) and JMP® (2021-EU-45MP-750)

https://www.jmp.com/en_us/whitepapers/book-chapters/infoq-support-with-jmp.html 

Level: Intermediate

 

Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion
Christopher Gotwalt, JMP Director, Statistical R&D, SAS

 

Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.

 

 

Auto-generated transcript...

 


Speaker

Transcript

  Hello.
  My name is Ron Kenett. This
  is a joint talk with Chris
  Gotwalt and we basically
  have two messages
  that should come out of the
  talk. One is that we should
  really be concerned about
  information and information
  quality. People tend to talk
  about data and data quality, but
  data is really not the the issue.
  We are statisticians. We are
  data scientists. We turn numbers,
  data into information so that
  our goal should be to make sure
  that we generate high quality
  information. The other message
  is that JMP can help you
  achieve that, and this is
  actually turning out to be in
  surprising ways. So by combining
  the expertise of Chris and an
  introduction to information
  quality, we hope that these two
  messages will come across
  clearly. So if I had to
  summarize what it is that we
  want to talk about, after all,
  it is all about information
  quality. I gave a talk at
  at the Summit in Prague four years
  ago and that talk was generic.
  It talked about moving the
  journey from quality by design.
  My journey doing information
  quality. In this talk we focus
  on how this can be done with
  JMP. This is a more detailed
  and technical talk than the
  general talk I gave in Prague.
  You can watch that talk.
  There's a link listed here. You can
  find it on the JMP community.
  So we're going to talk about
  information quality, which is
  the potential of the data set, a
  specific data set, to achieve a
  particular goal, a specific goal,
  with the given empirical method.
  So in that definition we have
  three components that are
  listed. One is a certain data
  set. Here is the data. The
  second one is the goal,
  the goal of the analysis, what
  it is that we want to achieve.
  And the third one is how we will
  do that, which is, with what
  methods we're going to generate
  information, and that potential
  is going to be assessed with the
  utility function. And I will
  begin with an introduction to
  information quality, and then
  Chris will will take over,
  discuss the case study and and
  show you how to conduct an
  information quality assessment.
  Eventually this should
  answer the question how JMP
  supports InfoQ, that
  would be the the bullet points
  that you can...the take away points
  from the talk. The setup for
  this is that we we encourage
  what I called a lifecycle view
  of statistics. In other words,
  not just data analysis.
  We should know...we should be part
  of the problem elicitation
  phase. Also, the goal
  formulation phase, that deserves
  a discussion. We should
  obviously be involved in the
  data collection scheme if it's
  through experiments or through
  surveys or through observational
  data. Then we should also take
  time for formulation of the
  findings and not just pull out
  printed reports on on
  regression coefficients
  estimates and and their
  significance, but we should
  also discuss what are the
  findings? Operationalization of
  findings meaning, OK, what can we
  do with these findings? What are
  the implications of the
  findings? This should should...
  needs to be communicated to the
  right people in the right way,
  and eventually we should do an
  impact assessment to figure out,
  OK, we did all this; what has
  been the added value of our
  work? I talked about the life
  cycle of your statistics a few years
  ago. This is the prerequisite,
  the perspective to what
  I'm going to talk about. So as I
  mentioned, the information
  quality is the potential of a
  particular data set to achieve a
  particular goal using given
  empirical analysis methods. This
  is identified through four
components the goal, the data,
  the analysis method, and the
  utility measure. So in a in a
  mathematical expression, the
  utility of applying f to x,
  condition on the goal, is how we
  identify InfoQ, information
  quality. This was published in
  the Royal Statistical Society
  Series A in 2013 with eight
  discussants, so it was amply
  discussed. Some people thought
  this was fantastic and some
  people had a lot of critique on
  that idea, so this is a wider
  scope consideration of what
  statistics is about. We also
  wrote in 2006, we meeting myself
  and Galit Shmueli, a book called
  Information Quality. And in
  the context of information
  quality we did what is called
  deconstruction. David Hand has
  a paper called Deconstruction
  of Statistics. This is the
  deconstruction of information
  quality into eight dimensions. I
  will cover these eight dimensions.
  That's my part in the talk and
  then Chris will show how this
  is implemented in a specific
  case study.
  Another aspect that relates to
  this is another book I have.
  This is recent, a year ago
  titled The Real Work of Data
  Science and we talk about what
  is the role of the data
  scientists in organizations and
  in that context, we emphasized
  the need for the data scientist
  to be involved in the generation
  of information as an...information
  quality as meeting the goals of
  the organization. So let me
  cover the eight dimensions. That's
  that's that's my intro. The
  first one is data resolution. We
  have a goal. OK, we we would
  like to know the level of flu
  because in the country or in the
  area where we live, because that
  will impact our decision on
  whether to go to the park where
  we could meet people or going to
  a...to a jazz concert. And that
  concert is tomorrow.
  If we look up the CDC data on
  the level of flu, that data is
  updated weekly, so we could get
  the red line in the graph you
  have in front of you, so we
  could get data of a few days
  ago, maybe good, maybe not good
  enough for our gold. Google Flu,
  which is based on searches
  related to flu, is updated
  momentarily, so it's updated
  online, it will probably give
  us better information. So for
  our goal, the blue line, the
  Google trend...the Google Flu
  Trends indicator, is probably
  more appropriate. The second
  dimension is data structure.
  To meet our goal, we're going to
  look at data. We should...we
  should identify the data sources
  and the structure in these data
  sources. So some data could be
  text, some could be video, some
  could be, for example, the
  network of recommendations. This
  is an Amazon picture on how if
  you look at the book, you're
  going to get some
  other books recommended. And if
  you go to these other books,
  you're going to have more data
  recommended. So the data
  structure can come in all sorts
  of shapes and forms and this can
  be text. This can be functional
  data. This can be images. We are
  not confined now to what we used
  to call data, which is what you
  find in an Excel spreadsheet.
  The data could be corrupted,
  could have missing values, could
  have unusual patterns which
  which would be
  something to look into. Some
  patterns, where things are
  repeated. Maybe some of the data
  is is just copy and paste and we
  would like to be warned about
  such options. The third
  dimension is data integration.
  When we consider the data from
  these different sources, we're
  going to integrate them so we
  can do some analysis linkage
  through an ID. For example, we
  would do that, but in doing that,
  we might create some issues, for
  example, in in disclosing data
  that normally should be
  anonymized. Data
  integration, yeah, that will
  allow us to do fantastic things,
  but if the data is perceived to
  have some privacy exposure
  issues, then maybe the quality
  of the information from the
  analysis that we're going to do
  is going is going to be
  affected. So data integration
  should be looked into very, very
  carefully. This is what people
  likely used to call ETL
  extract, transform and load. We
  now have much better methods for
  doing that. The join option, for
  example, in JMP will offer
  options for for doing that.
  Temporal relevance. OK, that's
  pretty clear. We have data. It is
  stamped somehow. If we're going
  to do the analysis later, later
  after the data collection, and
  if the deployement that we
  consider is even later, then the
  data might not be temporally
  relevant. In a common
  situation, if we want to compare
  what is going on now, we would
  like to be able to make this
  comparison to recent data or
  data before the pandemic
  started, but not 10 years
  earlier. The official statistics
  on health records used to be two
  or three years behind in terms
  of timing, which made it very
  difficult the use of official
  statistics in assessing
  what is going on with the
  pandemic. Chronology of data and
  goal is related to the decision
  that we make as a result of our
  goal. So if, for example, our
  goal is to forecast air quality,
  we're going to use some
  predictive models on the Air
  Quality Index reported on a
  daily basis. This gives us a one
  to six scale from hazardous to
  good. There are some values
  which are representing levels of
  health concern. Zero-50 is good;
  300-500 is hazardous and the
  chronology of data and goal
  means that we should be able to
  make a forecast on a daily
  basis. So the methods we use
  should be updated on a daily
  basis. If, on the other hand,
  our goal is to figure out how is
  this AQI index computed, then we
  are not really bound by the the
  the timeliness of the analysis.
  You know, we could take our
  time. There's no urgency in
  getting the analysis done on a
  daily basis. Generalizability,
  the sixth dimension, is about
  taking our findings and
  considering where this could
  apply in more general terms,
  other populations, all
  situations. This can be done
  intuitively. Marketing managers who
  have seen a study on the on the
  market, let's call it A, might
  already understand what are the
  implications to Market B
  without data. People who are
  physicists will be able to
  make predictions based on
  mechanics on first principles
  without without data.
  So some of the generalizability
  is done with data. This is the
  basis of statistical
  generalization, where we go from
  the sample to the population.
  Statistical inferences is about
  generalization. We generalize
  from the sample to the
  population. And some can be
  domain based, in other words,
  using expert knowledge, domain
  expertise, not necessarily with
  data. We have to recognize that
  generalizability is not just
  done with statistics.
  The seventh dimension is
  construct operationalization,
  which is really about what it is
  that we measure. We want to
  assess behavior, emotions, what
  it is that we can measure, that
  will give us data that reflects
  behavior or emotions.
  The example I give here
  typically is pain.
  We know what is pain. How do
  we measure that? If you go to a
  hospital and you ask the nurse,
  how do you assess pain, they
  will tell you, we have a scale,
  1 to 10. It's very
  qualitative, not very
  scientific, I would say. If we
  want to measure the level of
  alcohol in drivers on the...on
  the road, it will be difficult to
  measure. So we might measure
  speed as a surrogate measure.
  Another part of
  operationalization is the other
  end of the story. In other
  words, the first part, the
  construct is what we measure,
  which reflects our goal. The the
  end...the end result here is that
  we have findings and we want to
  do something with them. We want
  to operationalize our finding.
  This is what the action
  operationalization is about.
  It's about what you do with the
  findings and then being
  presented here on a podium. We
  used to ask three questions.
  These are very important
  questions to ask. Once you have
  done some analysis, you have
  someone in front of you who
  says, oh, thank you very much,
  you're done...you, the statistician
  or the data scientist. So this
  this takes you one extra step,
  getting you to ask your customer
these simple questions What do
  you want to accomplish? How will
  you do that and how will you
  know if you have accomplished
  that? We we can help answer, or
  at least support, some of these
  questions we've answered.
  The eighth dimension is
  communication. I'm giving you an
  example from a very famous old
  map from the 19th century, which
  is showing the march of the
  Napoleon Army from France to
  Moscow to Russia. You see the
  numbers are the the width of
  the path indicates the size
  of the army, and then on on in
  black you see what happened to
  them on their on their way back.
  So basically this was a
  disastrous march. We we we we
  can relate this old map to
  existing maps, and there is a
  JMP add-in, which you can
  find on the JMP Community, to to
  show you with bubble plots,
  dynamic bubble plots what this
  looks like. So I I've covered
  very quickly the eight information
  quality dimensions. My last
  slide is that what I've talked
  about from a historical
  perspective, really put some
  proportions to what I'm saying.
  I think we are really in the era
  of information quality. We used
  to be concerned with product
  quality in the 18th century, the
  17th century. We then moved to
  process quality and service
  quality. This is a short memo
  on proposing a control chart,
  1924, I think.
  Then we move to management
  quality. This is the Juran
  trilogy of design, improvement
  and control. Six Sigma (define,
  measure, analyze, improve)
  control process is the
  improvement process of Juran,
  and Juran was the grand father
  of Six Sigma in that sense.
  Then in the '80s, Taguchi came
  in. He talks about robust
  design. How can we handle
  variability in inputs by proper
  design decisions? And now we are
  in the Age of information
  quality. We have sensors. We
  have flexible systems. We are
  depending on AI and machine
  learning and data mining and we
  are gathering big big big
  numbers, but which we call big
  data. The interest in information
  quality should be a prime prime
  interest. I'm going to try and
  convince you of, with the help
  of Chris, that.
  We are here and JMP can
  help us achieve that in in a
  really unusual way.
  What you will see at the end of
  the case study that Chris will
  show is also how to do it an
  information quality assessment and
  on a specific study, basically
  generate an information quality
  score. So if we go top down, I
  can tell you this study, this
  work, this analysis is maybe 80% or
  maybe 30% or maybe 95%.
  And through the example you will
  see how to do that. There is a
  JMP add-in to provide this
  assessment. It's it's actually
  quite quite easy. There's
  nothing really sophisticated
  about that. So I'm done and
  Chris, after you. Thanks, Ron. So
  now I'm going to go through the
  analysis of a data set in a way
  that explicitly calls out the
  various aspects of information
  quality and show how JMP can be
  used to assess an improvement
  for InfoQ. So first off, I'm
  going to go through the InfoQ
  components. The first InfoQ
  component is the goal, so in
  this case the problem statement
  was that a chemical company
  wanted a formulation that
  maximized product yield while
  minimizing a nuisance impurity
  that resulted from the reaction.
  So that was the high level goal.
  In statistical terms, we wanted
  to find a model that accurately
  predicted a response on a data
  set so that we could find a
  combination of ingredients and
  processing steps that would lead
  to a better product.
  The data are set up in 100
  experimental formulations with
  one primary ingredient, X1,and
  10 additives. There's also a
  processing factor in 13
  responses. The data are
  completely fabricated but were
  simulated to illustrate the same
  strengths and weaknesses of the
  original data. The data
  formulation was made was also
  recorded. We will be looking at
  this data closely, so I want to
  elaborate beyond pointing out
  that they were collected in an
  ad hoc way, changing one or two
  additives at a time rather than
  as a designed or randomized
  experiment. There's a lot of
  ways to analyze this data, the
  most typical being least
  squares modeling with forward
  selection on selected responses.
  That was my original intention
  for this talk, but when I showed
  the data to Ron, he immediately
  recognized the response columns
  as time series from analytical
  chemistry. Even though the data
  were simulated, he could see the
  structure. He could see things
  in the data that I didn't see
  and read into it wasn't. I found
  this to be strikingly
  impressive. It's beyond the
  scope of this talk, but there is
  an even better approach based on
  ensemble modeling using
  fractionally weighted
  bootstrapping. Phil Ramsey,
  Wayne Levin and I have another
  talk about this methodology at
  the European Discovery
  Conference this year. The
  approach is promising because it
  can fit models to data with
  more active interactions than
  there are runs. The fourth and final
  component of information quality
  is utility, which is how well we
  are able to assess our goals. Or
  how do we measure how well we've
  assessed our goals? There's a
  domain aspect which is in this
  case we want to have a
  formulation that leads to
  maximized yields and minimized the
  waste in post processing of the
  material. The statistical
  analysis utility refers to the
  model that we fit. What we're
  going for there is least
  squares accuracy of our model in
  terms of how well we're able to
  predict what the...what would
  result from candidate
  combinations of formulation...of
  mixture factors. Now I'm going
  to go through a set of questions
  that make up a detailed InfoQ
  assessment as organized into the
  eight dimensions of information
  quality. I want to point out
  that not all questions will be
  equally relevant to different
  data science and statistical
  projects, and that this is not
  intended to be rigid dogma but
  rather a set of things that are
  a good idea to ask oneself.
  These questions represent a kind
  of data analytic wisdom that
  looks more broadly than just the
  application of a particular
  statistical technology. A copy
  of a spreadsheet with these
  questions along with pointers to
  JMP features that are the most
  useful for answering a
  particular one will be uploaded
  to the JMP Community along
  with this talk for you to use. As
  I proceed through the questions,
  I'll be demoing an analysis of
  the data in JMP. So Question 1
is is the data scale used
  aligned with the stated goal? So
  the Xs that we have consist of
  a single categorical variable
  processing and the 11 continuous
  inputs. These are measured
  as percentages and are also
  recorded to half a percent. We
  don't have the total amounts of
  the ingredients, only the
  percentages. The totals are
  information that was either lost
  or never recorded. There are
  other potentially important
  pieces of information that are
  missing here. The time between
  formulating the batches and
  taking the measurements is gone
  and there could have been other
  covariate level information that
  is missing here that would have
  described the conditions under
  which the reaction occurred.
  Without more information than I
  have, I cannot say how important
  this kind of covariate information
  would have been. We do have
  information on the day of the
  batch, so that could be used as
  a surrogate possibly. Overall we
  have what are, hopefully, the most
  important inputs, as well as
  measurements of the responses we
  wish to optimize. We could have
  had more information, but this
  looks promising enough to try
  and analysis with. The second
  question related to data
  resolution is how reliable and
  precise are the measuring devices
  and data sources. And the fact
  is, we don't have a lot of
  specific information here. The
  statistician internal the
  company would have had more
  information. In this case we
  have no choice but to trust that
  the chemists formulated and
  recorded the mixtures well. The
  third question relative to data
resolution is is the data
  analysis suitable for the data
  aggregation level? And the
  answer here is yes, assuming
  that their measurement system is
  accurate and that the data are
  clean enough. What we're going
  to end up doing actually is
  we're going to use the
  Functional Data Explorer to
  extract functional principal
  components, which are a data
  derived kind of data
  aggregation. And then we're
  going to be modeling those
  functional principal components
  using the input variables. So
  now we move on to the data
  structure dimension and the
  first question we ask is, is the
  data used aligned with the
  stated goal? And I think the
  answer is a clear yes here. We're
  trying to maximize
  yield. We've got measurements for
  that, and the inputs are
  recorded as Xs. The second data
  structure question is where
  things really start to get
  interesting for me. So this is
  are the integrity details
  (outliers, missing values, data
  corruption) issues described and
  handled appropriately? So from
  here we can use JMP to be able
  to understand where the outliers
  are, figure out strategies for
  what to do about missing values,
  observe their patterns and so
  on. So this is this is where
  things are going to get a little
  bit more interesting. The first
  thing we're going to do is we're
  going to determine if there are
  any outliers in the data that we
  need to be concerned about. So
  to do that, we're going to go
  into the explore outliers
  platform off of the screening
  menu. We're going to load up the
  response variables, and because
  this is a multivariate setting,
  we're going to use a new feature
  in JMP Pro 16 called Robust
  PCA Outliers. So we see where
  the large residuals are in those
  kind of Pareto type plots.
  There's a snapshot showing where
  there's some potentially
  unusually large observations. I
  don't really think this looks
  too unusual or worrisome to me.
  We can save the large outliers
  to a data table and then look at
  them in the distribution
  platform and what we see kind of
  looks like a normal distribution
  with the middle taken out. So I
  think this is data that are
  coming from sort of the same
  population and there's nothing
  really to worry about here,
  outliers-wise. So once we've
  taken care of the outlier
  situation we go in and explore
  missing values. So what we're
  going to do first is we're going
  to load up the Ys as...into the
  platform, and then we're going
  to use the missing value
  snapshot to see what patterns
  they are amongst our missing
  values. It looks like the
  missing values tend to occur in
  horizontal clusters, and there's
  also the same missing values
  across rows. So you can see that
  with the black splotches here.
  And then we'll go apply an
  automated data imputation,
  which goes ahead and saves
  formula columns that impute
  missing values in the new
  columns using a regression type
  algorithm that was developed by
  a PhD student of mine named Milo
  Page at NC State. So we can play
  around a little bit and get a
  sense of like how the ADI
  algorithm is working. So it's
  created these formula columns
  that are peeling off elements of
  the ADI impute column, which is
  a vector formula column, and the
  scoring impute function
  is calculating the expected
  value of the missing cells given
  the non missing cells, whenever
  it's got a missing value. And it's
  just carrying through a non
  missing value. So you can see 207
  in YO...Y6 there. It's initially
  207 but then I change it to
  missing and it's now being
  imputed to be 234.
  So I'll do this a couple of times so
  you can kind of see how how it's
  working. So here I'll put in a big
  value for Y7 and that's now.
  been replaced. And if we go down
  and we add a row,
  then all missing values are there
  initially and the column means
  are replaced for the
  imputations. If I were to go
  ahead and add values for some of
  those missing cells, it would
  start doing the conditional
  expectation of the still missing
  cells using the information
  that's in the missing one....the
  non missing ones. So our next
  question on data structure is
  are the analysis methods
  suitable for the data structure?
  So we've got 11 mixture inputs
  and a processing variable that's
  categorical. Those are going
  to be inputs into a least
  squares type model. We have
  13 continuous responses and
  we can model them using...
  individually using least
  squares. Or we can model
  functional principal
  components. The...now there are
  problems. The Xs are...the
  input variables have not been
  randomized at all. It's very
  clear that they would muck
  around with one or more of
  the compounds and then move
  on to another one. So the
  order in which the
  the input variables were varied
  was kind of haphazard. It's a
  clear lack of randomization, and
  that's going to negatively
  impact our...the generalizability
  and strength of our conclusions.
  Data integration is the third
  InfoQ dimension. These data
  are manually entered lab notes
  consisting mostly of mixture
  percentages and equipment
  readouts. We can only assume
  that the data were entered
  correctly and that the Xs are
  aligned properly with responses.
  If that isn't the case, then the
  model will have serious bias
  problems and have
  problems with generalizability.
  Integration is more of an issue
  with observational data
  science problems in machine
  learning exercises, than lab
  experiments like this. Although
  it doesn't apply here, I'll
  point out that privacy and
  confidentiality concerns can be
  identified by modeling the
  sensitive part of the data using
  the to be published component
  at the data. If the resulting
  model is predicted, then one
  should be concerned that privacy
  concerns are not being met. Temporal
  relevance refers to the
  operational time sequencing of
  data collection, analysis and
  deployment and whether gaps
  between those stages leads to a
  decrease in the usefulness of
  the information in the study.
  In this case, we can only simply
  hope that the material supplies
  are reasonably consistent and
  that the test equipment is
  reasonably accurate, which is an
  unverifiable assumption at this
  point. The time resolution we
  have when the data collection is
  at the day level, which means
  that there isn't much way that
  we can verify if there is time
  variation within each day.
  Chronology of data and goal is
  about the availability of
  relevant variables both in terms
  of whether the variable is
  present at all in the data or
  whether the right information
  will be available when the model
  is deployed. For predictive
  models, this relates to models
  being fit to data similar to
  what will be present at the time
  the model will be evaluated on
  new data. In this way, our data
  set is certainly fine. For
  establishing causality, however,
  we aren't in nearly as good a
  shape because the lack of
  randomization implies that time
  effects and factor effects may
  be confounded, leading to bias
  in our estimates. Endogeneity,
  or reverse causation, could
  clearly be an issue here, as
  variables like temperature and
  reaction time could clearly be
  impacting the responses, but have
  been left unrecorded. Overall,
  there is a lot we don't know
  about this dimension in an
  information quality sense.
  The rest of the InfoQ
  assessment is going to be
  dependent upon the type of
  analysis that we do. So at this
  point I'm going to go ahead and
  conduct an analysis of this data
  using the Functional Data
  Explorer platform in JMP Pro
  that allows me to model across
  all the columns simultaneously
  in a way that's based on
  functional principal components,
  which contain the maximum amount
  of information across all those
  columns as represented in the
  most efficient format possible.
  I'm going to be working on the
  imputed versions of the columns
  that I calculated earlier in the
  presentation. And I'm going to
  point out that I'm going to be
  working to find combinations of
  the mixture factors that achieve
  as closely as possible in a
  least square sense, an ideal
  curve that was created by the
  practitioner that maximizes the
  amount of potential product that
  could be in a batch while
  minimizing the amount of the
  impurities that they
  realistically thought a batch
  could contain. So I begin the
  analysis by going to the analyze
  menu, bring up the Functional
  Data Explorer. This has rows as
  functions. I'm going to load up my
  imputed rows, and then I'm going
  to put in my formulation
  components and my processing
  column as a supplementary
  variable. We've got an ID
  function, that's batch ID. Here I
  get in. I can see the functions,
  both the overlay altogether, and
  I can see the individual functions.
  Then I can load up the target
  function, which is the ideal.
  And that will change the
  analysis that results once I
  start going into the modeling
  steps. So these are pretty
  simple functions, so I'm just
  going to model them with
  B splines.
  And then I'm going to go into my
  functional DOE analysis.
  This is going to fit the model
  that connects the inputs into
  the functional principal
  components and then connect all
  the way through the
  eigenfunctions to make it so
  we're able to recover the
  overall functions as they
  changed, as we are varying the
  mixture factors. The
  functional principal component
  analysis has indicated that
  there are four dimensions of
  variation in these response
  functions. To understand what
  they mean, let's go ahead and
  explore with the FPC profiler.
  So watch this pane right here as
  I adjust FPC 1 and we can see
  that this FPC is associated with
  peak height. FPC2, it looks
  like it's kind of a peak
  narrowness. It's almost like a
  resolution principal
  component. The third one is
  related to kind of a knee on
  the left of the dominant peak.
  And Peak 4 looks like it's
  primarily related with the
  impurity, so that's what the
  underlying meaning is of
  these four functional
  principal components.
  So we've characterized our goal
  as maximizing the product and
  minimizing the impurity, and
  we've communicated that into the
  analysis through this ideal or
  golden curve that we supplied at
  the beginning of the FDE
  exercise we're doing. To get as
  close as possible to that ideal
  curve, we turn on desirability
  functions. And then we can go
  out and maximize desirability.
  And we find that the optimal
  combination of inputs is about
  4.5% of
  Ingredient four, 2% of
  Ingredient 6. 2.2% of
  Ingredient 8 and 1.24% of
  Ingredient 9 using processing
  method two. Let's review how
  we've gotten here. We first
  computed the missing response
  columns. Then we found B-spline
  models that fit those functions
  well in the FDE platform. A
  functional principal components
  analysis determined that there
  were four eigenfunctions
  characterizing the variation in
  this data. These four
  eigenfunctions were determined
  via the FPC profiler to each
  have a reasonable subject
  matter meaning. The functional
  DOE analysis consisted of
  applying pruned forward
  selection to each of the
  individual FPC scores using the
  DOE factors as input variables.
  And we see here that these have
  found combinations of
  interactions and main effects
  that were most predictive for
  each of the functional principal
  component scores individually.
  The Functional DOE Profiler
  has elegantly captured all
  aspects into one representation
  that allows us to find the
  formulation processing step that
  is predicted to have desirable
  properties as measured by high
  yield and low impurity.
  So now we can do an InfoQ
  assessment of the
  generalizability of the data in
  the analysis. So in this case,
  we're more interested in
  scientific generalizability, as
  the experimenter is a deeply
  knowledgeable chemist working
  with this compound. So we're
  going to be relying more on
  their subject matter expertise
  then on statistical principles
  and tools like hypothesis tests
  and so forth. The goal is
  primarily predictive, but the
  generalizability is kind of
  problematic because the
  experiment wasn't designed. Our
  ability to estimate interactions
  is weakened for techniques like
  forward selection and impossible
  via least squares analysis of
  the full model. Because the
  study wasn't randomized, there
  could be unrecorded time in
  order effects. We don't have
  potentially important covariate
  information like temperature and
  reaction time. This creates
  another big question mark
  regarding generalizability.
  Repeatability and
  reproducibility of the study is
  also an unknown here as we have
  no information about the
  variability due to the
  measurement system. Fortunately,
  we do have tools like JMP's
  evaluate design to understand
  the existing design as well as
  augment design that can greatly
  enhance the generalization
  performance of the analysis.
  Augment can improve information
  about main effects and
  interactions, and a second round
  of experimentation could be
  randomized to also enhance
  generalizability. So now I'm
  going to go through a couple of
  simple steps to show how to
  improve the generalization
  performance of our study using
  design tools in JMP. Before I
  do that, I want to point out
  that I had to take the data and
  convert it so that it was
  proportions rather than in
  percents. Otherwise the design
  tools were not really agreeing
  with the data very well. So we
  go into the evaluate designer
  and then we load up our Ys and
  our Xs. I requested the ability to
  handle second order interactions.
  Then...yeah, I got this alert
  saying, hey, I can't do that
  because we're not able to
  estimate all the interactions
  given the one factor at a time
  data that we have. So I backed
  up. We go to the augment
  designer, load everything up,
  set augment. We'll choose and I-
  optimal design because we're
  really concerned with
  predicted performance here.
  And I
  set the number of runs to 148.
  The custom designer requested
  141 as a minimum, but I went to
  148 just to kind of make sure
  that we've got all ability to
  estimate all of our interactions
  pretty well. After that, it
  takes about 20 seconds to
  construct the design. So now
  that we have the design, I'm
  going to show the two most
  important diagnostic tools in
  the augment designer for
  evaluating a design. On the
  left, we have the fraction of
  design space plot. This is
  showing that 50% of the volume
  of the design has
  a prediction variance that is
  less than 1. So 1 would be
  equivalent to the residual
  error. So we're able to get
  better than measurement error
  quality predictions over the
  majority of the design. On the
  right we have the color map on
  correlations. This is showing
  that we're able to estimate
  everything pretty well. There's
  some...because of the mixture
  constraint, we're getting some
  strong correlations between
  interactions and main effects.
  Overall, the effects are fairly
  clean. And the interactions are
  pretty well separated from one
  another, and the main effects
  are pretty well separated from
  one another as well. After
  looking at the design
  diagnostics, we can make the
  table. Here, I have shown the
  first 13 of the augmented runs
  and we see that we've got...we
  have more randomization. We don't
  have use of the same main effect
  over and over again streaks.
  That's evidence of better
  randomization and overall the
  design is going to be able to
  much better estimate the main
  effects and interactions having
  received better, higher quality
  information in this second stage
  of experimentation. So the input
  variables, the Xs, are accurate
  representations of the mixture
  proportion, so that's a clear
  objective interest. The
  responses are close surrogate
  for the amount of the product
  and amount of impurity that's in
  the batches. We're pretty good on
  7.1. there. The justifications
  are clear. After the study, we
  can of course go prepare a
  batch that is the formulation
  that was recommended by the FDOE
  profiler. Try it out and see if
  we're getting the kind of
  performance that we were looking
  for. It's very clear that that
  would be the way that we can
  assess how well we've achieved
  our study goals. So now under the
  last InfoQ dimension
  Communication. By describing the
  ideal curve as a target
  function, the Functional DOE
  Profiler makes the goal and the
  results of the analysis crystal
  clear. But this can be expressed
  at a level that is easily
  interpreted by the chemists and
  managers of the R&D facility.
  And as we have done our detailed
  information quality assessment,
  we've been upfront about the
  strengths and weaknesses of the
  study design and data
  collection. If the results do
  not generalize, we certainly
  know where to look for where the
  problems were. Once you become
  familiar with the concepts,
  there is a nice add-in written
  by Ian Cox that you can use to
  do a quick quantitative InfoQ
  assessment. The add-in has
  sliders for the upper and lower
  bounds of each InfoQ dimension.
  These dimensions are combined
  using a desirability function
  approach for an overall interval
  for the InfoQ over on the left.
  Here is an assessment for the
  data and analysis I covered in
  this presentation. The add-in is
  also a useful thinking tool that
  will make you consider each of
  the InfoQ dimensions. It's also a
  practical way to communicate
  InfoQ assessments to your
  clients or to your management, as
  it provides a high level view of
  information quality without
  using a lot of technical
  concepts and jargon. The add-in
  is also useful as the basis for
  an InfoQ comparison. My
  original hope for this
  presentation was to be a little
  bit more ambitious. I had hoped
  to cover the analysis I had
  just gone through, as well as
  another simpler one, one where I skip
  inputing the responses and doing
  a simple multivariate linear
  regression model of the response
  columns. Today, I'm only able to
  offer a final assessment of that
  approach. As you can see,
  several of the InfoQ
  dimensions suffer substantially
  without the more sophisticated
  analysis. It is very clear that
  the simple analysis leads to
  much lower InfoQ score.
  The upper limit of the simple
  analysis isn't that much higher
  than the lower limit of the more
  sophisticated one. With
  experience, you will gain
  intuition about what a good InfoQ
  score is for data science
  projects in your industry and
  pick up better habits as you
  will no longer be blind to the
  information bottlenecks in your
  data collection, analysis and
  model deployment. Information
  quality with an easy to use
  interface. This was my first
  formal information quality
  assessment. Speaking for myself,
  the information quality
  framework has given words and
  structure to a lot of things I
  already knew instinctively. It
  is already changed how I
  approach new data analysis
  projects. I encourage you to go
  through this process yourself on
  your own data, even if that data
  and analysis is already very
  familiar to you. I guarantee
  that you will be a wiser and
  more efficient data scientist
  because of it. Thank you.