Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion
Christopher Gotwalt, JMP Director, Statistical R&D, SAS

Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.

Auto-generated transcript...


Speaker

Transcript

Hello.
My name is Ron Kenett. This
is a joint talk with Chris
Gotwalt and we basically
have two messages
that should come out of the
talk. One is that we should
really be concerned about
information and information
quality. People tend to talk
about data and data quality, but
data is really not the the issue.
We are statisticians. We are
data scientists. We turn numbers,
data into information so that
our goal should be to make sure
that we generate high quality
information. The other message
is that JMP can help you
achieve that, and this is
actually turning out to be in
surprising ways. So by combining
the expertise of Chris and an
introduction to information
quality, we hope that these two
messages will come across
clearly. So if I had to
summarize what it is that we
want to talk about, after all,
it is all about information
quality. I gave a talk at
at the Summit in Prague four years
ago and that talk was generic.
It talked about moving the
journey from quality by design.
My journey doing information
quality. In this talk we focus
on how this can be done with
JMP. This is a more detailed
and technical talk than the
general talk I gave in Prague.
You can watch that talk.
There's a link listed here. You can
find it on the JMP community.
So we're going to talk about
information quality, which is
the potential of the data set, a
specific data set, to achieve a
particular goal, a specific goal,
with the given empirical method.
So in that definition we have
three components that are
listed. One is a certain data
set. Here is the data. The
second one is the goal,
the goal of the analysis, what
it is that we want to achieve.
And the third one is how we will
do that, which is, with what
methods we're going to generate
information, and that potential
is going to be assessed with the
utility function. And I will
begin with an introduction to
information quality, and then
Chris will will take over,
discuss the case study and and
show you how to conduct an
information quality assessment.
Eventually this should
answer the question how JMP
supports InfoQ, that
would be the the bullet points
that you can...the take away points
from the talk. The setup for
this is that we we encourage
what I called a lifecycle view
of statistics. In other words,
not just data analysis.
We should know...we should be part
of the problem elicitation
phase. Also, the goal
formulation phase, that deserves
a discussion. We should
obviously be involved in the
data collection scheme if it's
through experiments or through
surveys or through observational
data. Then we should also take
time for formulation of the
findings and not just pull out
printed reports on on
regression coefficients
estimates and and their
significance, but we should
also discuss what are the
findings? Operationalization of
findings meaning, OK, what can we
do with these findings? What are
the implications of the
findings? This should should...
needs to be communicated to the
right people in the right way,
and eventually we should do an
impact assessment to figure out,
OK, we did all this; what has
been the added value of our
work? I talked about the life
cycle of your statistics a few years
ago. This is the prerequisite,
the perspective to what
I'm going to talk about. So as I
mentioned, the information
quality is the potential of a
particular data set to achieve a
particular goal using given
empirical analysis methods. This
is identified through four
components the goal, the data,
the analysis method, and the
utility measure. So in a in a
mathematical expression, the
utility of applying f to x,
condition on the goal, is how we
identify InfoQ, information
quality. This was published in
the Royal Statistical Society
Series A in 2013 with eight
discussants, so it was amply
discussed. Some people thought
this was fantastic and some
people had a lot of critique on
that idea, so this is a wider
scope consideration of what
statistics is about. We also
wrote in 2006, we meeting myself
and Galit Shmueli, a book called
Information Quality. And in
the context of information
quality we did what is called
deconstruction. David Hand has
a paper called Deconstruction
of Statistics. This is the
deconstruction of information
quality into eight dimensions. I
will cover these eight dimensions.
That's my part in the talk and
then Chris will show how this
is implemented in a specific
case study.
Another aspect that relates to
this is another book I have.
This is recent, a year ago
titled The Real Work of Data
Science and we talk about what
is the role of the data
scientists in organizations and
in that context, we emphasized
the need for the data scientist
to be involved in the generation
of information as an...information
quality as meeting the goals of
the organization. So let me
cover the eight dimensions. That's
that's that's my intro. The
first one is data resolution. We
have a goal. OK, we we would
like to know the level of flu
because in the country or in the
area where we live, because that
will impact our decision on
whether to go to the park where
we could meet people or going to
a...to a jazz concert. And that
concert is tomorrow.
If we look up the CDC data on
the level of flu, that data is
updated weekly, so we could get
the red line in the graph you
have in front of you, so we
could get data of a few days
ago, maybe good, maybe not good
enough for our gold. Google Flu,
which is based on searches
related to flu, is updated
momentarily, so it's updated
online, it will probably give
us better information. So for
our goal, the blue line, the
Google trend...the Google Flu
Trends indicator, is probably
more appropriate. The second
dimension is data structure.
To meet our goal, we're going to
look at data. We should...we
should identify the data sources
and the structure in these data
sources. So some data could be
text, some could be video, some
could be, for example, the
network of recommendations. This
is an Amazon picture on how if
you look at the book, you're
going to get some
other books recommended. And if
you go to these other books,
you're going to have more data
recommended. So the data
structure can come in all sorts
of shapes and forms and this can
be text. This can be functional
data. This can be images. We are
not confined now to what we used
to call data, which is what you
find in an Excel spreadsheet.
The data could be corrupted,
could have missing values, could
have unusual patterns which
which would be
something to look into. Some
patterns, where things are
repeated. Maybe some of the data
is is just copy and paste and we
would like to be warned about
such options. The third
dimension is data integration.
When we consider the data from
these different sources, we're
going to integrate them so we
can do some analysis linkage
through an ID. For example, we
would do that, but in doing that,
we might create some issues, for
example, in in disclosing data
that normally should be
anonymized. Data
integration, yeah, that will
allow us to do fantastic things,
but if the data is perceived to
have some privacy exposure
issues, then maybe the quality
of the information from the
analysis that we're going to do
is going is going to be
affected. So data integration
should be looked into very, very
carefully. This is what people
likely used to call ETL
extract, transform and load. We
now have much better methods for
doing that. The join option, for
example, in JMP will offer
options for for doing that.
Temporal relevance. OK, that's
pretty clear. We have data. It is
stamped somehow. If we're going
to do the analysis later, later
after the data collection, and
if the deployement that we
consider is even later, then the
data might not be temporally
relevant. In a common
situation, if we want to compare
what is going on now, we would
like to be able to make this
comparison to recent data or
data before the pandemic
started, but not 10 years
earlier. The official statistics
on health records used to be two
or three years behind in terms
of timing, which made it very
difficult the use of official
statistics in assessing
what is going on with the
pandemic. Chronology of data and
goal is related to the decision
that we make as a result of our
goal. So if, for example, our
goal is to forecast air quality,
we're going to use some
predictive models on the Air
Quality Index reported on a
daily basis. This gives us a one
to six scale from hazardous to
good. There are some values
which are representing levels of
health concern. Zero-50 is good;
300-500 is hazardous and the
chronology of data and goal
means that we should be able to
make a forecast on a daily
basis. So the methods we use
should be updated on a daily
basis. If, on the other hand,
our goal is to figure out how is
this AQI index computed, then we
are not really bound by the the
the timeliness of the analysis.
You know, we could take our
time. There's no urgency in
getting the analysis done on a
daily basis. Generalizability,
the sixth dimension, is about
taking our findings and
considering where this could
apply in more general terms,
other populations, all
situations. This can be done
intuitively. Marketing managers who
have seen a study on the on the
market, let's call it A, might
already understand what are the
implications to Market B
without data. People who are
physicists will be able to
make predictions based on
mechanics on first principles
without without data.
So some of the generalizability
is done with data. This is the
basis of statistical
generalization, where we go from
the sample to the population.
Statistical inferences is about
generalization. We generalize
from the sample to the
population. And some can be
domain based, in other words,
using expert knowledge, domain
expertise, not necessarily with
data. We have to recognize that
generalizability is not just
done with statistics.
The seventh dimension is
construct operationalization,
which is really about what it is
that we measure. We want to
assess behavior, emotions, what
it is that we can measure, that
will give us data that reflects
behavior or emotions.
The example I give here
typically is pain.
We know what is pain. How do
we measure that? If you go to a
hospital and you ask the nurse,
how do you assess pain, they
will tell you, we have a scale,
1 to 10. It's very
qualitative, not very
scientific, I would say. If we
want to measure the level of
alcohol in drivers on the...on
the road, it will be difficult to
measure. So we might measure
speed as a surrogate measure.
Another part of
operationalization is the other
end of the story. In other
words, the first part, the
construct is what we measure,
which reflects our goal. The the
end...the end result here is that
we have findings and we want to
do something with them. We want
to operationalize our finding.
This is what the action
operationalization is about.
It's about what you do with the
findings and then being
presented here on a podium. We
used to ask three questions.
These are very important
questions to ask. Once you have
done some analysis, you have
someone in front of you who
says, oh, thank you very much,
you're done...you, the statistician
or the data scientist. So this
this takes you one extra step,
getting you to ask your customer
these simple questions What do
you want to accomplish? How will
you do that and how will you
know if you have accomplished
that? We we can help answer, or
at least support, some of these
questions we've answered.
The eighth dimension is
communication. I'm giving you an
example from a very famous old
map from the 19th century, which
is showing the march of the
Napoleon Army from France to
Moscow to Russia. You see the
numbers are the the width of
the path indicates the size
of the army, and then on on in
black you see what happened to
them on their on their way back.
So basically this was a
disastrous march. We we we we
can relate this old map to
existing maps, and there is a
JMP add-in, which you can
find on the JMP Community, to to
show you with bubble plots,
dynamic bubble plots what this
looks like. So I I've covered
very quickly the eight information
quality dimensions. My last
slide is that what I've talked
about from a historical
perspective, really put some
proportions to what I'm saying.
I think we are really in the era
of information quality. We used
to be concerned with product
quality in the 18th century, the
17th century. We then moved to
process quality and service
quality. This is a short memo
on proposing a control chart,
1924, I think.
Then we move to management
quality. This is the Juran
trilogy of design, improvement
and control. Six Sigma (define,
measure, analyze, improve)
control process is the
improvement process of Juran,
and Juran was the grand father
of Six Sigma in that sense.
Then in the '80s, Taguchi came
in. He talks about robust
design. How can we handle
variability in inputs by proper
design decisions? And now we are
in the Age of information
quality. We have sensors. We
have flexible systems. We are
depending on AI and machine
learning and data mining and we
are gathering big big big
numbers, but which we call big
data. The interest in information
quality should be a prime prime
interest. I'm going to try and
convince you of, with the help
of Chris, that.
We are here and JMP can
help us achieve that in in a
really unusual way.
What you will see at the end of
the case study that Chris will
show is also how to do it an
information quality assessment and
on a specific study, basically
generate an information quality
score. So if we go top down, I
can tell you this study, this
work, this analysis is maybe 80% or
maybe 30% or maybe 95%.
And through the example you will
see how to do that. There is a
JMP add-in to provide this
assessment. It's it's actually
quite quite easy. There's
nothing really sophisticated
about that. So I'm done and
Chris, after you. Thanks, Ron. So
now I'm going to go through the
analysis of a data set in a way
that explicitly calls out the
various aspects of information
quality and show how JMP can be
used to assess an improvement
for InfoQ. So first off, I'm
going to go through the InfoQ
components. The first InfoQ
component is the goal, so in
this case the problem statement
was that a chemical company
wanted a formulation that
maximized product yield while
minimizing a nuisance impurity
that resulted from the reaction.
So that was the high level goal.
In statistical terms, we wanted
to find a model that accurately
predicted a response on a data
set so that we could find a
combination of ingredients and
processing steps that would lead
to a better product.
The data are set up in 100
experimental formulations with
one primary ingredient, X1,and
10 additives. There's also a
processing factor in 13
responses. The data are
completely fabricated but were
simulated to illustrate the same
strengths and weaknesses of the
original data. The data
formulation was made was also
recorded. We will be looking at
this data closely, so I want to
elaborate beyond pointing out
that they were collected in an
ad hoc way, changing one or two
additives at a time rather than
as a designed or randomized
experiment. There's a lot of
ways to analyze this data, the
most typical being least
squares modeling with forward
selection on selected responses.
That was my original intention
for this talk, but when I showed
the data to Ron, he immediately
recognized the response columns
as time series from analytical
chemistry. Even though the data
were simulated, he could see the
structure. He could see things
in the data that I didn't see
and read into it wasn't. I found
this to be strikingly
impressive. It's beyond the
scope of this talk, but there is
an even better approach based on
ensemble modeling using
fractionally weighted
bootstrapping. Phil Ramsey,
Wayne Levin and I have another
talk about this methodology at
the European Discovery
Conference this year. The
approach is promising because it
can fit models to data with
more active interactions than
there are runs. The fourth and final
component of information quality
is utility, which is how well we
are able to assess our goals. Or
how do we measure how well we've
assessed our goals? There's a
domain aspect which is in this
case we want to have a
formulation that leads to
maximized yields and minimized the
waste in post processing of the
material. The statistical
analysis utility refers to the
model that we fit. What we're
going for there is least
squares accuracy of our model in
terms of how well we're able to
predict what the...what would
result from candidate
combinations of formulation...of
mixture factors. Now I'm going
to go through a set of questions
that make up a detailed InfoQ
assessment as organized into the
eight dimensions of information
quality. I want to point out
that not all questions will be
equally relevant to different
data science and statistical
projects, and that this is not
intended to be rigid dogma but
rather a set of things that are
a good idea to ask oneself.
These questions represent a kind
of data analytic wisdom that
looks more broadly than just the
application of a particular
statistical technology. A copy
of a spreadsheet with these
questions along with pointers to
JMP features that are the most
useful for answering a
particular one will be uploaded
to the JMP Community along
with this talk for you to use. As
I proceed through the questions,
I'll be demoing an analysis of
the data in JMP. So Question 1
is is the data scale used
aligned with the stated goal? So
the Xs that we have consist of
a single categorical variable
processing and the 11 continuous
inputs. These are measured
as percentages and are also
recorded to half a percent. We
don't have the total amounts of
the ingredients, only the
percentages. The totals are
information that was either lost
or never recorded. There are
other potentially important
pieces of information that are
missing here. The time between
formulating the batches and
taking the measurements is gone
and there could have been other
covariate level information that
is missing here that would have
described the conditions under
which the reaction occurred.
Without more information than I
have, I cannot say how important
this kind of covariate information
would have been. We do have
information on the day of the
batch, so that could be used as
a surrogate possibly. Overall we
have what are, hopefully, the most
important inputs, as well as
measurements of the responses we
wish to optimize. We could have
had more information, but this
looks promising enough to try
and analysis with. The second
question related to data
resolution is how reliable and
precise are the measuring devices
and data sources. And the fact
is, we don't have a lot of
specific information here. The
statistician internal the
company would have had more
information. In this case we
have no choice but to trust that
the chemists formulated and
recorded the mixtures well. The
third question relative to data
resolution is is the data
analysis suitable for the data
aggregation level? And the
answer here is yes, assuming
that their measurement system is
accurate and that the data are
clean enough. What we're going
to end up doing actually is
we're going to use the
Functional Data Explorer to
extract functional principal
components, which are a data
derived kind of data
aggregation. And then we're
going to be modeling those
functional principal components
using the input variables. So
now we move on to the data
structure dimension and the
first question we ask is, is the
data used aligned with the
stated goal? And I think the
answer is a clear yes here. We're
trying to maximize
yield. We've got measurements for
that, and the inputs are
recorded as Xs. The second data
structure question is where
things really start to get
interesting for me. So this is
are the integrity details
(outliers, missing values, data
corruption) issues described and
handled appropriately? So from
here we can use JMP to be able
to understand where the outliers
are, figure out strategies for
what to do about missing values,
observe their patterns and so
on. So this is this is where
things are going to get a little
bit more interesting. The first
thing we're going to do is we're
going to determine if there are
any outliers in the data that we
need to be concerned about. So
to do that, we're going to go
into the explore outliers
platform off of the screening
menu. We're going to load up the
response variables, and because
this is a multivariate setting,
we're going to use a new feature
in JMP Pro 16 called Robust
PCA Outliers. So we see where
the large residuals are in those
kind of Pareto type plots.
There's a snapshot showing where
there's some potentially
unusually large observations. I
don't really think this looks
too unusual or worrisome to me.
We can save the large outliers
to a data table and then look at
them in the distribution
platform and what we see kind of
looks like a normal distribution
with the middle taken out. So I
think this is data that are
coming from sort of the same
population and there's nothing
really to worry about here,
outliers-wise. So once we've
taken care of the outlier
situation we go in and explore
missing values. So what we're
going to do first is we're going
to load up the Ys as...into the
platform, and then we're going
to use the missing value
snapshot to see what patterns
they are amongst our missing
values. It looks like the
missing values tend to occur in
horizontal clusters, and there's
also the same missing values
across rows. So you can see that
with the black splotches here.
And then we'll go apply an
automated data imputation,
which goes ahead and saves
formula columns that impute
missing values in the new
columns using a regression type
algorithm that was developed by
a PhD student of mine named Milo
Page at NC State. So we can play
around a little bit and get a
sense of like how the ADI
algorithm is working. So it's
created these formula columns
that are peeling off elements of
the ADI impute column, which is
a vector formula column, and the
scoring impute function
is calculating the expected
value of the missing cells given
the non missing cells, whenever
it's got a missing value. And it's
just carrying through a non
missing value. So you can see 207
in YO...Y6 there. It's initially
207 but then I change it to
missing and it's now being
imputed to be 234.
So I'll do this a couple of times so
you can kind of see how how it's
working. So here I'll put in a big
value for Y7 and that's now.
been replaced. And if we go down
and we add a row,
then all missing values are there
initially and the column means
are replaced for the
imputations. If I were to go
ahead and add values for some of
those missing cells, it would
start doing the conditional
expectation of the still missing
cells using the information
that's in the missing one....the
non missing ones. So our next
question on data structure is
are the analysis methods
suitable for the data structure?
So we've got 11 mixture inputs
and a processing variable that's
categorical. Those are going
to be inputs into a least
squares type model. We have
13 continuous responses and
we can model them using...
individually using least
squares. Or we can model
functional principal
components. The...now there are
problems. The Xs are...the
input variables have not been
randomized at all. It's very
clear that they would muck
around with one or more of
the compounds and then move
on to another one. So the
order in which the
the input variables were varied
was kind of haphazard. It's a
clear lack of randomization, and
that's going to negatively
impact our...the generalizability
and strength of our conclusions.
Data integration is the third
InfoQ dimension. These data
are manually entered lab notes
consisting mostly of mixture
percentages and equipment
readouts. We can only assume
that the data were entered
correctly and that the Xs are
aligned properly with responses.
If that isn't the case, then the
model will have serious bias
problems and have
problems with generalizability.
Integration is more of an issue
with observational data
science problems in machine
learning exercises, than lab
experiments like this. Although
it doesn't apply here, I'll
point out that privacy and
confidentiality concerns can be
identified by modeling the
sensitive part of the data using
the to be published component
at the data. If the resulting
model is predicted, then one
should be concerned that privacy
concerns are not being met. Temporal
relevance refers to the
operational time sequencing of
data collection, analysis and
deployment and whether gaps
between those stages leads to a
decrease in the usefulness of
the information in the study.
In this case, we can only simply
hope that the material supplies
are reasonably consistent and
that the test equipment is
reasonably accurate, which is an
unverifiable assumption at this
point. The time resolution we
have when the data collection is
at the day level, which means
that there isn't much way that
we can verify if there is time
variation within each day.
Chronology of data and goal is
about the availability of
relevant variables both in terms
of whether the variable is
present at all in the data or
whether the right information
will be available when the model
is deployed. For predictive
models, this relates to models
being fit to data similar to
what will be present at the time
the model will be evaluated on
new data. In this way, our data
set is certainly fine. For
establishing causality, however,
we aren't in nearly as good a
shape because the lack of
randomization implies that time
effects and factor effects may
be confounded, leading to bias
in our estimates. Endogeneity,
or reverse causation, could
clearly be an issue here, as
variables like temperature and
reaction time could clearly be
impacting the responses, but have
been left unrecorded. Overall,
there is a lot we don't know
about this dimension in an
information quality sense.
The rest of the InfoQ
assessment is going to be
dependent upon the type of
analysis that we do. So at this
point I'm going to go ahead and
conduct an analysis of this data
using the Functional Data
Explorer platform in JMP Pro
that allows me to model across
all the columns simultaneously
in a way that's based on
functional principal components,
which contain the maximum amount
of information across all those
columns as represented in the
most efficient format possible.
I'm going to be working on the
imputed versions of the columns
that I calculated earlier in the
presentation. And I'm going to
point out that I'm going to be
working to find combinations of
the mixture factors that achieve
as closely as possible in a
least square sense, an ideal
curve that was created by the
practitioner that maximizes the
amount of potential product that
could be in a batch while
minimizing the amount of the
impurities that they
realistically thought a batch
could contain. So I begin the
analysis by going to the analyze
menu, bring up the Functional
Data Explorer. This has rows as
functions. I'm going to load up my
imputed rows, and then I'm going
to put in my formulation
components and my processing
column as a supplementary
variable. We've got an ID
function, that's batch ID. Here I
get in. I can see the functions,
both the overlay altogether, and
I can see the individual functions.
Then I can load up the target
function, which is the ideal.
And that will change the
analysis that results once I
start going into the modeling
steps. So these are pretty
simple functions, so I'm just
going to model them with
B splines.
And then I'm going to go into my
functional DOE analysis.
This is going to fit the model
that connects the inputs into
the functional principal
components and then connect all
the way through the
eigenfunctions to make it so
we're able to recover the
overall functions as they
changed, as we are varying the
mixture factors. The
functional principal component
analysis has indicated that
there are four dimensions of
variation in these response
functions. To understand what
they mean, let's go ahead and
explore with the FPC profiler.
So watch this pane right here as
I adjust FPC 1 and we can see
that this FPC is associated with
peak height. FPC2, it looks
like it's kind of a peak
narrowness. It's almost like a
resolution principal
component. The third one is
related to kind of a knee on
the left of the dominant peak.
And Peak 4 looks like it's
primarily related with the
impurity, so that's what the
underlying meaning is of
these four functional
principal components.
So we've characterized our goal
as maximizing the product and
minimizing the impurity, and
we've communicated that into the
analysis through this ideal or
golden curve that we supplied at
the beginning of the FDE
exercise we're doing. To get as
close as possible to that ideal
curve, we turn on desirability
functions. And then we can go
out and maximize desirability.
And we find that the optimal
combination of inputs is about
4.5% of
Ingredient four, 2% of
Ingredient 6. 2.2% of
Ingredient 8 and 1.24% of
Ingredient 9 using processing
method two. Let's review how
we've gotten here. We first
computed the missing response
columns. Then we found B-spline
models that fit those functions
well in the FDE platform. A
functional principal components
analysis determined that there
were four eigenfunctions
characterizing the variation in
this data. These four
eigenfunctions were determined
via the FPC profiler to each
have a reasonable subject
matter meaning. The functional
DOE analysis consisted of
applying pruned forward
selection to each of the
individual FPC scores using the
DOE factors as input variables.
And we see here that these have
found combinations of
interactions and main effects
that were most predictive for
each of the functional principal
component scores individually.
The Functional DOE Profiler
has elegantly captured all
aspects into one representation
that allows us to find the
formulation processing step that
is predicted to have desirable
properties as measured by high
yield and low impurity.
So now we can do an InfoQ
assessment of the
generalizability of the data in
the analysis. So in this case,
we're more interested in
scientific generalizability, as
the experimenter is a deeply
knowledgeable chemist working
with this compound. So we're
going to be relying more on
their subject matter expertise
then on statistical principles
and tools like hypothesis tests
and so forth. The goal is
primarily predictive, but the
generalizability is kind of
problematic because the
experiment wasn't designed. Our
ability to estimate interactions
is weakened for techniques like
forward selection and impossible
via least squares analysis of
the full model. Because the
study wasn't randomized, there
could be unrecorded time in
order effects. We don't have
potentially important covariate
information like temperature and
reaction time. This creates
another big question mark
regarding generalizability.
Repeatability and
reproducibility of the study is
also an unknown here as we have
no information about the
variability due to the
measurement system. Fortunately,
we do have tools like JMP's
evaluate design to understand
the existing design as well as
augment design that can greatly
enhance the generalization
performance of the analysis.
Augment can improve information
about main effects and
interactions, and a second round
of experimentation could be
randomized to also enhance
generalizability. So now I'm
going to go through a couple of
simple steps to show how to
improve the generalization
performance of our study using
design tools in JMP. Before I
do that, I want to point out
that I had to take the data and
convert it so that it was
proportions rather than in
percents. Otherwise the design
tools were not really agreeing
with the data very well. So we
go into the evaluate designer
and then we load up our Ys and
our Xs. I requested the ability to
handle second order interactions.
Then...yeah, I got this alert
saying, hey, I can't do that
because we're not able to
estimate all the interactions
given the one factor at a time
data that we have. So I backed
up. We go to the augment
designer, load everything up,
set augment. We'll choose and I-
optimal design because we're
really concerned with
predicted performance here.
And I
set the number of runs to 148.
The custom designer requested
141 as a minimum, but I went to
148 just to kind of make sure
that we've got all ability to
estimate all of our interactions
pretty well. After that, it
takes about 20 seconds to
construct the design. So now
that we have the design, I'm
going to show the two most
important diagnostic tools in
the augment designer for
evaluating a design. On the
left, we have the fraction of
design space plot. This is
showing that 50% of the volume
of the design has
a prediction variance that is
less than 1. So 1 would be
equivalent to the residual
error. So we're able to get
better than measurement error
quality predictions over the
majority of the design. On the
right we have the color map on
correlations. This is showing
that we're able to estimate
everything pretty well. There's
some...because of the mixture
constraint, we're getting some
strong correlations between
interactions and main effects.
Overall, the effects are fairly
clean. And the interactions are
pretty well separated from one
another, and the main effects
are pretty well separated from
one another as well. After
looking at the design
diagnostics, we can make the
table. Here, I have shown the
first 13 of the augmented runs
and we see that we've got...we
have more randomization. We don't
have use of the same main effect
over and over again streaks.
That's evidence of better
randomization and overall the
design is going to be able to
much better estimate the main
effects and interactions having
received better, higher quality
information in this second stage
of experimentation. So the input
variables, the Xs, are accurate
representations of the mixture
proportion, so that's a clear
objective interest. The
responses are close surrogate
for the amount of the product
and amount of impurity that's in
the batches. We're pretty good on
7.1. there. The justifications
are clear. After the study, we
can of course go prepare a
batch that is the formulation
that was recommended by the FDOE
profiler. Try it out and see if
we're getting the kind of
performance that we were looking
for. It's very clear that that
would be the way that we can
assess how well we've achieved
our study goals. So now under the
last InfoQ dimension
Communication. By describing the
ideal curve as a target
function, the Functional DOE
Profiler makes the goal and the
results of the analysis crystal
clear. But this can be expressed
at a level that is easily
interpreted by the chemists and
managers of the R&D facility.
And as we have done our detailed
information quality assessment,
we've been upfront about the
strengths and weaknesses of the
study design and data
collection. If the results do
not generalize, we certainly
know where to look for where the
problems were. Once you become
familiar with the concepts,
there is a nice add-in written
by Ian Cox that you can use to
do a quick quantitative InfoQ
assessment. The add-in has
sliders for the upper and lower
bounds of each InfoQ dimension.
These dimensions are combined
using a desirability function
approach for an overall interval
for the InfoQ over on the left.
Here is an assessment for the
data and analysis I covered in
this presentation. The add-in is
also a useful thinking tool that
will make you consider each of
the InfoQ dimensions. It's also a
practical way to communicate
InfoQ assessments to your
clients or to your management, as
it provides a high level view of
information quality without
using a lot of technical
concepts and jargon. The add-in
is also useful as the basis for
an InfoQ comparison. My
original hope for this
presentation was to be a little
bit more ambitious. I had hoped
to cover the analysis I had
just gone through, as well as
another simpler one, one where I skip
inputing the responses and doing
a simple multivariate linear
regression model of the response
columns. Today, I'm only able to
offer a final assessment of that
approach. As you can see,
several of the InfoQ
dimensions suffer substantially
without the more sophisticated
analysis. It is very clear that
the simple analysis leads to
much lower InfoQ score.
The upper limit of the simple
analysis isn't that much higher
than the lower limit of the more
sophisticated one. With
experience, you will gain
intuition about what a good InfoQ
score is for data science
projects in your industry and
pick up better habits as you
will no longer be blind to the
information bottlenecks in your
data collection, analysis and
model deployment. Information
quality with an easy to use
interface. This was my first
formal information quality
assessment. Speaking for myself,
the information quality
framework has given words and
structure to a lot of things I
already knew instinctively. It
is already changed how I
approach new data analysis
projects. I encourage you to go
through this process yourself on
your own data, even if that data
and analysis is already very
familiar to you. I guarantee
that you will be a wiser and
more efficient data scientist
because of it. Thank you.
Published on ‎05-20-2024 04:10 PM by | Updated on ‎07-07-2025 12:08 PM

Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion
Christopher Gotwalt, JMP Director, Statistical R&D, SAS

Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.

Auto-generated transcript...


Speaker

Transcript

Hello.
My name is Ron Kenett. This
is a joint talk with Chris
Gotwalt and we basically
have two messages
that should come out of the
talk. One is that we should
really be concerned about
information and information
quality. People tend to talk
about data and data quality, but
data is really not the the issue.
We are statisticians. We are
data scientists. We turn numbers,
data into information so that
our goal should be to make sure
that we generate high quality
information. The other message
is that JMP can help you
achieve that, and this is
actually turning out to be in
surprising ways. So by combining
the expertise of Chris and an
introduction to information
quality, we hope that these two
messages will come across
clearly. So if I had to
summarize what it is that we
want to talk about, after all,
it is all about information
quality. I gave a talk at
at the Summit in Prague four years
ago and that talk was generic.
It talked about moving the
journey from quality by design.
My journey doing information
quality. In this talk we focus
on how this can be done with
JMP. This is a more detailed
and technical talk than the
general talk I gave in Prague.
You can watch that talk.
There's a link listed here. You can
find it on the JMP community.
So we're going to talk about
information quality, which is
the potential of the data set, a
specific data set, to achieve a
particular goal, a specific goal,
with the given empirical method.
So in that definition we have
three components that are
listed. One is a certain data
set. Here is the data. The
second one is the goal,
the goal of the analysis, what
it is that we want to achieve.
And the third one is how we will
do that, which is, with what
methods we're going to generate
information, and that potential
is going to be assessed with the
utility function. And I will
begin with an introduction to
information quality, and then
Chris will will take over,
discuss the case study and and
show you how to conduct an
information quality assessment.
Eventually this should
answer the question how JMP
supports InfoQ, that
would be the the bullet points
that you can...the take away points
from the talk. The setup for
this is that we we encourage
what I called a lifecycle view
of statistics. In other words,
not just data analysis.
We should know...we should be part
of the problem elicitation
phase. Also, the goal
formulation phase, that deserves
a discussion. We should
obviously be involved in the
data collection scheme if it's
through experiments or through
surveys or through observational
data. Then we should also take
time for formulation of the
findings and not just pull out
printed reports on on
regression coefficients
estimates and and their
significance, but we should
also discuss what are the
findings? Operationalization of
findings meaning, OK, what can we
do with these findings? What are
the implications of the
findings? This should should...
needs to be communicated to the
right people in the right way,
and eventually we should do an
impact assessment to figure out,
OK, we did all this; what has
been the added value of our
work? I talked about the life
cycle of your statistics a few years
ago. This is the prerequisite,
the perspective to what
I'm going to talk about. So as I
mentioned, the information
quality is the potential of a
particular data set to achieve a
particular goal using given
empirical analysis methods. This
is identified through four
components the goal, the data,
the analysis method, and the
utility measure. So in a in a
mathematical expression, the
utility of applying f to x,
condition on the goal, is how we
identify InfoQ, information
quality. This was published in
the Royal Statistical Society
Series A in 2013 with eight
discussants, so it was amply
discussed. Some people thought
this was fantastic and some
people had a lot of critique on
that idea, so this is a wider
scope consideration of what
statistics is about. We also
wrote in 2006, we meeting myself
and Galit Shmueli, a book called
Information Quality. And in
the context of information
quality we did what is called
deconstruction. David Hand has
a paper called Deconstruction
of Statistics. This is the
deconstruction of information
quality into eight dimensions. I
will cover these eight dimensions.
That's my part in the talk and
then Chris will show how this
is implemented in a specific
case study.
Another aspect that relates to
this is another book I have.
This is recent, a year ago
titled The Real Work of Data
Science and we talk about what
is the role of the data
scientists in organizations and
in that context, we emphasized
the need for the data scientist
to be involved in the generation
of information as an...information
quality as meeting the goals of
the organization. So let me
cover the eight dimensions. That's
that's that's my intro. The
first one is data resolution. We
have a goal. OK, we we would
like to know the level of flu
because in the country or in the
area where we live, because that
will impact our decision on
whether to go to the park where
we could meet people or going to
a...to a jazz concert. And that
concert is tomorrow.
If we look up the CDC data on
the level of flu, that data is
updated weekly, so we could get
the red line in the graph you
have in front of you, so we
could get data of a few days
ago, maybe good, maybe not good
enough for our gold. Google Flu,
which is based on searches
related to flu, is updated
momentarily, so it's updated
online, it will probably give
us better information. So for
our goal, the blue line, the
Google trend...the Google Flu
Trends indicator, is probably
more appropriate. The second
dimension is data structure.
To meet our goal, we're going to
look at data. We should...we
should identify the data sources
and the structure in these data
sources. So some data could be
text, some could be video, some
could be, for example, the
network of recommendations. This
is an Amazon picture on how if
you look at the book, you're
going to get some
other books recommended. And if
you go to these other books,
you're going to have more data
recommended. So the data
structure can come in all sorts
of shapes and forms and this can
be text. This can be functional
data. This can be images. We are
not confined now to what we used
to call data, which is what you
find in an Excel spreadsheet.
The data could be corrupted,
could have missing values, could
have unusual patterns which
which would be
something to look into. Some
patterns, where things are
repeated. Maybe some of the data
is is just copy and paste and we
would like to be warned about
such options. The third
dimension is data integration.
When we consider the data from
these different sources, we're
going to integrate them so we
can do some analysis linkage
through an ID. For example, we
would do that, but in doing that,
we might create some issues, for
example, in in disclosing data
that normally should be
anonymized. Data
integration, yeah, that will
allow us to do fantastic things,
but if the data is perceived to
have some privacy exposure
issues, then maybe the quality
of the information from the
analysis that we're going to do
is going is going to be
affected. So data integration
should be looked into very, very
carefully. This is what people
likely used to call ETL
extract, transform and load. We
now have much better methods for
doing that. The join option, for
example, in JMP will offer
options for for doing that.
Temporal relevance. OK, that's
pretty clear. We have data. It is
stamped somehow. If we're going
to do the analysis later, later
after the data collection, and
if the deployement that we
consider is even later, then the
data might not be temporally
relevant. In a common
situation, if we want to compare
what is going on now, we would
like to be able to make this
comparison to recent data or
data before the pandemic
started, but not 10 years
earlier. The official statistics
on health records used to be two
or three years behind in terms
of timing, which made it very
difficult the use of official
statistics in assessing
what is going on with the
pandemic. Chronology of data and
goal is related to the decision
that we make as a result of our
goal. So if, for example, our
goal is to forecast air quality,
we're going to use some
predictive models on the Air
Quality Index reported on a
daily basis. This gives us a one
to six scale from hazardous to
good. There are some values
which are representing levels of
health concern. Zero-50 is good;
300-500 is hazardous and the
chronology of data and goal
means that we should be able to
make a forecast on a daily
basis. So the methods we use
should be updated on a daily
basis. If, on the other hand,
our goal is to figure out how is
this AQI index computed, then we
are not really bound by the the
the timeliness of the analysis.
You know, we could take our
time. There's no urgency in
getting the analysis done on a
daily basis. Generalizability,
the sixth dimension, is about
taking our findings and
considering where this could
apply in more general terms,
other populations, all
situations. This can be done
intuitively. Marketing managers who
have seen a study on the on the
market, let's call it A, might
already understand what are the
implications to Market B
without data. People who are
physicists will be able to
make predictions based on
mechanics on first principles
without without data.
So some of the generalizability
is done with data. This is the
basis of statistical
generalization, where we go from
the sample to the population.
Statistical inferences is about
generalization. We generalize
from the sample to the
population. And some can be
domain based, in other words,
using expert knowledge, domain
expertise, not necessarily with
data. We have to recognize that
generalizability is not just
done with statistics.
The seventh dimension is
construct operationalization,
which is really about what it is
that we measure. We want to
assess behavior, emotions, what
it is that we can measure, that
will give us data that reflects
behavior or emotions.
The example I give here
typically is pain.
We know what is pain. How do
we measure that? If you go to a
hospital and you ask the nurse,
how do you assess pain, they
will tell you, we have a scale,
1 to 10. It's very
qualitative, not very
scientific, I would say. If we
want to measure the level of
alcohol in drivers on the...on
the road, it will be difficult to
measure. So we might measure
speed as a surrogate measure.
Another part of
operationalization is the other
end of the story. In other
words, the first part, the
construct is what we measure,
which reflects our goal. The the
end...the end result here is that
we have findings and we want to
do something with them. We want
to operationalize our finding.
This is what the action
operationalization is about.
It's about what you do with the
findings and then being
presented here on a podium. We
used to ask three questions.
These are very important
questions to ask. Once you have
done some analysis, you have
someone in front of you who
says, oh, thank you very much,
you're done...you, the statistician
or the data scientist. So this
this takes you one extra step,
getting you to ask your customer
these simple questions What do
you want to accomplish? How will
you do that and how will you
know if you have accomplished
that? We we can help answer, or
at least support, some of these
questions we've answered.
The eighth dimension is
communication. I'm giving you an
example from a very famous old
map from the 19th century, which
is showing the march of the
Napoleon Army from France to
Moscow to Russia. You see the
numbers are the the width of
the path indicates the size
of the army, and then on on in
black you see what happened to
them on their on their way back.
So basically this was a
disastrous march. We we we we
can relate this old map to
existing maps, and there is a
JMP add-in, which you can
find on the JMP Community, to to
show you with bubble plots,
dynamic bubble plots what this
looks like. So I I've covered
very quickly the eight information
quality dimensions. My last
slide is that what I've talked
about from a historical
perspective, really put some
proportions to what I'm saying.
I think we are really in the era
of information quality. We used
to be concerned with product
quality in the 18th century, the
17th century. We then moved to
process quality and service
quality. This is a short memo
on proposing a control chart,
1924, I think.
Then we move to management
quality. This is the Juran
trilogy of design, improvement
and control. Six Sigma (define,
measure, analyze, improve)
control process is the
improvement process of Juran,
and Juran was the grand father
of Six Sigma in that sense.
Then in the '80s, Taguchi came
in. He talks about robust
design. How can we handle
variability in inputs by proper
design decisions? And now we are
in the Age of information
quality. We have sensors. We
have flexible systems. We are
depending on AI and machine
learning and data mining and we
are gathering big big big
numbers, but which we call big
data. The interest in information
quality should be a prime prime
interest. I'm going to try and
convince you of, with the help
of Chris, that.
We are here and JMP can
help us achieve that in in a
really unusual way.
What you will see at the end of
the case study that Chris will
show is also how to do it an
information quality assessment and
on a specific study, basically
generate an information quality
score. So if we go top down, I
can tell you this study, this
work, this analysis is maybe 80% or
maybe 30% or maybe 95%.
And through the example you will
see how to do that. There is a
JMP add-in to provide this
assessment. It's it's actually
quite quite easy. There's
nothing really sophisticated
about that. So I'm done and
Chris, after you. Thanks, Ron. So
now I'm going to go through the
analysis of a data set in a way
that explicitly calls out the
various aspects of information
quality and show how JMP can be
used to assess an improvement
for InfoQ. So first off, I'm
going to go through the InfoQ
components. The first InfoQ
component is the goal, so in
this case the problem statement
was that a chemical company
wanted a formulation that
maximized product yield while
minimizing a nuisance impurity
that resulted from the reaction.
So that was the high level goal.
In statistical terms, we wanted
to find a model that accurately
predicted a response on a data
set so that we could find a
combination of ingredients and
processing steps that would lead
to a better product.
The data are set up in 100
experimental formulations with
one primary ingredient, X1,and
10 additives. There's also a
processing factor in 13
responses. The data are
completely fabricated but were
simulated to illustrate the same
strengths and weaknesses of the
original data. The data
formulation was made was also
recorded. We will be looking at
this data closely, so I want to
elaborate beyond pointing out
that they were collected in an
ad hoc way, changing one or two
additives at a time rather than
as a designed or randomized
experiment. There's a lot of
ways to analyze this data, the
most typical being least
squares modeling with forward
selection on selected responses.
That was my original intention
for this talk, but when I showed
the data to Ron, he immediately
recognized the response columns
as time series from analytical
chemistry. Even though the data
were simulated, he could see the
structure. He could see things
in the data that I didn't see
and read into it wasn't. I found
this to be strikingly
impressive. It's beyond the
scope of this talk, but there is
an even better approach based on
ensemble modeling using
fractionally weighted
bootstrapping. Phil Ramsey,
Wayne Levin and I have another
talk about this methodology at
the European Discovery
Conference this year. The
approach is promising because it
can fit models to data with
more active interactions than
there are runs. The fourth and final
component of information quality
is utility, which is how well we
are able to assess our goals. Or
how do we measure how well we've
assessed our goals? There's a
domain aspect which is in this
case we want to have a
formulation that leads to
maximized yields and minimized the
waste in post processing of the
material. The statistical
analysis utility refers to the
model that we fit. What we're
going for there is least
squares accuracy of our model in
terms of how well we're able to
predict what the...what would
result from candidate
combinations of formulation...of
mixture factors. Now I'm going
to go through a set of questions
that make up a detailed InfoQ
assessment as organized into the
eight dimensions of information
quality. I want to point out
that not all questions will be
equally relevant to different
data science and statistical
projects, and that this is not
intended to be rigid dogma but
rather a set of things that are
a good idea to ask oneself.
These questions represent a kind
of data analytic wisdom that
looks more broadly than just the
application of a particular
statistical technology. A copy
of a spreadsheet with these
questions along with pointers to
JMP features that are the most
useful for answering a
particular one will be uploaded
to the JMP Community along
with this talk for you to use. As
I proceed through the questions,
I'll be demoing an analysis of
the data in JMP. So Question 1
is is the data scale used
aligned with the stated goal? So
the Xs that we have consist of
a single categorical variable
processing and the 11 continuous
inputs. These are measured
as percentages and are also
recorded to half a percent. We
don't have the total amounts of
the ingredients, only the
percentages. The totals are
information that was either lost
or never recorded. There are
other potentially important
pieces of information that are
missing here. The time between
formulating the batches and
taking the measurements is gone
and there could have been other
covariate level information that
is missing here that would have
described the conditions under
which the reaction occurred.
Without more information than I
have, I cannot say how important
this kind of covariate information
would have been. We do have
information on the day of the
batch, so that could be used as
a surrogate possibly. Overall we
have what are, hopefully, the most
important inputs, as well as
measurements of the responses we
wish to optimize. We could have
had more information, but this
looks promising enough to try
and analysis with. The second
question related to data
resolution is how reliable and
precise are the measuring devices
and data sources. And the fact
is, we don't have a lot of
specific information here. The
statistician internal the
company would have had more
information. In this case we
have no choice but to trust that
the chemists formulated and
recorded the mixtures well. The
third question relative to data
resolution is is the data
analysis suitable for the data
aggregation level? And the
answer here is yes, assuming
that their measurement system is
accurate and that the data are
clean enough. What we're going
to end up doing actually is
we're going to use the
Functional Data Explorer to
extract functional principal
components, which are a data
derived kind of data
aggregation. And then we're
going to be modeling those
functional principal components
using the input variables. So
now we move on to the data
structure dimension and the
first question we ask is, is the
data used aligned with the
stated goal? And I think the
answer is a clear yes here. We're
trying to maximize
yield. We've got measurements for
that, and the inputs are
recorded as Xs. The second data
structure question is where
things really start to get
interesting for me. So this is
are the integrity details
(outliers, missing values, data
corruption) issues described and
handled appropriately? So from
here we can use JMP to be able
to understand where the outliers
are, figure out strategies for
what to do about missing values,
observe their patterns and so
on. So this is this is where
things are going to get a little
bit more interesting. The first
thing we're going to do is we're
going to determine if there are
any outliers in the data that we
need to be concerned about. So
to do that, we're going to go
into the explore outliers
platform off of the screening
menu. We're going to load up the
response variables, and because
this is a multivariate setting,
we're going to use a new feature
in JMP Pro 16 called Robust
PCA Outliers. So we see where
the large residuals are in those
kind of Pareto type plots.
There's a snapshot showing where
there's some potentially
unusually large observations. I
don't really think this looks
too unusual or worrisome to me.
We can save the large outliers
to a data table and then look at
them in the distribution
platform and what we see kind of
looks like a normal distribution
with the middle taken out. So I
think this is data that are
coming from sort of the same
population and there's nothing
really to worry about here,
outliers-wise. So once we've
taken care of the outlier
situation we go in and explore
missing values. So what we're
going to do first is we're going
to load up the Ys as...into the
platform, and then we're going
to use the missing value
snapshot to see what patterns
they are amongst our missing
values. It looks like the
missing values tend to occur in
horizontal clusters, and there's
also the same missing values
across rows. So you can see that
with the black splotches here.
And then we'll go apply an
automated data imputation,
which goes ahead and saves
formula columns that impute
missing values in the new
columns using a regression type
algorithm that was developed by
a PhD student of mine named Milo
Page at NC State. So we can play
around a little bit and get a
sense of like how the ADI
algorithm is working. So it's
created these formula columns
that are peeling off elements of
the ADI impute column, which is
a vector formula column, and the
scoring impute function
is calculating the expected
value of the missing cells given
the non missing cells, whenever
it's got a missing value. And it's
just carrying through a non
missing value. So you can see 207
in YO...Y6 there. It's initially
207 but then I change it to
missing and it's now being
imputed to be 234.
So I'll do this a couple of times so
you can kind of see how how it's
working. So here I'll put in a big
value for Y7 and that's now.
been replaced. And if we go down
and we add a row,
then all missing values are there
initially and the column means
are replaced for the
imputations. If I were to go
ahead and add values for some of
those missing cells, it would
start doing the conditional
expectation of the still missing
cells using the information
that's in the missing one....the
non missing ones. So our next
question on data structure is
are the analysis methods
suitable for the data structure?
So we've got 11 mixture inputs
and a processing variable that's
categorical. Those are going
to be inputs into a least
squares type model. We have
13 continuous responses and
we can model them using...
individually using least
squares. Or we can model
functional principal
components. The...now there are
problems. The Xs are...the
input variables have not been
randomized at all. It's very
clear that they would muck
around with one or more of
the compounds and then move
on to another one. So the
order in which the
the input variables were varied
was kind of haphazard. It's a
clear lack of randomization, and
that's going to negatively
impact our...the generalizability
and strength of our conclusions.
Data integration is the third
InfoQ dimension. These data
are manually entered lab notes
consisting mostly of mixture
percentages and equipment
readouts. We can only assume
that the data were entered
correctly and that the Xs are
aligned properly with responses.
If that isn't the case, then the
model will have serious bias
problems and have
problems with generalizability.
Integration is more of an issue
with observational data
science problems in machine
learning exercises, than lab
experiments like this. Although
it doesn't apply here, I'll
point out that privacy and
confidentiality concerns can be
identified by modeling the
sensitive part of the data using
the to be published component
at the data. If the resulting
model is predicted, then one
should be concerned that privacy
concerns are not being met. Temporal
relevance refers to the
operational time sequencing of
data collection, analysis and
deployment and whether gaps
between those stages leads to a
decrease in the usefulness of
the information in the study.
In this case, we can only simply
hope that the material supplies
are reasonably consistent and
that the test equipment is
reasonably accurate, which is an
unverifiable assumption at this
point. The time resolution we
have when the data collection is
at the day level, which means
that there isn't much way that
we can verify if there is time
variation within each day.
Chronology of data and goal is
about the availability of
relevant variables both in terms
of whether the variable is
present at all in the data or
whether the right information
will be available when the model
is deployed. For predictive
models, this relates to models
being fit to data similar to
what will be present at the time
the model will be evaluated on
new data. In this way, our data
set is certainly fine. For
establishing causality, however,
we aren't in nearly as good a
shape because the lack of
randomization implies that time
effects and factor effects may
be confounded, leading to bias
in our estimates. Endogeneity,
or reverse causation, could
clearly be an issue here, as
variables like temperature and
reaction time could clearly be
impacting the responses, but have
been left unrecorded. Overall,
there is a lot we don't know
about this dimension in an
information quality sense.
The rest of the InfoQ
assessment is going to be
dependent upon the type of
analysis that we do. So at this
point I'm going to go ahead and
conduct an analysis of this data
using the Functional Data
Explorer platform in JMP Pro
that allows me to model across
all the columns simultaneously
in a way that's based on
functional principal components,
which contain the maximum amount
of information across all those
columns as represented in the
most efficient format possible.
I'm going to be working on the
imputed versions of the columns
that I calculated earlier in the
presentation. And I'm going to
point out that I'm going to be
working to find combinations of
the mixture factors that achieve
as closely as possible in a
least square sense, an ideal
curve that was created by the
practitioner that maximizes the
amount of potential product that
could be in a batch while
minimizing the amount of the
impurities that they
realistically thought a batch
could contain. So I begin the
analysis by going to the analyze
menu, bring up the Functional
Data Explorer. This has rows as
functions. I'm going to load up my
imputed rows, and then I'm going
to put in my formulation
components and my processing
column as a supplementary
variable. We've got an ID
function, that's batch ID. Here I
get in. I can see the functions,
both the overlay altogether, and
I can see the individual functions.
Then I can load up the target
function, which is the ideal.
And that will change the
analysis that results once I
start going into the modeling
steps. So these are pretty
simple functions, so I'm just
going to model them with
B splines.
And then I'm going to go into my
functional DOE analysis.
This is going to fit the model
that connects the inputs into
the functional principal
components and then connect all
the way through the
eigenfunctions to make it so
we're able to recover the
overall functions as they
changed, as we are varying the
mixture factors. The
functional principal component
analysis has indicated that
there are four dimensions of
variation in these response
functions. To understand what
they mean, let's go ahead and
explore with the FPC profiler.
So watch this pane right here as
I adjust FPC 1 and we can see
that this FPC is associated with
peak height. FPC2, it looks
like it's kind of a peak
narrowness. It's almost like a
resolution principal
component. The third one is
related to kind of a knee on
the left of the dominant peak.
And Peak 4 looks like it's
primarily related with the
impurity, so that's what the
underlying meaning is of
these four functional
principal components.
So we've characterized our goal
as maximizing the product and
minimizing the impurity, and
we've communicated that into the
analysis through this ideal or
golden curve that we supplied at
the beginning of the FDE
exercise we're doing. To get as
close as possible to that ideal
curve, we turn on desirability
functions. And then we can go
out and maximize desirability.
And we find that the optimal
combination of inputs is about
4.5% of
Ingredient four, 2% of
Ingredient 6. 2.2% of
Ingredient 8 and 1.24% of
Ingredient 9 using processing
method two. Let's review how
we've gotten here. We first
computed the missing response
columns. Then we found B-spline
models that fit those functions
well in the FDE platform. A
functional principal components
analysis determined that there
were four eigenfunctions
characterizing the variation in
this data. These four
eigenfunctions were determined
via the FPC profiler to each
have a reasonable subject
matter meaning. The functional
DOE analysis consisted of
applying pruned forward
selection to each of the
individual FPC scores using the
DOE factors as input variables.
And we see here that these have
found combinations of
interactions and main effects
that were most predictive for
each of the functional principal
component scores individually.
The Functional DOE Profiler
has elegantly captured all
aspects into one representation
that allows us to find the
formulation processing step that
is predicted to have desirable
properties as measured by high
yield and low impurity.
So now we can do an InfoQ
assessment of the
generalizability of the data in
the analysis. So in this case,
we're more interested in
scientific generalizability, as
the experimenter is a deeply
knowledgeable chemist working
with this compound. So we're
going to be relying more on
their subject matter expertise
then on statistical principles
and tools like hypothesis tests
and so forth. The goal is
primarily predictive, but the
generalizability is kind of
problematic because the
experiment wasn't designed. Our
ability to estimate interactions
is weakened for techniques like
forward selection and impossible
via least squares analysis of
the full model. Because the
study wasn't randomized, there
could be unrecorded time in
order effects. We don't have
potentially important covariate
information like temperature and
reaction time. This creates
another big question mark
regarding generalizability.
Repeatability and
reproducibility of the study is
also an unknown here as we have
no information about the
variability due to the
measurement system. Fortunately,
we do have tools like JMP's
evaluate design to understand
the existing design as well as
augment design that can greatly
enhance the generalization
performance of the analysis.
Augment can improve information
about main effects and
interactions, and a second round
of experimentation could be
randomized to also enhance
generalizability. So now I'm
going to go through a couple of
simple steps to show how to
improve the generalization
performance of our study using
design tools in JMP. Before I
do that, I want to point out
that I had to take the data and
convert it so that it was
proportions rather than in
percents. Otherwise the design
tools were not really agreeing
with the data very well. So we
go into the evaluate designer
and then we load up our Ys and
our Xs. I requested the ability to
handle second order interactions.
Then...yeah, I got this alert
saying, hey, I can't do that
because we're not able to
estimate all the interactions
given the one factor at a time
data that we have. So I backed
up. We go to the augment
designer, load everything up,
set augment. We'll choose and I-
optimal design because we're
really concerned with
predicted performance here.
And I
set the number of runs to 148.
The custom designer requested
141 as a minimum, but I went to
148 just to kind of make sure
that we've got all ability to
estimate all of our interactions
pretty well. After that, it
takes about 20 seconds to
construct the design. So now
that we have the design, I'm
going to show the two most
important diagnostic tools in
the augment designer for
evaluating a design. On the
left, we have the fraction of
design space plot. This is
showing that 50% of the volume
of the design has
a prediction variance that is
less than 1. So 1 would be
equivalent to the residual
error. So we're able to get
better than measurement error
quality predictions over the
majority of the design. On the
right we have the color map on
correlations. This is showing
that we're able to estimate
everything pretty well. There's
some...because of the mixture
constraint, we're getting some
strong correlations between
interactions and main effects.
Overall, the effects are fairly
clean. And the interactions are
pretty well separated from one
another, and the main effects
are pretty well separated from
one another as well. After
looking at the design
diagnostics, we can make the
table. Here, I have shown the
first 13 of the augmented runs
and we see that we've got...we
have more randomization. We don't
have use of the same main effect
over and over again streaks.
That's evidence of better
randomization and overall the
design is going to be able to
much better estimate the main
effects and interactions having
received better, higher quality
information in this second stage
of experimentation. So the input
variables, the Xs, are accurate
representations of the mixture
proportion, so that's a clear
objective interest. The
responses are close surrogate
for the amount of the product
and amount of impurity that's in
the batches. We're pretty good on
7.1. there. The justifications
are clear. After the study, we
can of course go prepare a
batch that is the formulation
that was recommended by the FDOE
profiler. Try it out and see if
we're getting the kind of
performance that we were looking
for. It's very clear that that
would be the way that we can
assess how well we've achieved
our study goals. So now under the
last InfoQ dimension
Communication. By describing the
ideal curve as a target
function, the Functional DOE
Profiler makes the goal and the
results of the analysis crystal
clear. But this can be expressed
at a level that is easily
interpreted by the chemists and
managers of the R&D facility.
And as we have done our detailed
information quality assessment,
we've been upfront about the
strengths and weaknesses of the
study design and data
collection. If the results do
not generalize, we certainly
know where to look for where the
problems were. Once you become
familiar with the concepts,
there is a nice add-in written
by Ian Cox that you can use to
do a quick quantitative InfoQ
assessment. The add-in has
sliders for the upper and lower
bounds of each InfoQ dimension.
These dimensions are combined
using a desirability function
approach for an overall interval
for the InfoQ over on the left.
Here is an assessment for the
data and analysis I covered in
this presentation. The add-in is
also a useful thinking tool that
will make you consider each of
the InfoQ dimensions. It's also a
practical way to communicate
InfoQ assessments to your
clients or to your management, as
it provides a high level view of
information quality without
using a lot of technical
concepts and jargon. The add-in
is also useful as the basis for
an InfoQ comparison. My
original hope for this
presentation was to be a little
bit more ambitious. I had hoped
to cover the analysis I had
just gone through, as well as
another simpler one, one where I skip
inputing the responses and doing
a simple multivariate linear
regression model of the response
columns. Today, I'm only able to
offer a final assessment of that
approach. As you can see,
several of the InfoQ
dimensions suffer substantially
without the more sophisticated
analysis. It is very clear that
the simple analysis leads to
much lower InfoQ score.
The upper limit of the simple
analysis isn't that much higher
than the lower limit of the more
sophisticated one. With
experience, you will gain
intuition about what a good InfoQ
score is for data science
projects in your industry and
pick up better habits as you
will no longer be blind to the
information bottlenecks in your
data collection, analysis and
model deployment. Information
quality with an easy to use
interface. This was my first
formal information quality
assessment. Speaking for myself,
the information quality
framework has given words and
structure to a lot of things I
already knew instinctively. It
is already changed how I
approach new data analysis
projects. I encourage you to go
through this process yourself on
your own data, even if that data
and analysis is already very
familiar to you. I guarantee
that you will be a wiser and
more efficient data scientist
because of it. Thank you.


0 Kudos