Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion
Christopher Gotwalt, JMP Director, Statistical R&D, SAS
Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.
Speaker |
Transcript |
Hello. | |
My name is Ron Kenett. This | |
is a joint talk with Chris | |
Gotwalt and we basically | |
have two messages | |
that should come out of the | |
talk. One is that we should | |
really be concerned about | |
information and information | |
quality. People tend to talk | |
about data and data quality, but | |
data is really not the the issue. | |
We are statisticians. We are | |
data scientists. We turn numbers, | |
data into information so that | |
our goal should be to make sure | |
that we generate high quality | |
information. The other message | |
is that JMP can help you | |
achieve that, and this is | |
actually turning out to be in | |
surprising ways. So by combining | |
the expertise of Chris and an | |
introduction to information | |
quality, we hope that these two | |
messages will come across | |
clearly. So if I had to | |
summarize what it is that we | |
want to talk about, after all, | |
it is all about information | |
quality. I gave a talk at | |
at the Summit in Prague four years | |
ago and that talk was generic. | |
It talked about moving the | |
journey from quality by design. | |
My journey doing information | |
quality. In this talk we focus | |
on how this can be done with | |
JMP. This is a more detailed | |
and technical talk than the | |
general talk I gave in Prague. | |
You can watch that talk. | |
There's a link listed here. You can | |
find it on the JMP community. | |
So we're going to talk about | |
information quality, which is | |
the potential of the data set, a | |
specific data set, to achieve a | |
particular goal, a specific goal, | |
with the given empirical method. | |
So in that definition we have | |
three components that are | |
listed. One is a certain data | |
set. Here is the data. The | |
second one is the goal, | |
the goal of the analysis, what | |
it is that we want to achieve. | |
And the third one is how we will | |
do that, which is, with what | |
methods we're going to generate | |
information, and that potential | |
is going to be assessed with the | |
utility function. And I will | |
begin with an introduction to | |
information quality, and then | |
Chris will will take over, | |
discuss the case study and and | |
show you how to conduct an | |
information quality assessment. | |
Eventually this should | |
answer the question how JMP | |
supports InfoQ, that | |
would be the the bullet points | |
that you can...the take away points | |
from the talk. The setup for | |
this is that we we encourage | |
what I called a lifecycle view | |
of statistics. In other words, | |
not just data analysis. | |
We should know...we should be part | |
of the problem elicitation | |
phase. Also, the goal | |
formulation phase, that deserves | |
a discussion. We should | |
obviously be involved in the | |
data collection scheme if it's | |
through experiments or through | |
surveys or through observational | |
data. Then we should also take | |
time for formulation of the | |
findings and not just pull out | |
printed reports on on | |
regression coefficients | |
estimates and and their | |
significance, but we should | |
also discuss what are the | |
findings? Operationalization of | |
findings meaning, OK, what can we | |
do with these findings? What are | |
the implications of the | |
findings? This should should... | |
needs to be communicated to the | |
right people in the right way, | |
and eventually we should do an | |
impact assessment to figure out, | |
OK, we did all this; what has | |
been the added value of our | |
work? I talked about the life | |
cycle of your statistics a few years | |
ago. This is the prerequisite, | |
the perspective to what | |
I'm going to talk about. So as I | |
mentioned, the information | |
quality is the potential of a | |
particular data set to achieve a | |
particular goal using given | |
empirical analysis methods. This | |
is identified through four | |
components | the goal, the data, |
the analysis method, and the | |
utility measure. So in a in a | |
mathematical expression, the | |
utility of applying f to x, | |
condition on the goal, is how we | |
identify InfoQ, information | |
quality. This was published in | |
the Royal Statistical Society | |
Series A in 2013 with eight | |
discussants, so it was amply | |
discussed. Some people thought | |
this was fantastic and some | |
people had a lot of critique on | |
that idea, so this is a wider | |
scope consideration of what | |
statistics is about. We also | |
wrote in 2006, we meeting myself | |
and Galit Shmueli, a book called | |
Information Quality. And in | |
the context of information | |
quality we did what is called | |
deconstruction. David Hand has | |
a paper called Deconstruction | |
of Statistics. This is the | |
deconstruction of information | |
quality into eight dimensions. I | |
will cover these eight dimensions. | |
That's my part in the talk and | |
then Chris will show how this | |
is implemented in a specific | |
case study. | |
Another aspect that relates to | |
this is another book I have. | |
This is recent, a year ago | |
titled The Real Work of Data | |
Science and we talk about what | |
is the role of the data | |
scientists in organizations and | |
in that context, we emphasized | |
the need for the data scientist | |
to be involved in the generation | |
of information as an...information | |
quality as meeting the goals of | |
the organization. So let me | |
cover the eight dimensions. That's | |
that's that's my intro. The | |
first one is data resolution. We | |
have a goal. OK, we we would | |
like to know the level of flu | |
because in the country or in the | |
area where we live, because that | |
will impact our decision on | |
whether to go to the park where | |
we could meet people or going to | |
a...to a jazz concert. And that | |
concert is tomorrow. | |
If we look up the CDC data on | |
the level of flu, that data is | |
updated weekly, so we could get | |
the red line in the graph you | |
have in front of you, so we | |
could get data of a few days | |
ago, maybe good, maybe not good | |
enough for our gold. Google Flu, | |
which is based on searches | |
related to flu, is updated | |
momentarily, so it's updated | |
online, it will probably give | |
us better information. So for | |
our goal, the blue line, the | |
Google trend...the Google Flu | |
Trends indicator, is probably | |
more appropriate. The second | |
dimension is data structure. | |
To meet our goal, we're going to | |
look at data. We should...we | |
should identify the data sources | |
and the structure in these data | |
sources. So some data could be | |
text, some could be video, some | |
could be, for example, the | |
network of recommendations. This | |
is an Amazon picture on how if | |
you look at the book, you're | |
going to get some | |
other books recommended. And if | |
you go to these other books, | |
you're going to have more data | |
recommended. So the data | |
structure can come in all sorts | |
of shapes and forms and this can | |
be text. This can be functional | |
data. This can be images. We are | |
not confined now to what we used | |
to call data, which is what you | |
find in an Excel spreadsheet. | |
The data could be corrupted, | |
could have missing values, could | |
have unusual patterns which | |
which would be | |
something to look into. Some | |
patterns, where things are | |
repeated. Maybe some of the data | |
is is just copy and paste and we | |
would like to be warned about | |
such options. The third | |
dimension is data integration. | |
When we consider the data from | |
these different sources, we're | |
going to integrate them so we | |
can do some analysis linkage | |
through an ID. For example, we | |
would do that, but in doing that, | |
we might create some issues, for | |
example, in in disclosing data | |
that normally should be | |
anonymized. Data | |
integration, yeah, that will | |
allow us to do fantastic things, | |
but if the data is perceived to | |
have some privacy exposure | |
issues, then maybe the quality | |
of the information from the | |
analysis that we're going to do | |
is going is going to be | |
affected. So data integration | |
should be looked into very, very | |
carefully. This is what people | |
likely used to call ETL | |
extract, transform and load. We | |
now have much better methods for | |
doing that. The join option, for | |
example, in JMP will offer | |
options for for doing that. | |
Temporal relevance. OK, that's | |
pretty clear. We have data. It is | |
stamped somehow. If we're going | |
to do the analysis later, later | |
after the data collection, and | |
if the deployement that we | |
consider is even later, then the | |
data might not be temporally | |
relevant. In a common | |
situation, if we want to compare | |
what is going on now, we would | |
like to be able to make this | |
comparison to recent data or | |
data before the pandemic | |
started, but not 10 years | |
earlier. The official statistics | |
on health records used to be two | |
or three years behind in terms | |
of timing, which made it very | |
difficult the use of official | |
statistics in assessing | |
what is going on with the | |
pandemic. Chronology of data and | |
goal is related to the decision | |
that we make as a result of our | |
goal. So if, for example, our | |
goal is to forecast air quality, | |
we're going to use some | |
predictive models on the Air | |
Quality Index reported on a | |
daily basis. This gives us a one | |
to six scale from hazardous to | |
good. There are some values | |
which are representing levels of | |
health concern. Zero-50 is good; | |
300-500 is hazardous and the | |
chronology of data and goal | |
means that we should be able to | |
make a forecast on a daily | |
basis. So the methods we use | |
should be updated on a daily | |
basis. If, on the other hand, | |
our goal is to figure out how is | |
this AQI index computed, then we | |
are not really bound by the the | |
the timeliness of the analysis. | |
You know, we could take our | |
time. There's no urgency in | |
getting the analysis done on a | |
daily basis. Generalizability, | |
the sixth dimension, is about | |
taking our findings and | |
considering where this could | |
apply in more general terms, | |
other populations, all | |
situations. This can be done | |
intuitively. Marketing managers who | |
have seen a study on the on the | |
market, let's call it A, might | |
already understand what are the | |
implications to Market B | |
without data. People who are | |
physicists will be able to | |
make predictions based on | |
mechanics on first principles | |
without without data. | |
So some of the generalizability | |
is done with data. This is the | |
basis of statistical | |
generalization, where we go from | |
the sample to the population. | |
Statistical inferences is about | |
generalization. We generalize | |
from the sample to the | |
population. And some can be | |
domain based, in other words, | |
using expert knowledge, domain | |
expertise, not necessarily with | |
data. We have to recognize that | |
generalizability is not just | |
done with statistics. | |
The seventh dimension is | |
construct operationalization, | |
which is really about what it is | |
that we measure. We want to | |
assess behavior, emotions, what | |
it is that we can measure, that | |
will give us data that reflects | |
behavior or emotions. | |
The example I give here | |
typically is pain. | |
We know what is pain. How do | |
we measure that? If you go to a | |
hospital and you ask the nurse, | |
how do you assess pain, they | |
will tell you, we have a scale, | |
1 to 10. It's very | |
qualitative, not very | |
scientific, I would say. If we | |
want to measure the level of | |
alcohol in drivers on the...on | |
the road, it will be difficult to | |
measure. So we might measure | |
speed as a surrogate measure. | |
Another part of | |
operationalization is the other | |
end of the story. In other | |
words, the first part, the | |
construct is what we measure, | |
which reflects our goal. The the | |
end...the end result here is that | |
we have findings and we want to | |
do something with them. We want | |
to operationalize our finding. | |
This is what the action | |
operationalization is about. | |
It's about what you do with the | |
findings and then being | |
presented here on a podium. We | |
used to ask three questions. | |
These are very important | |
questions to ask. Once you have | |
done some analysis, you have | |
someone in front of you who | |
says, oh, thank you very much, | |
you're done...you, the statistician | |
or the data scientist. So this | |
this takes you one extra step, | |
getting you to ask your customer | |
these simple questions | What do |
you want to accomplish? How will | |
you do that and how will you | |
know if you have accomplished | |
that? We we can help answer, or | |
at least support, some of these | |
questions we've answered. | |
The eighth dimension is | |
communication. I'm giving you an | |
example from a very famous old | |
map from the 19th century, which | |
is showing the march of the | |
Napoleon Army from France to | |
Moscow to Russia. You see the | |
numbers are the the width of | |
the path indicates the size | |
of the army, and then on on in | |
black you see what happened to | |
them on their on their way back. | |
So basically this was a | |
disastrous march. We we we we | |
can relate this old map to | |
existing maps, and there is a | |
JMP add-in, which you can | |
find on the JMP Community, to to | |
show you with bubble plots, | |
dynamic bubble plots what this | |
looks like. So I I've covered | |
very quickly the eight information | |
quality dimensions. My last | |
slide is that what I've talked | |
about from a historical | |
perspective, really put some | |
proportions to what I'm saying. | |
I think we are really in the era | |
of information quality. We used | |
to be concerned with product | |
quality in the 18th century, the | |
17th century. We then moved to | |
process quality and service | |
quality. This is a short memo | |
on proposing a control chart, | |
1924, I think. | |
Then we move to management | |
quality. This is the Juran | |
trilogy of design, improvement | |
and control. Six Sigma (define, | |
measure, analyze, improve) | |
control process is the | |
improvement process of Juran, | |
and Juran was the grand father | |
of Six Sigma in that sense. | |
Then in the '80s, Taguchi came | |
in. He talks about robust | |
design. How can we handle | |
variability in inputs by proper | |
design decisions? And now we are | |
in the Age of information | |
quality. We have sensors. We | |
have flexible systems. We are | |
depending on AI and machine | |
learning and data mining and we | |
are gathering big big big | |
numbers, but which we call big | |
data. The interest in information | |
quality should be a prime prime | |
interest. I'm going to try and | |
convince you of, with the help | |
of Chris, that. | |
We are here and JMP can | |
help us achieve that in in a | |
really unusual way. | |
What you will see at the end of | |
the case study that Chris will | |
show is also how to do it an | |
information quality assessment and | |
on a specific study, basically | |
generate an information quality | |
score. So if we go top down, I | |
can tell you this study, this | |
work, this analysis is maybe 80% or | |
maybe 30% or maybe 95%. | |
And through the example you will | |
see how to do that. There is a | |
JMP add-in to provide this | |
assessment. It's it's actually | |
quite quite easy. There's | |
nothing really sophisticated | |
about that. So I'm done and | |
Chris, after you. Thanks, Ron. So | |
now I'm going to go through the | |
analysis of a data set in a way | |
that explicitly calls out the | |
various aspects of information | |
quality and show how JMP can be | |
used to assess an improvement | |
for InfoQ. So first off, I'm | |
going to go through the InfoQ | |
components. The first InfoQ | |
component is the goal, so in | |
this case the problem statement | |
was that a chemical company | |
wanted a formulation that | |
maximized product yield while | |
minimizing a nuisance impurity | |
that resulted from the reaction. | |
So that was the high level goal. | |
In statistical terms, we wanted | |
to find a model that accurately | |
predicted a response on a data | |
set so that we could find a | |
combination of ingredients and | |
processing steps that would lead | |
to a better product. | |
The data are set up in 100 | |
experimental formulations with | |
one primary ingredient, X1,and | |
10 additives. There's also a | |
processing factor in 13 | |
responses. The data are | |
completely fabricated but were | |
simulated to illustrate the same | |
strengths and weaknesses of the | |
original data. The data | |
formulation was made was also | |
recorded. We will be looking at | |
this data closely, so I want to | |
elaborate beyond pointing out | |
that they were collected in an | |
ad hoc way, changing one or two | |
additives at a time rather than | |
as a designed or randomized | |
experiment. There's a lot of | |
ways to analyze this data, the | |
most typical being least | |
squares modeling with forward | |
selection on selected responses. | |
That was my original intention | |
for this talk, but when I showed | |
the data to Ron, he immediately | |
recognized the response columns | |
as time series from analytical | |
chemistry. Even though the data | |
were simulated, he could see the | |
structure. He could see things | |
in the data that I didn't see | |
and read into it wasn't. I found | |
this to be strikingly | |
impressive. It's beyond the | |
scope of this talk, but there is | |
an even better approach based on | |
ensemble modeling using | |
fractionally weighted | |
bootstrapping. Phil Ramsey, | |
Wayne Levin and I have another | |
talk about this methodology at | |
the European Discovery | |
Conference this year. The | |
approach is promising because it | |
can fit models to data with | |
more active interactions than | |
there are runs. The fourth and final | |
component of information quality | |
is utility, which is how well we | |
are able to assess our goals. Or | |
how do we measure how well we've | |
assessed our goals? There's a | |
domain aspect which is in this | |
case we want to have a | |
formulation that leads to | |
maximized yields and minimized the | |
waste in post processing of the | |
material. The statistical | |
analysis utility refers to the | |
model that we fit. What we're | |
going for there is least | |
squares accuracy of our model in | |
terms of how well we're able to | |
predict what the...what would | |
result from candidate | |
combinations of formulation...of | |
mixture factors. Now I'm going | |
to go through a set of questions | |
that make up a detailed InfoQ | |
assessment as organized into the | |
eight dimensions of information | |
quality. I want to point out | |
that not all questions will be | |
equally relevant to different | |
data science and statistical | |
projects, and that this is not | |
intended to be rigid dogma but | |
rather a set of things that are | |
a good idea to ask oneself. | |
These questions represent a kind | |
of data analytic wisdom that | |
looks more broadly than just the | |
application of a particular | |
statistical technology. A copy | |
of a spreadsheet with these | |
questions along with pointers to | |
JMP features that are the most | |
useful for answering a | |
particular one will be uploaded | |
to the JMP Community along | |
with this talk for you to use. As | |
I proceed through the questions, | |
I'll be demoing an analysis of | |
the data in JMP. So Question 1 | |
is | is the data scale used |
aligned with the stated goal? So | |
the Xs that we have consist of | |
a single categorical variable | |
processing and the 11 continuous | |
inputs. These are measured | |
as percentages and are also | |
recorded to half a percent. We | |
don't have the total amounts of | |
the ingredients, only the | |
percentages. The totals are | |
information that was either lost | |
or never recorded. There are | |
other potentially important | |
pieces of information that are | |
missing here. The time between | |
formulating the batches and | |
taking the measurements is gone | |
and there could have been other | |
covariate level information that | |
is missing here that would have | |
described the conditions under | |
which the reaction occurred. | |
Without more information than I | |
have, I cannot say how important | |
this kind of covariate information | |
would have been. We do have | |
information on the day of the | |
batch, so that could be used as | |
a surrogate possibly. Overall we | |
have what are, hopefully, the most | |
important inputs, as well as | |
measurements of the responses we | |
wish to optimize. We could have | |
had more information, but this | |
looks promising enough to try | |
and analysis with. The second | |
question related to data | |
resolution is how reliable and | |
precise are the measuring devices | |
and data sources. And the fact | |
is, we don't have a lot of | |
specific information here. The | |
statistician internal the | |
company would have had more | |
information. In this case we | |
have no choice but to trust that | |
the chemists formulated and | |
recorded the mixtures well. The | |
third question relative to data | |
resolution is | is the data |
analysis suitable for the data | |
aggregation level? And the | |
answer here is yes, assuming | |
that their measurement system is | |
accurate and that the data are | |
clean enough. What we're going | |
to end up doing actually is | |
we're going to use the | |
Functional Data Explorer to | |
extract functional principal | |
components, which are a data | |
derived kind of data | |
aggregation. And then we're | |
going to be modeling those | |
functional principal components | |
using the input variables. So | |
now we move on to the data | |
structure dimension and the | |
first question we ask is, is the | |
data used aligned with the | |
stated goal? And I think the | |
answer is a clear yes here. We're | |
trying to maximize | |
yield. We've got measurements for | |
that, and the inputs are | |
recorded as Xs. The second data | |
structure question is where | |
things really start to get | |
interesting for me. So this is | |
are the integrity details | |
(outliers, missing values, data | |
corruption) issues described and | |
handled appropriately? So from | |
here we can use JMP to be able | |
to understand where the outliers | |
are, figure out strategies for | |
what to do about missing values, | |
observe their patterns and so | |
on. So this is this is where | |
things are going to get a little | |
bit more interesting. The first | |
thing we're going to do is we're | |
going to determine if there are | |
any outliers in the data that we | |
need to be concerned about. So | |
to do that, we're going to go | |
into the explore outliers | |
platform off of the screening | |
menu. We're going to load up the | |
response variables, and because | |
this is a multivariate setting, | |
we're going to use a new feature | |
in JMP Pro 16 called Robust | |
PCA Outliers. So we see where | |
the large residuals are in those | |
kind of Pareto type plots. | |
There's a snapshot showing where | |
there's some potentially | |
unusually large observations. I | |
don't really think this looks | |
too unusual or worrisome to me. | |
We can save the large outliers | |
to a data table and then look at | |
them in the distribution | |
platform and what we see kind of | |
looks like a normal distribution | |
with the middle taken out. So I | |
think this is data that are | |
coming from sort of the same | |
population and there's nothing | |
really to worry about here, | |
outliers-wise. So once we've | |
taken care of the outlier | |
situation we go in and explore | |
missing values. So what we're | |
going to do first is we're going | |
to load up the Ys as...into the | |
platform, and then we're going | |
to use the missing value | |
snapshot to see what patterns | |
they are amongst our missing | |
values. It looks like the | |
missing values tend to occur in | |
horizontal clusters, and there's | |
also the same missing values | |
across rows. So you can see that | |
with the black splotches here. | |
And then we'll go apply an | |
automated data imputation, | |
which goes ahead and saves | |
formula columns that impute | |
missing values in the new | |
columns using a regression type | |
algorithm that was developed by | |
a PhD student of mine named Milo | |
Page at NC State. So we can play | |
around a little bit and get a | |
sense of like how the ADI | |
algorithm is working. So it's | |
created these formula columns | |
that are peeling off elements of | |
the ADI impute column, which is | |
a vector formula column, and the | |
scoring impute function | |
is calculating the expected | |
value of the missing cells given | |
the non missing cells, whenever | |
it's got a missing value. And it's | |
just carrying through a non | |
missing value. So you can see 207 | |
in YO...Y6 there. It's initially | |
207 but then I change it to | |
missing and it's now being | |
imputed to be 234. | |
So I'll do this a couple of times so | |
you can kind of see how how it's | |
working. So here I'll put in a big | |
value for Y7 and that's now. | |
been replaced. And if we go down | |
and we add a row, | |
then all missing values are there | |
initially and the column means | |
are replaced for the | |
imputations. If I were to go | |
ahead and add values for some of | |
those missing cells, it would | |
start doing the conditional | |
expectation of the still missing | |
cells using the information | |
that's in the missing one....the | |
non missing ones. So our next | |
question on data structure is | |
are the analysis methods | |
suitable for the data structure? | |
So we've got 11 mixture inputs | |
and a processing variable that's | |
categorical. Those are going | |
to be inputs into a least | |
squares type model. We have | |
13 continuous responses and | |
we can model them using... | |
individually using least | |
squares. Or we can model | |
functional principal | |
components. The...now there are | |
problems. The Xs are...the | |
input variables have not been | |
randomized at all. It's very | |
clear that they would muck | |
around with one or more of | |
the compounds and then move | |
on to another one. So the | |
order in which the | |
the input variables were varied | |
was kind of haphazard. It's a | |
clear lack of randomization, and | |
that's going to negatively | |
impact our...the generalizability | |
and strength of our conclusions. | |
Data integration is the third | |
InfoQ dimension. These data | |
are manually entered lab notes | |
consisting mostly of mixture | |
percentages and equipment | |
readouts. We can only assume | |
that the data were entered | |
correctly and that the Xs are | |
aligned properly with responses. | |
If that isn't the case, then the | |
model will have serious bias | |
problems and have | |
problems with generalizability. | |
Integration is more of an issue | |
with observational data | |
science problems in machine | |
learning exercises, than lab | |
experiments like this. Although | |
it doesn't apply here, I'll | |
point out that privacy and | |
confidentiality concerns can be | |
identified by modeling the | |
sensitive part of the data using | |
the to be published component | |
at the data. If the resulting | |
model is predicted, then one | |
should be concerned that privacy | |
concerns are not being met. Temporal | |
relevance refers to the | |
operational time sequencing of | |
data collection, analysis and | |
deployment and whether gaps | |
between those stages leads to a | |
decrease in the usefulness of | |
the information in the study. | |
In this case, we can only simply | |
hope that the material supplies | |
are reasonably consistent and | |
that the test equipment is | |
reasonably accurate, which is an | |
unverifiable assumption at this | |
point. The time resolution we | |
have when the data collection is | |
at the day level, which means | |
that there isn't much way that | |
we can verify if there is time | |
variation within each day. | |
Chronology of data and goal is | |
about the availability of | |
relevant variables both in terms | |
of whether the variable is | |
present at all in the data or | |
whether the right information | |
will be available when the model | |
is deployed. For predictive | |
models, this relates to models | |
being fit to data similar to | |
what will be present at the time | |
the model will be evaluated on | |
new data. In this way, our data | |
set is certainly fine. For | |
establishing causality, however, | |
we aren't in nearly as good a | |
shape because the lack of | |
randomization implies that time | |
effects and factor effects may | |
be confounded, leading to bias | |
in our estimates. Endogeneity, | |
or reverse causation, could | |
clearly be an issue here, as | |
variables like temperature and | |
reaction time could clearly be | |
impacting the responses, but have | |
been left unrecorded. Overall, | |
there is a lot we don't know | |
about this dimension in an | |
information quality sense. | |
The rest of the InfoQ | |
assessment is going to be | |
dependent upon the type of | |
analysis that we do. So at this | |
point I'm going to go ahead and | |
conduct an analysis of this data | |
using the Functional Data | |
Explorer platform in JMP Pro | |
that allows me to model across | |
all the columns simultaneously | |
in a way that's based on | |
functional principal components, | |
which contain the maximum amount | |
of information across all those | |
columns as represented in the | |
most efficient format possible. | |
I'm going to be working on the | |
imputed versions of the columns | |
that I calculated earlier in the | |
presentation. And I'm going to | |
point out that I'm going to be | |
working to find combinations of | |
the mixture factors that achieve | |
as closely as possible in a | |
least square sense, an ideal | |
curve that was created by the | |
practitioner that maximizes the | |
amount of potential product that | |
could be in a batch while | |
minimizing the amount of the | |
impurities that they | |
realistically thought a batch | |
could contain. So I begin the | |
analysis by going to the analyze | |
menu, bring up the Functional | |
Data Explorer. This has rows as | |
functions. I'm going to load up my | |
imputed rows, and then I'm going | |
to put in my formulation | |
components and my processing | |
column as a supplementary | |
variable. We've got an ID | |
function, that's batch ID. Here I | |
get in. I can see the functions, | |
both the overlay altogether, and | |
I can see the individual functions. | |
Then I can load up the target | |
function, which is the ideal. | |
And that will change the | |
analysis that results once I | |
start going into the modeling | |
steps. So these are pretty | |
simple functions, so I'm just | |
going to model them with | |
B splines. | |
And then I'm going to go into my | |
functional DOE analysis. | |
This is going to fit the model | |
that connects the inputs into | |
the functional principal | |
components and then connect all | |
the way through the | |
eigenfunctions to make it so | |
we're able to recover the | |
overall functions as they | |
changed, as we are varying the | |
mixture factors. The | |
functional principal component | |
analysis has indicated that | |
there are four dimensions of | |
variation in these response | |
functions. To understand what | |
they mean, let's go ahead and | |
explore with the FPC profiler. | |
So watch this pane right here as | |
I adjust FPC 1 and we can see | |
that this FPC is associated with | |
peak height. FPC2, it looks | |
like it's kind of a peak | |
narrowness. It's almost like a | |
resolution principal | |
component. The third one is | |
related to kind of a knee on | |
the left of the dominant peak. | |
And Peak 4 looks like it's | |
primarily related with the | |
impurity, so that's what the | |
underlying meaning is of | |
these four functional | |
principal components. | |
So we've characterized our goal | |
as maximizing the product and | |
minimizing the impurity, and | |
we've communicated that into the | |
analysis through this ideal or | |
golden curve that we supplied at | |
the beginning of the FDE | |
exercise we're doing. To get as | |
close as possible to that ideal | |
curve, we turn on desirability | |
functions. And then we can go | |
out and maximize desirability. | |
And we find that the optimal | |
combination of inputs is about | |
4.5% of | |
Ingredient four, 2% of | |
Ingredient 6. 2.2% of | |
Ingredient 8 and 1.24% of | |
Ingredient 9 using processing | |
method two. Let's review how | |
we've gotten here. We first | |
computed the missing response | |
columns. Then we found B-spline | |
models that fit those functions | |
well in the FDE platform. A | |
functional principal components | |
analysis determined that there | |
were four eigenfunctions | |
characterizing the variation in | |
this data. These four | |
eigenfunctions were determined | |
via the FPC profiler to each | |
have a reasonable subject | |
matter meaning. The functional | |
DOE analysis consisted of | |
applying pruned forward | |
selection to each of the | |
individual FPC scores using the | |
DOE factors as input variables. | |
And we see here that these have | |
found combinations of | |
interactions and main effects | |
that were most predictive for | |
each of the functional principal | |
component scores individually. | |
The Functional DOE Profiler | |
has elegantly captured all | |
aspects into one representation | |
that allows us to find the | |
formulation processing step that | |
is predicted to have desirable | |
properties as measured by high | |
yield and low impurity. | |
So now we can do an InfoQ | |
assessment of the | |
generalizability of the data in | |
the analysis. So in this case, | |
we're more interested in | |
scientific generalizability, as | |
the experimenter is a deeply | |
knowledgeable chemist working | |
with this compound. So we're | |
going to be relying more on | |
their subject matter expertise | |
then on statistical principles | |
and tools like hypothesis tests | |
and so forth. The goal is | |
primarily predictive, but the | |
generalizability is kind of | |
problematic because the | |
experiment wasn't designed. Our | |
ability to estimate interactions | |
is weakened for techniques like | |
forward selection and impossible | |
via least squares analysis of | |
the full model. Because the | |
study wasn't randomized, there | |
could be unrecorded time in | |
order effects. We don't have | |
potentially important covariate | |
information like temperature and | |
reaction time. This creates | |
another big question mark | |
regarding generalizability. | |
Repeatability and | |
reproducibility of the study is | |
also an unknown here as we have | |
no information about the | |
variability due to the | |
measurement system. Fortunately, | |
we do have tools like JMP's | |
evaluate design to understand | |
the existing design as well as | |
augment design that can greatly | |
enhance the generalization | |
performance of the analysis. | |
Augment can improve information | |
about main effects and | |
interactions, and a second round | |
of experimentation could be | |
randomized to also enhance | |
generalizability. So now I'm | |
going to go through a couple of | |
simple steps to show how to | |
improve the generalization | |
performance of our study using | |
design tools in JMP. Before I | |
do that, I want to point out | |
that I had to take the data and | |
convert it so that it was | |
proportions rather than in | |
percents. Otherwise the design | |
tools were not really agreeing | |
with the data very well. So we | |
go into the evaluate designer | |
and then we load up our Ys and | |
our Xs. I requested the ability to | |
handle second order interactions. | |
Then...yeah, I got this alert | |
saying, hey, I can't do that | |
because we're not able to | |
estimate all the interactions | |
given the one factor at a time | |
data that we have. So I backed | |
up. We go to the augment | |
designer, load everything up, | |
set augment. We'll choose and I- | |
optimal design because we're | |
really concerned with | |
predicted performance here. | |
And I | |
set the number of runs to 148. | |
The custom designer requested | |
141 as a minimum, but I went to | |
148 just to kind of make sure | |
that we've got all ability to | |
estimate all of our interactions | |
pretty well. After that, it | |
takes about 20 seconds to | |
construct the design. So now | |
that we have the design, I'm | |
going to show the two most | |
important diagnostic tools in | |
the augment designer for | |
evaluating a design. On the | |
left, we have the fraction of | |
design space plot. This is | |
showing that 50% of the volume | |
of the design has | |
a prediction variance that is | |
less than 1. So 1 would be | |
equivalent to the residual | |
error. So we're able to get | |
better than measurement error | |
quality predictions over the | |
majority of the design. On the | |
right we have the color map on | |
correlations. This is showing | |
that we're able to estimate | |
everything pretty well. There's | |
some...because of the mixture | |
constraint, we're getting some | |
strong correlations between | |
interactions and main effects. | |
Overall, the effects are fairly | |
clean. And the interactions are | |
pretty well separated from one | |
another, and the main effects | |
are pretty well separated from | |
one another as well. After | |
looking at the design | |
diagnostics, we can make the | |
table. Here, I have shown the | |
first 13 of the augmented runs | |
and we see that we've got...we | |
have more randomization. We don't | |
have use of the same main effect | |
over and over again streaks. | |
That's evidence of better | |
randomization and overall the | |
design is going to be able to | |
much better estimate the main | |
effects and interactions having | |
received better, higher quality | |
information in this second stage | |
of experimentation. So the input | |
variables, the Xs, are accurate | |
representations of the mixture | |
proportion, so that's a clear | |
objective interest. The | |
responses are close surrogate | |
for the amount of the product | |
and amount of impurity that's in | |
the batches. We're pretty good on | |
7.1. there. The justifications | |
are clear. After the study, we | |
can of course go prepare a | |
batch that is the formulation | |
that was recommended by the FDOE | |
profiler. Try it out and see if | |
we're getting the kind of | |
performance that we were looking | |
for. It's very clear that that | |
would be the way that we can | |
assess how well we've achieved | |
our study goals. So now under the | |
last InfoQ dimension | |
Communication. By describing the | |
ideal curve as a target | |
function, the Functional DOE | |
Profiler makes the goal and the | |
results of the analysis crystal | |
clear. But this can be expressed | |
at a level that is easily | |
interpreted by the chemists and | |
managers of the R&D facility. | |
And as we have done our detailed | |
information quality assessment, | |
we've been upfront about the | |
strengths and weaknesses of the | |
study design and data | |
collection. If the results do | |
not generalize, we certainly | |
know where to look for where the | |
problems were. Once you become | |
familiar with the concepts, | |
there is a nice add-in written | |
by Ian Cox that you can use to | |
do a quick quantitative InfoQ | |
assessment. The add-in has | |
sliders for the upper and lower | |
bounds of each InfoQ dimension. | |
These dimensions are combined | |
using a desirability function | |
approach for an overall interval | |
for the InfoQ over on the left. | |
Here is an assessment for the | |
data and analysis I covered in | |
this presentation. The add-in is | |
also a useful thinking tool that | |
will make you consider each of | |
the InfoQ dimensions. It's also a | |
practical way to communicate | |
InfoQ assessments to your | |
clients or to your management, as | |
it provides a high level view of | |
information quality without | |
using a lot of technical | |
concepts and jargon. The add-in | |
is also useful as the basis for | |
an InfoQ comparison. My | |
original hope for this | |
presentation was to be a little | |
bit more ambitious. I had hoped | |
to cover the analysis I had | |
just gone through, as well as | |
another simpler one, one where I skip | |
inputing the responses and doing | |
a simple multivariate linear | |
regression model of the response | |
columns. Today, I'm only able to | |
offer a final assessment of that | |
approach. As you can see, | |
several of the InfoQ | |
dimensions suffer substantially | |
without the more sophisticated | |
analysis. It is very clear that | |
the simple analysis leads to | |
much lower InfoQ score. | |
The upper limit of the simple | |
analysis isn't that much higher | |
than the lower limit of the more | |
sophisticated one. With | |
experience, you will gain | |
intuition about what a good InfoQ | |
score is for data science | |
projects in your industry and | |
pick up better habits as you | |
will no longer be blind to the | |
information bottlenecks in your | |
data collection, analysis and | |
model deployment. Information | |
quality with an easy to use | |
interface. This was my first | |
formal information quality | |
assessment. Speaking for myself, | |
the information quality | |
framework has given words and | |
structure to a lot of things I | |
already knew instinctively. It | |
is already changed how I | |
approach new data analysis | |
projects. I encourage you to go | |
through this process yourself on | |
your own data, even if that data | |
and analysis is already very | |
familiar to you. I guarantee | |
that you will be a wiser and | |
more efficient data scientist | |
because of it. Thank you. |