Speaker
|
Transcript
|
|
Hello. |
|
My name is Ron Kenett. This |
|
is a joint talk with Chris |
|
Gotwalt and we basically |
|
have two messages |
|
that should come out of the |
|
talk. One is that we should |
|
really be concerned about |
|
information and information |
|
quality. People tend to talk |
|
about data and data quality, but |
|
data is really not the the issue. |
|
We are statisticians. We are |
|
data scientists. We turn numbers, |
|
data into information so that |
|
our goal should be to make sure |
|
that we generate high quality |
|
information. The other message |
|
is that JMP can help you |
|
achieve that, and this is |
|
actually turning out to be in |
|
surprising ways. So by combining |
|
the expertise of Chris and an |
|
introduction to information |
|
quality, we hope that these two |
|
messages will come across |
|
clearly. So if I had to |
|
summarize what it is that we |
|
want to talk about, after all, |
|
it is all about information |
|
quality. I gave a talk at |
|
at the Summit in Prague four years |
|
ago and that talk was generic. |
|
It talked about moving the |
|
journey from quality by design. |
|
My journey doing information |
|
quality. In this talk we focus |
|
on how this can be done with |
|
JMP. This is a more detailed |
|
and technical talk than the |
|
general talk I gave in Prague. |
|
You can watch that talk. |
|
There's a link listed here. You can |
|
find it on the JMP community. |
|
So we're going to talk about |
|
information quality, which is |
|
the potential of the data set, a |
|
specific data set, to achieve a |
|
particular goal, a specific goal, |
|
with the given empirical method. |
|
So in that definition we have |
|
three components that are |
|
listed. One is a certain data |
|
set. Here is the data. The |
|
second one is the goal, |
|
the goal of the analysis, what |
|
it is that we want to achieve. |
|
And the third one is how we will |
|
do that, which is, with what |
|
methods we're going to generate |
|
information, and that potential |
|
is going to be assessed with the |
|
utility function. And I will |
|
begin with an introduction to |
|
information quality, and then |
|
Chris will will take over, |
|
discuss the case study and and |
|
show you how to conduct an |
|
information quality assessment. |
|
Eventually this should |
|
answer the question how JMP |
|
supports InfoQ, that |
|
would be the the bullet points |
|
that you can...the take away points |
|
from the talk. The setup for |
|
this is that we we encourage |
|
what I called a lifecycle view |
|
of statistics. In other words, |
|
not just data analysis. |
|
We should know...we should be part |
|
of the problem elicitation |
|
phase. Also, the goal |
|
formulation phase, that deserves |
|
a discussion. We should |
|
obviously be involved in the |
|
data collection scheme if it's |
|
through experiments or through |
|
surveys or through observational |
|
data. Then we should also take |
|
time for formulation of the |
|
findings and not just pull out |
|
printed reports on on |
|
regression coefficients |
|
estimates and and their |
|
significance, but we should |
|
also discuss what are the |
|
findings? Operationalization of |
|
findings meaning, OK, what can we |
|
do with these findings? What are |
|
the implications of the |
|
findings? This should should... |
|
needs to be communicated to the |
|
right people in the right way, |
|
and eventually we should do an |
|
impact assessment to figure out, |
|
OK, we did all this; what has |
|
been the added value of our |
|
work? I talked about the life |
|
cycle of your statistics a few years |
|
ago. This is the prerequisite, |
|
the perspective to what |
|
I'm going to talk about. So as I |
|
mentioned, the information |
|
quality is the potential of a |
|
particular data set to achieve a |
|
particular goal using given |
|
empirical analysis methods. This |
|
is identified through four |
components |
the goal, the data, |
|
the analysis method, and the |
|
utility measure. So in a in a |
|
mathematical expression, the |
|
utility of applying f to x, |
|
condition on the goal, is how we |
|
identify InfoQ, information |
|
quality. This was published in |
|
the Royal Statistical Society |
|
Series A in 2013 with eight |
|
discussants, so it was amply |
|
discussed. Some people thought |
|
this was fantastic and some |
|
people had a lot of critique on |
|
that idea, so this is a wider |
|
scope consideration of what |
|
statistics is about. We also |
|
wrote in 2006, we meeting myself |
|
and Galit Shmueli, a book called |
|
Information Quality. And in |
|
the context of information |
|
quality we did what is called |
|
deconstruction. David Hand has |
|
a paper called Deconstruction |
|
of Statistics. This is the |
|
deconstruction of information |
|
quality into eight dimensions. I |
|
will cover these eight dimensions. |
|
That's my part in the talk and |
|
then Chris will show how this |
|
is implemented in a specific |
|
case study. |
|
Another aspect that relates to |
|
this is another book I have. |
|
This is recent, a year ago |
|
titled The Real Work of Data |
|
Science and we talk about what |
|
is the role of the data |
|
scientists in organizations and |
|
in that context, we emphasized |
|
the need for the data scientist |
|
to be involved in the generation |
|
of information as an...information |
|
quality as meeting the goals of |
|
the organization. So let me |
|
cover the eight dimensions. That's |
|
that's that's my intro. The |
|
first one is data resolution. We |
|
have a goal. OK, we we would |
|
like to know the level of flu |
|
because in the country or in the |
|
area where we live, because that |
|
will impact our decision on |
|
whether to go to the park where |
|
we could meet people or going to |
|
a...to a jazz concert. And that |
|
concert is tomorrow. |
|
If we look up the CDC data on |
|
the level of flu, that data is |
|
updated weekly, so we could get |
|
the red line in the graph you |
|
have in front of you, so we |
|
could get data of a few days |
|
ago, maybe good, maybe not good |
|
enough for our gold. Google Flu, |
|
which is based on searches |
|
related to flu, is updated |
|
momentarily, so it's updated |
|
online, it will probably give |
|
us better information. So for |
|
our goal, the blue line, the |
|
Google trend...the Google Flu |
|
Trends indicator, is probably |
|
more appropriate. The second |
|
dimension is data structure. |
|
To meet our goal, we're going to |
|
look at data. We should...we |
|
should identify the data sources |
|
and the structure in these data |
|
sources. So some data could be |
|
text, some could be video, some |
|
could be, for example, the |
|
network of recommendations. This |
|
is an Amazon picture on how if |
|
you look at the book, you're |
|
going to get some |
|
other books recommended. And if |
|
you go to these other books, |
|
you're going to have more data |
|
recommended. So the data |
|
structure can come in all sorts |
|
of shapes and forms and this can |
|
be text. This can be functional |
|
data. This can be images. We are |
|
not confined now to what we used |
|
to call data, which is what you |
|
find in an Excel spreadsheet. |
|
The data could be corrupted, |
|
could have missing values, could |
|
have unusual patterns which |
|
which would be |
|
something to look into. Some |
|
patterns, where things are |
|
repeated. Maybe some of the data |
|
is is just copy and paste and we |
|
would like to be warned about |
|
such options. The third |
|
dimension is data integration. |
|
When we consider the data from |
|
these different sources, we're |
|
going to integrate them so we |
|
can do some analysis linkage |
|
through an ID. For example, we |
|
would do that, but in doing that, |
|
we might create some issues, for |
|
example, in in disclosing data |
|
that normally should be |
|
anonymized. Data |
|
integration, yeah, that will |
|
allow us to do fantastic things, |
|
but if the data is perceived to |
|
have some privacy exposure |
|
issues, then maybe the quality |
|
of the information from the |
|
analysis that we're going to do |
|
is going is going to be |
|
affected. So data integration |
|
should be looked into very, very |
|
carefully. This is what people |
|
likely used to call ETL |
|
extract, transform and load. We |
|
now have much better methods for |
|
doing that. The join option, for |
|
example, in JMP will offer |
|
options for for doing that. |
|
Temporal relevance. OK, that's |
|
pretty clear. We have data. It is |
|
stamped somehow. If we're going |
|
to do the analysis later, later |
|
after the data collection, and |
|
if the deployement that we |
|
consider is even later, then the |
|
data might not be temporally |
|
relevant. In a common |
|
situation, if we want to compare |
|
what is going on now, we would |
|
like to be able to make this |
|
comparison to recent data or |
|
data before the pandemic |
|
started, but not 10 years |
|
earlier. The official statistics |
|
on health records used to be two |
|
or three years behind in terms |
|
of timing, which made it very |
|
difficult the use of official |
|
statistics in assessing |
|
what is going on with the |
|
pandemic. Chronology of data and |
|
goal is related to the decision |
|
that we make as a result of our |
|
goal. So if, for example, our |
|
goal is to forecast air quality, |
|
we're going to use some |
|
predictive models on the Air |
|
Quality Index reported on a |
|
daily basis. This gives us a one |
|
to six scale from hazardous to |
|
good. There are some values |
|
which are representing levels of |
|
health concern. Zero-50 is good; |
|
300-500 is hazardous and the |
|
chronology of data and goal |
|
means that we should be able to |
|
make a forecast on a daily |
|
basis. So the methods we use |
|
should be updated on a daily |
|
basis. If, on the other hand, |
|
our goal is to figure out how is |
|
this AQI index computed, then we |
|
are not really bound by the the |
|
the timeliness of the analysis. |
|
You know, we could take our |
|
time. There's no urgency in |
|
getting the analysis done on a |
|
daily basis. Generalizability, |
|
the sixth dimension, is about |
|
taking our findings and |
|
considering where this could |
|
apply in more general terms, |
|
other populations, all |
|
situations. This can be done |
|
intuitively. Marketing managers who |
|
have seen a study on the on the |
|
market, let's call it A, might |
|
already understand what are the |
|
implications to Market B |
|
without data. People who are |
|
physicists will be able to |
|
make predictions based on |
|
mechanics on first principles |
|
without without data. |
|
So some of the generalizability |
|
is done with data. This is the |
|
basis of statistical |
|
generalization, where we go from |
|
the sample to the population. |
|
Statistical inferences is about |
|
generalization. We generalize |
|
from the sample to the |
|
population. And some can be |
|
domain based, in other words, |
|
using expert knowledge, domain |
|
expertise, not necessarily with |
|
data. We have to recognize that |
|
generalizability is not just |
|
done with statistics. |
|
The seventh dimension is |
|
construct operationalization, |
|
which is really about what it is |
|
that we measure. We want to |
|
assess behavior, emotions, what |
|
it is that we can measure, that |
|
will give us data that reflects |
|
behavior or emotions. |
|
The example I give here |
|
typically is pain. |
|
We know what is pain. How do |
|
we measure that? If you go to a |
|
hospital and you ask the nurse, |
|
how do you assess pain, they |
|
will tell you, we have a scale, |
|
1 to 10. It's very |
|
qualitative, not very |
|
scientific, I would say. If we |
|
want to measure the level of |
|
alcohol in drivers on the...on |
|
the road, it will be difficult to |
|
measure. So we might measure |
|
speed as a surrogate measure. |
|
Another part of |
|
operationalization is the other |
|
end of the story. In other |
|
words, the first part, the |
|
construct is what we measure, |
|
which reflects our goal. The the |
|
end...the end result here is that |
|
we have findings and we want to |
|
do something with them. We want |
|
to operationalize our finding. |
|
This is what the action |
|
operationalization is about. |
|
It's about what you do with the |
|
findings and then being |
|
presented here on a podium. We |
|
used to ask three questions. |
|
These are very important |
|
questions to ask. Once you have |
|
done some analysis, you have |
|
someone in front of you who |
|
says, oh, thank you very much, |
|
you're done...you, the statistician |
|
or the data scientist. So this |
|
this takes you one extra step, |
|
getting you to ask your customer |
these simple questions |
What do |
|
you want to accomplish? How will |
|
you do that and how will you |
|
know if you have accomplished |
|
that? We we can help answer, or |
|
at least support, some of these |
|
questions we've answered. |
|
The eighth dimension is |
|
communication. I'm giving you an |
|
example from a very famous old |
|
map from the 19th century, which |
|
is showing the march of the |
|
Napoleon Army from France to |
|
Moscow to Russia. You see the |
|
numbers are the the width of |
|
the path indicates the size |
|
of the army, and then on on in |
|
black you see what happened to |
|
them on their on their way back. |
|
So basically this was a |
|
disastrous march. We we we we |
|
can relate this old map to |
|
existing maps, and there is a |
|
JMP add-in, which you can |
|
find on the JMP Community, to to |
|
show you with bubble plots, |
|
dynamic bubble plots what this |
|
looks like. So I I've covered |
|
very quickly the eight information |
|
quality dimensions. My last |
|
slide is that what I've talked |
|
about from a historical |
|
perspective, really put some |
|
proportions to what I'm saying. |
|
I think we are really in the era |
|
of information quality. We used |
|
to be concerned with product |
|
quality in the 18th century, the |
|
17th century. We then moved to |
|
process quality and service |
|
quality. This is a short memo |
|
on proposing a control chart, |
|
1924, I think. |
|
Then we move to management |
|
quality. This is the Juran |
|
trilogy of design, improvement |
|
and control. Six Sigma (define, |
|
measure, analyze, improve) |
|
control process is the |
|
improvement process of Juran, |
|
and Juran was the grand father |
|
of Six Sigma in that sense. |
|
Then in the '80s, Taguchi came |
|
in. He talks about robust |
|
design. How can we handle |
|
variability in inputs by proper |
|
design decisions? And now we are |
|
in the Age of information |
|
quality. We have sensors. We |
|
have flexible systems. We are |
|
depending on AI and machine |
|
learning and data mining and we |
|
are gathering big big big |
|
numbers, but which we call big |
|
data. The interest in information |
|
quality should be a prime prime |
|
interest. I'm going to try and |
|
convince you of, with the help |
|
of Chris, that. |
|
We are here and JMP can |
|
help us achieve that in in a |
|
really unusual way. |
|
What you will see at the end of |
|
the case study that Chris will |
|
show is also how to do it an |
|
information quality assessment and |
|
on a specific study, basically |
|
generate an information quality |
|
score. So if we go top down, I |
|
can tell you this study, this |
|
work, this analysis is maybe 80% or |
|
maybe 30% or maybe 95%. |
|
And through the example you will |
|
see how to do that. There is a |
|
JMP add-in to provide this |
|
assessment. It's it's actually |
|
quite quite easy. There's |
|
nothing really sophisticated |
|
about that. So I'm done and |
|
Chris, after you. Thanks, Ron. So |
|
now I'm going to go through the |
|
analysis of a data set in a way |
|
that explicitly calls out the |
|
various aspects of information |
|
quality and show how JMP can be |
|
used to assess an improvement |
|
for InfoQ. So first off, I'm |
|
going to go through the InfoQ |
|
components. The first InfoQ |
|
component is the goal, so in |
|
this case the problem statement |
|
was that a chemical company |
|
wanted a formulation that |
|
maximized product yield while |
|
minimizing a nuisance impurity |
|
that resulted from the reaction. |
|
So that was the high level goal. |
|
In statistical terms, we wanted |
|
to find a model that accurately |
|
predicted a response on a data |
|
set so that we could find a |
|
combination of ingredients and |
|
processing steps that would lead |
|
to a better product. |
|
The data are set up in 100 |
|
experimental formulations with |
|
one primary ingredient, X1,and |
|
10 additives. There's also a |
|
processing factor in 13 |
|
responses. The data are |
|
completely fabricated but were |
|
simulated to illustrate the same |
|
strengths and weaknesses of the |
|
original data. The data |
|
formulation was made was also |
|
recorded. We will be looking at |
|
this data closely, so I want to |
|
elaborate beyond pointing out |
|
that they were collected in an |
|
ad hoc way, changing one or two |
|
additives at a time rather than |
|
as a designed or randomized |
|
experiment. There's a lot of |
|
ways to analyze this data, the |
|
most typical being least |
|
squares modeling with forward |
|
selection on selected responses. |
|
That was my original intention |
|
for this talk, but when I showed |
|
the data to Ron, he immediately |
|
recognized the response columns |
|
as time series from analytical |
|
chemistry. Even though the data |
|
were simulated, he could see the |
|
structure. He could see things |
|
in the data that I didn't see |
|
and read into it wasn't. I found |
|
this to be strikingly |
|
impressive. It's beyond the |
|
scope of this talk, but there is |
|
an even better approach based on |
|
ensemble modeling using |
|
fractionally weighted |
|
bootstrapping. Phil Ramsey, |
|
Wayne Levin and I have another |
|
talk about this methodology at |
|
the European Discovery |
|
Conference this year. The |
|
approach is promising because it |
|
can fit models to data with |
|
more active interactions than |
|
there are runs. The fourth and final |
|
component of information quality |
|
is utility, which is how well we |
|
are able to assess our goals. Or |
|
how do we measure how well we've |
|
assessed our goals? There's a |
|
domain aspect which is in this |
|
case we want to have a |
|
formulation that leads to |
|
maximized yields and minimized the |
|
waste in post processing of the |
|
material. The statistical |
|
analysis utility refers to the |
|
model that we fit. What we're |
|
going for there is least |
|
squares accuracy of our model in |
|
terms of how well we're able to |
|
predict what the...what would |
|
result from candidate |
|
combinations of formulation...of |
|
mixture factors. Now I'm going |
|
to go through a set of questions |
|
that make up a detailed InfoQ |
|
assessment as organized into the |
|
eight dimensions of information |
|
quality. I want to point out |
|
that not all questions will be |
|
equally relevant to different |
|
data science and statistical |
|
projects, and that this is not |
|
intended to be rigid dogma but |
|
rather a set of things that are |
|
a good idea to ask oneself. |
|
These questions represent a kind |
|
of data analytic wisdom that |
|
looks more broadly than just the |
|
application of a particular |
|
statistical technology. A copy |
|
of a spreadsheet with these |
|
questions along with pointers to |
|
JMP features that are the most |
|
useful for answering a |
|
particular one will be uploaded |
|
to the JMP Community along |
|
with this talk for you to use. As |
|
I proceed through the questions, |
|
I'll be demoing an analysis of |
|
the data in JMP. So Question 1 |
is |
is the data scale used |
|
aligned with the stated goal? So |
|
the Xs that we have consist of |
|
a single categorical variable |
|
processing and the 11 continuous |
|
inputs. These are measured |
|
as percentages and are also |
|
recorded to half a percent. We |
|
don't have the total amounts of |
|
the ingredients, only the |
|
percentages. The totals are |
|
information that was either lost |
|
or never recorded. There are |
|
other potentially important |
|
pieces of information that are |
|
missing here. The time between |
|
formulating the batches and |
|
taking the measurements is gone |
|
and there could have been other |
|
covariate level information that |
|
is missing here that would have |
|
described the conditions under |
|
which the reaction occurred. |
|
Without more information than I |
|
have, I cannot say how important |
|
this kind of covariate information |
|
would have been. We do have |
|
information on the day of the |
|
batch, so that could be used as |
|
a surrogate possibly. Overall we |
|
have what are, hopefully, the most |
|
important inputs, as well as |
|
measurements of the responses we |
|
wish to optimize. We could have |
|
had more information, but this |
|
looks promising enough to try |
|
and analysis with. The second |
|
question related to data |
|
resolution is how reliable and |
|
precise are the measuring devices |
|
and data sources. And the fact |
|
is, we don't have a lot of |
|
specific information here. The |
|
statistician internal the |
|
company would have had more |
|
information. In this case we |
|
have no choice but to trust that |
|
the chemists formulated and |
|
recorded the mixtures well. The |
|
third question relative to data |
resolution is |
is the data |
|
analysis suitable for the data |
|
aggregation level? And the |
|
answer here is yes, assuming |
|
that their measurement system is |
|
accurate and that the data are |
|
clean enough. What we're going |
|
to end up doing actually is |
|
we're going to use the |
|
Functional Data Explorer to |
|
extract functional principal |
|
components, which are a data |
|
derived kind of data |
|
aggregation. And then we're |
|
going to be modeling those |
|
functional principal components |
|
using the input variables. So |
|
now we move on to the data |
|
structure dimension and the |
|
first question we ask is, is the |
|
data used aligned with the |
|
stated goal? And I think the |
|
answer is a clear yes here. We're |
|
trying to maximize |
|
yield. We've got measurements for |
|
that, and the inputs are |
|
recorded as Xs. The second data |
|
structure question is where |
|
things really start to get |
|
interesting for me. So this is |
|
are the integrity details |
|
(outliers, missing values, data |
|
corruption) issues described and |
|
handled appropriately? So from |
|
here we can use JMP to be able |
|
to understand where the outliers |
|
are, figure out strategies for |
|
what to do about missing values, |
|
observe their patterns and so |
|
on. So this is this is where |
|
things are going to get a little |
|
bit more interesting. The first |
|
thing we're going to do is we're |
|
going to determine if there are |
|
any outliers in the data that we |
|
need to be concerned about. So |
|
to do that, we're going to go |
|
into the explore outliers |
|
platform off of the screening |
|
menu. We're going to load up the |
|
response variables, and because |
|
this is a multivariate setting, |
|
we're going to use a new feature |
|
in JMP Pro 16 called Robust |
|
PCA Outliers. So we see where |
|
the large residuals are in those |
|
kind of Pareto type plots. |
|
There's a snapshot showing where |
|
there's some potentially |
|
unusually large observations. I |
|
don't really think this looks |
|
too unusual or worrisome to me. |
|
We can save the large outliers |
|
to a data table and then look at |
|
them in the distribution |
|
platform and what we see kind of |
|
looks like a normal distribution |
|
with the middle taken out. So I |
|
think this is data that are |
|
coming from sort of the same |
|
population and there's nothing |
|
really to worry about here, |
|
outliers-wise. So once we've |
|
taken care of the outlier |
|
situation we go in and explore |
|
missing values. So what we're |
|
going to do first is we're going |
|
to load up the Ys as...into the |
|
platform, and then we're going |
|
to use the missing value |
|
snapshot to see what patterns |
|
they are amongst our missing |
|
values. It looks like the |
|
missing values tend to occur in |
|
horizontal clusters, and there's |
|
also the same missing values |
|
across rows. So you can see that |
|
with the black splotches here. |
|
And then we'll go apply an |
|
automated data imputation, |
|
which goes ahead and saves |
|
formula columns that impute |
|
missing values in the new |
|
columns using a regression type |
|
algorithm that was developed by |
|
a PhD student of mine named Milo |
|
Page at NC State. So we can play |
|
around a little bit and get a |
|
sense of like how the ADI |
|
algorithm is working. So it's |
|
created these formula columns |
|
that are peeling off elements of |
|
the ADI impute column, which is |
|
a vector formula column, and the |
|
scoring impute function |
|
is calculating the expected |
|
value of the missing cells given |
|
the non missing cells, whenever |
|
it's got a missing value. And it's |
|
just carrying through a non |
|
missing value. So you can see 207 |
|
in YO...Y6 there. It's initially |
|
207 but then I change it to |
|
missing and it's now being |
|
imputed to be 234. |
|
So I'll do this a couple of times so |
|
you can kind of see how how it's |
|
working. So here I'll put in a big |
|
value for Y7 and that's now. |
|
been replaced. And if we go down |
|
and we add a row, |
|
then all missing values are there |
|
initially and the column means |
|
are replaced for the |
|
imputations. If I were to go |
|
ahead and add values for some of |
|
those missing cells, it would |
|
start doing the conditional |
|
expectation of the still missing |
|
cells using the information |
|
that's in the missing one....the |
|
non missing ones. So our next |
|
question on data structure is |
|
are the analysis methods |
|
suitable for the data structure? |
|
So we've got 11 mixture inputs |
|
and a processing variable that's |
|
categorical. Those are going |
|
to be inputs into a least |
|
squares type model. We have |
|
13 continuous responses and |
|
we can model them using... |
|
individually using least |
|
squares. Or we can model |
|
functional principal |
|
components. The...now there are |
|
problems. The Xs are...the |
|
input variables have not been |
|
randomized at all. It's very |
|
clear that they would muck |
|
around with one or more of |
|
the compounds and then move |
|
on to another one. So the |
|
order in which the |
|
the input variables were varied |
|
was kind of haphazard. It's a |
|
clear lack of randomization, and |
|
that's going to negatively |
|
impact our...the generalizability |
|
and strength of our conclusions. |
|
Data integration is the third |
|
InfoQ dimension. These data |
|
are manually entered lab notes |
|
consisting mostly of mixture |
|
percentages and equipment |
|
readouts. We can only assume |
|
that the data were entered |
|
correctly and that the Xs are |
|
aligned properly with responses. |
|
If that isn't the case, then the |
|
model will have serious bias |
|
problems and have |
|
problems with generalizability. |
|
Integration is more of an issue |
|
with observational data |
|
science problems in machine |
|
learning exercises, than lab |
|
experiments like this. Although |
|
it doesn't apply here, I'll |
|
point out that privacy and |
|
confidentiality concerns can be |
|
identified by modeling the |
|
sensitive part of the data using |
|
the to be published component |
|
at the data. If the resulting |
|
model is predicted, then one |
|
should be concerned that privacy |
|
concerns are not being met. Temporal |
|
relevance refers to the |
|
operational time sequencing of |
|
data collection, analysis and |
|
deployment and whether gaps |
|
between those stages leads to a |
|
decrease in the usefulness of |
|
the information in the study. |
|
In this case, we can only simply |
|
hope that the material supplies |
|
are reasonably consistent and |
|
that the test equipment is |
|
reasonably accurate, which is an |
|
unverifiable assumption at this |
|
point. The time resolution we |
|
have when the data collection is |
|
at the day level, which means |
|
that there isn't much way that |
|
we can verify if there is time |
|
variation within each day. |
|
Chronology of data and goal is |
|
about the availability of |
|
relevant variables both in terms |
|
of whether the variable is |
|
present at all in the data or |
|
whether the right information |
|
will be available when the model |
|
is deployed. For predictive |
|
models, this relates to models |
|
being fit to data similar to |
|
what will be present at the time |
|
the model will be evaluated on |
|
new data. In this way, our data |
|
set is certainly fine. For |
|
establishing causality, however, |
|
we aren't in nearly as good a |
|
shape because the lack of |
|
randomization implies that time |
|
effects and factor effects may |
|
be confounded, leading to bias |
|
in our estimates. Endogeneity, |
|
or reverse causation, could |
|
clearly be an issue here, as |
|
variables like temperature and |
|
reaction time could clearly be |
|
impacting the responses, but have |
|
been left unrecorded. Overall, |
|
there is a lot we don't know |
|
about this dimension in an |
|
information quality sense. |
|
The rest of the InfoQ |
|
assessment is going to be |
|
dependent upon the type of |
|
analysis that we do. So at this |
|
point I'm going to go ahead and |
|
conduct an analysis of this data |
|
using the Functional Data |
|
Explorer platform in JMP Pro |
|
that allows me to model across |
|
all the columns simultaneously |
|
in a way that's based on |
|
functional principal components, |
|
which contain the maximum amount |
|
of information across all those |
|
columns as represented in the |
|
most efficient format possible. |
|
I'm going to be working on the |
|
imputed versions of the columns |
|
that I calculated earlier in the |
|
presentation. And I'm going to |
|
point out that I'm going to be |
|
working to find combinations of |
|
the mixture factors that achieve |
|
as closely as possible in a |
|
least square sense, an ideal |
|
curve that was created by the |
|
practitioner that maximizes the |
|
amount of potential product that |
|
could be in a batch while |
|
minimizing the amount of the |
|
impurities that they |
|
realistically thought a batch |
|
could contain. So I begin the |
|
analysis by going to the analyze |
|
menu, bring up the Functional |
|
Data Explorer. This has rows as |
|
functions. I'm going to load up my |
|
imputed rows, and then I'm going |
|
to put in my formulation |
|
components and my processing |
|
column as a supplementary |
|
variable. We've got an ID |
|
function, that's batch ID. Here I |
|
get in. I can see the functions, |
|
both the overlay altogether, and |
|
I can see the individual functions. |
|
Then I can load up the target |
|
function, which is the ideal. |
|
And that will change the |
|
analysis that results once I |
|
start going into the modeling |
|
steps. So these are pretty |
|
simple functions, so I'm just |
|
going to model them with |
|
B splines. |
|
And then I'm going to go into my |
|
functional DOE analysis. |
|
This is going to fit the model |
|
that connects the inputs into |
|
the functional principal |
|
components and then connect all |
|
the way through the |
|
eigenfunctions to make it so |
|
we're able to recover the |
|
overall functions as they |
|
changed, as we are varying the |
|
mixture factors. The |
|
functional principal component |
|
analysis has indicated that |
|
there are four dimensions of |
|
variation in these response |
|
functions. To understand what |
|
they mean, let's go ahead and |
|
explore with the FPC profiler. |
|
So watch this pane right here as |
|
I adjust FPC 1 and we can see |
|
that this FPC is associated with |
|
peak height. FPC2, it looks |
|
like it's kind of a peak |
|
narrowness. It's almost like a |
|
resolution principal |
|
component. The third one is |
|
related to kind of a knee on |
|
the left of the dominant peak. |
|
And Peak 4 looks like it's |
|
primarily related with the |
|
impurity, so that's what the |
|
underlying meaning is of |
|
these four functional |
|
principal components. |
|
So we've characterized our goal |
|
as maximizing the product and |
|
minimizing the impurity, and |
|
we've communicated that into the |
|
analysis through this ideal or |
|
golden curve that we supplied at |
|
the beginning of the FDE |
|
exercise we're doing. To get as |
|
close as possible to that ideal |
|
curve, we turn on desirability |
|
functions. And then we can go |
|
out and maximize desirability. |
|
And we find that the optimal |
|
combination of inputs is about |
|
4.5% of |
|
Ingredient four, 2% of |
|
Ingredient 6. 2.2% of |
|
Ingredient 8 and 1.24% of |
|
Ingredient 9 using processing |
|
method two. Let's review how |
|
we've gotten here. We first |
|
computed the missing response |
|
columns. Then we found B-spline |
|
models that fit those functions |
|
well in the FDE platform. A |
|
functional principal components |
|
analysis determined that there |
|
were four eigenfunctions |
|
characterizing the variation in |
|
this data. These four |
|
eigenfunctions were determined |
|
via the FPC profiler to each |
|
have a reasonable subject |
|
matter meaning. The functional |
|
DOE analysis consisted of |
|
applying pruned forward |
|
selection to each of the |
|
individual FPC scores using the |
|
DOE factors as input variables. |
|
And we see here that these have |
|
found combinations of |
|
interactions and main effects |
|
that were most predictive for |
|
each of the functional principal |
|
component scores individually. |
|
The Functional DOE Profiler |
|
has elegantly captured all |
|
aspects into one representation |
|
that allows us to find the |
|
formulation processing step that |
|
is predicted to have desirable |
|
properties as measured by high |
|
yield and low impurity. |
|
So now we can do an InfoQ |
|
assessment of the |
|
generalizability of the data in |
|
the analysis. So in this case, |
|
we're more interested in |
|
scientific generalizability, as |
|
the experimenter is a deeply |
|
knowledgeable chemist working |
|
with this compound. So we're |
|
going to be relying more on |
|
their subject matter expertise |
|
then on statistical principles |
|
and tools like hypothesis tests |
|
and so forth. The goal is |
|
primarily predictive, but the |
|
generalizability is kind of |
|
problematic because the |
|
experiment wasn't designed. Our |
|
ability to estimate interactions |
|
is weakened for techniques like |
|
forward selection and impossible |
|
via least squares analysis of |
|
the full model. Because the |
|
study wasn't randomized, there |
|
could be unrecorded time in |
|
order effects. We don't have |
|
potentially important covariate |
|
information like temperature and |
|
reaction time. This creates |
|
another big question mark |
|
regarding generalizability. |
|
Repeatability and |
|
reproducibility of the study is |
|
also an unknown here as we have |
|
no information about the |
|
variability due to the |
|
measurement system. Fortunately, |
|
we do have tools like JMP's |
|
evaluate design to understand |
|
the existing design as well as |
|
augment design that can greatly |
|
enhance the generalization |
|
performance of the analysis. |
|
Augment can improve information |
|
about main effects and |
|
interactions, and a second round |
|
of experimentation could be |
|
randomized to also enhance |
|
generalizability. So now I'm |
|
going to go through a couple of |
|
simple steps to show how to |
|
improve the generalization |
|
performance of our study using |
|
design tools in JMP. Before I |
|
do that, I want to point out |
|
that I had to take the data and |
|
convert it so that it was |
|
proportions rather than in |
|
percents. Otherwise the design |
|
tools were not really agreeing |
|
with the data very well. So we |
|
go into the evaluate designer |
|
and then we load up our Ys and |
|
our Xs. I requested the ability to |
|
handle second order interactions. |
|
Then...yeah, I got this alert |
|
saying, hey, I can't do that |
|
because we're not able to |
|
estimate all the interactions |
|
given the one factor at a time |
|
data that we have. So I backed |
|
up. We go to the augment |
|
designer, load everything up, |
|
set augment. We'll choose and I- |
|
optimal design because we're |
|
really concerned with |
|
predicted performance here. |
|
And I |
|
set the number of runs to 148. |
|
The custom designer requested |
|
141 as a minimum, but I went to |
|
148 just to kind of make sure |
|
that we've got all ability to |
|
estimate all of our interactions |
|
pretty well. After that, it |
|
takes about 20 seconds to |
|
construct the design. So now |
|
that we have the design, I'm |
|
going to show the two most |
|
important diagnostic tools in |
|
the augment designer for |
|
evaluating a design. On the |
|
left, we have the fraction of |
|
design space plot. This is |
|
showing that 50% of the volume |
|
of the design has |
|
a prediction variance that is |
|
less than 1. So 1 would be |
|
equivalent to the residual |
|
error. So we're able to get |
|
better than measurement error |
|
quality predictions over the |
|
majority of the design. On the |
|
right we have the color map on |
|
correlations. This is showing |
|
that we're able to estimate |
|
everything pretty well. There's |
|
some...because of the mixture |
|
constraint, we're getting some |
|
strong correlations between |
|
interactions and main effects. |
|
Overall, the effects are fairly |
|
clean. And the interactions are |
|
pretty well separated from one |
|
another, and the main effects |
|
are pretty well separated from |
|
one another as well. After |
|
looking at the design |
|
diagnostics, we can make the |
|
table. Here, I have shown the |
|
first 13 of the augmented runs |
|
and we see that we've got...we |
|
have more randomization. We don't |
|
have use of the same main effect |
|
over and over again streaks. |
|
That's evidence of better |
|
randomization and overall the |
|
design is going to be able to |
|
much better estimate the main |
|
effects and interactions having |
|
received better, higher quality |
|
information in this second stage |
|
of experimentation. So the input |
|
variables, the Xs, are accurate |
|
representations of the mixture |
|
proportion, so that's a clear |
|
objective interest. The |
|
responses are close surrogate |
|
for the amount of the product |
|
and amount of impurity that's in |
|
the batches. We're pretty good on |
|
7.1. there. The justifications |
|
are clear. After the study, we |
|
can of course go prepare a |
|
batch that is the formulation |
|
that was recommended by the FDOE |
|
profiler. Try it out and see if |
|
we're getting the kind of |
|
performance that we were looking |
|
for. It's very clear that that |
|
would be the way that we can |
|
assess how well we've achieved |
|
our study goals. So now under the |
|
last InfoQ dimension |
|
Communication. By describing the |
|
ideal curve as a target |
|
function, the Functional DOE |
|
Profiler makes the goal and the |
|
results of the analysis crystal |
|
clear. But this can be expressed |
|
at a level that is easily |
|
interpreted by the chemists and |
|
managers of the R&D facility. |
|
And as we have done our detailed |
|
information quality assessment, |
|
we've been upfront about the |
|
strengths and weaknesses of the |
|
study design and data |
|
collection. If the results do |
|
not generalize, we certainly |
|
know where to look for where the |
|
problems were. Once you become |
|
familiar with the concepts, |
|
there is a nice add-in written |
|
by Ian Cox that you can use to |
|
do a quick quantitative InfoQ |
|
assessment. The add-in has |
|
sliders for the upper and lower |
|
bounds of each InfoQ dimension. |
|
These dimensions are combined |
|
using a desirability function |
|
approach for an overall interval |
|
for the InfoQ over on the left. |
|
Here is an assessment for the |
|
data and analysis I covered in |
|
this presentation. The add-in is |
|
also a useful thinking tool that |
|
will make you consider each of |
|
the InfoQ dimensions. It's also a |
|
practical way to communicate |
|
InfoQ assessments to your |
|
clients or to your management, as |
|
it provides a high level view of |
|
information quality without |
|
using a lot of technical |
|
concepts and jargon. The add-in |
|
is also useful as the basis for |
|
an InfoQ comparison. My |
|
original hope for this |
|
presentation was to be a little |
|
bit more ambitious. I had hoped |
|
to cover the analysis I had |
|
just gone through, as well as |
|
another simpler one, one where I skip |
|
inputing the responses and doing |
|
a simple multivariate linear |
|
regression model of the response |
|
columns. Today, I'm only able to |
|
offer a final assessment of that |
|
approach. As you can see, |
|
several of the InfoQ |
|
dimensions suffer substantially |
|
without the more sophisticated |
|
analysis. It is very clear that |
|
the simple analysis leads to |
|
much lower InfoQ score. |
|
The upper limit of the simple |
|
analysis isn't that much higher |
|
than the lower limit of the more |
|
sophisticated one. With |
|
experience, you will gain |
|
intuition about what a good InfoQ |
|
score is for data science |
|
projects in your industry and |
|
pick up better habits as you |
|
will no longer be blind to the |
|
information bottlenecks in your |
|
data collection, analysis and |
|
model deployment. Information |
|
quality with an easy to use |
|
interface. This was my first |
|
formal information quality |
|
assessment. Speaking for myself, |
|
the information quality |
|
framework has given words and |
|
structure to a lot of things I |
|
already knew instinctively. It |
|
is already changed how I |
|
approach new data analysis |
|
projects. I encourage you to go |
|
through this process yourself on |
|
your own data, even if that data |
|
and analysis is already very |
|
familiar to you. I guarantee |
|
that you will be a wiser and |
|
more efficient data scientist |
|
because of it. Thank you. |