Maximizing Data Science Success with Information Quality (InfoQ) and JMP® (2021-EU-45MP-750)

5 Kudos

https://www.jmp.com/en_us/whitepapers/book-chapters/infoq-support-with-jmp.html

Level: Intermediate

Ron Kenett, Chairman, KPA Group and Samuel Neaman Instiute, Technion
Christopher Gotwalt, JMP Director, Statistical R&D, SAS

Data analysis – from designed experiments, generalized regression to machine learning – is being deployed at an accelerating rate. At the same time, concerns with reproducibility of findings and flawed p-value interpretation indicate that well-intentioned statistical analyses can lead to mistaken conclusions and bad decisions. For a data analysis project to properly fulfill its goals, one must assess the scope and strength of the conclusions derived from the data and tools available. This focus on statistical strategy requires a framework that isolates the components of the project: the goals, data collection procedure, data properties, the analysis provided, etc. The InfoQ Framework provides structured procedures for making this assessment. Moreover, it is easy to operationalize InfoQ in JMP. In this presentation we provide an overview of InfoQ along with use case studies, drawing from consumer research and pharmaceutical manufacturing, to illustrate how JMP can be used to make an InfoQ assessment, highlighting situations of both high and low InfoQ. We also give tips showing how JMP can be used to increase information quality by enhanced study design, without necessarily acquiring more data. This talk is aimed at statisticians, machine learning experts and data scientists whose job it is to turn numbers into information.

Auto-generated transcript...

Speaker	Transcript
	Hello.
	My name is Ron Kenett. This
	is a joint talk with Chris
	Gotwalt and we basically
	have two messages
	that should come out of the
	talk. One is that we should
	really be concerned about
	information and information
	quality. People tend to talk
	about data and data quality, but
	data is really not the the issue.
	We are statisticians. We are
	data scientists. We turn numbers,
	data into information so that
	our goal should be to make sure
	that we generate high quality
	information. The other message
	is that JMP can help you
	achieve that, and this is
	actually turning out to be in
	surprising ways. So by combining
	the expertise of Chris and an
	introduction to information
	quality, we hope that these two
	messages will come across
	clearly. So if I had to
	summarize what it is that we
	want to talk about, after all,
	it is all about information
	quality. I gave a talk at
	at the Summit in Prague four years
	ago and that talk was generic.
	It talked about moving the
	journey from quality by design.
	My journey doing information
	quality. In this talk we focus
	on how this can be done with
	JMP. This is a more detailed
	and technical talk than the
	general talk I gave in Prague.
	You can watch that talk.
	There's a link listed here. You can
	find it on the JMP community.
	So we're going to talk about
	information quality, which is
	the potential of the data set, a
	specific data set, to achieve a
	particular goal, a specific goal,
	with the given empirical method.
	So in that definition we have
	three components that are
	listed. One is a certain data
	set. Here is the data. The
	second one is the goal,
	the goal of the analysis, what
	it is that we want to achieve.
	And the third one is how we will
	do that, which is, with what
	methods we're going to generate
	information, and that potential
	is going to be assessed with the
	utility function. And I will
	begin with an introduction to
	information quality, and then
	Chris will will take over,
	discuss the case study and and
	show you how to conduct an
	information quality assessment.
	Eventually this should
	answer the question how JMP
	supports InfoQ, that
	would be the the bullet points
	that you can...the take away points
	from the talk. The setup for
	this is that we we encourage
	what I called a lifecycle view
	of statistics. In other words,
	not just data analysis.
	We should know...we should be part
	of the problem elicitation
	phase. Also, the goal
	formulation phase, that deserves
	a discussion. We should
	obviously be involved in the
	data collection scheme if it's
	through experiments or through
	surveys or through observational
	data. Then we should also take
	time for formulation of the
	findings and not just pull out
	printed reports on on
	regression coefficients
	estimates and and their
	significance, but we should
	also discuss what are the
	findings? Operationalization of
	findings meaning, OK, what can we
	do with these findings? What are
	the implications of the
	findings? This should should...
	needs to be communicated to the
	right people in the right way,
	and eventually we should do an
	impact assessment to figure out,
	OK, we did all this; what has
	been the added value of our
	work? I talked about the life
	cycle of your statistics a few years
	ago. This is the prerequisite,
	the perspective to what
	I'm going to talk about. So as I
	mentioned, the information
	quality is the potential of a
	particular data set to achieve a
	particular goal using given
	empirical analysis methods. This
	is identified through four
components	the goal, the data,
	the analysis method, and the
	utility measure. So in a in a
	mathematical expression, the
	utility of applying f to x,
	condition on the goal, is how we
	identify InfoQ, information
	quality. This was published in
	the Royal Statistical Society
	Series A in 2013 with eight
	discussants, so it was amply
	discussed. Some people thought
	this was fantastic and some
	people had a lot of critique on
	that idea, so this is a wider
	scope consideration of what
	statistics is about. We also
	wrote in 2006, we meeting myself
	and Galit Shmueli, a book called
	Information Quality. And in
	the context of information
	quality we did what is called
	deconstruction. David Hand has
	a paper called Deconstruction
	of Statistics. This is the
	deconstruction of information
	quality into eight dimensions. I
	will cover these eight dimensions.
	That's my part in the talk and
	then Chris will show how this
	is implemented in a specific
	case study.
	Another aspect that relates to
	this is another book I have.
	This is recent, a year ago
	titled The Real Work of Data
	Science and we talk about what
	is the role of the data
	scientists in organizations and
	in that context, we emphasized
	the need for the data scientist
	to be involved in the generation
	of information as an...information
	quality as meeting the goals of
	the organization. So let me
	cover the eight dimensions. That's
	that's that's my intro. The
	first one is data resolution. We
	have a goal. OK, we we would
	like to know the level of flu
	because in the country or in the
	area where we live, because that
	will impact our decision on
	whether to go to the park where
	we could meet people or going to
	a...to a jazz concert. And that
	concert is tomorrow.
	If we look up the CDC data on
	the level of flu, that data is
	updated weekly, so we could get
	the red line in the graph you
	have in front of you, so we
	could get data of a few days
	ago, maybe good, maybe not good
	enough for our gold. Google Flu,
	which is based on searches
	related to flu, is updated
	momentarily, so it's updated
	online, it will probably give
	us better information. So for
	our goal, the blue line, the
	Google trend...the Google Flu
	Trends indicator, is probably
	more appropriate. The second
	dimension is data structure.
	To meet our goal, we're going to
	look at data. We should...we
	should identify the data sources
	and the structure in these data
	sources. So some data could be
	text, some could be video, some
	could be, for example, the
	network of recommendations. This
	is an Amazon picture on how if
	you look at the book, you're
	going to get some
	other books recommended. And if
	you go to these other books,
	you're going to have more data
	recommended. So the data
	structure can come in all sorts
	of shapes and forms and this can
	be text. This can be functional
	data. This can be images. We are
	not confined now to what we used
	to call data, which is what you
	find in an Excel spreadsheet.
	The data could be corrupted,
	could have missing values, could
	have unusual patterns which
	which would be
	something to look into. Some
	patterns, where things are
	repeated. Maybe some of the data
	is is just copy and paste and we
	would like to be warned about
	such options. The third
	dimension is data integration.
	When we consider the data from
	these different sources, we're
	going to integrate them so we
	can do some analysis linkage
	through an ID. For example, we
	would do that, but in doing that,
	we might create some issues, for
	example, in in disclosing data
	that normally should be
	anonymized. Data
	integration, yeah, that will
	allow us to do fantastic things,
	but if the data is perceived to
	have some privacy exposure
	issues, then maybe the quality
	of the information from the
	analysis that we're going to do
	is going is going to be
	affected. So data integration
	should be looked into very, very
	carefully. This is what people
	likely used to call ETL
	extract, transform and load. We
	now have much better methods for
	doing that. The join option, for
	example, in JMP will offer
	options for for doing that.
	Temporal relevance. OK, that's
	pretty clear. We have data. It is
	stamped somehow. If we're going
	to do the analysis later, later
	after the data collection, and
	if the deployement that we
	consider is even later, then the
	data might not be temporally
	relevant. In a common
	situation, if we want to compare
	what is going on now, we would
	like to be able to make this
	comparison to recent data or
	data before the pandemic
	started, but not 10 years
	earlier. The official statistics
	on health records used to be two
	or three years behind in terms
	of timing, which made it very
	difficult the use of official
	statistics in assessing
	what is going on with the
	pandemic. Chronology of data and
	goal is related to the decision
	that we make as a result of our
	goal. So if, for example, our
	goal is to forecast air quality,
	we're going to use some
	predictive models on the Air
	Quality Index reported on a
	daily basis. This gives us a one
	to six scale from hazardous to
	good. There are some values
	which are representing levels of
	health concern. Zero-50 is good;
	300-500 is hazardous and the
	chronology of data and goal
	means that we should be able to
	make a forecast on a daily
	basis. So the methods we use
	should be updated on a daily
	basis. If, on the other hand,
	our goal is to figure out how is
	this AQI index computed, then we
	are not really bound by the the
	the timeliness of the analysis.
	You know, we could take our
	time. There's no urgency in
	getting the analysis done on a
	daily basis. Generalizability,
	the sixth dimension, is about
	taking our findings and
	considering where this could
	apply in more general terms,
	other populations, all
	situations. This can be done
	intuitively. Marketing managers who
	have seen a study on the on the
	market, let's call it A, might
	already understand what are the
	implications to Market B
	without data. People who are
	physicists will be able to
	make predictions based on
	mechanics on first principles
	without without data.
	So some of the generalizability
	is done with data. This is the
	basis of statistical
	generalization, where we go from
	the sample to the population.
	Statistical inferences is about
	generalization. We generalize
	from the sample to the
	population. And some can be
	domain based, in other words,
	using expert knowledge, domain
	expertise, not necessarily with
	data. We have to recognize that
	generalizability is not just
	done with statistics.
	The seventh dimension is
	construct operationalization,
	which is really about what it is
	that we measure. We want to
	assess behavior, emotions, what
	it is that we can measure, that
	will give us data that reflects
	behavior or emotions.
	The example I give here
	typically is pain.
	We know what is pain. How do
	we measure that? If you go to a
	hospital and you ask the nurse,
	how do you assess pain, they
	will tell you, we have a scale,
	1 to 10. It's very
	qualitative, not very
	scientific, I would say. If we
	want to measure the level of
	alcohol in drivers on the...on
	the road, it will be difficult to
	measure. So we might measure
	speed as a surrogate measure.
	Another part of
	operationalization is the other
	end of the story. In other
	words, the first part, the
	construct is what we measure,
	which reflects our goal. The the
	end...the end result here is that
	we have findings and we want to
	do something with them. We want
	to operationalize our finding.
	This is what the action
	operationalization is about.
	It's about what you do with the
	findings and then being
	presented here on a podium. We
	used to ask three questions.
	These are very important
	questions to ask. Once you have
	done some analysis, you have
	someone in front of you who
	says, oh, thank you very much,
	you're done...you, the statistician
	or the data scientist. So this
	this takes you one extra step,
	getting you to ask your customer
these simple questions	What do
	you want to accomplish? How will
	you do that and how will you
	know if you have accomplished
	that? We we can help answer, or
	at least support, some of these
	questions we've answered.
	The eighth dimension is
	communication. I'm giving you an
	example from a very famous old
	map from the 19th century, which
	is showing the march of the
	Napoleon Army from France to
	Moscow to Russia. You see the
	numbers are the the width of
	the path indicates the size
	of the army, and then on on in
	black you see what happened to
	them on their on their way back.
	So basically this was a
	disastrous march. We we we we
	can relate this old map to
	existing maps, and there is a
	JMP add-in, which you can
	find on the JMP Community, to to
	show you with bubble plots,
	dynamic bubble plots what this
	looks like. So I I've covered
	very quickly the eight information
	quality dimensions. My last
	slide is that what I've talked
	about from a historical
	perspective, really put some
	proportions to what I'm saying.
	I think we are really in the era
	of information quality. We used
	to be concerned with product
	quality in the 18th century, the
	17th century. We then moved to
	process quality and service
	quality. This is a short memo
	on proposing a control chart,
	1924, I think.
	Then we move to management
	quality. This is the Juran
	trilogy of design, improvement
	and control. Six Sigma (define,
	measure, analyze, improve)
	control process is the
	improvement process of Juran,
	and Juran was the grand father
	of Six Sigma in that sense.
	Then in the '80s, Taguchi came
	in. He talks about robust
	design. How can we handle
	variability in inputs by proper
	design decisions? And now we are
	in the Age of information
	quality. We have sensors. We
	have flexible systems. We are
	depending on AI and machine
	learning and data mining and we
	are gathering big big big
	numbers, but which we call big
	data. The interest in information
	quality should be a prime prime
	interest. I'm going to try and
	convince you of, with the help
	of Chris, that.
	We are here and JMP can
	help us achieve that in in a
	really unusual way.
	What you will see at the end of
	the case study that Chris will
	show is also how to do it an
	information quality assessment and
	on a specific study, basically
	generate an information quality
	score. So if we go top down, I
	can tell you this study, this
	work, this analysis is maybe 80% or
	maybe 30% or maybe 95%.
	And through the example you will
	see how to do that. There is a
	JMP add-in to provide this
	assessment. It's it's actually
	quite quite easy. There's
	nothing really sophisticated
	about that. So I'm done and
	Chris, after you. Thanks, Ron. So
	now I'm going to go through the
	analysis of a data set in a way
	that explicitly calls out the
	various aspects of information
	quality and show how JMP can be
	used to assess an improvement
	for InfoQ. So first off, I'm
	going to go through the InfoQ
	components. The first InfoQ
	component is the goal, so in
	this case the problem statement
	was that a chemical company
	wanted a formulation that
	maximized product yield while
	minimizing a nuisance impurity
	that resulted from the reaction.
	So that was the high level goal.
	In statistical terms, we wanted
	to find a model that accurately
	predicted a response on a data
	set so that we could find a
	combination of ingredients and
	processing steps that would lead
	to a better product.
	The data are set up in 100
	experimental formulations with
	one primary ingredient, X1,and
	10 additives. There's also a
	processing factor in 13
	responses. The data are
	completely fabricated but were
	simulated to illustrate the same
	strengths and weaknesses of the
	original data. The data
	formulation was made was also
	recorded. We will be looking at
	this data closely, so I want to
	elaborate beyond pointing out
	that they were collected in an
	ad hoc way, changing one or two
	additives at a time rather than
	as a designed or randomized
	experiment. There's a lot of
	ways to analyze this data, the
	most typical being least
	squares modeling with forward
	selection on selected responses.
	That was my original intention
	for this talk, but when I showed
	the data to Ron, he immediately
	recognized the response columns
	as time series from analytical
	chemistry. Even though the data
	were simulated, he could see the
	structure. He could see things
	in the data that I didn't see
	and read into it wasn't. I found
	this to be strikingly
	impressive. It's beyond the
	scope of this talk, but there is
	an even better approach based on
	ensemble modeling using
	fractionally weighted
	bootstrapping. Phil Ramsey,
	Wayne Levin and I have another
	talk about this methodology at
	the European Discovery
	Conference this year. The
	approach is promising because it
	can fit models to data with
	more active interactions than
	there are runs. The fourth and final
	component of information quality
	is utility, which is how well we
	are able to assess our goals. Or
	how do we measure how well we've
	assessed our goals? There's a
	domain aspect which is in this
	case we want to have a
	formulation that leads to
	maximized yields and minimized the
	waste in post processing of the
	material. The statistical
	analysis utility refers to the
	model that we fit. What we're
	going for there is least
	squares accuracy of our model in
	terms of how well we're able to
	predict what the...what would
	result from candidate
	combinations of formulation...of
	mixture factors. Now I'm going
	to go through a set of questions
	that make up a detailed InfoQ
	assessment as organized into the
	eight dimensions of information
	quality. I want to point out
	that not all questions will be
	equally relevant to different
	data science and statistical
	projects, and that this is not
	intended to be rigid dogma but
	rather a set of things that are
	a good idea to ask oneself.
	These questions represent a kind
	of data analytic wisdom that
	looks more broadly than just the
	application of a particular
	statistical technology. A copy
	of a spreadsheet with these
	questions along with pointers to
	JMP features that are the most
	useful for answering a
	particular one will be uploaded
	to the JMP Community along
	with this talk for you to use. As
	I proceed through the questions,
	I'll be demoing an analysis of
	the data in JMP. So Question 1
is	is the data scale used
	aligned with the stated goal? So
	the Xs that we have consist of
	a single categorical variable
	processing and the 11 continuous
	inputs. These are measured
	as percentages and are also
	recorded to half a percent. We
	don't have the total amounts of
	the ingredients, only the
	percentages. The totals are
	information that was either lost
	or never recorded. There are
	other potentially important
	pieces of information that are
	missing here. The time between
	formulating the batches and
	taking the measurements is gone
	and there could have been other
	covariate level information that
	is missing here that would have
	described the conditions under
	which the reaction occurred.
	Without more information than I
	have, I cannot say how important
	this kind of covariate information
	would have been. We do have
	information on the day of the
	batch, so that could be used as
	a surrogate possibly. Overall we
	have what are, hopefully, the most
	important inputs, as well as
	measurements of the responses we
	wish to optimize. We could have
	had more information, but this
	looks promising enough to try
	and analysis with. The second
	question related to data
	resolution is how reliable and
	precise are the measuring devices
	and data sources. And the fact
	is, we don't have a lot of
	specific information here. The
	statistician internal the
	company would have had more
	information. In this case we
	have no choice but to trust that
	the chemists formulated and
	recorded the mixtures well. The
	third question relative to data
resolution is	is the data
	analysis suitable for the data
	aggregation level? And the
	answer here is yes, assuming
	that their measurement system is
	accurate and that the data are
	clean enough. What we're going
	to end up doing actually is
	we're going to use the
	Functional Data Explorer to
	extract functional principal
	components, which are a data
	derived kind of data
	aggregation. And then we're
	going to be modeling those
	functional principal components
	using the input variables. So
	now we move on to the data
	structure dimension and the
	first question we ask is, is the
	data used aligned with the
	stated goal? And I think the
	answer is a clear yes here. We're
	trying to maximize
	yield. We've got measurements for
	that, and the inputs are
	recorded as Xs. The second data
	structure question is where
	things really start to get
	interesting for me. So this is
	are the integrity details
	(outliers, missing values, data
	corruption) issues described and
	handled appropriately? So from
	here we can use JMP to be able
	to understand where the outliers
	are, figure out strategies for
	what to do about missing values,
	observe their patterns and so
	on. So this is this is where
	things are going to get a little
	bit more interesting. The first
	thing we're going to do is we're
	going to determine if there are
	any outliers in the data that we
	need to be concerned about. So
	to do that, we're going to go
	into the explore outliers
	platform off of the screening
	menu. We're going to load up the
	response variables, and because
	this is a multivariate setting,
	we're going to use a new feature
	in JMP Pro 16 called Robust
	PCA Outliers. So we see where
	the large residuals are in those
	kind of Pareto type plots.
	There's a snapshot showing where
	there's some potentially
	unusually large observations. I
	don't really think this looks
	too unusual or worrisome to me.
	We can save the large outliers
	to a data table and then look at
	them in the distribution
	platform and what we see kind of
	looks like a normal distribution
	with the middle taken out. So I
	think this is data that are
	coming from sort of the same
	population and there's nothing
	really to worry about here,
	outliers-wise. So once we've
	taken care of the outlier
	situation we go in and explore
	missing values. So what we're
	going to do first is we're going
	to load up the Ys as...into the
	platform, and then we're going
	to use the missing value
	snapshot to see what patterns
	they are amongst our missing
	values. It looks like the
	missing values tend to occur in
	horizontal clusters, and there's
	also the same missing values
	across rows. So you can see that
	with the black splotches here.
	And then we'll go apply an
	automated data imputation,
	which goes ahead and saves
	formula columns that impute
	missing values in the new
	columns using a regression type
	algorithm that was developed by
	a PhD student of mine named Milo
	Page at NC State. So we can play
	around a little bit and get a
	sense of like how the ADI
	algorithm is working. So it's
	created these formula columns
	that are peeling off elements of
	the ADI impute column, which is
	a vector formula column, and the
	scoring impute function
	is calculating the expected
	value of the missing cells given
	the non missing cells, whenever
	it's got a missing value. And it's
	just carrying through a non
	missing value. So you can see 207
	in YO...Y6 there. It's initially
	207 but then I change it to
	missing and it's now being
	imputed to be 234.
	So I'll do this a couple of times so
	you can kind of see how how it's
	working. So here I'll put in a big
	value for Y7 and that's now.
	been replaced. And if we go down
	and we add a row,
	then all missing values are there
	initially and the column means
	are replaced for the
	imputations. If I were to go
	ahead and add values for some of
	those missing cells, it would
	start doing the conditional
	expectation of the still missing
	cells using the information
	that's in the missing one....the
	non missing ones. So our next
	question on data structure is
	are the analysis methods
	suitable for the data structure?
	So we've got 11 mixture inputs
	and a processing variable that's
	categorical. Those are going
	to be inputs into a least
	squares type model. We have
	13 continuous responses and
	we can model them using...
	individually using least
	squares. Or we can model
	functional principal
	components. The...now there are
	problems. The Xs are...the
	input variables have not been
	randomized at all. It's very
	clear that they would muck
	around with one or more of
	the compounds and then move
	on to another one. So the
	order in which the
	the input variables were varied
	was kind of haphazard. It's a
	clear lack of randomization, and
	that's going to negatively
	impact our...the generalizability
	and strength of our conclusions.
	Data integration is the third
	InfoQ dimension. These data
	are manually entered lab notes
	consisting mostly of mixture
	percentages and equipment
	readouts. We can only assume
	that the data were entered
	correctly and that the Xs are
	aligned properly with responses.
	If that isn't the case, then the
	model will have serious bias
	problems and have
	problems with generalizability.
	Integration is more of an issue
	with observational data
	science problems in machine
	learning exercises, than lab
	experiments like this. Although
	it doesn't apply here, I'll
	point out that privacy and
	confidentiality concerns can be
	identified by modeling the
	sensitive part of the data using
	the to be published component
	at the data. If the resulting
	model is predicted, then one
	should be concerned that privacy
	concerns are not being met. Temporal
	relevance refers to the
	operational time sequencing of
	data collection, analysis and
	deployment and whether gaps
	between those stages leads to a
	decrease in the usefulness of
	the information in the study.
	In this case, we can only simply
	hope that the material supplies
	are reasonably consistent and
	that the test equipment is
	reasonably accurate, which is an
	unverifiable assumption at this
	point. The time resolution we
	have when the data collection is
	at the day level, which means
	that there isn't much way that
	we can verify if there is time
	variation within each day.
	Chronology of data and goal is
	about the availability of
	relevant variables both in terms
	of whether the variable is
	present at all in the data or
	whether the right information
	will be available when the model
	is deployed. For predictive
	models, this relates to models
	being fit to data similar to
	what will be present at the time
	the model will be evaluated on
	new data. In this way, our data
	set is certainly fine. For
	establishing causality, however,
	we aren't in nearly as good a
	shape because the lack of
	randomization implies that time
	effects and factor effects may
	be confounded, leading to bias
	in our estimates. Endogeneity,
	or reverse causation, could
	clearly be an issue here, as
	variables like temperature and
	reaction time could clearly be
	impacting the responses, but have
	been left unrecorded. Overall,
	there is a lot we don't know
	about this dimension in an
	information quality sense.
	The rest of the InfoQ
	assessment is going to be
	dependent upon the type of
	analysis that we do. So at this
	point I'm going to go ahead and
	conduct an analysis of this data
	using the Functional Data
	Explorer platform in JMP Pro
	that allows me to model across
	all the columns simultaneously
	in a way that's based on
	functional principal components,
	which contain the maximum amount
	of information across all those
	columns as represented in the
	most efficient format possible.
	I'm going to be working on the
	imputed versions of the columns
	that I calculated earlier in the
	presentation. And I'm going to
	point out that I'm going to be
	working to find combinations of
	the mixture factors that achieve
	as closely as possible in a
	least square sense, an ideal
	curve that was created by the
	practitioner that maximizes the
	amount of potential product that
	could be in a batch while
	minimizing the amount of the
	impurities that they
	realistically thought a batch
	could contain. So I begin the
	analysis by going to the analyze
	menu, bring up the Functional
	Data Explorer. This has rows as
	functions. I'm going to load up my
	imputed rows, and then I'm going
	to put in my formulation
	components and my processing
	column as a supplementary
	variable. We've got an ID
	function, that's batch ID. Here I
	get in. I can see the functions,
	both the overlay altogether, and
	I can see the individual functions.
	Then I can load up the target
	function, which is the ideal.
	And that will change the
	analysis that results once I
	start going into the modeling
	steps. So these are pretty
	simple functions, so I'm just
	going to model them with
	B splines.
	And then I'm going to go into my
	functional DOE analysis.
	This is going to fit the model
	that connects the inputs into
	the functional principal
	components and then connect all
	the way through the
	eigenfunctions to make it so
	we're able to recover the
	overall functions as they
	changed, as we are varying the
	mixture factors. The
	functional principal component
	analysis has indicated that
	there are four dimensions of
	variation in these response
	functions. To understand what
	they mean, let's go ahead and
	explore with the FPC profiler.
	So watch this pane right here as
	I adjust FPC 1 and we can see
	that this FPC is associated with
	peak height. FPC2, it looks
	like it's kind of a peak
	narrowness. It's almost like a
	resolution principal
	component. The third one is
	related to kind of a knee on
	the left of the dominant peak.
	And Peak 4 looks like it's
	primarily related with the
	impurity, so that's what the
	underlying meaning is of
	these four functional
	principal components.
	So we've characterized our goal
	as maximizing the product and
	minimizing the impurity, and
	we've communicated that into the
	analysis through this ideal or
	golden curve that we supplied at
	the beginning of the FDE
	exercise we're doing. To get as
	close as possible to that ideal
	curve, we turn on desirability
	functions. And then we can go
	out and maximize desirability.
	And we find that the optimal
	combination of inputs is about
	4.5% of
	Ingredient four, 2% of
	Ingredient 6. 2.2% of
	Ingredient 8 and 1.24% of
	Ingredient 9 using processing
	method two. Let's review how
	we've gotten here. We first
	computed the missing response
	columns. Then we found B-spline
	models that fit those functions
	well in the FDE platform. A
	functional principal components
	analysis determined that there
	were four eigenfunctions
	characterizing the variation in
	this data. These four
	eigenfunctions were determined
	via the FPC profiler to each
	have a reasonable subject
	matter meaning. The functional
	DOE analysis consisted of
	applying pruned forward
	selection to each of the
	individual FPC scores using the
	DOE factors as input variables.
	And we see here that these have
	found combinations of
	interactions and main effects
	that were most predictive for
	each of the functional principal
	component scores individually.
	The Functional DOE Profiler
	has elegantly captured all
	aspects into one representation
	that allows us to find the
	formulation processing step that
	is predicted to have desirable
	properties as measured by high
	yield and low impurity.
	So now we can do an InfoQ
	assessment of the
	generalizability of the data in
	the analysis. So in this case,
	we're more interested in
	scientific generalizability, as
	the experimenter is a deeply
	knowledgeable chemist working
	with this compound. So we're
	going to be relying more on
	their subject matter expertise
	then on statistical principles
	and tools like hypothesis tests
	and so forth. The goal is
	primarily predictive, but the
	generalizability is kind of
	problematic because the
	experiment wasn't designed. Our
	ability to estimate interactions
	is weakened for techniques like
	forward selection and impossible
	via least squares analysis of
	the full model. Because the
	study wasn't randomized, there
	could be unrecorded time in
	order effects. We don't have
	potentially important covariate
	information like temperature and
	reaction time. This creates
	another big question mark
	regarding generalizability.
	Repeatability and
	reproducibility of the study is
	also an unknown here as we have
	no information about the
	variability due to the
	measurement system. Fortunately,
	we do have tools like JMP's
	evaluate design to understand
	the existing design as well as
	augment design that can greatly
	enhance the generalization
	performance of the analysis.
	Augment can improve information
	about main effects and
	interactions, and a second round
	of experimentation could be
	randomized to also enhance
	generalizability. So now I'm
	going to go through a couple of
	simple steps to show how to
	improve the generalization
	performance of our study using
	design tools in JMP. Before I
	do that, I want to point out
	that I had to take the data and
	convert it so that it was
	proportions rather than in
	percents. Otherwise the design
	tools were not really agreeing
	with the data very well. So we
	go into the evaluate designer
	and then we load up our Ys and
	our Xs. I requested the ability to
	handle second order interactions.
	Then...yeah, I got this alert
	saying, hey, I can't do that
	because we're not able to
	estimate all the interactions
	given the one factor at a time
	data that we have. So I backed
	up. We go to the augment
	designer, load everything up,
	set augment. We'll choose and I-
	optimal design because we're
	really concerned with
	predicted performance here.
	And I
	set the number of runs to 148.
	The custom designer requested
	141 as a minimum, but I went to
	148 just to kind of make sure
	that we've got all ability to
	estimate all of our interactions
	pretty well. After that, it
	takes about 20 seconds to
	construct the design. So now
	that we have the design, I'm
	going to show the two most
	important diagnostic tools in
	the augment designer for
	evaluating a design. On the
	left, we have the fraction of
	design space plot. This is
	showing that 50% of the volume
	of the design has
	a prediction variance that is
	less than 1. So 1 would be
	equivalent to the residual
	error. So we're able to get
	better than measurement error
	quality predictions over the
	majority of the design. On the
	right we have the color map on
	correlations. This is showing
	that we're able to estimate
	everything pretty well. There's
	some...because of the mixture
	constraint, we're getting some
	strong correlations between
	interactions and main effects.
	Overall, the effects are fairly
	clean. And the interactions are
	pretty well separated from one
	another, and the main effects
	are pretty well separated from
	one another as well. After
	looking at the design
	diagnostics, we can make the
	table. Here, I have shown the
	first 13 of the augmented runs
	and we see that we've got...we
	have more randomization. We don't
	have use of the same main effect
	over and over again streaks.
	That's evidence of better
	randomization and overall the
	design is going to be able to
	much better estimate the main
	effects and interactions having
	received better, higher quality
	information in this second stage
	of experimentation. So the input
	variables, the Xs, are accurate
	representations of the mixture
	proportion, so that's a clear
	objective interest. The
	responses are close surrogate
	for the amount of the product
	and amount of impurity that's in
	the batches. We're pretty good on
	7.1. there. The justifications
	are clear. After the study, we
	can of course go prepare a
	batch that is the formulation
	that was recommended by the FDOE
	profiler. Try it out and see if
	we're getting the kind of
	performance that we were looking
	for. It's very clear that that
	would be the way that we can
	assess how well we've achieved
	our study goals. So now under the
	last InfoQ dimension
	Communication. By describing the
	ideal curve as a target
	function, the Functional DOE
	Profiler makes the goal and the
	results of the analysis crystal
	clear. But this can be expressed
	at a level that is easily
	interpreted by the chemists and
	managers of the R&D facility.
	And as we have done our detailed
	information quality assessment,
	we've been upfront about the
	strengths and weaknesses of the
	study design and data
	collection. If the results do
	not generalize, we certainly
	know where to look for where the
	problems were. Once you become
	familiar with the concepts,
	there is a nice add-in written
	by Ian Cox that you can use to
	do a quick quantitative InfoQ
	assessment. The add-in has
	sliders for the upper and lower
	bounds of each InfoQ dimension.
	These dimensions are combined
	using a desirability function
	approach for an overall interval
	for the InfoQ over on the left.
	Here is an assessment for the
	data and analysis I covered in
	this presentation. The add-in is
	also a useful thinking tool that
	will make you consider each of
	the InfoQ dimensions. It's also a
	practical way to communicate
	InfoQ assessments to your
	clients or to your management, as
	it provides a high level view of
	information quality without
	using a lot of technical
	concepts and jargon. The add-in
	is also useful as the basis for
	an InfoQ comparison. My
	original hope for this
	presentation was to be a little
	bit more ambitious. I had hoped
	to cover the analysis I had
	just gone through, as well as
	another simpler one, one where I skip
	inputing the responses and doing
	a simple multivariate linear
	regression model of the response
	columns. Today, I'm only able to
	offer a final assessment of that
	approach. As you can see,
	several of the InfoQ
	dimensions suffer substantially
	without the more sophisticated
	analysis. It is very clear that
	the simple analysis leads to
	much lower InfoQ score.
	The upper limit of the simple
	analysis isn't that much higher
	than the lower limit of the more
	sophisticated one. With
	experience, you will gain
	intuition about what a good InfoQ
	score is for data science
	projects in your industry and
	pick up better habits as you
	will no longer be blind to the
	information bottlenecks in your
	data collection, analysis and
	model deployment. Information
	quality with an easy to use
	interface. This was my first
	formal information quality
	assessment. Speaking for myself,
	the information quality
	framework has given words and
	structure to a lot of things I
	already knew instinctively. It
	is already changed how I
	approach new data analysis
	projects. I encourage you to go
	through this process yourself on
	your own data, even if that data
	and analysis is already very
	familiar to you. I guarantee
	that you will be a wiser and
	more efficient data scientist
	because of it. Thank you.