Hi, I'm Ron Kenett.
This is a joint talk with Chris Gotwalt.
The talk is on Functional Data Analysis and Nonlinear Regression Models.
And in order to examine the options
and what we get out of this type of analysis,
we will take an information quality perspective.
In a sense, this is a follow up to a talk we gave last year
at the same Discovery Summit.
So I will start with simple examples
to introduce FDA and Nonlinear Regression.
And then Chris will cover a complex and more
substantially more complex example of optimization,
which includes mixture experiments designed to match a reference profile.
So the story starts with data on tablets that are dissolved
and measurements are done at different time points,
five minutes, ten minutes, 15 minutes, every five minutes,
and then 20 minutes, ten minutes later,
30, and then 15 minutes later, 45 minutes.
We have 12 tablets that are our product
and 12 tablets that are the reference.
Our goal is to have a product that matches the reference.
And in this type of data, we have a profile,
and we consider two options,
FDA and NLR.
In Chris' example, we'll talk about something called the F 2,
which is a third option for analyzing this type of data.
So here's what it looks like with the graph builder.
On the left, we have the reference profiles.
On the right, we have the tablets and the test.
We can see this is an example from my book on modern industrial statistics,
the book with Shelley Zacks, which is now in its third edition.
So on the left you can see there is a tablet that seems
a bit different.
It's labeled T 5R.
And if we run a functional data analysis of this data,
T 5R does look different.
We see that the growth part is different.
It has a slow growth but consistent growth.
It does not have the shape that we see in the other dissolution curves.
This was done with a Quadratic B-spline with 1 knot,
and the quadratic was in this case,
fitting the data better than the QB.
This is a bit of an unusual situation.
So because of the shapes, the Quadratic B-spline was a better fit.
If we look at T 1R,
the first tablet that has still different shape,
it shoots up and then it stays.
So basically, the tablet has dissolved.
Obviously, beyond a high number of dissolution,
there's not much left for the solution.
So T 1R and T 5R, they seem different.
T 5R stands out more than T 1R.
So, yes, T 5R on the cluster analysis
on the functional principal components does stand out.
So here we see how functional principal components
scatter plot of the first two functional principal components points,
what we observe visually.
And T1R, which is next to T 2R,
is a different cluster.
We can proceed with a nonlinear regression approach.
Here we are fitting a Gompertz three parameter model
with three parameters,
the asymptote, the growth rate, and the inflection point.
This is the model and when we fit the profiles,
we again see that T 5R stands out.
So we have the same qualitative impression that we had with FDA.
Now we have these three parameters listed
and we can run a profiler on the model because we now have a model.
This is where T 1R stands.
So by running the profile on the different tablets,
we can also see how similar or different they look like.
This is the table that maps out the parameters of this asymptotes.
so T 1R growth rate .21,
T5R, this tablet that stood out.
Growth rate .075,
very slow growth rate,
consistent but slow.
The inflection point is 11.5, way on the right.
So we can see through this parameter values the difference.
We can also pick up two tablets that stand out for growth rates.
T2R 1.77,
and T8R, almost no growth.
We'll get back to T 2 and T 8 in a minute.
If we take the principal components of these three parameters space.
So we conceal the parameters as if these are the measurements
and we run a multivariate control shot.
We can see T 1, this is the first one
and we can see T 5, this is the blue one, the fifth one.
This we already saw.
They are within the control limits of the T Square
multivariate statistical distance control chart.
And T 2 and T 8 that I highlighted before now stand out
and we can see qualitatively why.
This is the model degradation approach in the guidance documents
that is used for modeling dissolution curves.
In running such analysis with an information quality perspective,
the first question is asking what is the goal of the analysis?
And then we can consider the method of analysis.
Here we're using nonlinear regression and functional data analysis.
Chris will get into how this is combined
with data derived from experimental design.
We have a utility function and the information quality
is the utility of applying a method F on data X conditio n of the goal.
It is evaluated with eight dimensions.
And here Chris again we'll talk about data resolution and data structure.
So Chris, the floor is yours.
Thanks Ron.
Now I'm going to give an example that is a little bit more complicated
than the first one.
In Ron's example,
he was comparing the dissolution curves of test tablets
to those from a set of reference tablets.
In that situation, the expectation is that
the curves should generally be following the same path.
And he showed how to find anomalous curves
that deviate from the rest of the population.
In this second example, we also have a reference dissolution curve.
But we are analyzing data from a designed experiment
where the goal is to find a formulation in two polymer additives
and the amount of force used in the tablet production process
that leads to a close match
to the reference splashes the solution curve.
The graph you see here shows
the data from the reference curve that we want to match.
To do this, I'm going to demonstrate three analyses of the data
that use different methods and models
to find factor settings that match the reference that lead to a...
To do this, I'm going to demonstrate three analysis of this data
that use different methods and models to find factor settings
that will best match the reference curve.
In the first analysis,
I'm going to summarize each of the DoE curves
down to a single metric called F2
that is typically used in dissolution curve analysis,
a measure of agreement with the reference match.
There, I'll use standard DoE methods to model that F2 response
and then find the factor settings that are predicted
to best agree with the reference.
In the second analysis,
I'll use a functional DoE modeling approach
where I model the curves using these blinds,
extract functional principle component stores, and model them.
I'll load the reference batch as a target function
in the Functional Data Explorer platform
and then use the FDoE profiler
to find the closest match recommended by that model.
These first two approaches use little subject matter information
about these types of tablets.
In the third analysis,
I'll model the curves using a nonlinear model
that was known to fit this type of tablet well
and use the Curve DoE option in the fit Curve platform
to model the relationship between the DoE factors
and the shape of the curve.
I want to credit Clay Barker for adding this capability to JMP Pro 16.
I think it has a lot of promise for modeling curves
whose general shape can be assumed to be known in advance
to come from one of the supported nonlinear models.
At the end, verification batches were made
using the recommended formulation settings for each of the three analyses,
and we compared them to a new reference batch.
What we found was that the nonlinear regression-based approach
led to the closest match to the reference.
What we see here is a scatter plot matrix of the four factors
in the designed experiment.
There was a mixture constraint between the two polymers,
as well as a constraint on the total amount of polymer
and the proportions of the individual polymers.
Here's a look at some of the raw data from the experiment.
At the top of the table, we have data from the reference that we wish to match.
There are 16 DoE formulations or batches in the experiment.
We can only see data from two of them in this picture, though.
There were six tablets per formulation.
There were four dissolution measurements per tablet.
Here we see plots of the dissolution curves for each of the 16 DoE formulations
with the dissolution curve of the reference batch here at the lower right.
Now, I'm going to do a quick preliminary information quality assessment
using the questions that you'll find in the spreadsheet
that you can download from the JMP user community page.
The first part of the assessment is related to the data resolution.
In this case, I think we're looking pretty good.
The data scale is well aligned
with the stated goal because it's a design experiment.
The measuring devices seem to be reliable and precise,
and the data analysis is definitely going to be suitable
for the data aggregation level,
and we'll be illustrating different kinds of data aggregation
as we extract features from these dissolution curves.
As far as the data structure goes, we're in pretty good shape.
The data is certainly aligned with the stated goal,
we don't have any problems with outliers or missing values,
and the analysis methods are all suitable for the data structure,
although we do see some variation
in the quality of the results depending on the type of analysis we do.
As far as data integration goes, this is a pretty simple analysis.
We have multiple responses,
and we're exploring different ways of combining them into extracted features.
So there's a common workflow to all three of the analysis I'm going to be showing.
First, we have to get the data into a form
that is analyzable by the platform that we're using.
Then there's around a feature extraction.
Then we model those features.
That's where there's a lot of difference between the methods,
and then we use the profiler in different ways
to find a formulation that closely matches the reference.
First, I'm going to go over the F2 analysis.
F 2 is a standard measure of agreement of a dissolution curve
relative to a reference dissolution curve.
In the formula,
the Rs are the means of the reference curve at each time point,
and the Ts are the means of the non-reference curve.
The convention is to say that the two curves are equivalent
when F2 is greater than or equal to 50.
It's important to point out
that I'm including this F 2 based analysis
not just as an example of a dissolution DoE analysis,
but more broadly as an example of how reducing a response
that is inherently a curve down to a single number
leads to a much lower quality analysis
and results at the end
than a procedure that treats curves as first-class citizens.
So now I'm going to share the F2 analysis of the dissolution DoE data.
The first thing we have to do is calculate the batch means
of the dissolution curves at the different time points.
Then we create a formula column
that calculates the F2 dissolution curve agreement statistic
for each of these curves relative to the reference batch,
and we modeled the F2 using the DoE factors as inputs
and use the profiler to find the factor settings that match the reference.
Before the analysis,
we use the table's summary feature
to calculate the means of the dissolution measurements
by batch and across each of the times.
We can save ourselves a little bit of work by using all of the DoE factors here
as grouping variables so they'll be carried through
into the subsequent table.
Now we have a 17-row data set
and we hide and exclude the reference batch.
Now take note of the values of the dissolution means for the reference
because we're going to use those
when we create a formula column that calculates the F2 agreement metric
for each of the batches relative to the reference batch
and now we're going to be able to use this F2 formula column
as a response to be modeled.
We use the model script created by the DoE platform
to set up our model for us.
We place F2 as our response variable
and we're going to analyze this data today
using the generalized regression platform in JMP Pro.
When we get into the platform
we see that it has automatically done a standard lease squares analysis
because it found that there were enough degrees of freedom
in the data for it to do so
and it's given us an AICc of 155.6.
I'm going to see if we can do better by trying
a best subsets reduction of the model
and when we do that we see
that the AICc of that best subsets goes down to 136
smaller is better with the AICc
and the difference of 20 is pretty substantial
so I would conclude that the normal best subset is a better model
than the standardly squares one.
I'm going to try one more thing, though,
and fit a log normal distribution with best subsets to the data.
When I do that, the AICc goes down a little bit further to 130.6.
That's a modest difference, but it's good enough
that I'm going to conclude that we are going to work
with the Lo gNormal,
especially because we know that we're working with a strictly positive response
and the LogNormal distribution fits data that is strictly positive.
From there, the analysis is pretty straightforward,
so I'm going to jump straight ahead to using the profiler.
F2 is an agreement metric that we want to maximize.
So we get into the profiler,
we turn on Desirability functions and have them set to maximize,
and then maximize desirability to find the combination of factor settings
that this model says gives us the closest match to the reference,
and that would be at this combination of factor settings that we see here.
Now the F2 analysis is complete,
and we're going to go into the second analysis,
the functional DoE analysis.
For this analysis,
we're going to work with the data in a stacked format
where all of the dissolution measurements have been combined into a single column,
and we have a time column as well.
The first thing we do is go into the Functional Data Explorer platform.
In the platform launch,
we put dissolution as our response, time is our X,
the batch column as our ID,
and we supply the four DoE factors as supplementary variables.
Once we're in the platform,
we take a look at the data using the initial data plot.
This particular data set doesn't need any clean up or alignment options,
but we are going to go ahead and load the reference dissolution curve
as a target function.
For relatively simple functions like these,
I typically use B -Splines for my functional model.
When we do that, we see our B-Spline model fit,
and the initial fit that has come up is a cubic model that is behaving poorly.
It's interpolating the data points well,
but kind of doing crazy things in between them.
So I'm going to change from the default recommended model
over to a quadratic Spline model instead of the cubic one.
We do that by simply clicking on Quadratic over here
in the right of the B-Spline model fit.
We'll see that this quadratic model fits the data well.
A functional principal components analysis is automatically calculated,
and we see that the Functional Data Explorer platform has found
three functional principal components.
The leading one is very dominant,
explaining 97. 9 percent of the functional variation.
And it looks like this is a level set up or down kind of shape component.
The second one looks like a rate component,
and the third one almost looks like a quadratic.
Looking a little closer at this quadratic B-Spline model fit,
we see that this model is fitting the individual dissolution curves
pretty well.
So now we're ready to do our functional DoE analysis.
Each of our individual dissolution curves has been approximated now
by an underlying mean function common to all the batches
plus a batch dependent FPC score times the first eigenfunction,
plus another batch dependent FPC score times the second eigenfunction,
and so on with the third one.
What we're going to do is set up individual DoE models
for each of these functional principal component scores
as responses using our DoE factors as inputs.
The Functional Data Explorer platform, of course,
makes all this simple and kind of ties it up into a bow for us.
And when I say that,
it ties it up in a bow for us, what I really mean is the FDoE profiler.
So this pane here shows our predicted trajectory
of dissolution as a function of time,
and then we can see how that trajectory would change
by altering the DoE factors.
That relationship with the DoE factors
comes from these three generalized regression models
for each of our functional principal component scores.
If we want, we can open those up
and we can look at the relationship
between the DoE factors and that functional principal component score,
and we could even alter the model
by moving around to other ones in the solution path.
I just want to point out that it's possible to change the DoE model
for an FPC score.
In the interest of time,
I'm just going to have to move on and not demonstrate that, though.
We have diagnostic plots,
the most important one probably being the Actual by Predicted Plot.
This has our plotted dissolution measurements on the Y-axis
and the predicted dissolution values using the functional DoE model.
And as always, we want to see that plot
have data points tight along the 45 degree line.
And in this case, I think this model looks pretty good.
We don't want to see any patterns in our residuals,
and I'm not seeing any bad ones here.
So this model looks pretty good, and we're going to work with it.
So I've already explained how this pane right here
represents the predicted dissolution curve as a function of time
and the individual DoE factors.
Now, these other two rows here are because
we've loaded the reference as a target function.
So this row is the difference
of the predicted dissolution curve from the target reference curve.
And then the bottom pane here is the integrated distance
of the predicted curve from the target.
When we maximize desirability in this profiler,
it gives us the combination of factor settings
that minimize this integrated distance from the target.
So I'm going to do that by bringing up maximize desirability.
And now we see the results of the functional DoE analysis,
where we have identified .725 of polymer A
275 of polymer B,
a total polymer of about .17,
and a compression force of about 1700
minimize the distance between our predicted curve and the reference.
Now, we've done two analyses.
Both of those analyses have recommended that we go
to the lowest setting of polymer A and the highest setting of polymer B.
They differ in their recommendations
for what total polymer amount to use and how much compression force to use.
The third analysis I'm going to do is the Curve DoE analysis.
This is going to be structured pretty similar
in some ways to the functional DoE analysis,
in that we're going to use the same version of the data
where dissolution measurements are all in one column
and we have a time column.
But we don't have a built-in target function option
in the Fit Curve platform yet.
So the first thing we have to do is fit just the reference batch
and save its prediction formula back to the table.
Then we do a Curve DoE analysis,
which is largely similar to a functional DoE analysis
in that we're extracting features from the curves
modeling the curves.
Then we go to the graph profile to find settings that best match the reference.
The nonlinear model that we're going to be using is
a three parameter Weibull Growth Curve,
which has a long history in the analysis of dissolution curves.
Weibull Growth Curve have an asymptote parameter A
that represents the value as time goes to infinity.
There's what's called the inflection point parameter
that I see is a scaling factor that kind of stretches out
or squeezes in the entirety of the curve.
And then there's also a growth rate parameter
that dictates the shape of the curve.
What I think is really valuable about using this model
relative to the functional DoE model or the F2 type analysis
is that we're going to be modeling features extracted from the data
that have real scientific meaning,
especially the asymptote and inflection point parameters.
Now Curve DoE analysis doesn't have a target matching capability
like the Functional Data Explorer.
So we begin the analysis by excluding all of the DoE rows in the data table.
These are represented with the set column equal to A.
So I select a cell there,
select matching cells, and then hide and exclude those rows
so that I only have the reference batch not excluded.
Then I go to the Fit Curve platform,
load it up, get in there, fit the Weibull growth model,
and then I save that prediction formula back to the table.
Once we complete the Curve DoE analysis,
we're going to compare the Curve DoE prediction formula
to this reference predictor
to find combinations of the factor settings
that get us as close to this curve as possible.
So now we unhide and exclude the DoE batches,
go back into the Fit Curve platform,
just like the Functional Data Explorer platform.
We're going to load up the DoE factors as supplementary variables.
Now that we're in the platform, we can fit our Weibull growth model.
The initial fit here looks pretty good.
Looks like we're capturing the shape of the dissolution curves.
One thing I like to do next is to make a parameter table.
This creates a data table with our fitted down the near regression parameters.
I like to look at these in the distribution platform
to see if there are outliers in there or anything unusual.
I also like to look at the patterns I see in the multivariate platform
just gives you a better sense of what's going on
with the nonlinear model fit.
Once we know that everything is looking pretty good,
we can do our curve DoE analysis
and this looks very much like the functional DoE analysis from before.
We have a profiler that shows
the relationship between dissolution and time
and how that relationship changes as a function of our DoE factors.
And then we also have a generalized regression model
for each of those three parameters that we can take a look at individually.
The first thing I would do before trying to use the model in any way
is look at the Actual by Predicted Plot,
so that's what we see here.
This is the predicted values incorporating
both the time model from the mean function and the eigen functions,
as well as the DoE models on the nonlinear regression parameters.
This looks pretty good
because there is a fairly easy interpretation
for the viable growth model parameters.,
It can be useful and interesting to open up
the individual model fits for these parameters.
For example, here are the coefficients for the inflection point model.
Because inflection point is a strictly positive quantity,
a LogNormal best subsets model has been fixed with the data
by the generalized regression platform.
We see that the mixture manifests have been forced in
and that the compression forced by polymer A interaction
is the only other term in the model.
What this means is that
if we hold the polymer proportions constant
and increase the compression force,
we would expect a larger value of the inflection point.
One would observe this as a tablet that takes longer to dissolve,
which is exactly what we would expect to have happen.
We can save the curve DoE prediction formula back to the table
and we can see in all its gory detail
how the model for asymptote inflection point
and growth rate are combined with time to come up with our overall prediction
for the dissolution curve based on.
Fortunately, with junk we don't have to look at the formula too closely though,
because we have profilers that
let us see the relationships in a visual way
rather than an algebraic one.
To solve our problem of finding the combination of factors
that give us the dissolution curve
that would be closest to the reference,
I created a formula column that calculated
the percentage difference of the predicted curve,
taking into consideration the DoE factors from the reference.
The last step of the analysis is
to bring up this percent difference response
in the profiler that is under the graph menu,
being sure to check the Expand Intermediate formulas option.
This led to a profiler where we're able to see the percent difference
from the reference as a function of time and the DoE factors,
I've shared the region where the difference
is less than one percent in green.
By manually adjusting the factors,
I was able to find settings where the predicted curve
is less than one percent from the reference across all time values.
This looks really good, but in practice I bet that this is overly optimistic.
Here we see the optimal values of the factor settings for all three analyses.
The curve DoE analysis is in the interior of the range for the polymers.
The optimal value for total polymer is . 16
which is close to the functional DoE analysis result,
and compression force is in between the optimal values
recommended by the F2 analysis and the functional DoE analysis.
After this, we made new formulations based on the recommended factor settings
from each of these models and measured their dissolution curves
as well as took a new set of measurements from the reference.
Here we see a summary of the final results from the verification runs.
The new reference distribution curve is in black,
and the curve DoE in green is the closest curve to it,
followed by the F DoE curve in blue.
The result of modeling F2 is in red, and it did the poorest overall.
This should perhaps not be too surprising.
The F2 metric was the simplest,
reducing the data down to a single metric and did the poorest.
The functional DoE model had to empirically
derive the shapes of the curves
and then model three features of those shapes,
essentially using more of the information in the data.
The curve DoE led to the best formulation
because it used the data efficiently via some prior knowledge
about the parametric form of the dissolution curves.
We see that the results of the F2 based analysis
are not equivalent with the new reference patch,
while the approaches that treated curves as first class objects are equivalent.
What this means is that the approach would have required
at least another round of DoE runs,
and so an inefficient analysis leads to an inefficient use
of time and resources.
I'm going to close the presentation
with a retrospective Info Q Assessment of the Results.
Overall, we found that the curved DoE prediction
generalized the best to new data,
but was the most difficult analysis to perform.
I want to note that if we didn't have a known nonlinear model
to work with that fit the data well, we could not have done that analysis.
The functional DoE analysis and the F 2 based approach
can be used more broadly in other situations.
The profiler leads to excellent communication scores
for all three analyses.
The ability to see how the shape of the dissolution curve changes
with the DoE factors in the functional and curve-based approaches
leads me to give them a better communication score.
I see the curve DoE approach
having the highest communication score by a little bit
because we're directly modeling more meaningful parameters
than the functional DoE approach.
That's all we have for you today.
I want to thank you for your time, interest, and attention.