Functional Data Analysis and Regression Models: Pros and Cons, and Their Combin...

Hi, I'm Ron Kenett.

This is a joint talk with Chris Gotwalt.

The talk is on Functional Data Analysis and Nonlinear Regression Models.

And in order to examine the options

and what we get out of this type of analysis,

we will take an information quality perspective.

In a sense, this is a follow up to a talk we gave last year

at the same Discovery Summit.

So I will start with simple examples

to introduce FDA and Nonlinear Regression.

And then Chris will cover a complex and more

substantially more complex example of optimization,

which includes mixture experiments designed to match a reference profile.

So the story starts with data on tablets that are dissolved

and measurements are done at different time points,

five minutes, ten minutes, 15 minutes, every five minutes,

and then 20 minutes, ten minutes later,

30, and then 15 minutes later, 45 minutes.

We have 12 tablets that are our product

and 12 tablets that are the reference.

Our goal is to have a product that matches the reference.

And in this type of data, we have a profile,

and we consider two options,

FDA and NLR.

In Chris' example, we'll talk about something called the F 2,

which is a third option for analyzing this type of data.

So here's what it looks like with the graph builder.

On the left, we have the reference profiles.

On the right, we have the tablets and the test.

We can see this is an example from my book on modern industrial statistics,

the book with Shelley Zacks, which is now in its third edition.

So on the left you can see there is a tablet that seems

a bit different.

It's labeled T 5R.

And if we run a functional data analysis of this data,

T 5R does look different.

We see that the growth part is different.

It has a slow growth but consistent growth.

It does not have the shape that we see in the other dissolution curves.

This was done with a Quadratic B-spline with 1 knot,

and the quadratic was in this case,

fitting the data better than the QB.

This is a bit of an unusual situation.

So because of the shapes, the Quadratic B-spline was a better fit.

If we look at T 1R,

the first tablet that has still different shape,

it shoots up and then it stays.

So basically, the tablet has dissolved.

Obviously, beyond a high number of dissolution,

there's not much left for the solution.

So T 1R and T 5R, they seem different.

T 5R stands out more than T 1R.

So, yes, T 5R on the cluster analysis

on the functional principal components does stand out.

So here we see how functional principal components

scatter plot of the first two functional principal components points,

what we observe visually.

And T1R, which is next to T 2R,

is a different cluster.

We can proceed with a nonlinear regression approach.

Here we are fitting a Gompertz three parameter model

with three parameters,

the asymptote, the growth rate, and the inflection point.

This is the model and when we fit the profiles,

we again see that T 5R stands out.

So we have the same qualitative impression that we had with FDA.

Now we have these three parameters listed

and we can run a profiler on the model because we now have a model.

This is where T 1R stands.

So by running the profile on the different tablets,

we can also see how similar or different they look like.

This is the table that maps out the parameters of this asymptotes.

so T 1R growth rate .21,

T5R, this tablet that stood out.

Growth rate .075,

very slow growth rate,

consistent but slow.

The inflection point is 11.5, way on the right.

So we can see through this parameter values the difference.

We can also pick up two tablets that stand out for growth rates.

T2R 1.77,

and T8R, almost no growth.

We'll get back to T 2 and T 8 in a minute.

If we take the principal components of these three parameters space.

So we conceal the parameters as if these are the measurements

and we run a multivariate control shot.

We can see T 1, this is the first one

and we can see T 5, this is the blue one, the fifth one.

This we already saw.

They are within the control limits of the T Square

multivariate statistical distance control chart.

And T 2 and T 8 that I highlighted before now stand out

and we can see qualitatively why.

This is the model degradation approach in the guidance documents

that is used for modeling dissolution curves.

In running such analysis with an information quality perspective,

the first question is asking what is the goal of the analysis?

And then we can consider the method of analysis.

Here we're using nonlinear regression and functional data analysis.

Chris will get into how this is combined

with data derived from experimental design.

We have a utility function and the information quality

is the utility of applying a method F on data X conditio n of the goal.

It is evaluated with eight dimensions.

And here Chris again we'll talk about data resolution and data structure.

So Chris, the floor is yours.

Thanks Ron.

Now I'm going to give an example that is a little bit more complicated

than the first one.

In Ron's example,

he was comparing the dissolution curves of test tablets

to those from a set of reference tablets.

In that situation, the expectation is that

the curves should generally be following the same path.

And he showed how to find anomalous curves

that deviate from the rest of the population.

In this second example, we also have a reference dissolution curve.

But we are analyzing data from a designed experiment

where the goal is to find a formulation in two polymer additives

and the amount of force used in the tablet production process

that leads to a close match

to the reference splashes the solution curve.

The graph you see here shows

the data from the reference curve that we want to match.

To do this, I'm going to demonstrate three analyses of the data

that use different methods and models

to find factor settings that match the reference that lead to a...

To do this, I'm going to demonstrate three analysis of this data

that use different methods and models to find factor settings

that will best match the reference curve.

In the first analysis,

I'm going to summarize each of the DoE curves

down to a single metric called F2

that is typically used in dissolution curve analysis,

a measure of agreement with the reference match.

There, I'll use standard DoE methods to model that F2 response

and then find the factor settings that are predicted

to best agree with the reference.

In the second analysis,

I'll use a functional DoE modeling approach

where I model the curves using these blinds,

extract functional principle component stores, and model them.

I'll load the reference batch as a target function

in the Functional Data Explorer platform

and then use the FDoE profiler

to find the closest match recommended by that model.

These first two approaches use little subject matter information

about these types of tablets.

In the third analysis,

I'll model the curves using a nonlinear model

that was known to fit this type of tablet well

and use the Curve DoE option in the fit Curve platform

to model the relationship between the DoE factors

and the shape of the curve.

I want to credit Clay Barker for adding this capability to JMP Pro 16.

I think it has a lot of promise for modeling curves

whose general shape can be assumed to be known in advance

to come from one of the supported nonlinear models.

At the end, verification batches were made

using the recommended formulation settings for each of the three analyses,

and we compared them to a new reference batch.

What we found was that the nonlinear regression-based approach

led to the closest match to the reference.

What we see here is a scatter plot matrix of the four factors

in the designed experiment.

There was a mixture constraint between the two polymers,

as well as a constraint on the total amount of polymer

and the proportions of the individual polymers.

Here's a look at some of the raw data from the experiment.

At the top of the table, we have data from the reference that we wish to match.

There are 16 DoE formulations or batches in the experiment.

We can only see data from two of them in this picture, though.

There were six tablets per formulation.

There were four dissolution measurements per tablet.

Here we see plots of the dissolution curves for each of the 16 DoE formulations

with the dissolution curve of the reference batch here at the lower right.

Now, I'm going to do a quick preliminary information quality assessment

using the questions that you'll find in the spreadsheet

that you can download from the JMP user community page.

The first part of the assessment is related to the data resolution.

In this case, I think we're looking pretty good.

The data scale is well aligned

with the stated goal because it's a design experiment.

The measuring devices seem to be reliable and precise,

and the data analysis is definitely going to be suitable

for the data aggregation level,

and we'll be illustrating different kinds of data aggregation

as we extract features from these dissolution curves.

As far as the data structure goes, we're in pretty good shape.

The data is certainly aligned with the stated goal,

we don't have any problems with outliers or missing values,

and the analysis methods are all suitable for the data structure,

although we do see some variation

in the quality of the results depending on the type of analysis we do.

As far as data integration goes, this is a pretty simple analysis.

We have multiple responses,

and we're exploring different ways of combining them into extracted features.

So there's a common workflow to all three of the analysis I'm going to be showing.

First, we have to get the data into a form

that is analyzable by the platform that we're using.

Then there's around a feature extraction.

Then we model those features.

That's where there's a lot of difference between the methods,

and then we use the profiler in different ways

to find a formulation that closely matches the reference.

First, I'm going to go over the F2 analysis.

F 2 is a standard measure of agreement of a dissolution curve

relative to a reference dissolution curve.

In the formula,

the Rs are the means of the reference curve at each time point,

and the Ts are the means of the non-reference curve.

The convention is to say that the two curves are equivalent

when F2 is greater than or equal to 50.

It's important to point out

that I'm including this F 2 based analysis

not just as an example of a dissolution DoE analysis,

but more broadly as an example of how reducing a response

that is inherently a curve down to a single number

leads to a much lower quality analysis

and results at the end

than a procedure that treats curves as first-class citizens.

So now I'm going to share the F2 analysis of the dissolution DoE data.

The first thing we have to do is calculate the batch means

of the dissolution curves at the different time points.

Then we create a formula column

that calculates the F2 dissolution curve agreement statistic

for each of these curves relative to the reference batch,

and we modeled the F2 using the DoE factors as inputs

and use the profiler to find the factor settings that match the reference.

Before the analysis,

we use the table's summary feature

to calculate the means of the dissolution measurements

by batch and across each of the times.

We can save ourselves a little bit of work by using all of the DoE factors here

as grouping variables so they'll be carried through

into the subsequent table.

Now we have a 17-row data set

and we hide and exclude the reference batch.

Now take note of the values of the dissolution means for the reference

because we're going to use those

when we create a formula column that calculates the F2 agreement metric

for each of the batches relative to the reference batch

and now we're going to be able to use this F2 formula column

as a response to be modeled.

We use the model script created by the DoE platform

to set up our model for us.

We place F2 as our response variable

and we're going to analyze this data today

using the generalized regression platform in JMP Pro.

When we get into the platform

we see that it has automatically done a standard lease squares analysis

because it found that there were enough degrees of freedom

in the data for it to do so

and it's given us an AICc of 155.6.

I'm going to see if we can do better by trying

a best subsets reduction of the model

and when we do that we see

that the AICc of that best subsets goes down to 136

smaller is better with the AICc

and the difference of 20 is pretty substantial

so I would conclude that the normal best subset is a better model

than the standardly squares one.

I'm going to try one more thing, though,

and fit a log normal distribution with best subsets to the data.

When I do that, the AICc goes down a little bit further to 130.6.

That's a modest difference, but it's good enough

that I'm going to conclude that we are going to work

with the Lo gNormal,

especially because we know that we're working with a strictly positive response

and the LogNormal distribution fits data that is strictly positive.

From there, the analysis is pretty straightforward,

so I'm going to jump straight ahead to using the profiler.

F2 is an agreement metric that we want to maximize.

So we get into the profiler,

we turn on Desirability functions and have them set to maximize,

and then maximize desirability to find the combination of factor settings

that this model says gives us the closest match to the reference,

and that would be at this combination of factor settings that we see here.

Now the F2 analysis is complete,

and we're going to go into the second analysis,

the functional DoE analysis.

For this analysis,

we're going to work with the data in a stacked format

where all of the dissolution measurements have been combined into a single column,

and we have a time column as well.

The first thing we do is go into the Functional Data Explorer platform.

In the platform launch,

we put dissolution as our response, time is our X,

the batch column as our ID,

and we supply the four DoE factors as supplementary variables.

Once we're in the platform,

we take a look at the data using the initial data plot.

This particular data set doesn't need any clean up or alignment options,

but we are going to go ahead and load the reference dissolution curve

as a target function.

For relatively simple functions like these,

I typically use B -Splines for my functional model.

When we do that, we see our B-Spline model fit,

and the initial fit that has come up is a cubic model that is behaving poorly.

It's interpolating the data points well,

but kind of doing crazy things in between them.

So I'm going to change from the default recommended model

over to a quadratic Spline model instead of the cubic one.

We do that by simply clicking on Quadratic over here

in the right of the B-Spline model fit.

We'll see that this quadratic model fits the data well.

A functional principal components analysis is automatically calculated,

and we see that the Functional Data Explorer platform has found

three functional principal components.

The leading one is very dominant,

explaining 97. 9 percent of the functional variation.

And it looks like this is a level set up or down kind of shape component.

The second one looks like a rate component,

and the third one almost looks like a quadratic.

Looking a little closer at this quadratic B-Spline model fit,

we see that this model is fitting the individual dissolution curves

pretty well.

So now we're ready to do our functional DoE analysis.

Each of our individual dissolution curves has been approximated now

by an underlying mean function common to all the batches

plus a batch dependent FPC score times the first eigenfunction,

plus another batch dependent FPC score times the second eigenfunction,

and so on with the third one.

What we're going to do is set up individual DoE models

for each of these functional principal component scores

as responses using our DoE factors as inputs.

The Functional Data Explorer platform, of course,

makes all this simple and kind of ties it up into a bow for us.

And when I say that,

it ties it up in a bow for us, what I really mean is the FDoE profiler.

So this pane here shows our predicted trajectory

of dissolution as a function of time,

and then we can see how that trajectory would change

by altering the DoE factors.

That relationship with the DoE factors

comes from these three generalized regression models

for each of our functional principal component scores.

If we want, we can open those up

and we can look at the relationship

between the DoE factors and that functional principal component score,

and we could even alter the model

by moving around to other ones in the solution path.

I just want to point out that it's possible to change the DoE model

for an FPC score.

In the interest of time,

I'm just going to have to move on and not demonstrate that, though.

We have diagnostic plots,

the most important one probably being the Actual by Predicted Plot.

This has our plotted dissolution measurements on the Y-axis

and the predicted dissolution values using the functional DoE model.

And as always, we want to see that plot

have data points tight along the 45 degree line.

And in this case, I think this model looks pretty good.

We don't want to see any patterns in our residuals,

and I'm not seeing any bad ones here.

So this model looks pretty good, and we're going to work with it.

So I've already explained how this pane right here

represents the predicted dissolution curve as a function of time

and the individual DoE factors.

Now, these other two rows here are because

we've loaded the reference as a target function.

So this row is the difference

of the predicted dissolution curve from the target reference curve.

And then the bottom pane here is the integrated distance

of the predicted curve from the target.

When we maximize desirability in this profiler,

it gives us the combination of factor settings

that minimize this integrated distance from the target.

So I'm going to do that by bringing up maximize desirability.

And now we see the results of the functional DoE analysis,

where we have identified .725 of polymer A

275 of polymer B,

a total polymer of about .17,

and a compression force of about 1700

minimize the distance between our predicted curve and the reference.

Now, we've done two analyses.

Both of those analyses have recommended that we go

to the lowest setting of polymer A and the highest setting of polymer B.

They differ in their recommendations

for what total polymer amount to use and how much compression force to use.

The third analysis I'm going to do is the Curve DoE analysis.

This is going to be structured pretty similar

in some ways to the functional DoE analysis,

in that we're going to use the same version of the data

where dissolution measurements are all in one column

and we have a time column.

But we don't have a built-in target function option

in the Fit Curve platform yet.

So the first thing we have to do is fit just the reference batch

and save its prediction formula back to the table.

Then we do a Curve DoE analysis,

which is largely similar to a functional DoE analysis

in that we're extracting features from the curves

modeling the curves.

Then we go to the graph profile to find settings that best match the reference.

The nonlinear model that we're going to be using is

a three parameter Weibull Growth Curve,

which has a long history in the analysis of dissolution curves.

Weibull Growth Curve have an asymptote parameter A

that represents the value as time goes to infinity.

There's what's called the inflection point parameter

that I see is a scaling factor that kind of stretches out

or squeezes in the entirety of the curve.

And then there's also a growth rate parameter

that dictates the shape of the curve.

What I think is really valuable about using this model

relative to the functional DoE model or the F2 type analysis

is that we're going to be modeling features extracted from the data

that have real scientific meaning,

especially the asymptote and inflection point parameters.

Now Curve DoE analysis doesn't have a target matching capability

like the Functional Data Explorer.

So we begin the analysis by excluding all of the DoE rows in the data table.

These are represented with the set column equal to A.

So I select a cell there,

select matching cells, and then hide and exclude those rows

so that I only have the reference batch not excluded.

Then I go to the Fit Curve platform,

load it up, get in there, fit the Weibull growth model,

and then I save that prediction formula back to the table.

Once we complete the Curve DoE analysis,

we're going to compare the Curve DoE prediction formula

to this reference predictor

to find combinations of the factor settings

that get us as close to this curve as possible.

So now we unhide and exclude the DoE batches,

go back into the Fit Curve platform,

just like the Functional Data Explorer platform.

We're going to load up the DoE factors as supplementary variables.

Now that we're in the platform, we can fit our Weibull growth model.

The initial fit here looks pretty good.

Looks like we're capturing the shape of the dissolution curves.

One thing I like to do next is to make a parameter table.

This creates a data table with our fitted down the near regression parameters.

I like to look at these in the distribution platform

to see if there are outliers in there or anything unusual.

I also like to look at the patterns I see in the multivariate platform

just gives you a better sense of what's going on

with the nonlinear model fit.

Once we know that everything is looking pretty good,

we can do our curve DoE analysis

and this looks very much like the functional DoE analysis from before.

We have a profiler that shows

the relationship between dissolution and time

and how that relationship changes as a function of our DoE factors.

And then we also have a generalized regression model

for each of those three parameters that we can take a look at individually.

The first thing I would do before trying to use the model in any way

is look at the Actual by Predicted Plot,

so that's what we see here.

This is the predicted values incorporating

both the time model from the mean function and the eigen functions,

as well as the DoE models on the nonlinear regression parameters.

This looks pretty good

because there is a fairly easy interpretation

for the viable growth model parameters.,

It can be useful and interesting to open up

the individual model fits for these parameters.

For example, here are the coefficients for the inflection point model.

Because inflection point is a strictly positive quantity,

a LogNormal best subsets model has been fixed with the data

by the generalized regression platform.

We see that the mixture manifests have been forced in

and that the compression forced by polymer A interaction

is the only other term in the model.

What this means is that

if we hold the polymer proportions constant

and increase the compression force,

we would expect a larger value of the inflection point.

One would observe this as a tablet that takes longer to dissolve,

which is exactly what we would expect to have happen.

We can save the curve DoE prediction formula back to the table

and we can see in all its gory detail

how the model for asymptote inflection point

and growth rate are combined with time to come up with our overall prediction

for the dissolution curve based on.

Fortunately, with junk we don't have to look at the formula too closely though,

because we have profilers that

let us see the relationships in a visual way

rather than an algebraic one.

To solve our problem of finding the combination of factors

that give us the dissolution curve

that would be closest to the reference,

I created a formula column that calculated

the percentage difference of the predicted curve,

taking into consideration the DoE factors from the reference.

The last step of the analysis is

to bring up this percent difference response

in the profiler that is under the graph menu,

being sure to check the Expand Intermediate formulas option.

This led to a profiler where we're able to see the percent difference

from the reference as a function of time and the DoE factors,

I've shared the region where the difference

is less than one percent in green.

By manually adjusting the factors,

I was able to find settings where the predicted curve

is less than one percent from the reference across all time values.

This looks really good, but in practice I bet that this is overly optimistic.

Here we see the optimal values of the factor settings for all three analyses.

The curve DoE analysis is in the interior of the range for the polymers.

The optimal value for total polymer is . 16

which is close to the functional DoE analysis result,

and compression force is in between the optimal values

recommended by the F2 analysis and the functional DoE analysis.

After this, we made new formulations based on the recommended factor settings

from each of these models and measured their dissolution curves

as well as took a new set of measurements from the reference.

Here we see a summary of the final results from the verification runs.

The new reference distribution curve is in black,

and the curve DoE in green is the closest curve to it,

followed by the F DoE curve in blue.

The result of modeling F2 is in red, and it did the poorest overall.

This should perhaps not be too surprising.

The F2 metric was the simplest,

reducing the data down to a single metric and did the poorest.

The functional DoE model had to empirically

derive the shapes of the curves

and then model three features of those shapes,

essentially using more of the information in the data.

The curve DoE led to the best formulation

because it used the data efficiently via some prior knowledge

about the parametric form of the dissolution curves.

We see that the results of the F2 based analysis

are not equivalent with the new reference patch,

while the approaches that treated curves as first class objects are equivalent.

What this means is that the approach would have required

at least another round of DoE runs,

and so an inefficient analysis leads to an inefficient use

of time and resources.

I'm going to close the presentation

with a retrospective Info Q Assessment of the Results.

Overall, we found that the curved DoE prediction

generalized the best to new data,

but was the most difficult analysis to perform.

I want to note that if we didn't have a known nonlinear model

to work with that fit the data well, we could not have done that analysis.

The functional DoE analysis and the F 2 based approach

can be used more broadly in other situations.

The profiler leads to excellent communication scores

for all three analyses.

The ability to see how the shape of the dissolution curve changes

with the DoE factors in the functional and curve-based approaches

leads me to give them a better communication score.

I see the curve DoE approach

having the highest communication score by a little bit

because we're directly modeling more meaningful parameters

than the functional DoE approach.

That's all we have for you today.

I want to thank you for your time, interest, and attention.

Files