The most novel, innovative, and promising therapeutics in biopharmaceuticals are cell therapies. Cell therapies transfer human cells into a patient to treat disease. These cells either come directly from the patient or from a healthy (cell) donor. Multiple regulatory guidance documents recognize the importance of providing cell therapy manufacturers the flexibility to improve their processes. Therefore, it is imperative to show that the pre- and post-change processes are comparable and process changes pose no threat to the safety or efficacy of the drug product.

One method used to ensure comparability is an equivalence test of means. There is a regulatory expectation that the study is done as a paired design, often referred to as a split-apheresis study, unless there is minimal donor-to-donor variability. In split-apheresis studies, the same donor material is split and processed in the pre- and post-change process for comparison. The design of this study presents unique challenges in cell therapies as they require adequate sample sizes to ensure properly powered designs, yet the number of healthy donors available is usually quite low (three to six donors). Additionally, the power depends on lot-to-lot and assay variability, assay replication strategy, and the effect size used for the equivalence acceptance criterion (EAC).

This talk presents a series of JMP scripts that extend the existing capabilities of the Sample Size Explorer platform to address each of these relevant study nuances, as well as the capability to overlay power curves to address trade-offs with different sample sizes and approaches.

I am Heath Rushing.

Although Andrew Karl, and Jeff Hofer , and Rick Burdick, some teammates of mine ,

did the majority of technical work here, I'm going to be the one presenting today .

I'm going to talk about how JMP and JMP scripts can be used

in a very particular specific application in cell therapies.

I'm going to talk a little bit about what gene and cell therapies are

and the very specific instance that I want to talk about

is comparability.

I'm going to focus on process changes.

Interestingly enough, last year,

I gave a talk , and it focused on cell and gene therapies.

They're very novel therapeutics.

The first one was approved in the United States in 2017.

A little bit different than most of ,

what I call the small molecule and the large molecule therapeutics

that you may have heard of in the past.

Let me just touch base on what is a cell and gene therapy.

First thing I'm going to do is touch base on what a gene therapy is.

What you're essentially doing is you're replacing

a gene with a healthy one,

or turning off bad genes.

A lot of cancers are caused by defective genes.

What you're doing is you're inserting these healthy genes back into a patient

of either in vivo or in vitro.

An in vitro would be more of a bone marrow transplant.

Last year, I talked about ,

the challenge with gene therapies is that patient -to -patient variability.

I focused on process development.

Then I talked about cell therapies.

In the cell therapies , what you're doing is you're replacing disease cells.

You're either transferring some sort of healthy cell into a patient,

or we're replacing missing cells into a patient.

Where do these cells come from?

They either come from the patient themselves,

so you would have to deal with that patient -to -patient variability,

or in most cases, they come from a healthy donor.

Now you're not dealing with this patient-to-patient variability,

but you're dealing with donor -to -donor variability.

Whenever I say donor, I'm talking about a healthy donor.

I could be a healthy donor.

Then someone else could be a healthy donor also.

In both of those cases

is you have to deal with that patient-to-patient

or donor -to -donor variability.

What's interesting is last year , I gave the example in process development,

and it looks something like this.

It was the exact same data set that I used last year that I said,

say that you were developing a process where you look time, temperature , and pH,

and you're measuring their effect on cell viability and by product.

In that case is, I cannot use one donor material,

I had to split that up into four different donors.

I said, "If you ran these experiments for process development,

and you did not consider that there was donor -to -donor variability,

this is what you would see.

What you would say is we're looking for P-values that are below 0.05,

you would say nothing affects cell viability

and nothing affects by product.

You were not able to detect that you had any significant

or critical process parameters

for the very reason that you do not consider

that there could be a difference in donor.

Right now, if you do consider those as what it's called a fixed donor effect,

the only thing that I did is I brought in donor.

Then you see that NAND.

This really sticks out what significantly affects cell viability

and what significantly affects by product.

The whole talk was on how does that donor-to-donor variability

affect statistical inference and also process capability.

I'm going to focus on that statistical inference.

What are you trying to do in process development

is you're trying to determine if things like pH, and temperature, and time

significantly affect your critical equality attributes.

Say that I was a drug manufacturer ,

and I have set up a process development study .

I send this process development study .

I want to determine if temperature affects,

and I'm going to call it cell viability.

I say, "Hey whenever I'm looking at that , is I want to make sure that if something

significantly affects my quality attributes,

I control that in my process.

But if it doesn't, I am not spending money and time and resources controlling it."

What I'm concerned with as a drug manufacturer

is the Type I error rate.

I do not want to inflate a Type I error rate.

A Type I error rate would say,

"Hey , this is significant when, in fact, it's not."

What do you think that regulatory agencies would be more concerned with?

You controlling more things?

Are you not controlling things that should be controlled?

That is exactly right is they'd be more concerned

about that patient risk, that Type II error.

In process development is drug manufacturers

do not want to inflate the Type I error.

They also want significant power. Why?

Because that controls that patient risk.

The whole point of me showing that last year

was to show the effective donor-to-donor variability

on trying to determine your critical process parameters.

I call it statistical inference.

Right now, what happens if I change my process?

I had a colleague just last week , I was working with her.

Whenever we're talking about cell and gene therapies, she said,

and this is her quote, "Heath , at cell and gene therapies,

things are constantly changing.

You could have things like analytical methods change.

You could have things like process change."

Today, I'm going to focu s on this process right here.

Mainly, I'm going to focus on that process change.

I do want to point out that regulatory agencies understand

that you have a need for improving your process.

Even if you improve your process, are you changing your process ?

They recognize the need for that, but they also recognize the need

that the therapeutics that you're making from that process

should be similar in terms of product quality.

You're using these in clinical trials.

What does it mean to be similar?

That doesn't say that they have to be exactly the same,

that they have to be similar or comparable.

In terms of me saying that something is similar,

what I want to do is I want to make sure that I have some similarity condition.

That's the whole point of comparability.

For very low risk attributes,

what I can do is I can show that process A and process B

is similar in side -by -side plots .

For more higher risk attributes,

what I want to do is maybe something like a quality range.

In terms for quality ranges,

I just take that reference group, the old process,

and I built some range around it

and ensure that all of th e measured quality attributes

from the new process fall within that range.

For very high risk attributes,

what I want to do is I want to do equivalen ce testing.

This is what I'm going to focus on today,

tell you about what equivalence testing is,

and how that acceptable difference or that similarity condition is set.

It's called equivalence testing, Two One Sided t-test.

To reiterate what we talked about before,

whenever I'm using design of experiments in process development,

what I do is I'm changing some variable -like temperature

from low to high ,

and I'm measuring the effect on my critical equality attributes.

I am assuming in the null hypothesis that they are the same.

What I do is I set up a design to see if they're different.

A Type I error in that case would be me saying,

"Wow, they're different " when , in fact, they're not.

That would mean that I would control that.

I would spend resources controlling that in the process.

If I'm a drug manufacturer,

I do not want to control things that I don't need to.

I'm concerned about that Type I error rate .

If I was a regulatory agency,

I would be even more concerned with the Type II.

There's no difference when, in fact, there is .

You should be controlling something and you're not.

If I was a regulatory agency,

I'd be more concerned with the Type II error.

Now, we're going to flip it.

We're going to talk about equivalence testing.

Equivalence testing is I'm not saying that they are the same.

I am assuming that there is a difference.

I just want to make sure that the difference isn't too big.

That too big, I'm going to call delta.

T here's a lot of different ways to calculate that delta.

I'm going to call it d or that delta right there,

often called the equivalence acceptance criteria.

I would like to come from subject matter expertise,

but the majority of times, it comes from me taking

some k-value times that historical value.

That's split into two different tests .

On one, I'm determining if it's less than positive d.

In the other one , I want to show in the alternate hypothesis

that that difference is greater than a negative d.

I'm going two different sides.

That's what's called the left -hand side of the bottom,

or the top.

In terms of, if I was a drug manufacturer, what would I want to do?

I would want to be able to reject both of those hypotheses.

I would want high power, low Type II error and high power.

T his is equivalent to taking a 90% confidence number

around the difference in means in ensuring that 90% confidence interval

whenever I'm looking at the low and high or within the balance of those lower delta

in the upper delta.

If you're looking at this, you should think to yourself is,

"I want the width of that confidence interval to be very small."

What are the different ways that I could make the width

of that confidence interval

for the difference between those two means very small?

I could decrease my standard deviation.

That's a good thing.

I could increase my sample size.

That's a good thing.

I could also increase my alpha level.

Maybe that wouldn't be so good because what you're doing

is you're inflating your Type I error rate.

In inflating your Type I error rate, what you're saying is,

I am stating that they're equivalent when indeed they're not.

The different ways to control the width of that confidence interval

is to lower s, increase in, or increase alpha.

We talked about two of those being good and one of those not being good.

It makes sense that if I'm a drug manufacturer,

I want to maximize the power of the design.

That's the flip. I want to minimize my Type II error.

Regulatory agencies want to make sure that you do not inflate

that Type I error rate.

That Type I error rate would be saying your assume equivalent

or you're stating equivalence when indeed they're not.

In JMP, you can do these equivalence tests,

and I want to show you an example of that.

For my journal, first thing I want to do

is I want to show you that in terms of determining

your Type I error rates and your Type II error rates

is JMP provides power curves

under Sample Size Explorer , Power , Two Sample Independent Equivalence.

Caleb King did an awful great job with this.

I say awful great job, but he did a great job with this.

Let's just say that my margin, my equivalence acceptance criteria,

is plus or minus 2 standard deviations.

I'm just going to put a 2 here,

and that's just 2 times the standard deviations that I'm talking about.

That's all that I'm doing.

Let's just say that in my historical process

is I have 10 lots,

and I'm going to compare it to a new process that has 5 lots.

I want to see what the power is if they are exactly the same,

but there's no difference between these.

A few things that I want to point out here is JMP gives those power calculations.

The other thing that it does is it allows you change those.

What's going to happen if I do things like increase my new process,

the number of samples in my new process day report ,

my power is going to go up.

What would happen if I do things like,

"Hey, Heath, I want to decrease that margin of error

to, instead 2 standard deviations , to say maybe 1.5 standard deviations,

essentially , as I'm taking those boundaries

and I'm tightening them up."

What I see is my power is going to go down.

I'm able to ask myself all those typical questions

that you would in equivalence testing.

This is something else that I want to show you that's going to come up

is JMP has the ability to say,

do I know the true standard deviation or not?

If I know the true standard deviation, that is going to be better.

You're going to see that your power goes up.

Indeed, what happens is my power goes up.

That's usually not the case.

I always call that the utopia,

which uses the cases if I do not know what that true standard deviation is.

I always call this the optimum, a car of the utopia.

I always call the no, the realism.

I would be remiss

if I did not show you the tools that JMP does have

for showing that equivalence

like if I had an historical process where I had 10 lots and I made 5 new ones.

First thing I want to do

is I want to look at this through Graph Builder,

and I see that there is no effect between those two.

I can see both of those , and they both look like

they came from the new process, the blue versus the red.

How about if there is an effect ? What I do is I see a shift.

Just like I showed you before is that is Two One Sided t-test.

JMP has tools for that.

Jin Feng did a great job with this. My goodness. I love the scores plot.

Here's the difference in means. Here's the lower , and here's the upper,

and that's within the boundaries.

In that case, what you've done is you rejected both the null hypothesis

in favor of the alternate,

which is the same as what you see in the picture.

What you also see here is that if there is an effect,

is I am not going to reject both the nulls.

One of those is I am going to fail to reject and indeed I did.

What you'll see is my confidence intervals outside that boundary.

I would like to talk about a very specific case.

A very specific case in cell therapy is called split apheresis design.

In a split apheresis design , this is a situation where

in cell therapies is you're changing the process.

What you do is you're using donor materials split

between the two different processes.

We kept getting questions over and over and over again

from our customers about,

"Can I look at the sample size and power calculations

for these pair of designs ?"

Cannot overlay them .

You cannot see if they're dependent upon that donor -to-donor variability?

Let's talk about a split apheresis design.

In a split apheresis design, first thing I want to do

is I want to tell you about the regulatory expectation.

This is even a recent draft guidance document from the FDA

in July of 2023, just last month.

In that , they said that you need to select a suitable statistical test

for analysis in difference between paired data

where those donors are paired up.

That's where the split apheresis design comes from.

For every single donor material that you have,

you split it in between process A and process B.

This is not two independent t-tests.

What this is, is a paired design.

That's the first thing that I wanted to talk about.

The second thing is,

I wanted to talk about that you are very often in early stage,

so you do not have a line of donor materials,

so you have very low sample sizes.

It's hard to get power out of low sample sizes.

The third thing that I'm going to tell you is ,

how do you come up with your EAC?

How do you come up with your similarity condition ,

that difference, that acceptable difference?

What you do is you use historical data that is made off of multiple donors.

You take the standard deviation used off of historical data.

I'm going to call that n 1 or historical.

You take some k number of standard deviations of historical data .

You do a test , and you're using the split apheresis design

to judge off of that historical data.

These are two examples that I want to show you.

The first example here

is where you're looking at process A and process B.

What you see is you do see six different donors here.

What you see in the one on the left is the majority variation is coming

from donor-to-donor variability,

not the difference between process A and process B.

You have high donor -to -donor variability.

I'm going to call that, prho.

In the case on the right, what you do is,

is the majority variation is coming from the difference

between process A and process B, not the donor-to-donor variability.

The majority variation is coming from the analytical or the process.

What that tells you is you have very low rho.

You'd have low donor-to-donor variability.

I'm going to show you a series of scripts that we worked on.

These are typical questions that came from our customers.

In our cases, we do not know what the standard deviation.

How does that compare to the known?

How about those Type I and Type II error rates?

Remember, if I'm a drug manufacturer, I want to increase the power.

If I'm a regulatory agency,

I want to make sure that you do not inflate that Type I error rate.

How are we going to do this?

This is from the European Medicines Agency, 2001.

The best way to do that is with things

called expected operating characteristic curves.

That gives you power on the y-axis and a shift in the main.

I 'm going to go through a series of scripts,

and these series of scripts ...

It's really one script that have right here,

that it's going to allow me to change things like that rho,

that proportion of donor-to-donor variability.

That k -value, remember, how do I set the acceptance criteria?

It is k times that standard deviation.

The typical way of doing this is that k times those historical lots.

This is the number of historical lots that you use n1.

n2 is the number of lots that I'm going to use for that paired design.

Whenever you run the script, what happens is you come out ,

and it does a series of simulations.

In this case, it did 5,000 simulations ,

and it calculates the power for you.

In those 5,000 runs, what percentage of those passed?

It looks something like this. It gives you a lot of different options.

My goodness . I can look at different k -values.

I can look at a different number of n1, which are called historical lots.

I can also look at the different number of n 2 or paired lots.

Right now, I want to talk about...

Whenever I do this, what I can do

is I can select which of these different cases that I want to look at

to be able to answer typical questions.

Let me open up my typical comparisons here.

The first one I want to talk about is,

"Heath, what if I have a known standard deviation?"

Look s something like this.

That's what the known standard deviation looks like.

A few things that I want to point out

is this is the percentage of time that you're going to claim equivalence.

If they're exactly the same that you said you're going to claim equivalence

a high percentage of time.

If there's a huge difference between them like a two standard deviation shift

or a three standard deviation shift , is you're not going to claim equivalence.

That's a good thing.

The other thing that I want to show you here

is if you're looking at this alpha of 0.05,

being that I set my k -value at 2 ,

k number of standard deviations versus 10 historical lots,

the standard deviation of 10 historical lots,

you would expect that alpha level would be 0.05,

the exact alpha level that I set in my equivalence test.

Right now, the thing that I want to show you

is this is for a proportion of donor -to -donor variability of 90%.

What happens if I change that?

What happens if I change that to 60%.

What happens if I change that to 30% ?

There's no donor-to-donor variability.

What you see is that paired test, the power curve looks really good

whenever I have high donor -to -donor variability.

The other thing that you notice with the known standard deviation

is the alpha level regardless of operating characteristic curve

is always at 0.05.

Let's talk about some other typical questions.

One typical question is ,

how does it compare for the different levels of rho?

How does my typical way of doing this ?

I do not know what the standard deviation is.

My typical way of doing this is in the blue.

The known standard deviation is in the red.

One thing that I want to point out

is I want to point out this one right here.

What you see is the preferred approach,

the approach that even regulatory documents have said that you should do,

the paired approach,

using the standard deviation

that is calculated off of my historical lots,

is I have an inflated Type I error rate.

This should be 0.05 just like it is here.

That was really strange to us, and we looked into this.

When we looked into it, what we found is ,

it has everything to do with this right here.

The reason why it has everything to do with this right here,

as I said, how do I decrease the width of that confidence interval?

The way that I decrease the width of that confidence interval

was either to decrease s , or increase n , or increase my alpha level.

Understand this.

This is why you have an inflated Type I error rate

with this paired test

is those deltas , which you're using to judge this off of, those deltas

are using the standard deviation off of historical data

that contains donor-to-donor variability.

That confidence interval right there

does not contain donor-to-donor variability.

Why? Because you did a pair test .

That contains only analytical and process variability.

That's where that inflated Type I error rate comes from.

Using this paired approac h is understand

you have an inflated Type I error rate.

We see that, and it's even more prevalent

when you have high donor -to -donor variability.

Why? Because if you have low donor-to-donor variability,

th at process variability

is the largest part of the variance component that you have.

Let's look at a few more questions that you have.

A s I said, this one script answers these different questions.

This is answering the question ,

"Hey, Heath, if I use that paired approach that's recommended,

can I look at what happens as I increase sample size

from 3 to 4 to 5 to 6?"

Two things that I want to point out here

is number one , what you see as I increase sample size,

is I'm going to have higher power.

I still do not have adequate power if there's no donor-to-donor variability.

That means that I have 0 donor-to-donor variability.

I would need at least a sample size of 8 or 8 different donors.

If I do have high donor-to-donor variability,

like 0.9, 90 % of that variability, which you see is I do have high power

for no difference between the means.

What I can do is I can make sure to answer those questions

with overlaid operating character s occurs for different sample size.

I can also answer that question if I was looking at , and I say, "Hey ,

I've stated my different sample sizes,

but what if we look at the different k -values?"

Understand that your acceptance criteria is k number of standard deviations.

What's going to happen is that acceptance criteria

are those what I call go post are going to widen as you increase k.

Therefore, you're going to have a much higher ability

to pass equivalence ,

and you're going to have much higher power.

Another typical question is this.

What if I want to change both of those together?

I'm a big fan of Graph Builder .

What Graph Builder is what you're looking at here

is not only are you looking at,

"Hey , Heath, I am increasing sample size in blue, that would be 3,

in red, that would be 4,

in green, that would be a 5, and in purple, that would be 6 ,

but I also looked at it for different k -values.

What would your operating characteristic curves look like?"

Good?

I want to revisit this.

Just like I said before , I said, "Hey , I want to revisit this

and show you that for..."

Whenever I have a large proportion of donor -to -donor variability ,

I said, "What you see for 2 right here, I would expect my alpha level

that my proportion of time that I pass this test would be 0.05."

But what you see is you have inflated Type I error rate.

How does this look?

Whenever I'm looking at a rho

or a proportion of donor-to-donor variability

that is very small,

I do not have much power.

The question was , what if we did this instead?

If we had low donor-to-donor variability,

if what we did is we used information from those historical lots.

If I have no donor-to-donor variability or very low donor-to-donor variability,

why couldn't I just do a independent t -test,

where I compare from process A or my historical process,

not just the paired lots,

but I also consider those 10 historical lots

and not comparing to the mean of the new process?

We wanted to see how that compared .

Doing it that way is the independent test is in the red.

The paired way is in the blue.

What you see is, if I have little to no donor-to-donor variability

in my cell therapy split apheresis process ,

you said that the independent t-test

has much better profile than the paired approach.

However, if I have high donor-to-donor variability,

that paired approach in the blue

has a much better operating characteristic than the red.

Right now, the question is instead of just automatically

doing that split apheresis pair design,

maybe it would be better

to make a decision based upon that donor-to-donor variability.

How does this compare whenever I'm looking at different k -values?

I see the exact same thing, the exact same phenomena that

with a low donor-to-donor variability,

it makes sense to do the independent t -test.

With high donor-to-donor variability is I have a much better

operating characteristic curve

are higher power associated with the paired approach.

It doesn't matter if I looked at a k of 1.5 , or 2, or even 3.0.

Regardless of the k -value ,

I have a much better operating characteristic curve

if I consider that donor -to -donor variability.

What if I looked at different values of those historical lots?

I looked at 3.

We looked at 4.

W e looked at 5 paired lots. We looked at 6 paired lots.

Regardless, you see the same phenomena.

We're currently writing a paper on this

to try to propose

that if you have low donor-to-donor variability,

maybe it does not make sense for you to use a split apheresis

or a paired analysis approach.

Maybe the approach is only good

whenever you have high donor -to -donor variability.

T hese are typical questions that are asked in the split apheresis designs.

What I want to do is I just want to cover

t wo or three more of these

j ust to show you a few o ther things that you could do.

These are different things that we were looking at .

We looked at, "Hey , how does the operating characteristic curve,

how does that compare if we looked at in the blue

that's using nothing but the historical lots

to estimate the standard deviation

versus if you use the paired and the historical lots,

which is in the red?"

What you see is there's not much difference between these two,

especially if I'm using higher sample sizes like the n 2.

W e also looked at,

"Hey, if I estimated that standard deviation

using a few different ways,

what if I looked at estimating

that standard deviation using the historical lots,

which is in the blue, versus in the red

is using the historical lots and the paired lots?

I compare the independent case versus the paired case.

What do I see? "

As I said before, you see that exact same phenomena

with a low donor-to-donor variability.

The much better way of doing this would be an independent t -test

on the lower right -hand corner.

That is where you high donor-to-donor variability.

It makes sense that we would use the paired approach.

Last one that I want to show you

is this is something that we've been working on.

We looked at the paired approach versus the independent.

The paired approach is in the blue.

The independent is in the red.

I've said this over and over and over again.

That it makes sense that if I have low donor-to-donor variability,

the independent case in the blue looks much better.

If I have high donor-to-donor variability, the paired approach looks better.

But one thing that we did i s we took a look and just said,

"What if I took a look at the approach that gave me the shortest

with that confidence interval?"

That's in the green .

What you see is that usually gives you the best approach

regardless of what your rho is

or what your proportion of donor -to -donor variability is.

In closing,

I would like to just point out a few things.

This script that we have answers, along with the typical questions

that our customers have on operating characteristic curves,

associated with these split apheresis designs,

what I do want to pull away from here, though,

is if you do have a low proportion of donor-to-donor variability

is you'll see that these designs are very underpowered

for fewer than 8 lots, fewer than 8 different donor material.

We live in a world in cell therapies

where you do not have a lot of donor materials,

so you have very low sizes.

It would be much more efficient if you had low donor-to-donor variability

to use the independent case.

We do have the other revisions that we made on this

where if you were able to make multiple lots

for those paired approaches with the same donor,

or if you're able to take multiple measurements

to be able to look at those operating characteristics curves.

Thank you.

Published on ‎03-25-2024 04:54 PM by | Updated on ‎07-07-2025 12:11 PM

The most novel, innovative, and promising therapeutics in biopharmaceuticals are cell therapies. Cell therapies transfer human cells into a patient to treat disease. These cells either come directly from the patient or from a healthy (cell) donor. Multiple regulatory guidance documents recognize the importance of providing cell therapy manufacturers the flexibility to improve their processes. Therefore, it is imperative to show that the pre- and post-change processes are comparable and process changes pose no threat to the safety or efficacy of the drug product.

One method used to ensure comparability is an equivalence test of means. There is a regulatory expectation that the study is done as a paired design, often referred to as a split-apheresis study, unless there is minimal donor-to-donor variability. In split-apheresis studies, the same donor material is split and processed in the pre- and post-change process for comparison. The design of this study presents unique challenges in cell therapies as they require adequate sample sizes to ensure properly powered designs, yet the number of healthy donors available is usually quite low (three to six donors). Additionally, the power depends on lot-to-lot and assay variability, assay replication strategy, and the effect size used for the equivalence acceptance criterion (EAC).

This talk presents a series of JMP scripts that extend the existing capabilities of the Sample Size Explorer platform to address each of these relevant study nuances, as well as the capability to overlay power curves to address trade-offs with different sample sizes and approaches.

I am Heath Rushing.

Although Andrew Karl, and Jeff Hofer , and Rick Burdick, some teammates of mine ,

did the majority of technical work here, I'm going to be the one presenting today .

I'm going to talk about how JMP and JMP scripts can be used

in a very particular specific application in cell therapies.

I'm going to talk a little bit about what gene and cell therapies are

and the very specific instance that I want to talk about

is comparability.

I'm going to focus on process changes.

Interestingly enough, last year,

I gave a talk , and it focused on cell and gene therapies.

They're very novel therapeutics.

The first one was approved in the United States in 2017.

A little bit different than most of ,

what I call the small molecule and the large molecule therapeutics

that you may have heard of in the past.

Let me just touch base on what is a cell and gene therapy.

First thing I'm going to do is touch base on what a gene therapy is.

What you're essentially doing is you're replacing

a gene with a healthy one,

or turning off bad genes.

A lot of cancers are caused by defective genes.

What you're doing is you're inserting these healthy genes back into a patient

of either in vivo or in vitro.

An in vitro would be more of a bone marrow transplant.

Last year, I talked about ,

the challenge with gene therapies is that patient -to -patient variability.

I focused on process development.

Then I talked about cell therapies.

In the cell therapies , what you're doing is you're replacing disease cells.

You're either transferring some sort of healthy cell into a patient,

or we're replacing missing cells into a patient.

Where do these cells come from?

They either come from the patient themselves,

so you would have to deal with that patient -to -patient variability,

or in most cases, they come from a healthy donor.

Now you're not dealing with this patient-to-patient variability,

but you're dealing with donor -to -donor variability.

Whenever I say donor, I'm talking about a healthy donor.

I could be a healthy donor.

Then someone else could be a healthy donor also.

In both of those cases

is you have to deal with that patient-to-patient

or donor -to -donor variability.

What's interesting is last year , I gave the example in process development,

and it looks something like this.

It was the exact same data set that I used last year that I said,

say that you were developing a process where you look time, temperature , and pH,

and you're measuring their effect on cell viability and by product.

In that case is, I cannot use one donor material,

I had to split that up into four different donors.

I said, "If you ran these experiments for process development,

and you did not consider that there was donor -to -donor variability,

this is what you would see.

What you would say is we're looking for P-values that are below 0.05,

you would say nothing affects cell viability

and nothing affects by product.

You were not able to detect that you had any significant

or critical process parameters

for the very reason that you do not consider

that there could be a difference in donor.

Right now, if you do consider those as what it's called a fixed donor effect,

the only thing that I did is I brought in donor.

Then you see that NAND.

This really sticks out what significantly affects cell viability

and what significantly affects by product.

The whole talk was on how does that donor-to-donor variability

affect statistical inference and also process capability.

I'm going to focus on that statistical inference.

What are you trying to do in process development

is you're trying to determine if things like pH, and temperature, and time

significantly affect your critical equality attributes.

Say that I was a drug manufacturer ,

and I have set up a process development study .

I send this process development study .

I want to determine if temperature affects,

and I'm going to call it cell viability.

I say, "Hey whenever I'm looking at that , is I want to make sure that if something

significantly affects my quality attributes,

I control that in my process.

But if it doesn't, I am not spending money and time and resources controlling it."

What I'm concerned with as a drug manufacturer

is the Type I error rate.

I do not want to inflate a Type I error rate.

A Type I error rate would say,

"Hey , this is significant when, in fact, it's not."

What do you think that regulatory agencies would be more concerned with?

You controlling more things?

Are you not controlling things that should be controlled?

That is exactly right is they'd be more concerned

about that patient risk, that Type II error.

In process development is drug manufacturers

do not want to inflate the Type I error.

They also want significant power. Why?

Because that controls that patient risk.

The whole point of me showing that last year

was to show the effective donor-to-donor variability

on trying to determine your critical process parameters.

I call it statistical inference.

Right now, what happens if I change my process?

I had a colleague just last week , I was working with her.

Whenever we're talking about cell and gene therapies, she said,

and this is her quote, "Heath , at cell and gene therapies,

things are constantly changing.

You could have things like analytical methods change.

You could have things like process change."

Today, I'm going to focu s on this process right here.

Mainly, I'm going to focus on that process change.

I do want to point out that regulatory agencies understand

that you have a need for improving your process.

Even if you improve your process, are you changing your process ?

They recognize the need for that, but they also recognize the need

that the therapeutics that you're making from that process

should be similar in terms of product quality.

You're using these in clinical trials.

What does it mean to be similar?

That doesn't say that they have to be exactly the same,

that they have to be similar or comparable.

In terms of me saying that something is similar,

what I want to do is I want to make sure that I have some similarity condition.

That's the whole point of comparability.

For very low risk attributes,

what I can do is I can show that process A and process B

is similar in side -by -side plots .

For more higher risk attributes,

what I want to do is maybe something like a quality range.

In terms for quality ranges,

I just take that reference group, the old process,

and I built some range around it

and ensure that all of th e measured quality attributes

from the new process fall within that range.

For very high risk attributes,

what I want to do is I want to do equivalen ce testing.

This is what I'm going to focus on today,

tell you about what equivalence testing is,

and how that acceptable difference or that similarity condition is set.

It's called equivalence testing, Two One Sided t-test.

To reiterate what we talked about before,

whenever I'm using design of experiments in process development,

what I do is I'm changing some variable -like temperature

from low to high ,

and I'm measuring the effect on my critical equality attributes.

I am assuming in the null hypothesis that they are the same.

What I do is I set up a design to see if they're different.

A Type I error in that case would be me saying,

"Wow, they're different " when , in fact, they're not.

That would mean that I would control that.

I would spend resources controlling that in the process.

If I'm a drug manufacturer,

I do not want to control things that I don't need to.

I'm concerned about that Type I error rate .

If I was a regulatory agency,

I would be even more concerned with the Type II.

There's no difference when, in fact, there is .

You should be controlling something and you're not.

If I was a regulatory agency,

I'd be more concerned with the Type II error.

Now, we're going to flip it.

We're going to talk about equivalence testing.

Equivalence testing is I'm not saying that they are the same.

I am assuming that there is a difference.

I just want to make sure that the difference isn't too big.

That too big, I'm going to call delta.

T here's a lot of different ways to calculate that delta.

I'm going to call it d or that delta right there,

often called the equivalence acceptance criteria.

I would like to come from subject matter expertise,

but the majority of times, it comes from me taking

some k-value times that historical value.

That's split into two different tests .

On one, I'm determining if it's less than positive d.

In the other one , I want to show in the alternate hypothesis

that that difference is greater than a negative d.

I'm going two different sides.

That's what's called the left -hand side of the bottom,

or the top.

In terms of, if I was a drug manufacturer, what would I want to do?

I would want to be able to reject both of those hypotheses.

I would want high power, low Type II error and high power.

T his is equivalent to taking a 90% confidence number

around the difference in means in ensuring that 90% confidence interval

whenever I'm looking at the low and high or within the balance of those lower delta

in the upper delta.

If you're looking at this, you should think to yourself is,

"I want the width of that confidence interval to be very small."

What are the different ways that I could make the width

of that confidence interval

for the difference between those two means very small?

I could decrease my standard deviation.

That's a good thing.

I could increase my sample size.

That's a good thing.

I could also increase my alpha level.

Maybe that wouldn't be so good because what you're doing

is you're inflating your Type I error rate.

In inflating your Type I error rate, what you're saying is,

I am stating that they're equivalent when indeed they're not.

The different ways to control the width of that confidence interval

is to lower s, increase in, or increase alpha.

We talked about two of those being good and one of those not being good.

It makes sense that if I'm a drug manufacturer,

I want to maximize the power of the design.

That's the flip. I want to minimize my Type II error.

Regulatory agencies want to make sure that you do not inflate

that Type I error rate.

That Type I error rate would be saying your assume equivalent

or you're stating equivalence when indeed they're not.

In JMP, you can do these equivalence tests,

and I want to show you an example of that.

For my journal, first thing I want to do

is I want to show you that in terms of determining

your Type I error rates and your Type II error rates

is JMP provides power curves

under Sample Size Explorer , Power , Two Sample Independent Equivalence.

Caleb King did an awful great job with this.

I say awful great job, but he did a great job with this.

Let's just say that my margin, my equivalence acceptance criteria,

is plus or minus 2 standard deviations.

I'm just going to put a 2 here,

and that's just 2 times the standard deviations that I'm talking about.

That's all that I'm doing.

Let's just say that in my historical process

is I have 10 lots,

and I'm going to compare it to a new process that has 5 lots.

I want to see what the power is if they are exactly the same,

but there's no difference between these.

A few things that I want to point out here is JMP gives those power calculations.

The other thing that it does is it allows you change those.

What's going to happen if I do things like increase my new process,

the number of samples in my new process day report ,

my power is going to go up.

What would happen if I do things like,

"Hey, Heath, I want to decrease that margin of error

to, instead 2 standard deviations , to say maybe 1.5 standard deviations,

essentially , as I'm taking those boundaries

and I'm tightening them up."

What I see is my power is going to go down.

I'm able to ask myself all those typical questions

that you would in equivalence testing.

This is something else that I want to show you that's going to come up

is JMP has the ability to say,

do I know the true standard deviation or not?

If I know the true standard deviation, that is going to be better.

You're going to see that your power goes up.

Indeed, what happens is my power goes up.

That's usually not the case.

I always call that the utopia,

which uses the cases if I do not know what that true standard deviation is.

I always call this the optimum, a car of the utopia.

I always call the no, the realism.

I would be remiss

if I did not show you the tools that JMP does have

for showing that equivalence

like if I had an historical process where I had 10 lots and I made 5 new ones.

First thing I want to do

is I want to look at this through Graph Builder,

and I see that there is no effect between those two.

I can see both of those , and they both look like

they came from the new process, the blue versus the red.

How about if there is an effect ? What I do is I see a shift.

Just like I showed you before is that is Two One Sided t-test.

JMP has tools for that.

Jin Feng did a great job with this. My goodness. I love the scores plot.

Here's the difference in means. Here's the lower , and here's the upper,

and that's within the boundaries.

In that case, what you've done is you rejected both the null hypothesis

in favor of the alternate,

which is the same as what you see in the picture.

What you also see here is that if there is an effect,

is I am not going to reject both the nulls.

One of those is I am going to fail to reject and indeed I did.

What you'll see is my confidence intervals outside that boundary.

I would like to talk about a very specific case.

A very specific case in cell therapy is called split apheresis design.

In a split apheresis design , this is a situation where

in cell therapies is you're changing the process.

What you do is you're using donor materials split

between the two different processes.

We kept getting questions over and over and over again

from our customers about,

"Can I look at the sample size and power calculations

for these pair of designs ?"

Cannot overlay them .

You cannot see if they're dependent upon that donor -to-donor variability?

Let's talk about a split apheresis design.

In a split apheresis design, first thing I want to do

is I want to tell you about the regulatory expectation.

This is even a recent draft guidance document from the FDA

in July of 2023, just last month.

In that , they said that you need to select a suitable statistical test

for analysis in difference between paired data

where those donors are paired up.

That's where the split apheresis design comes from.

For every single donor material that you have,

you split it in between process A and process B.

This is not two independent t-tests.

What this is, is a paired design.

That's the first thing that I wanted to talk about.

The second thing is,

I wanted to talk about that you are very often in early stage,

so you do not have a line of donor materials,

so you have very low sample sizes.

It's hard to get power out of low sample sizes.

The third thing that I'm going to tell you is ,

how do you come up with your EAC?

How do you come up with your similarity condition ,

that difference, that acceptable difference?

What you do is you use historical data that is made off of multiple donors.

You take the standard deviation used off of historical data.

I'm going to call that n 1 or historical.

You take some k number of standard deviations of historical data .

You do a test , and you're using the split apheresis design

to judge off of that historical data.

These are two examples that I want to show you.

The first example here

is where you're looking at process A and process B.

What you see is you do see six different donors here.

What you see in the one on the left is the majority variation is coming

from donor-to-donor variability,

not the difference between process A and process B.

You have high donor -to -donor variability.

I'm going to call that, prho.

In the case on the right, what you do is,

is the majority variation is coming from the difference

between process A and process B, not the donor-to-donor variability.

The majority variation is coming from the analytical or the process.

What that tells you is you have very low rho.

You'd have low donor-to-donor variability.

I'm going to show you a series of scripts that we worked on.

These are typical questions that came from our customers.

In our cases, we do not know what the standard deviation.

How does that compare to the known?

How about those Type I and Type II error rates?

Remember, if I'm a drug manufacturer, I want to increase the power.

If I'm a regulatory agency,

I want to make sure that you do not inflate that Type I error rate.

How are we going to do this?

This is from the European Medicines Agency, 2001.

The best way to do that is with things

called expected operating characteristic curves.

That gives you power on the y-axis and a shift in the main.

I 'm going to go through a series of scripts,

and these series of scripts ...

It's really one script that have right here,

that it's going to allow me to change things like that rho,

that proportion of donor-to-donor variability.

That k -value, remember, how do I set the acceptance criteria?

It is k times that standard deviation.

The typical way of doing this is that k times those historical lots.

This is the number of historical lots that you use n1.

n2 is the number of lots that I'm going to use for that paired design.

Whenever you run the script, what happens is you come out ,

and it does a series of simulations.

In this case, it did 5,000 simulations ,

and it calculates the power for you.

In those 5,000 runs, what percentage of those passed?

It looks something like this. It gives you a lot of different options.

My goodness . I can look at different k -values.

I can look at a different number of n1, which are called historical lots.

I can also look at the different number of n 2 or paired lots.

Right now, I want to talk about...

Whenever I do this, what I can do

is I can select which of these different cases that I want to look at

to be able to answer typical questions.

Let me open up my typical comparisons here.

The first one I want to talk about is,

"Heath, what if I have a known standard deviation?"

Look s something like this.

That's what the known standard deviation looks like.

A few things that I want to point out

is this is the percentage of time that you're going to claim equivalence.

If they're exactly the same that you said you're going to claim equivalence

a high percentage of time.

If there's a huge difference between them like a two standard deviation shift

or a three standard deviation shift , is you're not going to claim equivalence.

That's a good thing.

The other thing that I want to show you here

is if you're looking at this alpha of 0.05,

being that I set my k -value at 2 ,

k number of standard deviations versus 10 historical lots,

the standard deviation of 10 historical lots,

you would expect that alpha level would be 0.05,

the exact alpha level that I set in my equivalence test.

Right now, the thing that I want to show you

is this is for a proportion of donor -to -donor variability of 90%.

What happens if I change that?

What happens if I change that to 60%.

What happens if I change that to 30% ?

There's no donor-to-donor variability.

What you see is that paired test, the power curve looks really good

whenever I have high donor -to -donor variability.

The other thing that you notice with the known standard deviation

is the alpha level regardless of operating characteristic curve

is always at 0.05.

Let's talk about some other typical questions.

One typical question is ,

how does it compare for the different levels of rho?

How does my typical way of doing this ?

I do not know what the standard deviation is.

My typical way of doing this is in the blue.

The known standard deviation is in the red.

One thing that I want to point out

is I want to point out this one right here.

What you see is the preferred approach,

the approach that even regulatory documents have said that you should do,

the paired approach,

using the standard deviation

that is calculated off of my historical lots,

is I have an inflated Type I error rate.

This should be 0.05 just like it is here.

That was really strange to us, and we looked into this.

When we looked into it, what we found is ,

it has everything to do with this right here.

The reason why it has everything to do with this right here,

as I said, how do I decrease the width of that confidence interval?

The way that I decrease the width of that confidence interval

was either to decrease s , or increase n , or increase my alpha level.

Understand this.

This is why you have an inflated Type I error rate

with this paired test

is those deltas , which you're using to judge this off of, those deltas

are using the standard deviation off of historical data

that contains donor-to-donor variability.

That confidence interval right there

does not contain donor-to-donor variability.

Why? Because you did a pair test .

That contains only analytical and process variability.

That's where that inflated Type I error rate comes from.

Using this paired approac h is understand

you have an inflated Type I error rate.

We see that, and it's even more prevalent

when you have high donor -to -donor variability.

Why? Because if you have low donor-to-donor variability,

th at process variability

is the largest part of the variance component that you have.

Let's look at a few more questions that you have.

A s I said, this one script answers these different questions.

This is answering the question ,

"Hey, Heath, if I use that paired approach that's recommended,

can I look at what happens as I increase sample size

from 3 to 4 to 5 to 6?"

Two things that I want to point out here

is number one , what you see as I increase sample size,

is I'm going to have higher power.

I still do not have adequate power if there's no donor-to-donor variability.

That means that I have 0 donor-to-donor variability.

I would need at least a sample size of 8 or 8 different donors.

If I do have high donor-to-donor variability,

like 0.9, 90 % of that variability, which you see is I do have high power

for no difference between the means.

What I can do is I can make sure to answer those questions

with overlaid operating character s occurs for different sample size.

I can also answer that question if I was looking at , and I say, "Hey ,

I've stated my different sample sizes,

but what if we look at the different k -values?"

Understand that your acceptance criteria is k number of standard deviations.

What's going to happen is that acceptance criteria

are those what I call go post are going to widen as you increase k.

Therefore, you're going to have a much higher ability

to pass equivalence ,

and you're going to have much higher power.

Another typical question is this.

What if I want to change both of those together?

I'm a big fan of Graph Builder .

What Graph Builder is what you're looking at here

is not only are you looking at,

"Hey , Heath, I am increasing sample size in blue, that would be 3,

in red, that would be 4,

in green, that would be a 5, and in purple, that would be 6 ,

but I also looked at it for different k -values.

What would your operating characteristic curves look like?"

Good?

I want to revisit this.

Just like I said before , I said, "Hey , I want to revisit this

and show you that for..."

Whenever I have a large proportion of donor -to -donor variability ,

I said, "What you see for 2 right here, I would expect my alpha level

that my proportion of time that I pass this test would be 0.05."

But what you see is you have inflated Type I error rate.

How does this look?

Whenever I'm looking at a rho

or a proportion of donor-to-donor variability

that is very small,

I do not have much power.

The question was , what if we did this instead?

If we had low donor-to-donor variability,

if what we did is we used information from those historical lots.

If I have no donor-to-donor variability or very low donor-to-donor variability,

why couldn't I just do a independent t -test,

where I compare from process A or my historical process,

not just the paired lots,

but I also consider those 10 historical lots

and not comparing to the mean of the new process?

We wanted to see how that compared .

Doing it that way is the independent test is in the red.

The paired way is in the blue.

What you see is, if I have little to no donor-to-donor variability

in my cell therapy split apheresis process ,

you said that the independent t-test

has much better profile than the paired approach.

However, if I have high donor-to-donor variability,

that paired approach in the blue

has a much better operating characteristic than the red.

Right now, the question is instead of just automatically

doing that split apheresis pair design,

maybe it would be better

to make a decision based upon that donor-to-donor variability.

How does this compare whenever I'm looking at different k -values?

I see the exact same thing, the exact same phenomena that

with a low donor-to-donor variability,

it makes sense to do the independent t -test.

With high donor-to-donor variability is I have a much better

operating characteristic curve

are higher power associated with the paired approach.

It doesn't matter if I looked at a k of 1.5 , or 2, or even 3.0.

Regardless of the k -value ,

I have a much better operating characteristic curve

if I consider that donor -to -donor variability.

What if I looked at different values of those historical lots?

I looked at 3.

We looked at 4.

W e looked at 5 paired lots. We looked at 6 paired lots.

Regardless, you see the same phenomena.

We're currently writing a paper on this

to try to propose

that if you have low donor-to-donor variability,

maybe it does not make sense for you to use a split apheresis

or a paired analysis approach.

Maybe the approach is only good

whenever you have high donor -to -donor variability.

T hese are typical questions that are asked in the split apheresis designs.

What I want to do is I just want to cover

t wo or three more of these

j ust to show you a few o ther things that you could do.

These are different things that we were looking at .

We looked at, "Hey , how does the operating characteristic curve,

how does that compare if we looked at in the blue

that's using nothing but the historical lots

to estimate the standard deviation

versus if you use the paired and the historical lots,

which is in the red?"

What you see is there's not much difference between these two,

especially if I'm using higher sample sizes like the n 2.

W e also looked at,

"Hey, if I estimated that standard deviation

using a few different ways,

what if I looked at estimating

that standard deviation using the historical lots,

which is in the blue, versus in the red

is using the historical lots and the paired lots?

I compare the independent case versus the paired case.

What do I see? "

As I said before, you see that exact same phenomena

with a low donor-to-donor variability.

The much better way of doing this would be an independent t -test

on the lower right -hand corner.

That is where you high donor-to-donor variability.

It makes sense that we would use the paired approach.

Last one that I want to show you

is this is something that we've been working on.

We looked at the paired approach versus the independent.

The paired approach is in the blue.

The independent is in the red.

I've said this over and over and over again.

That it makes sense that if I have low donor-to-donor variability,

the independent case in the blue looks much better.

If I have high donor-to-donor variability, the paired approach looks better.

But one thing that we did i s we took a look and just said,

"What if I took a look at the approach that gave me the shortest

with that confidence interval?"

That's in the green .

What you see is that usually gives you the best approach

regardless of what your rho is

or what your proportion of donor -to -donor variability is.

In closing,

I would like to just point out a few things.

This script that we have answers, along with the typical questions

that our customers have on operating characteristic curves,

associated with these split apheresis designs,

what I do want to pull away from here, though,

is if you do have a low proportion of donor-to-donor variability

is you'll see that these designs are very underpowered

for fewer than 8 lots, fewer than 8 different donor material.

We live in a world in cell therapies

where you do not have a lot of donor materials,

so you have very low sizes.

It would be much more efficient if you had low donor-to-donor variability

to use the independent case.

We do have the other revisions that we made on this

where if you were able to make multiple lots

for those paired approaches with the same donor,

or if you're able to take multiple measurements

to be able to look at those operating characteristics curves.

Thank you.



0 Kudos