Time-Efficient Strategy for Selecting a Test Set in the Validation of an Image Detection Algorithm - (2023-US-30MP-1362)

1 Kudo

Caroll Co, Statistician, Social & Scientific Systems
Sandra McBride, Principal Statistician, Social & Scientific Systems Inc., A DLH Holdings Corp. Company
Shawn Harris, Statistician, Social & Scientific Systems Inc., A DLH Holdings Corp. Company
Debra A Tokarz, Senior Pathologist, Experimental Pathology Laboratories, Inc.
Thomas J Steinbach, Vice President/Senior Pathologist, Experimental Pathology Laboratories, Inc.
Mark F Cesta, Pathologist, National Institute of Environmental Health Sciences

Helen Cunny, Toxicologist, National Institute of Environmental Health Sciences

Keith R Shockley, Staff Scientist, National Institute of Environmental Health Sciences

Advances in digital image analysis have created opportunities for quantitative histopathology assessments in rodent toxicology studies. Microscopic evaluation of rodent spleen is performed to assess for test article-induced immunotoxic effects but can be subject to inter- and intra-pathologist variability in characterization of differences between treatment groups and across studies.

To address this problem, an image detection algorithm was trained to quantify tissue compartments in histologic sections of rodent spleen. Our aim was to design a study to evaluate how the image detection algorithm compared to digital annotations performed by human raters for specific features of the spleen, while keeping within operational constraints (e.g., rater time and effort).

In this talk, we show how we used JMP Custom Designer, using data generated by the image algorithm as inputs, to select and allocate a test set across human raters. We used a response surface model, which is designed to select samples that fall on the boundaries and center of the input space. The resulting study design allowed us to strategically select a test set and create a balanced sampling plan for use across several pathologists from different institutions.

Hello everyone. I'm Caroll Co.

I'm a statistician at DLH.

Today, I will talk about a project I worked on

in creating a time- efficient strategy for selecting a test set

in the validation of an image detection algorithm.

This work was done in collaboration with scientists

from the National Institute of Environmental Health Sciences,

pathologists from experimental pathology laboratories,

and my fellow coworkers at DLH.

In rodent toxicology, advances in digital image analysis

offer opportunities for quantitative histopathology assessments.

An example of where digital image analysis could be useful

is in the evaluation of test article- induced immunotoxic effects

in rodent spleens.

Typically, pathologists use a microscope

to evaluate whether there is immunotoxicity in the spleen

and make judgments on whether spleens from animals that receive a treatment

are different from the control group.

This workflow is prone to inter- and intra-rater variability

in characterizing differences within the study

and also across different studies.

Here's an example of a zoomed- in cross-section view of a rodent spleen.

Here, I'm just pointing out

specific features of interest to our collaborators that we want to capture.

Our problem.

When we got involved in this project,

our pathologist collaborators had already trained an algorithm

that measure these features of interest.

The question that was posed to us was how do we validate this algorithm?

Validation is a very broad term.

To narrow our focus,

we thought about what are some of the main questions that we wanted to address.

First, the algorithm was trained by a select few people,

so we wanted to see whether different pathologists from different laboratories

would agree with the algorithm output.

Second, there were multiple features

where the algorithm was trained to measure.

We wanted to see if there are any specific features

where performance is better or worse.

Then, we also wanted to see

if this algorithm can hold up to against a wide range of cases.

If there are any blind spots, can we find them?

I'll also mention that from here onwards,

I will refer to the image algorithm as the AI.

One solution that we thought of was to have both humans and AI

annotate the same images and compare the output.

This is how it would work.

A tissue sample gets scanned so it turns into a digital image

or a whole slide image WSI,

and that image gets fed into the AI for processing.

Then for humans, they view the image using an image software

where they can manually annotate features of interest.

One of the questions we got asked was how many images do we need to validate?

As statisticians,

our answer is always as many as you can or do you have hundreds?

Do you have thousands?

But after talking to them,

you realize that there are a number of operational constraints to implement this.

The first constraint was that

there was a need for each image to be evaluated by three different people.

Having three people was useful

so that we can also estimate the variability between raters.

It also turns out that this process of annotation is very time- consuming.

After talking to the pathologists, the maximum number of images

they were willing to annotate was about 24 per person.

Then, lastly, one of the goal of this project

is to get buy- in or support from pathologists from other labs.

It meant we needed participation from multiple labs

and multiple people from each lab.

In this study, we got participation from three centers

and we had three pathologists representing each center.

In total, we had nine pathologists recruited in the study.

Based on all of these constraints,

we've determined that we can validate only 72 images.

The sampling plan.

Now we know our sample size.

How do we select the 72?

Random selection is okay, but can we do better?

What we came up with was

since the cost per image for the AI is relatively low,

we got the AI to process a larger set of images

and then from there,

use information from the AI output to better select our 72 samples.

We use a response service model to select our points

and an RSM will select points

towards the boundaries and center of the input space.

It's a model containing your main two- way interaction effects and quadratic effects.

This model was particularly useful to our validation problem

because we wanted to look for areas where agreement fails between humans and AI.

Generally, bugs tend to occur on the boundary and edges of this space.

This type of model fits the problem that we had.

Now, we have a plan to select the images.

The question now is how do we allocate these 72 images across nine raters?

We expect the samples to have a wide range of complexity.

We wanted to make sure that everyone got a balanced mix of slides.

The complexity or this case mix

is determined by the output that we got from the AI.

We also needed to satisfy the constraint

of making sure that each image is seen by three different people.

Here is my workflow.

I will show you how I created the sampling plan that satisfied

all of our operational constraints in JMP.

I have three steps in this workflow.

First, I'm going to show you how we selected the 72 images

from a larger set.

Second is how we replicated the 72 images three times,

so we have 216 runs.

We needed to do this

because we wanted each image to be seen by three raters.

The last step,

I'll show you how we allocated the 216 runs across nine raters

so that each person has exactly 24 images.

In each of these steps, I will actually be using the DOE platform.

Now, I'm just going to move over to JMP.

I have my JMP file in there.

My JMP journal.

I am going to open a sample data set,

and I say sample because our data has not been published yet.

For the purposes of this demonstration, I will be using a fake data set.

This data set has the same features to the original data that we collected.

This is what the data looks like.

I have my slide ID, which is just a numeric variable going from 1 to 100.

Going from 1 to 100 right here.

I have four variables that I'm looking for.

That's been collected by the AI.

Features 1, 2, 3, and 4.

Three of them are continuous and one is account variable.

If you look at under Analyze, Multivariate Methods, Multivariate,

if we just show the scatterplot matrix of all of the variables you'll see,

all of them are uncorrelated.

This is the spread of the range of our variables.

Step one.

Under DOE, there's a Custom Design, that's what I'm clicking.

We can leave the responses Y here. Just leave it alone.

The factors here, there's actually two ways to do this.

One way is to click on this button here that says Add Factor.

I think about fifth of the way and there's a Covariate selection.

If you click that,

it'll ask you which columns of covariates you want to include in the design.

In my case, it's these four features.

I'm just selecting them and you click Okay.

JMP will automatically populate the min and max

for each of these variables.

You'll see here that they're all treated as covariates.

That looks good.

I'll just quickly close this

and then show you another way of doing the same thing.

Again, in DOE Custom Design.

What I showed you was adding it using this way here.

Another way to do it is to just use this button right here

which says Select Covariate Factors

and it'll give you the exact same thing.

Again, I'm picking the factors that I'm interested in

and it automatically populated this Factors window.

I'm just scrolling down to the bottom and click Continue.

This is where we can put in the model that we want.

For us, we wanted an RSM or response service model.

There is again a shortcut button right here.

I'm just going to click that.

What that did is it added...

We already have our main factors in there,

but it added our quadratics and two- way interaction effects.

Then, lastly, at the bottom in here where it says number of runs,

we don't want 100 because we actually just want to select 72 out of 100.

I'm just going to change this to 72.

This all looks good to me.

I'll click Make Design.

Let's just give JMP a few seconds to create the design for us.

Here we go.

This is the design that it created for us.

As you can see, if you scroll down,

it only gave us 72 rows, which is what we had asked for.

I'm just going to say, turn this into a data table.

I'm just going to hit Make Table right here on the bottom left.

I'm going to close this for now.

Now, before I show you what this design looks like,

I actually want to go back to the original table,

just so you could see...

We can compare what happened to the observations that were picked

versus observations that were not picked.

If you go back to the original data table, the ones that were chosen,

that 72 are actually highlighted.

What I'd like to do at this point is I'd like to create a new column

that would identify which rows were selected and which ones weren't.

To do that, you go under Rows.

Under Row Selection,

there is the last option in here says Name Selection in Column.

What it'll do is it'll label the currently selected rows

and save whatever values you assign for that column.

I'm going to click that.

The column name.

Y ou get to create a name for that column.

T he rows that are highlighted, I'm going to give them a value of 1.

The ones that were not selected, I want to give them a value of zero.

I'm going to press O kay, and that created this column right here.

All of my wants, there's 72...

There's 72 rows for the ones that were selected

and then 28 for the ones that were not selected.

I want to go back to my scatterplot matrix to see the responses

or to see which observations were picked and which ones weren't.

But before we do that, I do want to color-code them

so that when you look at the graph,

I can immediately spot which ones were picked and which ones weren't.

A quick way of doing that is to do Rows, Color or Mark by Column.

Then, here, I want to use this column called Selected

so that all the zeros will be identified by an orange circle

and all the ones will be identified by a blue circle.

We use a marker as well.

That might make it easier to identify the observations.

I'll just click O kay.

Now, all of my rows are marked by the plus or the circles.

If I go back to my...

If I go back to my scatterplot matrix,

again, it's under Multivariate Methods, Multivariate, and I just hit Recall.

I'm doing the same thing as I did initially.

Let me click Okay.

I'm just going to make this a little bit bigger.

You'll now see that the blue pluses are the observations that were selected,

and the orange circles are the ones that weren't.

The ones that were selected

tend to occur more on the boundary and edges of our space.

This is exactly what we wanted.

This looks great to me.

Let's close this.

Just to convince yourself

that the model is doing what it's supposed to be doing,

I actually did this same setup,

but this time, instead of picking 72, I'm only picking 24.

Let's make it a little bit more extreme.

Just to show you an example of what that looks like.

Again, now my selection.

Now, there's only 24 rows that are selected in here.

Let's do the same thing.

Making this a little bit bigger.

Now, you see fewer blue pluses because there should only be 24.

But then you'll see how those observations are getting picked

versus the ones that were not picked.

They do tend to occur more on the outer boundaries,

but still some in the center.

Let's go back to our original problem. We'll just close this.

This was our original data table.

The one that was created by the design is this one.

What do we have here?

We're still keeping the same variables that we asked JMP to include.

We have our features 1 2, 3, and 4.

Why our response is still missing?

We now have this new column that says Covariate Row Index.

Basically, this just links you back to your original table.

If this says 88, it would be row 88 on your original table that got captured.

This one.

This row here should be the same as this row right here.

Before we move, I'm actually just going to rename this to my slide ID.

In my case, the slide ID, which is just a number from 1 to 100,

it's actually just the same as the Covariate Row Index.

That was all for step one.

Step two is now replicating the 72 images.

How do we do that?

A quick way of doing that is to go to DOE.

The second selection is augment design.

Then this window will pop up asking you what are your responses and factors?

I know why we don't really have a response,

but I think you still need to put in something in here.

That's okay.

You just put it in there even though it's all missing values.

Then, for the factors, we're going to select our features 1 to 4.

This time, we would also need to select our Slide ID.

Click Okay.

This new window pops up with all of our factors .

Again, JMP auto- populates the ranges of all of our variables.

On the bottom here, under Augmentation Choices,

there's a replicate button, and that's exactly what we need.

Click that and then they'll ask you,

how many number of times do you want to perform each run?

The default is two, but we actually want it to be three

because we needed each image to show up three times in our design.

Then we click Okay.

Now, what you'll see is that in our design,

instead of just 72, we now have 72 times three.

We'll just scroll all the way to the bottom.

We now have 216 rows.

I'll just click Make Table to turn that into a data table.

We'll close this.

A few more things that I want to check.

A couple more things I want to check before we move on to step three.

That is, every time you're doing these steps,

you want to make sure that it's actually doing what you think it should be doing.

In this case,

I wanted to check that each slide ID actually occurs three times.

We're going to use tabulate to do that.

Tabulate, S lide ID, and I should have a count of three for each ID.

That's what we have.

That looks great.

That's it for step two.

Let's move on to the last step, which is the most exciting step.

Now, we have 216 rows and we have the slide IDs.

Now, we want to distribute this to nine different people.

How would we do that in a way that each person gets a balanced mix?

A cool way to do it is to use DOE again.

DOE, click on DOE Custom Design.

Like what we did before,

we're still going to use our covariate as factors.

Just select features 1 to 4 and our slide ID.

JMP will auto- populate the mins and max.

Now, all of the row here is they're all listed as covariate.

But at this point, I actually want to add two more factors.

One factor is a categorical factor with three levels,

and that is the center or the laboratory because you have nine pathologists,

but they're all coming from three different centers.

I'm just going to rename them A, B and C.

Then I want to do...

In each center, we also have three different people participating.

I want to add another categorical factor again with three levels.

This is going to be our rater.

These are like the people.

Let's name them 1, 2, and 3.

Basically, what I'm saying is that Center A will have Raters 1, 2, and 3,

Center B has Raters 1, 2, and 3, and so on.

In total, we have nine different people,

nine different combinations of center and rater in here.

I'm going to minimize this.

This actually just shows the data table where we pulled our covariates from.

Then hit Continue here.

Now, we get to tell JMP what kind of model do we want.

There's a couple of things in here.

First, I actually wanted to add in.

JMP will automatically put in your main effects.

These are all the factors that are in my model.

I wanted to put in an interaction term between center and rater,

and that's because I wanted to make sure that

all combinations of center and rater appear in our design.

That guarantees that.

The other thing to know is that

we actually don't have enough runs to estimate slide ID.

Remember, slide ID goes from...

There's 72 distinct slide IDs in here,

but we actually don't want an effect, a slide ID effect.

We just want JMP to take that into account when it's constructing the design.

We don't really want to estimate slide ID.

Under Estima bility here, you can change...

The estimability for slide ID,

from Necessary, you can change it to If Possible.

Then, lastly, the number of runs.

I think t he number of runs they calculated for me was 18.

But we actually wanted to use up all of our runs

because we have 216 runs and now we're just looking to see...

We wanted to get JMP to tell us how do we allocate these 216.

Make Design.

It just might take a little bit longer.

I'll go into like, how do we check the design

and then talk a little bit about the run order.

This is what the design looks like.

We have our original features, features 1 to 4 in here, the slide ID,

and now it added center and rater assigned for each slide ID.

What this is saying is that rater B1 would have to annotate slide number 27,

A1 will annotate slide 96, and so on.

There should be 216 runs in here.

That looks okay.

The last part here under Data Table Options,

there is a check mark for Include Run Order Column.

I'm going to click yes because in our case,

for annotation, if you expect there to be some type of time effect

in whatever process that you're doing.

In our case, we were maybe worried about

will there be a learning curve, will there be a fatigue effect?

We want to make sure that not everyone is starting with a slide ID

that's like the lower-numbered slide ID

and ending with the higher- numbered slide IDs.

I'm just going to click Make Table.

I'm going to close this window for now.

This is what now our design looks like.

Before we do our checks, I'm just going to create a new column

where I can concatenate the center and rater.

I'm just highlighting these two columns, right-click.

Under New F ormula Character,

I just want to concatenate them with a comma.

This is going to be our Center, Rater variable.

There you go.

Three things that we're checking here.

First is let's do tabulate.

We want to make sure that each person has exactly 24 images.

That's my Center, R ater, and there's 24 here.

That's great.

The other thing that we want to check for is that there are no repeats.

We don't want...

For example, we don't want slide ID 1 being assigned to the same person twice

because that would not be fun for that person.

If I do a crosstab of slide ID by Center, Rater,

I should see just a column of 1s.

That means each slide ID was assigned to three different people.

I'm just scrolling through here

and looking at this table to see that there are no 2s or 3s in here.

That looks great.

Then, lastly, I do want to check,

how does the case mix look like across these nine raters?

One way of thought of at least checking that visually

is to just do a parallel plot using Graph Builder.

I'm going to extend this.

I'm going to highlight features 1, 2, 3, and 4

because those are my original variables.

Then, my center and rater, I am dragging it to here on my x- axis.

I'm going to hit this parallel plot option right here on the top right icon.

I get that.

A ctually, you don't want Center, Rater in there.

I'm just going to turn that off.

Maybe what I want is Center, Rat er and it's own panel.

What this shows us is we have the assignments from the nine people,

so A1 all the way to C3.

This is the case mix of the images that they got assigned to.

A gain, we're just doing this visually.

What I'm looking for is that there are no...

When you look at them in total that they all look about the same,

that they're somewhat blended and there's no clumping that's happening

and they look to be okay.

Another way to do it is to, I guess you can overlay.

It gets a little bit hard because I do have nine different colors in here.

But again, you're just looking for...

You don't want there to be any patterns in here.

You don't want there to be clumps of green up here or down here

or wherever in this space.

You could also check it by center.

I guess my colors are not the best.

You have yellow, blues, and greens.

They all look to be well- mixed.

Then, I did mention about the run order.

What it is, is that you just want to make sure

that when you're telling people how to do their annotation,

you want to make sure that that is also randomized

so that if there is a time effect, that's taken into account already.

What I'm plotting here is the slide ID.

A gain, this is sequential. It goes from 1 to 100.

Then this is the run order,

the sequence of when the pathologist would rate these images.

This looks random.

We could also plot it by center and rater.

This is for each individual person.

They look good.

I'm just going to close that.

Now, I'll just go back to my PowerPoint slide

and just end with some conclusions here.

We can use DOE in selecting samples or test cases

where you have prior information.

If you have data or covariates that you can use to inform the selection,

why not use them?

A response surface model is advantageous

if you're interested in the boundary or edge cases.

Second point is that you can use augment DOE replication

if you have a situation where you need multiple raters per sample.

The job or the design already gives you an opportunity

to factor run order in your plan.

That could be really useful if you expect there to be a time effect

such as a learning curve or fatigue.

I have a couple of links here to a blog

that talk more or discuss more about what is a covariate in design of experiments.

Please check that out if you want to learn more about this technique.

Then, lastly, I just wanted to say

thank you to all my collaborators who helped make this project possible.

Thanks.