Hi, I'm Dave Trindade,
founder and owner of Stat-Tech, a consulting firm specializing
in the use of JMP software for solving industrial problems.
Today I'm going to to talk about a consulting project
that I worked on over the last year
with a robotics company.
We're going to be talking about Drones F lying in Warehouses:
An Application of Attribute Gauge Analysis.
Attribute gauge analysis is typically applied to compare agreement,
or lack thereof between two or more rating approaches to a problem.
For example, two inspectors may have differences of opinion
as to whether a part is conforming, call it pass
or nonconforming call it fail.
Based on consideration of specific quality indicators for individual parts,
how do we quantitatively measure the degree of agreement?
Let's actually start off with an example.
Let's say we have two inspectors,
inspector 1 and Inspector 2,
and they are presented with a list of 100 parts,
the critical characteristics on these 100 parts
and asked to determine whether each part should be classified as a pass or a fail.
I've summarized the results on the table to the right, partial table.
You see there are 100 rows in the table.
All variables are nominal, so the first column
is the part 1-100.
Then the second column is the rating by Inspector 1,
whether it's a pass or fail, and then the second inspector
where also second column has a pass or fail rating.
Now, if we were not familiar with JMP's gauge attribute analysis program,
first step that we could take could be
to look at the two classification distributions
and use dynamic linking to compare.
What I will do is show you the slides
and then I will go and demonstrate the results on the slides
after I've gone through the slides, through a certain amount of material.
For example, if we click, say, let's generate distributions
of the two columns, the fail, Inspector 1 and Inspector 2,
then we can click, say, for example, on the fail column for Inspector 1,
you see mostly matches for Inspector 2,
but there are a few disagreements over here.
There are some passes where Inspector 1
classified it as a fail, but Inspector 2 classified it as a pass.
Now, when you do click on that JMP highlights,
the actual rows that correspond to this.
You can see over here, for example, row four, Inspector 1 called it a fail
and Inspector 2 called it a pass.
Generally though, they're mostly in agreement,
fail, failed, fail, fail and so forth.
We could also do that by clicking on Inspector 2 fail,
and then seeing how it compares to Inspector 1.
We see that there are actually five instances of disagreement
between the two inspectors.
When the Inspector 1 classifies it as a fail,
there's five that I nspector 2 classified it as a pass.
Now we can also visualize the inspector comparison data.
To do that, we can use graph builder with tabulate to view agree and disagree
counts between the two inspectors. Here's one way of visualizing it.
We can put Inspector 1 using graph builder on the horizontal axis
and Inspector 2 on the vertical axis.
Then we see now with color coding, whether it's agreement or disagree,
agree is green and then the rows
that are disagree color code is red markers.
Now we can see the actual distribution.
Then we can use tabulate to actually total the numbers that are involved.
We can see over here for Inspector 1 and Inspector 2.
Inspector 1 and Inspector 2 agreed on the fail categorization
for 42 of the parts and they agreed on 44 of the pass parts.
They disagreed on nine instances over here
where Inspector 2 called it a fail and Inspector 1 called it a pass.
And Inspector 2 called a pass, where Inspector 1 called it a fail.
So those total 14.
The inspectors agreed on a classification for 86% of the parts
and they disagreed on 14%.
From there now we can go and do attribute gauge analysis
and see what JMP can do for this analysis.
To go to attribute gauge analysis, we're going to go to quality
and analyze quality and process variability attribute gauge chart.
Then we cast the rows. And here I've shown the inspectors
are listed under the Y response, both of them.
Then the column for the part is listed as the grouping.
These are required entries.
We notice the chart type is attribute, we click okay.
And now JMP provides us with this attribute gauge analysis report.
The first chart that's shown over here is the percent agreement for each part.
So we have 100 parts on the horizontal rows axis over here,
and when there's 100%, it means the two inspectors agreed.
When there's 0%, it means to disagree.
The left chart shows the overall percent agreement 86% by inspector.
Since the comparison is between only two inspectors,
both are going to have the same 86% agreement value.
The agreement report now includes
a numerical summary of the overall 86% agreement.
You can see 86 matches out of 100 inspected
and the individual ones are going to be the same since it's only one issue
of whether it was a pass or fail for a given part.
And 95% confidence limits are provided for the two results,
both for the inspector and for the agreement.
Now the agreement comparisons report includes a new statistic
that perhaps many people are not familiar with,
called the Kappa Statistic.
It's devised by Cohen, that's given in the reference.
The Cohen Kappa Statistic index, which in this case is 0.7203.
Is designed to correct for agreement by chance alone.
This was very interesting to me when I first read about this.
Like, "What do we mean by agreement by chance alone?"
Let's go into a little bit of an explanation of agreement by chance
and how can we estimate it.
Let's consider two raters, R1 and R 2.
We'll assume totally random choices
for each rater for each sample, example, for each part.
We further assume that the probability a rater selects either choice pass
or fail over the other is 50%. So it's 50/50.
Hundred samples or trials are differently categorized by pass/ fail for each rater,
similar to flipping a coin for each choice.
We just visualize two inspectors and they're each flipping a coin,
and they're trying to match how many get head, heads
or tail, tail, or head, tail or tail, head.
What's the expected fraction of agreements by chance?
Well, it's a simple problem in probability.
Similar to tossing two coins, there are only four possible
and equally likely chance outcomes between the two inspectors for each part.
Rater 1 could call it a fail, and Rated 2 could call it a fail.
They would agree. Rater 1 can call it a pass,
and rater 2 would call it a pass and there'd be agreement there.
The disagreement would be when they don't agree on whether it's a pass, fail.
Now, these are four equally likely chances.
Well, two of them are to agree and two of them are disagree.
Therefore, the probability of agreement by chance alone is two out of 4 it's 50%.
It's a simple probability problem.
Now, how do we calculate the Kappa S tatistic on the other hand?
As I say, it's meant to correct for this expected probability
of agreement by chance alone.
The simple formula that you can put into JMP for the Kappa Statistic
is the percent agreement, in this case, would be 86% minus the expected by chance
from the data, which we know is going to be around 50%.
How do we actually use the data to estimate the expected chance
from the data itself, the expected agreement by chance from the data?
Well, the estimation of Cohen Kappa Statistic
is shown below, and this is basically how it's done.
This is the tabulated value. This is what we saw earlier.
Agreement on fail/ fail for Inspector 1 and 42 instances.
Agreement between Inspector 1 and Inspector 2
on the past criteria for 44 instances.
You had those up and that's 86%.
So I show you over here in the Excel format that we added, 42 plus 44
we got divided by 100.
Then disagree is one minus [inaudible 00:08:58]
or just the five plus nine divided by 100.
Now, to calculate the agreement by chance, the Cohen Kappa Statistic is the sum
of the products of the marginal fractions for each pass/ fail type.
Well, here are the marginal fractions.
F or fail, the marginal fractions are 51 divided by 100 and 47 divided by 100.
So we form a product of those two. Fifty-one divided by 100
times 47 divided by 100.
Plus, now we enter the fail. The other criteria is the fail.
This is 49... I should have said that was the first disagreement.
Fifty-one and 47 for the fail criteria. For the pass criteria now,
we go 49 out of 100, and 53 out of 100,
and we multiply those two together and add them.
That gives us a number when we calculate it out to 49. 94,
which is very close obviously, to 0.5.
Then the Kappa Statistic is the percentage agreement
minus the expected by chance, divided by one minus the expected by chance.
That comes out to be 0.7023 in this case.
The Kappa. Here are some guidelines
for interpreting Kappa.
Go back up again.
I want to show you that the Kappa was 72.03.
The guidelines for interpreting Kappa are if Kappa is greater
than 0 .75, it's excellent.
If it's between 0.4 and 0. 75, it's good.
And if it's less than 0 .40 it's called marginal or poor.
There's some finite divide lines on this.
An excellent, totally 100% agreement,
would make a Kappa of one, and we could actually get a negative Kappa
which would be agreement that's less than by chance alone.
The books for these are given in the reference.
All right, let me just stop here and go into JMP to what we've done so far.
We can see that.
The data file that we've got over here is the inspectors
that I talked about, and I said we could take a look at the distribution.
Obviously, if you're familiar with JMP, you just go analyze distribution
and put in the inspectors over here.
Then we can put this next to the rows,
and if we click fail, we see the comparison between the fails
for Inspector 1 versus the fail for Inspector 2 and some disagreements.
Then you can see which areas the rows that disagree it's example four.
Similarly, if we calculate over there, we can compare Inspector 2 to Inspector 1.
Get a different set of numbers over here.
The other option that I mentioned for visualization is to do a graph builder
and in graph builder, we can put Inspector 1 on the horizontal axis
and Inspector 2 on the vertical axis.
Now we have a comparison, and now we can actually see the numbers
of times, for example, how many times did Inspector 1
rate something as a pass where Inspector 2 rated as a fail.
If we go back to the data table,
we see that there are nine instances of that, and they're highlighted in a row.
Th is is a very quick way of seeing what the numbers are
for the different categories that we're working with.
Let's say, click done over here.
The other thing we can use as I mentioned, is the tabulate feature.
We can go to tabulate, and we can put Inspector 1 on the top
and Inspector 2 on the bottom over here this way.
Then we can add in another row for the marginal totals down here and so forth.
Now we have the summary that we can put next to the graph builder
and see what the actual tabulations are that we've got for that.
That's something that we would do
perhaps if we were not familiar with JMPs program.
But let's use JMP now to do the analysis.
We're going to come over here, we're going to go to Quality and Process.
We're going to go to variability attribute gauge chart.
We're going to put in the inspectors over here.
We're going to put in the part.
Over here you notice it's required, attribute.
Okay, click okay, and now we have our output.
This output shows again for each part the agreement.
Zero means I disagree. 100% means I disagree.
This shows the rating between the two inspectors, 86%.
This is the summary, 86%.
Here's our Kappa index over here, and we have the agreement within rates.
This is kind of redundant in the sense
because we're only looking at one binary comparison.
Then further on down here, we can do the agreement by categories.
We can actually calculate how much is the agreement by the fail,
individually, or by the pass individually.
Okay, so that's how we would do it for a simple comparison.
But what if we now consider that the actual diagnosis
of the part was known or confirmed?
Let's go back into our PowerPoint's presentation.
I introduce a standard.
Okay, this is a measure of what we call effectiveness.
We're going to assume the correct part, classification was either known
or subsequently confirmed.
So this is the true correct diagnosis that should have been done on that part.
How accurate are the inspectors choices?
In other words, how can we compare how accurate each inspector was
to determine matching up with the true standard?
We set up that column in JMP, and now we can go through the process
that we said earlier of looking at a distribution.
For example, if we add in the distribution,
this time we include a standard, now we can click on pass,
and we can see the agreements between Inspector 1 and Inspector 2
on pass classifications.
You can see both of them had some misclassifications, some wrong diagnosis.
We can click on fail and do the same thing for the other category.
Then JMP will highlight those in the data table.
To do the attribute gauge analysis in JMP using the standard,
all we have to do now is enter standard into the dialog box as we did before.
This is the additional column.
The big difference now, and this is not highlighted in the manual,
is that under attribute gauge, we can now get a chart
that applies specifically to the effectiveness.
What we're going to do is unclick these agreement points on the chart,
and click instead the effectiveness points under attribute gauge.
When we do that, we get another chart that measures the effectiveness.
And this effectiveness has three ratings for it.
This gauge attribute chart now shows the percent agreement,
0, 50 %, or 100% of the two inspectors to the standard for each part.
A 0% implies both inspectors misdiagnosed their problem.
Seven events of that occurred.
A 50% signifies one of the inspectors got their correct classification,
and obviously, 100% means they both got it right.
Then the left chart shows the overall percent agreement
to the standard for each inspector.
We noticed that there was some slight difference between the two inspectors.
We now generate the effectiveness report
that incorporates the pass/ fail comparisons
to the standard for each inspector.
You can see Inspector 1 got 42 of the fails correct.
He got 43 of the passes correct, but he got 10incorrect of the fails,
call them passes.
And he got five of the passes incorrect.
I find this notation a little bit confusing.
I put it down at the bottom.
When we say incorrect fail,
that means a fail was incorrectly classified as a pass.
When we say incorrect pass, it means a pass was incorrectly
classified as a fail.
You can get your mind going in crazy ways
just trying to always interpret what's in there.
What I've done is I created my own chart to simplify things.
The misclassifications shows over here that 17 actual fail parts were classified
as pass, and eleven pass parts classified as fail.
So that's in the JMP output.
But what I said over here, and I'd love to see JMP include
something similar to this as a clear explanation of what's going on.
Inspector 1, Inspector 2, the standard is pass,
and then the correct classification as pass is 43 and 42.
The misclassified as fails are five and six.
Then over here, when the standard is the fail,
the correct choices by Inspector 1 and Inspector 2 is 42 and 45.
And the misclassified, so when it was fail,
10 of them were classified as pass and seven over here.
Now, understand, a fail part classified as pass
is a part that's a defective part going up.
That's called a miss.
On the other hand, a fail part that is t he actual pass part,
we call that basically produces risk, that's a false alarm.
And JMP uses those terms, false alarm and miss,
later on I'll explain that.
I like this chart because it seems to make a clear explanation
of what's going on.
Using graph builder, again we can view the classifications
by each inspector as shown over here.
A gain, you can highlight specific issues there.
JMP also allows you to define a conformance.
In other words, we said non- conforming is a fail and conforming is a pass.
That way we can take a look at the rate of false alarms and misses
in the data itself as determined by the inspectors.
We can see that the probability of false alarms for Inspector 1 was 0.1
and Inspector 2 is 0.125. The probability of misses,
okay, this means that we let it defect the park, go out,
was higher for Inspector 1 and Inspector 2.
I'll show how these calculations are done.
To emphasize this, a false alarm
occurs when the part is incorrectly classified as a fail,
when it is a pass. That's called a false positive.
The false alarm, the number of parts
that have been incorrectly judged to be fails,
divided by the total number of parts that are judged to be passes.
Now, that's where this calculation is done over here.
If I go up here, here's the pass is misclassified as fail,
so if I take five out of 48, I end up with that number 0.1042.
Now the next thing is a miss.
That part is incorrectly classified as a pass.
When it actually is a fail. That's a false negative.
In this case, we're sending out a defective part.
The number of parts that have been incorrectly judged to be passes
divided by the total of parts that are judged to be fails
is 10 out of 42 plus 10 is 0.193.
A gain, going back to this table,
these are the parts that are fail, but 10 of them were misclassified as a pass.
So the parts that should have been classified as they fail is 52.
Ten divided by 52 gives you that number of 0.1923.
So I like that table is easier to interpret over here.
The final thing about the conformance report
is you can change your conformance category,
or you can switch conform to non-conform.
You can calculate also an escape rate. And that is the rate that the probability
that a non-conforming part is actually produced and not detected.
To do that, we have to provide some estimate or probability
of non-conformance to the JMP program.
I put in like 10%, let's say 10% of the time we produce a defective part.
Given that we've produced a defective part, what's the probability
that 's going to be a miss and then escape?
And that's the escape rate. That's the multiplication of the two.
our process will produce times the probability of a missed process
produces fail part times the probability of a miss.
Now let's go into JMP again
and we're going to use the inspection with the standard.
I'm quickly going to go through this.
We do analyze distribution, again, put into three over here,
and now we can click on a standard down here
and then we can highlight, compare the Inspectors 01 and Inspector 2.
Another way to visualize it is to use graph builder as we've done before.
Then we can put Inspector 1 over here.
Let me do it this way.
And Inspector 2 can be on this side now.
Then we can enter the standard over here on this side.
And now we have a way of clicking and seeing what the categories
were relative to the standard.
That's a very nice little graph, and if you wanted to say,
"Okay, how many Inspector 1
versus classified as a pass when it was a fail.
Now we can bring that to a stand, and the rows are highlighted too.
Let's go into JMP. Now we're going to analyze
quality and process, variability attribute gauge chart,
recall and I'm going to add in the standard over here.
We click the standard. Now here's the issue.
JMP gives us the attribute gauge chart, but this was for the agreement.
What we'd like to measure is against the standard.
We come up here on the attribute gauge chart
and what we're going to do is unclick anything that says agreement.
And click on anything that says effectiveness.
There might be a simpler way to do this eventually
in the [inaudible 00:23:31] programming in JMP.
Now we have the effect on this chart, again, as I said, 50% means
that one of the inspectors got it right, 0% means they both got it wrong.
And we had the agreement report showing 86 that we've seen before.
But what we want to get down to is the effectiveness rating,
the effectiveness report.
And now we see that Inspector 1 was 85% effective.
Inspector 2 was 87% effective.
Overall it was 86% effective.
Here's the summary of the miss classifications.
And these are the ones that are listed over here.
As they say this terminology you need to understand that incorrect fails
were correct passes and incorrect passes were correct fails.
Then the conformance report is down here, we showed you how to do the calculation
and then we can change the conforming category by doing that over here.
Or we can calculate the probability of escape, escape rate,
by putting in some number in that estimates,
how often we'd expect to see a defective part.
I'm putting in over here point one,
click, okay.
And then JMP gives us the probability of non-conformance and the escape rates
as shown over here now for each inspector.
I was going back to now my PowerPoint presentation.
Now that we have a feeling for these concepts of agreement,
effectiveness and the Kappa index, let us see how we can apply the approach
to a more complex problem engage analysis called inventory tracking.
As part of a consulting project with the robotics company,
and It's Vimaan Robotics,
and by the way, there's some wonderful videos
if you click over here that shows the drones flying
in the warehouse and so forth
doing the readings and some of the results from the analysis.
As part of a consulting project I was introduced to the problem
of drones flying in a warehouse using optical character recognition
to read inventory labels and boxes and shelves.
In measurement system analysis MSA,
the purpose is to determine if the variability in the measurement system
is low enough to accurately detect
differences in product- to- product variability.
A further objective is to verify that the measurement system
is accurate, precise and stable.
In this study, the product to be measured via OCR
on drones is the label on the container stored in racks on a warehouse.
The measurement system must read the labels accurately.
Furthermore, the measurement system will also validate the ability to detect,
for example, empty bins, damaged item, counts of items in a given location,
dimensions, and so forth. All being done by the drones.
In gauge R&R studies, one concern addresses pure error,
that is the repeatability of repeated measurements on the same label.
Repeatability is a measure of precision.
In addition, in gauge R&R studies,
a second concern is the bias associated with differences in the tools,
that is, differences among the drones reading the same labels.
This aspect is called reproducibility, that's a measure of accuracy.
The design that I proposed
was a cross study in which the same locations in the warehouses,
in the bins are measured multiple times, that's for repeatability
across different bias factors, the drones for reproducibility.
The proposal will define several standards for the drones to measure.
The comparisons will involve both within- drone repeatability,
drone- to- drone agreement consistency,
and drone- to- standard accuracy.
The plan was to measure 50 locations 1-50,
and three drones will be used to measure reproducibility
that's drone- to- drone comparisons, and there will be three passes
for each location by each drone to measure repeatability.
Now, multiple responses can be measured against each specific standard.
So we don't have to have just one item and a standard.
We can have different characteristics.
The reading can be binary, that is, classified as either correct or incorrect.
And also the reading can provide status
reporting for a location, like the number of units, any damage units, and so forth.
Examples of different responses
are how accurately can the drones read a standard label?
Are there any missing or inverted labels?
Are the inventory items in the correct location?
Is the quantity of boxes in a location correct?
Are any of the boxes damaged? This would be something
that a human person would be checking as part of an inventory control,
but now we're doing it all with drones.
Here's the proposal.
I have 50 rows over here,
150 rows actually, because each location is being read three times by each drone.
So I have drone A, drone B, and drone C.
Then these are the results of a comparison to the standard.
We're classifying five standards, A,B,C,D and E,
and they're randomly arranged in here as far as the location goes.
And it's one characteristic specify each of the 50 locations.
S ince we're doing three readings, it's 150 rows.
T hree drones reproducibility, three passes for each location
and by each drone that's repeatability
and the standards are specified for each location.
I'm going to make an important statement over here that the data that I'm using
for illustration is made- up data and not actual experimental
results from the company.
We can start off with distributions and dynamic linking.
We can now compare the classification of the drones by standard.
We generate the distributions and then we click on say, standard A,
and we can see how many drones got that standard A right,
or whether any drones had further misdiagnosis.
Same thing if we can click on standard E, we can see drone A had a higher propensity
for misclassifying standard E and same thing with drone C.
Now the chart below shows how well
the drones agreed with each other for each location.
Here are the 50 locations and we're looking at the drones comparing.
Now when you're comparing a drone to other drones, you've got a lot of comparisons.
You're comparing maybe drone one to i tself three times.
You're comparing zone one,
drone one to zone two, times for each one of the measurements.
So it could have like a 1, 2, 3 for zone,
drone one and a one, two, three for zone.
Two, and you're comparing all possible combinations now of those drones.
That's why the calculations get a little bit complex
when you get multiple drones in there.
But you're doing a comparison.
This shows the agreement among all the comparisons.
Now we noticed that between zones the five and 10,
that for these locations the accuracy dropped quite significantly
and that prompted further investigation as to why?
It could have been the lighting, it could have been the location,
it could have been something else that was interfering with the proper diagnosis.
You see most of the drones are reading accurately 100%.
This is an agreement between the drones,
so they were agreeing roughly 90, 91% at the time.
And these are the confidence intervals for each drone.
So this told us how well the drones were comparing to each other.
Now we got agreement comparison, the tables will show the agreement values
comparing pairs of drones and drones to the standard.
And the Kappa index is given against the standard and repeatability
within drones and reproducibility are all excellent based on the Kappa Statistics
and agreement across even the categories is also excellent.
So we're comparing here drone A to drone B
drone A to drone C, drone B to drone C, all doing excellent agreement.
We're comparing here the drones to the standards, all an excellent agreement.
And then this is the agreement, just basically a summary of it.
Then this is the agreement by the different categories.
Now again, we can look at the attribute chart for effectiveness.
Same way we click out all the agreement check boxes
and then click on the effectiveness boxes.
We see again over here, that seven and eight had the lowest
agreement to the standard.
Again, that could have been something associated with the lighting.
It could have been something associated with some other issue there.
Then the overall agreement to the standard by drone,
you can see they're about 95%.
The drone is pretty accurate and they were pretty reproducible
and the repeatability was excellent.
This is the effectiveness report. Now this is a little bit more elaborate
because now we're comparing it for each of the five characteristic standards
and these are the incorrect choices that were made for each one.
Out of 150 possible measurements,
drone A measured 142 correctly, drone B 145 and drone C 140.
So effectiveness is the important one.
How accurate were the drones?
We can see that the drones are all running up around an average about 95%.
This appears to be highly effective.
Then we have a detail analysis
by level, provided in Misclassification report.
So we can see individually how each one of these drones compared to the different,
how each one of the different characterizations
were measured correctly or incorrectly by the drones.
This is the ones that are misclassifications.
A gain, let me go into JMP.
Oh, one further example I meant to show up over here.
You using graph builder,
we can view the classifications and mis classifications by each strong.
This is a really neat way of showing it. I wish JMP would include this possibly
as part of the output, but you can see where the misclassifications occur.
For example, for drone A, when you misclassify drone C,
most of them were classified correctly, but there are a few that were not.
These show the misclassifications.
I like that kind of representation in graph builder.
Now let's go back into JMP and we're going to do
attribute gauge analysis multiple with the actual experiment that was run.
Okay, so we're going to analyze distributions so we can do this.
We can compare the drones to the standard.
Again, we can just click on a standard and see how it compares across the drones.
We can also do an analyzed graph builder.
And we can put the zone drone A
and then drone B, and then drone C over here.
And then we can put the standard in there
and it shows very clearly what's happening with that.
B ut we can go and also into JMP and use the JMP quality and process
variability attribute gauge.
So we add the three drones in here, we add the standard,
and we put in the location and we get our gauge attribute chart report
showing that drones as far as the agreement with each other, we're at 90%.
This one has the most difficult locations to characterize.
Here are the agreement reports that I've shown you.
Drone A, Drone B and Drone C agreement with the other drones and with itself too.
Drone A to drone B, these are the Kappa values.
This is the measurement to the standard, all very high.
And then these are the agreement across categories.
And then for the effectiveness,
to get that graph that we like to see for the effectiveness report,
we take out the agreement over here and click on now the effectiveness.
We now have the effectiveness plot
on the tap that shows us how the drones agreed with the standard.
We now go back into the PowerPoint presentation over here.
Okay, to summarize w hat we've done over here,
the use of attribute gauge analysis allowed the company to provide solid data
on the agreement and effectiveness of drones for inventory management.
T he results are very impressive. Subsequent results reported
on the company's website show inventory counts to be 35%, faster inventory costs
reduced by 40% and reduced missed- shipments
and damage claims reduced by 50% compared to the previous methods.
In addition, the system generates what we call actionable data
for more accurate, effective, safer, more cost effective,
and faster inventory control.
Some excellent references over here is Cohen's original paper,
and book by Fleiss is excellent,
has a lot of detail, and also the book by Le is well done.
I thank you very much for listening. Have a good day.