Welcome, and thank you for joining my poster presentation
at this year's JMP C onference.
My name is Kaitlin Shorkey ,
and I'm a senior statistical engineer at Corning Incorporated.
How do you get a glimpse of a product quality
before it completes the production process?
We chose to build a model that will predict the product quality outcome
before it has completed the entire process.
There are two major benefits of this approach.
One is the operations team
receives instant feedback on how the parts will perform
and can adjust the process in real time.
The second is that we can dem the product acceptable or not.
Like I just mentioned, the main objective of this work
is to build a predictive model using a few modeling approaches
to understand and predict product quality
based on certain product measurements.
Our major steps in building this model
are data collection, model development, testing and implementation.
First off, for the data collection phase, parts are collected at the end
of the production line and appropriate product measurements are completed.
The parts are then subjected to the quality test.
The product measurements and quality measurement results
are combined into a data set and used for building the model.
In this case, the quality measurement results
are combined into a data set and used for building the model.
In this case, the quality measurement is the response
and all the product measurements are the predictors.
The dataset consists 767 predictors and 990 observations or parts.
This stuff can take a long time to execute.
Since we're building a model, it's important to get as large of a range
of product measurements and quality measurement results as we can.
If we leave this the accuracy and model predictions
are more consistent across the range.
Essentially, this allows the model to accurately predict at all levels
of the product quality results.
Once the dataset is compiled,
it is thoroughly examined to ensure it is as clean as possible.
After the data collection and cleaning,
the second phase of model development is started.
For this, we begin with variable clustering.
Step by regression, remove highly correlated variables
and select the most important ones.
With so many predictors we first apply variable clustering.
This method allows for the reduction in the number of variables.
Variable clustering groups the predictors,
the variables into clusters that share common characteristics.
Each cluster can be represented by a single component or variable.
A snippet of the cluster summary from JMP is shown,
which indicates that 85 % of the variation
is explained by clustering.
Cluster 12 has 49 members,
and V 232 is the most representative of that cluster.
The variables that are identified as the most representative ones
are then used in the next method of stepwise regression.
Stepwise regression is used
on the identified clustered variables to select the most important one
to us in the model,
and further reduces the number of variables.
For this, the forward direction
and minimum corrected AIC stopping rule is used.
The direction controls how variables enter and leave the model.
The forward direction means that terms are entered into the model
that have the smallest p-value.
The stopping rule is used to determine which model is selected.
The corrected AIC is based on negative two, law of likelihood,
and the model with the smallest corrected AIC is a preferred model.
From this, 51 variables are entered into the model
of the 99 available variables
from the variable clustering step.
At this point, we have reduced the number
of variables from 767 to 51
using variable clustering and stepwise regression.
The final method is to fit a generalized regression model.
For this, the law of normal distribution
is used with an adaptive lasso estimation method.
For this, the long normal distribution is used
with an adaptive lasso estimation method.
The law of normal distribution is the best
fit for the response, so is chosen to use in the regression model.
The adaptive lasso estimation method
is a penalized regression technique
which shrinks the size of the regression coefficient
and reduces the variance in the estimate.
This helps to improve predictive ability of the model.
Also the data set was split into a training and validation set.
The training set has 744 observations, and the validation set has 246.
From this, the resulting model produces a .81 generalized R-square
for the training set and .8 for the validation set.
These R-squares are acceptable for our process that we will now evaluate
the accuracy and predictability of the resulting model.
Now that we have a model, we need to review its accuracy
and predictability to see if it would be suitable to use in production.
In doing this,
a graph is produced that compares a predicted quality measurement
for a specific part to the actual quality measurement.
In the graph the xx shows the predicted value,
and the yx shows the actual.
Also, the quality measurement is bucketed into three groups
based on its value, which is shown
by the three colors on the graph.
In general, the model predicts the quality measurement well.
It does appear that the model may fit better
in the lower product quality range
than the upper, which may be due to more observations in the lower range.
As mentioned, the quality measurement
was bucketed into three different categories based on its value.
This was also done for the predictive quality measurement.
For each observation, if the quality measurement category
is the same as a predicted measurement category,
it is assigned to one.
If not, it is assigned to zero.
For both the training and validation sets, the average of these ones and zeros
is calculated and is used as the accuracy to measure.
We see that training set has an accuracy of 87.5%
and the validation set has an accuracy of 84%.
For the model to be moved to the testing phase,
accuracy must be above a certain limit,
and both of these accuracy values are.
This will allow us to move to the testing phase of the project.
In addition, we look at the confusion matrix
to visualize the accuracy of the model
by comparing the actual to the predicted categories.
Ideally, the off diagonals of each matrix should be all zeros,
with the diagonal from top left to bottom right containing all the counts.
The matrices show on the poster that the higher counts are along
that diagonal with lower numbers in the off diagonal,
but discrepancies still exist among the three categories.
For example, in the training set, there are 29 instances where the actual
quality measurement of the three was predicted as a two.
In the same case for the validation set, there are 12.
The confusion matrix helps to understand where these discrepancies are
so further investigations can be done and improvements made.
Overall though, the model has an accuracy above that requires limit,
where our next step would be the testing and implementation phases.
Now that our model is through the development phase,
it's time to test it in live situations.
For this, the model is used under engineering control
to determine how well it predicts the quality measurement
in small, controlled experiment.
This is done by the engineering team
with support from the project team when necessary.
Once the engineering team is satisfied with this testing,
the model is fully implemented into production and monitored over time.
In conclusion, this model development process
has allowed us to build
predictive models for the production process.
The methods of variable clustering,
stepwise variable selection in generalized regression
were the most appropriate and best students to use
for this application.
With further research and investigation,
other methods could be potentially applied
to improve model performance even more.
From a production standpoint, the benefit of this model
is that the operations team will receive instant feedback on how a part
or group of parts will perform, and can ingest the tune
and tune the process in real time.
We can also deem the product acceptable or not.
If rejected, the product is disposed of and will not continue through the process,
which over time reduces production costs.
Lastly, I'd like to give a huge special thank you
to Zvouno and Chova and the entire project team
at Corning Incorporated.
Thank you for joining and listening to my poster presentation.