Data Mining Tools and Applications in Manufacturing (2021-EU-30MP-770)

Level: Beginner

Yassir EL HOUSNI, R&D Engineer/Data Scientist, Saint Gobain Research Provence
Mickael Boinet, Smart Manufacturing Group Leader, Saint-Gobain Research Provence

Working on data projects across different departments such as R&D, production, quality and maintenance requires taking a step-by-step approach to the pursuit of progress. For this reason, a protocol based on Knowledge Discovery in Databases (KDD) methodology and Six Sigma philosophy was implemented. A real case study of this protocol using JMP as a supporting tool is presented in this paper. The following steps were used: data preprocessing, data transformation and data mining. The goal of this case study was to evolve the technical yield of a specific product through statistical analysis. Due to the complexity of the process (multi-physics phenomena: chemical, electrical, thermal and time), this approach has been combined with physical interpretations. In this paper, the data aggregation (coming from more than 100 sensors) will be explained. In order to explain the yield, decision tree learning was used as the predictive modelling approach. Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. In our case, a model based on three input variables was used to predict the yield.

Auto-generated transcript...

Speaker	Transcript
YASSIR EL HOUSNI	Hello.
	I am Yassir El Housni, R&D engineer and data scientist, in the smart manufacturing team of Saint-Gobain Research Provence.
	in Cavaillon, France.
	We are working for ceramic materials business units. In this post, we have two points. In the first we will present the data projects life cycle that we propose for manufacturers data projects.
	And in the second, we will continue
	to present two user cases of Saint-Gobain materials...ceramic materials industries.
	Working on data projects across different departments, such as R&D, production, quality and maintenance, require taking a step by step approach to pursue the progress.
	knowledge discovery in database methodology, and the six Sigma philosophy, also known as DMAIC to define, measure, analyze, improve, and control.
	We define in this infinity loop seven steps to pursue in order to manage correctly data analysis project. In all of them, we ensure to have the good understanding of the process, because we believe it's a relevance key to sucessful data projects in an industrial world.
	For example, to detect the variation in the process we use the SIPOC or flow chart map and to detect the causes of variation we use our toolbox of problem resolving which contain a ??? or Ishikawa diagram.
	Then infinity loop presents also another route to achieve the continuous improvement. In the next we will detail the approach, step by step.
	Let's start with define the projects. It's necessary to define clearly three elements before starting data projects.
	We propose here some questions which we found very useful to define clearly the elements of the trade(?). First of all business need definition.
	Frequently, the targets of data project in manufacturing is to optimize a process, maximize a yield, improve the quality for a specific product or reduce the consumption of energy.
	Under the definition of the business opportunity, we should know what will be used, is the target need just a visualization or analytics.
	And after that, the impact should bring our quantified gain to the business.
	Secondly, data availability and the usability. In it we launch a diagonal analysis of quality of data. It is so important in the step to determine the feasibility of the data project. And therefore the team setup,
	a person from the data team, a person from the business unit team and a person from the plants, a process engineer with Six Sigma green or black belts.
	Let's move to the second step.
	Data preparation.
	With the transformation, integration, cleansing, it's an important step which consumes a lot of time in data projects. For example, we have here different sources of data and we need to centralize them
	in one table. Mainly we use X for inputs and Y for outputs. In this setup we use a different tools in JMP such as missing value processing, the ??? of constant variables, and of course the JMP data tables tools which ensures the right SQL request to transform correctly tables.
	The third step is about exploring data with dynamic visualization and with JMP we have a large choice for visualization. For example, plot the distribution of variable and estimate the load(?) that it's follow.
	Detect the outliers with the box plots diagram,
	nonlinear regression between two variables,
	contour or the density mapping to determine the principal placement for concentration of each population, and we have a large choice to plot it.
	???, we use them usually in our work and we found them very useful.
	The fourth step is to develop the development of the model and it depends on the kind of analysis that we need.
	It's to...
	the target is to explain or is to predict. The first is about links between variables and it serves to explain patterns in data. The second is about formula and it serves to predict patterns in data.
	Generally we cut our data sets in three blocks, 70% for training, 20% for testing, and 10% for validation. And sometime if we have a small size of data we use 70% for training and 30% for validation. If the model is good, we request a new set of data in order to drive decision making.
	We have to approach supervised and unsupervised learning. Today at Saint-Gobain we'll use the normal version of JMP.
	And we have access to the supervised learning tools, such as linear regression, decision tree and the neural network. We work a lot today with decision tree because it gives us a relevant results, which help us to resolve a challenge in ceramic materials industry.
	The fifth step is about finding the optimal solution. Sometimes it's just a solution, but in other cases it's a combination between a lot of models. And to ensure the good sense, we added some constraints, for example, the mean max and the step of valuation of each variable.
	JMP profile give us large possibility to optimize quickly solutions.
	From now, the next step is pass to the plant, by the support of our process engineer with the Six Sigma green or black belt.
	The sixth step is about implementing the best solution in the plants and governed by only one representative model. For example, we implement the control charts of output Y1 and analyze the different variation.
	In the seventh step, we monitor the model effectiveness and we visualize the global gain of working on our data project. For example in the pie charts, we see the radial impact to the global yield.
	And the last but not least, the preparation to the step one for a new data project to ensure the continuity, the continuous improvement and the continuity of the infinity loop.
	That was all about data project lifecycle and now we will pass to present two real case studies of the protocol using JMP.
	In the two examples we studied the same process technology. It's about electric arc furnace, but for the two we have different products and different target.
	In this technology, we have a complex multi-physical phenomena such as electrical, thermal, chemical, time and others physics effects.
	Here, we have, for example, more than 100 process variables that comes from different kinds of sensors.
	The business need
	was to explain the global yield of a specific product, JO7.
	In it we detect a lot of kind of defects, the Defect #1, #2 to #N.
	And for that to prepare correctly the data sets, we used Pareto,
	outlier processing with Mahalanobis distance and recoding the correct attributes to correct the type of errors and missing value processing.
	In step three, in explore, we present here just an example of correlation that we study between inputs to reduce number of variation
	after working with models.
	As our results, we found our decision tree with just three variables and here, the goal was to explain why the yield is not
	in the max.
	And we have a
	decision tree with just three variables under 100. So for the model we use here 70% of the data for training because we don't have a big size of data and 30% for validation and we get good results with important root square.
	As you see, it's more than 70%.
	So the message that we passed to the plant is that just with the specific setting of
	X1, X2, and X3 we can explain the
	global yield, and if we need to maximize Y, the precent of this yield, we need to get a specific setting just for X1 and X3.
	And the global yield should evolve rapidly. It's the point...at the cluster of points here.
	And for each
	project, we give also the physical understanding of each parameter
	to the plant.
	For the second example here with the same technology but for another product, here we have just 80 process variables and the target was to evolve the number of pieces with no defect, D1.
	The need is about explaining the quality
	of our specific products, so we use the same methodology for our steps. For example here, we studied the same correlation between inputs to reduce the number of variables that we will put in the model.
	And, as a result, we use also a decision tree, but here we found 12 variables that explain this global yield
	with good
	results. As you see the R square was 84%, the RMSE was 3% and number of size was 287. For that, we used the Cross validation method
	because we have a very small size of table.
	So the first parameter was very important, as you see, it contributes 50%.
	And it was difficult to explain that with the 12 variables to the plants. For that, when we plot just the first variable, we see visually that we can define a threshold just with the variable X1, and with it the global yield should evolve rapidly.
	Thank you.