Predicting Effluents from Glass Melting Process for Sustainable Zero-Waste
Modelling manufacturing processes encounters challenges when measurements occur on different time scales. One such company produces glass containing SeO2, with SeO2's volatility at high melting temperatures, causing significant evaporation, which is captured by a dust filter. Complicating matters, SeO2-laden dust is toxic, posing disposal challenges. Predicting the SeO2 content within the dust enables it to be recycled back into the manufacturing process, promoting sustainable, circular, zero-waste production and cost reduction.
The glass manufacturing process operates continuously, with one-minute sensor readings for input variables. However, SeO2 content measurement in the dust occurs at longer intervals, necessitating sufficient dust accumulation for homogenization. In this presentation, we demonstrate the development of a predictive model for SeO2 dust concentration using a limited data set. To ensure that process parameters are averaged over the duration of dust sampling, we employ Monte Carlo simulations, utilizing the variability of process parameters.
With JMP Pro, we outline data collection and preparation methods, as well as the subsequent use of simulated data to construct predictive models for SeO2 concentration. We explore the potential applicability of this methodology to industries facing similar challenges, such as chemicals or biotechnology, where modelling processes with disparate time scales and uncertainties is common.
Hi. My name is Drejc Kopac, and this is the work done in collaboration with my co-worker and co-presenter, Matej Emin, and Professor Philip Ramsey.
The talk is about predicting effluents from glass melting process, focusing on the modeling of a continuous process with sparse batch sampling of filter dust.
This project was done together with the largest Slovenian glass manufacturer, and the aim was to build a predictive model of a filter dust composition in order to recycle it.
In the beginning, let's visualize first what we are modeling. Here is the 3D scheme of a glass melting furnace, and we can see the size of it when compared to a human being. The raw material goes into the furnace and is melted, and then this glass flows down the channels to the production line. We can see two heat exchangers where flue gas is exchanged with the filter.
We can also see two burners, left and right, which switch periodically. This periodicity induces a short-term variability in the data set. The conditions in this furnace and the properties of the flue gas are crucial for our work, and this is what we will focus on.
To put things into a bit more perspective, here is a process scheme which is depicted in a circular economy fashion. The flue gas as mentioned, goes through the heat exchanger. Here it is cooled, and then it is captured by the electrostatic filter. Here the gas is further cooled and mixed with Calcium Hydroxide, which cleans the gas.
Cleaned flue gas exits the filter via the chimney, while the particle dust is collected in big bags. This is the point where we enter the story. Can this 300–400 kilograms per day of dust be recycled back to the furnace instead of disposed? To recycle it, we need to know the filter dust composition. This was the main part of this project.
The main problem with this dust and recycling it is the selenium. The selenium acts as a glass decolorizer. The Selenium Dioxide is volatile at glass melting temperatures, so it evaporates quickly and is later captured by the filter. This filter dust is treated as hazardous waste, so the disposal is expensive. The question is, is there a way to recycle this 300–400 kilograms of dust per day in the furnace? I have to mention that this is not a huge amount given the approximately 150 tons per day of the glass being manufactured.
The issue is that glass color is extremely sensitive to selenium concentration. This means that selenium concentration should be known quite precisely. The measurements and chemical analysis of this dust is expensive and time-consuming. Here, the possible solution would be to build a predictive model. But there are quite a few challenges here.
First, the glass manufacturing process is continuous with one-minute sensor reading. There are 40 plus online sensors measuring various temperatures, gas consumption, heater power, glass pool rate, etc. The challenge here was how can we build a predictive model for Selenium Dioxide in the dust sample based on sparse sampling of filter dust, based on various timescales of this sampling, and of course, based on many process parameters.
Usually, when we start a project like this, the most typical approach is to gather historical data, to sit down with the experts in the company, such as Process Plant Managers, and do the initial data screening and brainstorming. We did that. We gathered historical data for over one year, and we tried to identify the influential parameters with subject-matter experts.
Here on the right-hand side, a bit blurred, of course, we can see a typical multivariate plot which shows parameter correlations. This plot often helps us reduce the number of influential parameters in the model.
Often, getting to such data set, which can be visualized and analyzed, takes quite a lot of time, especially when dealing with industrial data set. I will just quickly show you how the raw data from SCADA reports looks like.
For example, this is a typical report from the company for one day of data. You can see that the parameters are not in columns as we are used to, but in fact, in rows. If you scroll down this data, you will see that the parameter name changes every minute, and the value is written here. If you scroll down, you see that there are 30 plus or 40 even parameters that are written in rows instead of columns.
On the other hand, there is a second part of this data set which is in fact already ordered in columns. Here you can see that parameters are actually entered in columns. The challenge here was to reorder or reformat this data set to be included in JMP. We actually wrote a simple, or let's say, a JSL script. I will just show you quickly.
This JSL, this JMP Scripting Language script, goes into this loop. Here, every 24 hours of data is another process factor. This actually loops through this data set and puts it in the columns, and then it attaches the columns from the second part of this data set. This is the script that we used. In the end, we came to this nice data set, which is ordered as we are used to. In columns, with all these parameters here, of course, they are anonymized. It's on one-minute interval. This is the first part. As I mentioned, it often takes quite some time to come to this stage.
Initially, it is important to visualize data so that one can get the impression of what the data might tell us. Here, we can see a collection of X, Y graphs. This shows us nicely a short-term variability in the data. This short-term variability, these 20-minute spikes, correspond to the 20-minute interval between switching the burner side, so left or right. Most of the short-term variability occurs during this 20-minute interval. This was an important piece of information for us when we sampled the filter dust.
To confirm this 20-minute period in the data set, we use the time series platform in JMP. This clearly showed us this periodicity and this +/- 20-minute lag due to left or right burner being ignited. Together with other subject-matter expertise such as effluent transition time from melt to filter, which is around 15 seconds, and dust extraction time from filter, which is around one minute, was used later to combine dust samples, measurement, and process parameters. These steps were quite crucial in the beginning before even sampling the dust.
Based on all this information and subject-matter expertise and previous tests also done in-house, one of the most influential parameter for Selenium Dioxide content should be the glass pool rate. Thus, we asked for the pool rate plan of the manufacturer, which is prepared one month ahead, and selected the best or the optimal dates for physical sampling of the filter dust. We targeted minimum, maximum, and somehow middle values of the pull rates. With that, we try to optimally cover the parameter space.
Once we knew the dates when the production line will be operating at such pull rates, we actually went to the factory, we collected the dust, and we homogenized it. Then this dust was sent to the external institute where the ICP OES analysis was done to get the composition of the filter dust. We were particularly interested in the Selenium Dioxide content.
Those are some photographs from the process. On the right-hand side, this is not your typical day in the office of a data scientist. But this task was actually done by ourselves since it was a crucial step to get quality data. We did not want to risk to have samples messed up or, I don't know, wrongly labeled or similarly.
Once we got the results back from the external institute, we actually input it into the data set. Here you can see the original data set, but now there is a column which is named IGR Results. Basically, during some times, there are values here which correspond to the value of the content of Selenium Dioxide during those times when we capture this dust.
Once we get this data set, we can start modeling it. We started with a very simple model. We used nine data points where the Selenium Dioxide is determined. These data points are very... We are very confident about these data points because the time and date is very well known, and the dust was consistently homogenized.
But additionally, there were also a bit more data points available, five data points, which were obtained from various big-bag sampling done by the people in-house. But those five data points have time only approximately determined due to coarse sampling of the dust from the big bags. We expect that those data points would induce larger uncertainty to the model. However, without including those extra points, the models are statistically non-significant. We have to extend the data set with these points as well.
We focused on three input parameters, namely natural gas flow, glass pull rate, and power on electrical heaters. Those parameters were considered most influential and showed approximately constant values during the sampling. Here, I mentioned that we took the median values of those parameters during the dust sampling period.
Here is also the plots of those parameters at these dates at which the dust was sampled. We used a simple standard least squares method for modeling, and we included the second-order interactions. The end result was the following model. I will show you in JMP. These are the nine points obtained from our own dust sampling and five additional points from big-bag sampling. This is a simple and clean model, including the second-order interactions.
Of course, we can see that our R² is good, so close to one. P-value is low. One would say that this is a statistically significant model. Also, the P-value for factors is low. That means all those factors are important. But the problem or the caveats of this simple model can be seen in the profiler here in the bottom. I will just zoom out a bit like this.
The main problem, which also experts in company saw, was those steep relationships between the Selenium Dioxide content and natural gas flow and power on heater. If we move to extremes of these values, we can see that the Selenium Dioxide content gets over or underpredicted.
Furthermore, we can see on the pull rate parameter that if we move to the extremes, like 130 tons per day pull rate, the error bar gets or the uncertainty on the prediction of Selenium Dioxide gets pretty high. Those were already some caveats of the model that would help us believe that this is not a statistically significant model. So simple model, did not predict new measurements very well. There were some physically unrealistic predictions.
We saw that perhaps other parameters could also affect the Selenium Dioxide level, but we could not include them given the scarcity of the data set. As mentioned, the prediction profiler indicated large uncertainties at extreme values of the parameter space. To have a better model, we would need to consider more process parameters.
But of course, we were limited with only nine measurements, and we did not want to send more dust samples to the analysis because they were time-consuming and expensive. We asked ourselves if the information on process variation could bring us some additional information to the existing data set because additional parameters might be also influential.
Here, we started to think in terms of variability of data. First of all, the Selenium Dioxide determination from the external laboratory has a measurement uncertainty. This is already a variation in measurement, which we can use. Furthermore, for each of these 20-minute sampling interval, there are 18 data points from 18-minute sensor readings, excluding two minutes during the burner switch, of course. These 18 measurements have the underlying distributions. Here, I only show five of them to see how they look like.
Our idea was that given this distribution means, in fact, we used medians due to some distributions being slightly skewed. Given the distribution dispersions, we can randomly pick values from these parent populations and thus generate a synthetic data set using a Monte Carlo approach.
Together with process experts, we considered 13 input parameters among all these process parameters. Those 13 were said to be most influential. Also, they were normally distributed, or at least approximately normally distributed during the sampling periods. They had no outliers, no strange artifacts, and no short-term temporal drifts. This was our initial request to have factors which have nice distributions.
A very important thing here that we had to be careful was to preserve correlations between factors, to keep the relationship between process parameters as they are. You can imagine if you have two parameters which are very highly correlated, and then if you pick a random value of one parameter to be very high, then the other parameter which you pick a random value of, should also be biased towards higher values. These parameter correlations should be preserved in the random picking, in the Monte Carlo approach, of course. We did that via covariance matrices.
You are probably aware of this multivariate method, multivariate analysis in JMP. I can show you. Once we get this data set cleaned, for each measurement of the dust, we can build these multivariate plots, which are now here. There are nine multivariate plots or analysis done. For each of these analyses, we can extract this covariance matrix. This gives us the information how those parameters are correlated between each other.
For each dust sampling, the corresponding covariance matrix is generated. We exported those matrices. Here we used Python because there is a method called multivariate normal in NumPy. We used it to generate multidimensional normal distributions, taking those covariances into account. I'm sure this could also be done in JMP with some script, but for us, it was a bit easier to do it in Python because we are a bit more used to using these methods there.
Using this, we can basically generate a synthetic data set. We did that using those matrices. Five hundred synthetic values for each chosen dust sampling was generated. We used nine data points, let's say nine intervals, of those trustworthy samples which we gathered ourselves and also homogenized.
But then we also used one additional measurement from big-bag sampling, which had a very low Selenium Dioxide content. This was needed to extend the parameter space towards low Selenium Dioxide values. I will show you how this looks like.
These are the simulated or the synthetic data set. You can see 5,000 rows, so 500 times 10. Of course, correlations between parameters here are preserved. We checked it. This is, let's say, a very nice data set to do some more detailed modeling because now we have enough data available. We immediately thought of modeling this with neural nets. We did it, and the results showed a nice statistical model.
We use a simple neural network with three TanH functions and one hidden layer, so a very simple approach. The R² values, both for training and validation, are very high. By the way, we split those 5,000 data points into two-thirds for training and one-third for validation. Also, the Actual by Predicted plots show a statistically significant model.
We then exported this model. Let's go back to JMP. You can click here and save the prediction formulas. But more importantly... I will jump back to the presentation. More importantly, we wanted to test this model prediction with other big-bag sampling data that we had. We use those big-bag sampling data as validation points.
Here we have an actual measurement of Selenium Dioxide, and here is the predicted value of the Selenium Dioxide. Plotting these two numbers in the Actual by Predicted plot, we saw here that the model predicts other measurements quite well. So only one big bag, so number one, two samples from this big bag, are slightly overpredicted. The Actual by Predicted line here has an R² of 0.86 based on those measurements. We were quite happy with this result, and the next step was to present it to people in the company.
More importantly, we present it with this profiler. In this profiler, we invoked a simulator, you can do it under this red triangle. Here, instead of just having a single value of Selenium Dioxide content, which is now predicted, based on these 13 process parameters here, we actually tried to simulate the distribution of this Selenium Dioxide using the uncertainties on these process parameters during the time when this content was being put into the big bag. Here we actually determine the mean and the standard deviation of these process parameters during this interval. Based on these values, here, we could actually simulate the prediction for Selenium Dioxide.
Also, we can see in this profiler that the responses are more complex than just a simple linear behavior or maybe quadratic. There's lots of interactions within these parameters. But more importantly, the simulator for prediction scatter gave us this distribution of Selenium Dioxide in this batch, in this big bag. In fact, the distribution platform in JMP showed us that normal two-mixture distribution is the best fit for this case. We were quite happy with this result because it showed us that this distribution actually matches the value that was measured later.
The crucial step that I mentioned before was to take the correlations, so the covariance between factors into account. We tested what happens if we don't include these correlations in the Monte Carlo simulation. Here you can see the Actual by Predicted plot of the same measurements, but where we simulated the synthetic data set without using these covariances. We can see that, first of all, the span of this Selenium Dioxide prediction is unrealistic, it's too big, and also the points are much more scattered.
Comparing both models, the proper one in blue and the wrong one, where we don't take correlation to count with red, we can see that the blue one is much better. Of course, it shows much more realistic predictions for Selenium Dioxide.
To conclude, despite sparse dust sampling and small data set initially, we derived a useful predictive model for Selenium Dioxide content in filter dust. We saw that the Monte Carlo approach here was actually the required one, considering the complexity of the process and the scarce data set. Also, neural network modeling that we used was in fact... We tested some other methods as well, but neural network model proved to be better because those parameters are really, not highly correlated, but they interact between each other. So neural network preserved these interactions.
The predictive power of the model was tested with new dust sampling, and it showed the realistic and practically useful predictions. The most important conclusion is that this approach led to a 100% return of the waste filter dust in a pilot attempt. Thereby this replaced 60% of an expensive primary raw material for a glass decolorization, and of course, completely removing the cost of filter dust disposal.
That's it. Thank you very much.