Smart Subsampling vs. Brute Force: A Strategic Approach to Predictive Modelling

Handling large data sets continues to present unique challenges, even in an era where advanced machine learning algorithms can process vast amounts of information. Relying on brute-force techniques to analyze massive data sets can lead to inefficiencies, model overfitting, noise accumulation, and diminishing returns from adding more data.

Intelligent subsampling, which involves selecting a representative fraction of the data, often provides a more targeted and insightful approach. Subsampling encourages more interpretable models, as the reduced data set size simplifies the relationships between variables. For these reasons, smart subsampling should be a preferred approach for a wide range of applications, including material science, biomedical research, environmental modelling, marketing analysis, and social sciences.

But why go brute force when you can go smart? Through an interactive demonstration using the latest capabilities of JMP Pro in the field of complex material formulation, this presentation shows that a well-designed subsampling approach, combined with both classical and advanced modeling techniques (multilinear regression, neural nets, SVM, generalized regression) can lead to robust predictions.

Hello. Thank you for watching this presentation for the 2025 edition of the Europe Discovery Summit conference. My name is Damien Perret. I am R&D Scientist at CEA in France, and I am presenting today with Carole Soual, who is statistician at Ippon Innovation. With Carole, we are very happy to be here today, and we would like to thank the steering committee who gave us the opportunity to talk about this work, which is about a smart subsampling approach applied to predictive modeling.

Let's start with a few words about the CEA. The CEA is a French government organization for research, development, and innovation in four areas: low-carbon energies, digital technology, medicine technologies for the future, and defense and security.

The CEA counts about 21,000 people on 9 locations. We have strong relationships with universities through various joint research units, high amounts of patents, and startup creation with a budget around €6 billion.

Ippon Innovation is a company of data scientists dedicated to provide industry with advanced algorithms. Objectives are to optimize manufacturing costs and test times to improve the quality of complex products, to detect small signals in big data. We propose services of consulting, training, and on-demand advanced studies. We have several projects such as customized statistical solution, or advanced tools with some of them that are patented.

The motivation for this work comes from the fact that even if most recent machine learning algorithms can process big amounts of information, handling large data sets still presents some challenges. Instead of using brute force techniques on massive data sets, which can possibly lead to a degradation of the prediction for several reasons, we believe that an intelligence subsampling is a more insightful approach, which also leads to more interpretable models. Here we show an application in material science, but we could imagine that this approach can also be applied to new applications, even outside the fields of physics and chemistry.

For people, we attended the discovery in 2021, a preview of this study was presented to explain the approach. The approach has been improved a lot during the last 4 years, and the predictive ability of the method has been significantly improved and that's why we wanted to show you our last results.

Our main objective in this work is to develop a well-designed subsampling approach, which is combined as you will see, with both classical and more advanced modeling techniques to reach robust predictions. To do that, we wanted the algorithms to be coded in JSL and implemented in JMP Pro 18.

For the example today, we show how this methodology was applied to the prediction of the glass melt viscosity. Therefore, the response of the model is the glass viscosity. Factors are the weight percent of the glass oxides, and our database contains both commercial and proprietary data sets.

Very briefly, glass is a non-crystalline material obtained by a rapid quenching, and it is a mixture of different oxides. The number of oxides is variable from 2 or 3 in a very simple glass to about 30, and even more in the most complex compositions.

An important point about glass viscosity is the very wide range of variation making this property very challenging to predict. Glass viscosity is also highly sensitive to composition and mechanisms like crystallization usually impact a lot of the rheology of the melt.

There is a nice analogy with the lava behavior in geology to illustrate the complex relationship between crystallization and viscosity. At high temperature, lava is a viscous and homogeneous fluid, very similar to a melt of oxide glass. When lava temperature decreases, it usually induces fast and massive crystallization, which has a huge impact on the viscosity of the melt.

As you can see on this graph, the viscosity increase caused by this crystallization can reach up to eight orders of magnitude, also changing the real logical behavior of the melt making the viscosity prediction very difficult.

What we wanted to show here is that viscosity prediction is not easy even for very simple glass compositions. Here we have selected three compositions of SBN glass, which is a very simple glass with only three oxides. We applied the best-known models from the literature to calculate the viscosity. Then we compared the predicted values with the experimental values we have measured with our own device in red on these graphs. You can see that even for a very simple glass, it is not easy to obtain one reliable value for the predicted viscosity.

Machine learning techniques are more and more applied in glass material science and for glass viscosity prediction in particular, but with a limited accuracy in the prediction. We tried to apply neural nets in the past to predict glass viscosity, but we were not very satisfied with the results we got.

For example, we observe some unexpected patterns and also deviations from several orders of magnitude. This was also the case in recent publications, also using neural nets where we can see these six orders of magnitude difference between the prediction and the measurement.

At the end, we went to the conclusion that there was so much disparity in the global data sets and so many physical and chemical nonlinear mechanisms occurring in the glass melt impacting its viscosity that using traditional machine learning on the global database was not the best approach.

Here is a good picture to give a view of our database where each dot is one glass in a multidimensional view of the domain of composition. Data may come from different isolated studies, or we can have data from experimental designs, or data obtained from parametric studies with variations of one component at a time. Which makes at the end a very heterogeneous and evenly populated database.

The strategy is based on the hypothesis that instead of using all the data, we think it's better to train the predictive models only on similar data. For example, if we want to predict here, models will be created from the data we have in this area. Different models will be created if we want to predict the property here.

We said that the approach is dynamic because the predictive models depend on the composition and are fitted where we want to predict. It's automatic because we don't have to do this manually. Every step is done by algorithms implemented in our tool.

The strategy we applied to create training data sets is based on the design of experiment methodology. For each prediction, an optimal DOE is generated around the composition of interest and then each run of the design is replaced by the most similar experimental data present in the database, leading to the final training data set.

For the predictive part of the tool, four algorithms were implemented: a classical polynomial model obtained from multilinear regression, a GENREG model, a neural net model, and a support vector regression model, which is a new addition. At the end, we don't have one single predicted value, but four predicted values. By taking the median of these four predictions, we have a very robust estimation of the response. As we will see during the demo, several statistical criteria are also computed and analyzed to compare the performance of the different predictive algorithms.

Here are some of the key parameters. It was very important to take into account as many inputs from the glass experts as possible. One point of major importance was related to the origin and the reliability of the data. For this, a big amount of time in this project has been spent to the constitution of a reliable database.

We also had to create specific algorithms to handle with the nature and the role of oxides and viscosity. We had to implement ways, and we had to study different ways of calculating the distances between the similar compositions. All of this was implemented in the tool algorithms.

Just before moving to the demo, here is a schematic view of the tool, also showing which evolutions have been recently added. Many algorithms have been updated for both the sampling part and the predictive model. SVR has been implemented, as we said before. Also, now it is possible to enter either one single glass composition or a list of compositions, which enables to evaluate the tool capability by using a high number of compositions.

In terms of calculation time, it takes less than a minute for one production. This is okay, but we are currently working on this to improve a little bit the efficiency of the code. As we said, all is coded in JSL, and comes with an add-in.

The demo, thank you, Damien. I'm going to share my screen. JMP add-in has been developed since JMP Pro version 16 for MS [inaudible 00:13:28] tool at CEA. It's now in JMP Pro version 18. It contains two ways to run the predictions for a glass. First button in the JMP menu contains a user interface to predict viscosity or TG for only one glass at one or several temperatures. Second button is an automated version, giving the prediction for a list of glasses at one temperature only.

If I go to the add-in folder, you can see the two main scripts that are run with the two buttons and the different functions or expressions used by the two main scripts. The folder contains scripts for the steps of pre-processing data to construct learning sample data, the prediction models, and at final, the construction of the output of the analysis.

Let's run the first button, the interactive version. The first step for the user is to select the folder containing a global database of all glass viscosity in JMP format. Then I have to select the composition elements, the oxides of the glass I want to predict the viscosity. Then I click on okay.

Here I have to fill the amount in percentage of each element in the global composition for our glass. Here, for example, my glass I want to predict the viscosity contains 6.5% of CaO. The last step is to fill the temperatures. The feature I want to predict here, the viscosity at 2 temperatures, 1,100 degrees and 1,200 degrees. All the four models explained by Damien are to the two temperatures. The objective is to predict the viscosity of my glass. For two temperatures, it takes about 1 minute to give the results.

At final, I have this output window. We can choose the temperature most interesting to go further analysis of predictions. Here, 1,200 degrees. Selecting go will open the final output window, which is the zoom on viscosity at 1,200 degrees.

First, outlines contain the composition of the glass we want to predict and the elements that are really considered in the models. Then we have an outline giving the quality of the models with mean absolute error or R-squared, for example. More details are available on the method of adaptive DOE, such as, for example, the percentage of correlation between the DOE optimal and adaptive DOE here, 80% of correlated elements between the two DOE. It's a very good project.

Prediction formulas and effects are also given. Here, for example, for the BIC, we can see that SiO2 is the oxide that is the most significant into the model. The most important output is here. It's the summary of the predictions in the graphic showing the predicted viscosity values in function of the models. We can see the predicted values of viscosity given by each point. Sixty-one decipascal second for BIC F, 67 for GLM, 49 for neural nets, and 64 for SVR.

We have also the nearest glass given by the green-dashed horizontal line with a measured true viscosity of 68 decipascal second. This nearest glass is here. It's very close to the one we want to predict. As a reminder, we had 6.5% of CaO, for example, the nearest glass has 6.4% of CaO. We don't know the true viscosity of the glass to predict, so we compare the prediction to the true viscosity of the nearest glass.

Finally, for this glass, the median of prediction is 62.5. Is a very good result regarding the measured viscosity of its nearest glass. GLM and SVR, here and here, are the best models to predict the viscosity.

Damien is going now to present the analysis of the automated version and the output at a table for this tool.

The tool capability was evaluated by using a test set containing 7,500 data. The metric of interest is the absolute error on the log of the viscosity, which reflects how far the predicted viscosity is from the reported value in terms of order of magnitude. We usually consider that it is very effective to predict the class viscosity with one order of magnitude within the whole range of possible viscosity values.

Here are the results we obtain, where the predicted viscosity is the median of the predictions explained from the four algorithms. We were quite impressed by the accuracy of the prediction because the error on the predicted viscosity is less than an order of magnitude, which was our target for the vast majority of the data. Second thing is that there are only very few samples for which the error is higher. Highest errors, 3.3 orders of magnitude, which is much lower than viscosity models from the literature.

Another important thing is that the predictive accuracy is good within the whole range of viscosity values from 1–10 to the 6 decipascal second. The last point is that the accuracy is even better for the glass compositions we are mostly focusing on for our application.

We are currently investigating what was behind the data showing the highest error of prediction. In terms of composition, it appears that the prediction error is highly correlated to the presence of chemical elements, which are known to have a very strong impact on the glass viscosity, but are only present in a very few number of compositions in the database. We also observed that the distance to the nearest neighbor was significantly higher for these outliers.

Finally, we also tried to check if there was an effect of the year of the reported data because we can imagine that the experimental devices for glass viscosity measurements were maybe less accurate 60 or 80 years ago. Indeed, these graphs tend to show that the error of prediction is lower for more recent data.

Finally, we observed that the four types of models give very close predictions in terms of quantiles of the absolute error. The max of the absolute error is significantly smaller for neural nets and even smaller for SVR, giving at the end a smaller average error. Because the four models were found to be efficient, we have decided to keep the median of the four predictions to ensure highly robust predictions.

As a conclusion, we have presented a predictive approach based on the dynamic subsampling of the data. We have implemented four different models, which are trained on a subset of similar data generated from an optimal DOE. This procedure is fully automatic with all steps coded in JSL and implemented as an add-in in JMP Pro 18.

The predictive ability of the tool has been evaluated on a data set containing about 7,500 glass compositions, giving very accurate results, especially for the glass compositions of interest.

Next, we would like to evaluate and maybe implement additional predictive techniques that we think could be promising for this application. We will also try to develop a Python interface to couple this tool with other codes of finite element simulation we already use for processing the vitrification.

Thank you very much for your attention.

Thank you.

Presented At Discovery Summit Europe 2025

Presenters

Skill level

Intermediate

Beginner
Intermediate
Advanced

Files

Discovery summit Europe 2025_PERRET & SOUAL_final.pdf

Smart Subsampling vs. Brute Force: A Strategic Approach to Predictive Modelling

Presenters

Skill level

Files

Advanced Statistical Modeling

Design of Experiments

Predictive Modeling and Machine Learning