Linear solvation energy relationship (LSER) models are used in adsorption and chromatography to describe how molecular interactions influence solute retention or adsorption. These models relate the partitioning coefficient of a solute to various molecular properties, enabling predictions based on solute descriptors, which can be looked up or calculated via quantum chemistry.

Mathematically, LSER models are expressed as linear equations, with coefficients obtained through multiple linear regression of experimental data from a set of solutes. Since obtaining data for solutes is labor-intensive, and solutes may have limitations (e.g., low solubility, high cost, or instability), selecting an optimal minimal set of solutes becomes important.

This study discusses strategies for selecting a chemically diverse minimal solute set that minimizes the standard error of the model's coefficients. Monte Carlo simulations (performed in JMP via Python integration) are used to explore potential solutes, considering cases where solute descriptors span a limited range. Theoretical upper and lower bounds for the standard error are presented. Both homoscedastic and heteroscedastic LSER models are considered. Finally, the impact of interdependencies among solute descriptors on the statistical robustness of these strategies is discussed.

 

 

Hi, everyone, and welcome to our poster. Today we're going to talk about linear solvation energy relationships or LSER for short, which are a type of thermodynamic models. Our focus is going to be processes where we have a solute interacting with a solid. So a simple example can be a dye molecule, for example, which we call a solute binding to a textile, which acts as a solid phase.

How strong the interaction is depends on both the dye properties as well as on the specific characteristics of the textile. We can measure this interaction using, for example, properties like absorption constants. In the LSER models, which are introduced by the chemist, Michael Abraham, this property expressed as a linear combination of two quantities. One is molecular descriptors, the uppercase letters E, S, A, B and V, five molecular descriptors representing characteristics like acidity or polarizability of a solute. These values are known and well documented and tabulated and collected, for example, in large databases, like one of those from the Helmholtz Center for Environmental Research.

The second part of the LSER models are small letters, lowercase letters, c, e, s, a, b and v. Those letters are called system constants, and they have to be determined experimentally by performing multiple linear regression. LSER models are quite widely used in chemical engineering, environmental engineering. So they help us, for example, to describe complex systems. So they help us to predict toxicity of chemicals, sensory irritation properties, soil, water absorption, coefficients and things like this.

They are also life and their applicability is expanding currently, for example, through various quantum chemical techniques. This is the real models. The question which we ask ourselves is as follows. Since the LSER model is a linear regression, so we want to come up with a minimal set of solutes to be able to estimate system coefficients as good as possible with the lowest possible uncertainty. Looking at this equation, we know multiple linear regression, that's what we have to do 6 solutes at least to be able to determine the system coefficients.

From statistical point of view, of course, what we know for linear regression problem is that to get reliable estimations, our molecular descriptors must span possibly large range. This can be illustrated with a very simple homoscedastic model with only one descriptor and normally distributed noise as given here. We know that the standard deviation determines the standard error of the estimator like this. If you look at this equation, you're going to also see that the standard deviation of x values of the molecular descriptors is also inside here.

What we have demonstrated as this ratio only depends on the number of points. In other words, for the standard error of the slope, for example, this holds true. What means is that the larger the coverage, the more reliable the estimation of the specific slope is. In fact, it's more important to have larger coverage than the large number of experimental points. This is purely, so to say, statistical aspect of this thing. But there is another aspect, which is chemical aspect. Picking solutes from a data bank, we may by chance take solutes, which are very similar chemically.

That means that the 5 molecular descriptors are correlated to each other. This problem is known as multicollinearity problem. The question, of course, whether taking those solutes can be a major issue or not, and maybe reducing the multicollinearity problem is even more important than maximizing the range. To answer this question, we basically decided to explore different strategies in a more systematic way. With this, I will hand it over now to Preethi, who will explain our approach in more detail.

Thank you, Pavel. So to address the issues while minimizing the experimental effort, we explore two strategies for selecting minimal solute sets. In strategy 1, where we try to minimize the descriptor correlation, we reduce multicollinearity by selecting compounds with minimal interdependence among the descriptors. This strategy aims to isolate individual contributions of each descriptor and improves statistical robustness. In strategy 2, a set of solutes with maximum differences between the descriptors are selected.

To find the distances between the descriptors, they are normalized between 0 and 1. This ensures we pick a diverse chemical space and then enhancing predictive power. This is a deterministic method, meaning that we get the same set of solutes or same set of results at any point of time given that we use the same database with same descriptors. Both strategies are tested with selecting data sets of 20 and 50 compounds from the full data set that we have which is over 5,000 compounds. For the calculations, we assume the system coefficients represented in the lower cases are equal to 1. For strategy 1, we calculate the correlation matrix using the Pearson correlation coefficient. From this, the average absolute correlation AAC is then computed to quantify the multicollinearity.

If the AAC is close to 1 then there is a high correlation between the descriptors and if the values are close to 0 then there is no correlation between the descriptors. For strategy 1, we perform 10,000 iterations which checks for different combinations of 20 and 50 compounds which returns the least AAC values. After selecting these smaller data sets, multiple linear regression is performed by adding random normal noise to the property in every iteration. This helps to analyze how noise impacts the coefficient distributions. For strategy 2, descriptors are normalized using min max scaling the starting compound is chosen based on the median of the normalized descriptor values subsequent compounds are selected by maximizing the dissimilarity. This ensures that the selected compounds span diverse chemical space. Now, let's compare the results of both strategies.

Introducing random normal noise over 10,000 iterations during multiple linear regression, significantly shapes the resulting coefficient distributions, which are plotted using the JMP software. Due to the central limit theorem, the coefficient distributions tend to converge towards a Gaussian shaped curve, exhibiting reduced variance as the number of iterations increase. For the simplicity, we chose 2 of the 6 system coefficients for the discussion here. First, let's compare the AAC values of strategies 1 and 2. We see that the strategy 1 has a lower AAC value than that of strategy 2, which is what we expect as our main goal where the first strategy is to minimize the correlation between the descriptors which minimizes the average of solute correlation.

On one hand, we reduced the multicollinearity that is associated but on the other hand we do not have a diverse chemical space as the mean values of the system coefficients which are between 0.7-1.5, deviate significantly from the ground truth value of 1 and have moderately higher standard deviations usually around 0.3. With the strategy 2 in consideration, we see that the AAC values almost as twice that of strategy 1, describing that there is a stronger correlation between the descriptors resulting in multicollinearity.

Interestingly, the mean values of these system coefficients are closely around the ground truth of 1 where standard deviations are 0.2 shows that strategy 2 is better suited for achieving diverse chemical space and predictive accuracy and the data set with 50 compounds has lower standard deviations than the data set with 20 compounds. While we compare both the strategies 1 and 2 with full data set of 5,000 compounds, it is clear that the AAC values are higher given that all compounds are considered and there's no minimization involved. The mean is closer to the ground truth and the standard deviations for this full data set are 10 times lower than the ones obtained from the implemented strategies.

The higher the number of compounds chosen, the narrower the histogram distribution looks like. In conclusion, strategy 2, which tries to find the solutes which have maximum differences between the descriptors, appears to provide a data set that better aligns with and represents the larger chemical space. Both the strategies have distinct strengths depending on the final objectives of the user. We thank JMP for providing such powerful and intuitive tools for statistical analysis the multiple linear regression functionality allowed us to effectively model relationships between our variables and the histogram plotting tools provided a clear visualization of our data distributions, aiding in our analysis and interpretation. Thank you

Thank you

Presenters

Skill level

Intermediate
  • Beginner
  • Intermediate
  • Advanced

Files

Published on ‎12-15-2024 08:23 AM by Community Manager Community Manager | Updated on ‎03-18-2025 01:12 PM

Linear solvation energy relationship (LSER) models are used in adsorption and chromatography to describe how molecular interactions influence solute retention or adsorption. These models relate the partitioning coefficient of a solute to various molecular properties, enabling predictions based on solute descriptors, which can be looked up or calculated via quantum chemistry.

Mathematically, LSER models are expressed as linear equations, with coefficients obtained through multiple linear regression of experimental data from a set of solutes. Since obtaining data for solutes is labor-intensive, and solutes may have limitations (e.g., low solubility, high cost, or instability), selecting an optimal minimal set of solutes becomes important.

This study discusses strategies for selecting a chemically diverse minimal solute set that minimizes the standard error of the model's coefficients. Monte Carlo simulations (performed in JMP via Python integration) are used to explore potential solutes, considering cases where solute descriptors span a limited range. Theoretical upper and lower bounds for the standard error are presented. Both homoscedastic and heteroscedastic LSER models are considered. Finally, the impact of interdependencies among solute descriptors on the statistical robustness of these strategies is discussed.

 

 

Hi, everyone, and welcome to our poster. Today we're going to talk about linear solvation energy relationships or LSER for short, which are a type of thermodynamic models. Our focus is going to be processes where we have a solute interacting with a solid. So a simple example can be a dye molecule, for example, which we call a solute binding to a textile, which acts as a solid phase.

How strong the interaction is depends on both the dye properties as well as on the specific characteristics of the textile. We can measure this interaction using, for example, properties like absorption constants. In the LSER models, which are introduced by the chemist, Michael Abraham, this property expressed as a linear combination of two quantities. One is molecular descriptors, the uppercase letters E, S, A, B and V, five molecular descriptors representing characteristics like acidity or polarizability of a solute. These values are known and well documented and tabulated and collected, for example, in large databases, like one of those from the Helmholtz Center for Environmental Research.

The second part of the LSER models are small letters, lowercase letters, c, e, s, a, b and v. Those letters are called system constants, and they have to be determined experimentally by performing multiple linear regression. LSER models are quite widely used in chemical engineering, environmental engineering. So they help us, for example, to describe complex systems. So they help us to predict toxicity of chemicals, sensory irritation properties, soil, water absorption, coefficients and things like this.

They are also life and their applicability is expanding currently, for example, through various quantum chemical techniques. This is the real models. The question which we ask ourselves is as follows. Since the LSER model is a linear regression, so we want to come up with a minimal set of solutes to be able to estimate system coefficients as good as possible with the lowest possible uncertainty. Looking at this equation, we know multiple linear regression, that's what we have to do 6 solutes at least to be able to determine the system coefficients.

From statistical point of view, of course, what we know for linear regression problem is that to get reliable estimations, our molecular descriptors must span possibly large range. This can be illustrated with a very simple homoscedastic model with only one descriptor and normally distributed noise as given here. We know that the standard deviation determines the standard error of the estimator like this. If you look at this equation, you're going to also see that the standard deviation of x values of the molecular descriptors is also inside here.

What we have demonstrated as this ratio only depends on the number of points. In other words, for the standard error of the slope, for example, this holds true. What means is that the larger the coverage, the more reliable the estimation of the specific slope is. In fact, it's more important to have larger coverage than the large number of experimental points. This is purely, so to say, statistical aspect of this thing. But there is another aspect, which is chemical aspect. Picking solutes from a data bank, we may by chance take solutes, which are very similar chemically.

That means that the 5 molecular descriptors are correlated to each other. This problem is known as multicollinearity problem. The question, of course, whether taking those solutes can be a major issue or not, and maybe reducing the multicollinearity problem is even more important than maximizing the range. To answer this question, we basically decided to explore different strategies in a more systematic way. With this, I will hand it over now to Preethi, who will explain our approach in more detail.

Thank you, Pavel. So to address the issues while minimizing the experimental effort, we explore two strategies for selecting minimal solute sets. In strategy 1, where we try to minimize the descriptor correlation, we reduce multicollinearity by selecting compounds with minimal interdependence among the descriptors. This strategy aims to isolate individual contributions of each descriptor and improves statistical robustness. In strategy 2, a set of solutes with maximum differences between the descriptors are selected.

To find the distances between the descriptors, they are normalized between 0 and 1. This ensures we pick a diverse chemical space and then enhancing predictive power. This is a deterministic method, meaning that we get the same set of solutes or same set of results at any point of time given that we use the same database with same descriptors. Both strategies are tested with selecting data sets of 20 and 50 compounds from the full data set that we have which is over 5,000 compounds. For the calculations, we assume the system coefficients represented in the lower cases are equal to 1. For strategy 1, we calculate the correlation matrix using the Pearson correlation coefficient. From this, the average absolute correlation AAC is then computed to quantify the multicollinearity.

If the AAC is close to 1 then there is a high correlation between the descriptors and if the values are close to 0 then there is no correlation between the descriptors. For strategy 1, we perform 10,000 iterations which checks for different combinations of 20 and 50 compounds which returns the least AAC values. After selecting these smaller data sets, multiple linear regression is performed by adding random normal noise to the property in every iteration. This helps to analyze how noise impacts the coefficient distributions. For strategy 2, descriptors are normalized using min max scaling the starting compound is chosen based on the median of the normalized descriptor values subsequent compounds are selected by maximizing the dissimilarity. This ensures that the selected compounds span diverse chemical space. Now, let's compare the results of both strategies.

Introducing random normal noise over 10,000 iterations during multiple linear regression, significantly shapes the resulting coefficient distributions, which are plotted using the JMP software. Due to the central limit theorem, the coefficient distributions tend to converge towards a Gaussian shaped curve, exhibiting reduced variance as the number of iterations increase. For the simplicity, we chose 2 of the 6 system coefficients for the discussion here. First, let's compare the AAC values of strategies 1 and 2. We see that the strategy 1 has a lower AAC value than that of strategy 2, which is what we expect as our main goal where the first strategy is to minimize the correlation between the descriptors which minimizes the average of solute correlation.

On one hand, we reduced the multicollinearity that is associated but on the other hand we do not have a diverse chemical space as the mean values of the system coefficients which are between 0.7-1.5, deviate significantly from the ground truth value of 1 and have moderately higher standard deviations usually around 0.3. With the strategy 2 in consideration, we see that the AAC values almost as twice that of strategy 1, describing that there is a stronger correlation between the descriptors resulting in multicollinearity.

Interestingly, the mean values of these system coefficients are closely around the ground truth of 1 where standard deviations are 0.2 shows that strategy 2 is better suited for achieving diverse chemical space and predictive accuracy and the data set with 50 compounds has lower standard deviations than the data set with 20 compounds. While we compare both the strategies 1 and 2 with full data set of 5,000 compounds, it is clear that the AAC values are higher given that all compounds are considered and there's no minimization involved. The mean is closer to the ground truth and the standard deviations for this full data set are 10 times lower than the ones obtained from the implemented strategies.

The higher the number of compounds chosen, the narrower the histogram distribution looks like. In conclusion, strategy 2, which tries to find the solutes which have maximum differences between the descriptors, appears to provide a data set that better aligns with and represents the larger chemical space. Both the strategies have distinct strengths depending on the final objectives of the user. We thank JMP for providing such powerful and intuitive tools for statistical analysis the multiple linear regression functionality allowed us to effectively model relationships between our variables and the histogram plotting tools provided a clear visualization of our data distributions, aiding in our analysis and interpretation. Thank you

Thank you



Event has ended
You can no longer attend this event.

Start:
Thu, Mar 13, 2025 06:50 AM EDT
End:
Thu, Mar 13, 2025 07:30 AM EDT
Ballroom Gallery- Ped 3
0 Kudos