Efficient Strategies for Selecting Minimal Solute Sets in Linear Solvation Energy (LSER) Models
Linear solvation energy relationship (LSER) models are used in adsorption and chromatography to describe how molecular interactions influence solute retention or adsorption. These models relate the partitioning coefficient of a solute to various molecular properties, enabling predictions based on solute descriptors, which can be looked up or calculated via quantum chemistry.
Mathematically, LSER models are expressed as linear equations, with coefficients obtained through multiple linear regression of experimental data from a set of solutes. Since obtaining data for solutes is labor-intensive, and solutes may have limitations (e.g., low solubility, high cost, or instability), selecting an optimal minimal set of solutes becomes important.
This study discusses strategies for selecting a chemically diverse minimal solute set that minimizes the standard error of the model's coefficients. Monte Carlo simulations (performed in JMP via Python integration) are used to explore potential solutes, considering cases where solute descriptors span a limited range. Theoretical upper and lower bounds for the standard error are presented. Both homoscedastic and heteroscedastic LSER models are considered. Finally, the impact of interdependencies among solute descriptors on the statistical robustness of these strategies is discussed.