Hi @AsymptoticRules,
Welcome in the Community !
Using chemical structures as categorical factors come with several drawbacks :
- You may be able to analyze and select which molecule(s) perform best, but not always understand the reason(s) certain molecules perform best (polarity, hydrophilic/lipophilic behaviour, molecular volume, number of atoms, H-donors/acceptors, ...),
- You're able to analyze response depending on molecular structures used in the design, but not to predict the response for new molecules not yet used in the experimental design,
- Using a categorical factor with so many levels come at the price of a lot of possible combinations with other factors, creating a design with a high number of experiments (which may not be very convenient in an early stage like screening phase).
As you mentioned being in a screening phase, it would be interesting to reduce the number of fatty acids candidates to screen, to reduce the number of experiments and interactions to screen and only keep the molecules with highest chemical variability, to detect significant effects and interactions, and from there augmenting the design to an optimization/predictive design in a second step, possibly with other molecule candidates.
In order to reduce the number of fatty acids candidates, I would try to analyze the chemical properties/molecular descriptors of the initial 10 fatty acids you plan to screen. Here is how I would do it :
- Calculate/extract molecular descriptors from the chemical structures (several options can be available, with different libraries on Python like RDKit to calculate molecular descriptors, or extract them from public databases like PubChem, ChemSpider, and many others...)
- Use a PCA (or other dimension reduction analysis) to keep a large part of the chemical information in a low number of dimensions (chemical properties/molecular descriptors are frequently highly correlated, so you may be able to keep >70% of the chemical information of this chemical class with only one principal component). This step should facilitate the analysis and selection of molecule candidates.
- Plot the molecule candidates on a Parallel Coordinates Plot or another visualization with their principal component or raw molecular attributes to be able to select the most dissimilar molecules (at least a high and low level for each principal component for example). You can also simulate a DoE based only on the principal components as covariate factors, and see which molecules would have been selected in a D-optimal/screening design.
You can then use these selected molecules as levels of your categorical factor in your design, or directly use the Principal components as continuous factors/covariates in the design.
On this topic, you might find this presentation interesting : https://community.jmp.com/t5/Discovery-Summit-Europe-2017/Increase-Efficiency-and-Model-Applicabilit...
This is only one possible option, I'm sure other members of this forum may have different experiences with molecules as factors. I personally always try to transform the categorical information in a continuous information whenever possible with this type of approach.
I hope this will help you,
Victor GUILLER
L'Oréal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)