Machine Learning for Formulation Science: A Technical Framework for Predictive, Data-Sparse Formulation Optimization

Formulation science – the design of chemical mixtures like detergents, agrochemicals, or cosmetics – is traditionally empirical and labor-intensive. Recent advances in machine learning (ML) promise to accelerate formulation R&D by intelligently exploring vast composition spaces. This review outlines key ML frameworks suitable for formulation (especially when data are scarce), including Bayesian optimization, graph neural networks, and generative models, and discusses validation metrics relevant to formulation quality.

Bayesian Optimization Strategies

Bayesian optimization (BO) is widely adopted for optimization under scarce data. It treats the formulation-property relationship as a black box. A Gaussian process (GP) or similar surrogate models the objective(s) (e.g. stability, viscosity) as a function of ingredient ratios, along with an uncertainty estimate. At each step, an acquisition function (like Expected Improvement) proposes the next experiment balancing exploration and exploitation. Critically, BO efficiently handles expensive experiments and constraints (e.g. maximum solubility, pH range).

For example, Waibel et al. (2025) applied BO to biologic drug formulations. They simultaneously optimized three properties (melting temperature, colloidal stability, and interfacial stability) of a monoclonal antibody. Their BO workflow identified an optimal formulation in just 33 experiments, far fewer than a full factorial screen . Importantly, they incorporated constraints (e.g. fixed pH/osmolality range) in the acquisition step. Such multi-objective BO approaches (Pareto optimization) are directly transferable to formulations where trade-offs (e.g. high stability vs. low viscosity) must be balanced. In general, BO can achieve 60–90% reduction in experimental runs compared to brute-force searches, especially when formulations involve dozens of potential components.

Graph Neural Networks for Solubility/Miscibility Prediction

Graph Neural Networks (GNNs) are powerful for modeling molecular interactions. In formulation, a key task is predicting solubility or miscibility of components (e.g. an active in a solvent blend). GNNs encode molecules as graphs (atoms bonded as edges) and learn how molecules interact. Recent work shows outstanding performance: for example, Amiri & Khaleseh (2025) trained a graph convolutional network on 27,000 measurements of drug solubility in binary solvent mixtures across temperatures . Their model achieved a mean absolute error of just 0.28 log units (15% better than previous ML approaches) and reliably predicted solubility of new molecules (MAE < 0.5) . Critically, they demonstrated that GNNs could model non-linear solvent–solute interactions and reduce experimental needs by 60–80%. In practice, GNNs or message-passing networks can predict whether two ingredients are compatible (miscible) or the solubility limit of an active, aiding selection of solvent systems without trial-and-error.

ML with Proprietary or Incomplete Data

Formulation R&D often suffers from limited, proprietary data. When datasets are small or incomplete (missing measurements for some formulations), ML must adapt. Strategies include transfer learning (pre-training on public molecular data), active learning (selective experiments to maximize information gain), and physics-informed models. Transfer learning can leverage large chemical databases for feature embeddings, then fine-tune on a small formulation dataset. Active learning (a BO instance) chooses the most informative samples to label. Federated learning is emerging: companies can collaboratively train models on pooled (encrypted) data without sharing raw formulations. Despite these techniques, data scarcity is a persistent limitation. As one recent dataset paper notes, “availability of experimental data to train or validate ML models is the most critical factor” . Open datasets (e.g. the shampoo formulations dataset below) are a key step in addressing this gap.

Generative Models for Novel Formulations

Generative AI (VAEs, GANs, or diffusion models) offers a way to propose new candidate formulations or molecules. In drug discovery, such models propose novel molecules with desired properties. By analogy, generative models can design new surfactant structures or solvent blends. For example, a VAE could be trained to encode ingredient combinations, and then sample from regions of latent space associated with high stability or low cost. While literature on formulation-specific generative models is nascent, the potential is recognized: tools like Chemprop or DeepChem already use generative architectures for molecules. We anticipate that future platforms will allow formulator to “ask” for a novel formulation given target specs, akin to AI-driven molecular design. This remains an active area for research.

Validation Metrics (Stability, Rheology, Shelf Life)

Any ML-predicted formulation must be validated by experiments. Key metrics include:

Stability: Chemical and physical stability tests measure how a formulation holds up over time. Accelerated stability protocols (e.g. elevated temperature/humidity) can predict shelf life . For example, a formulation is stressed at 40°C and monitored until a property (e.g. viscosity, potency) degrades. Kinetic models (often Arrhenius-based) extrapolate to room temperature longevity . ML models may predict degradation rates or shelf life months from limited stability data.
Rheology: Flow properties such as viscosity, yield stress, and viscoelastic moduli determine performance. Rheological tests (oscillatory or shear rheometry) quantify whether a formulation will pour, spray, or spread as intended. Formulations often need a specific viscosity range (e.g. low enough to pump but high enough to suspend particles). ML surrogate models predict viscosity profiles from composition, but final proof is measuring the actual rheogram. Typical endpoints include the zero-shear viscosity and shear-thinning behavior.
Other Performance Metrics: Depending on application: cleaning efficacy (e.g. oil-lift tests), foam stability (for detergents), drug release profile (for pharma). ML can output proxy predictions, but experimental validation (e.g. percent cleaning vs. control) remains essential. In agrochemicals, metrics include droplet spreading on leaves or uptake rates, which are then measured in bioassays.

Importantly, data used for ML training should include measurement uncertainty. The shampoo formulations dataset, for instance, records sample-specific uncertainty to train robust models . This allows error propagation into predictions and more reliable formulation recommendations.

Example Applications

Several industries are already piloting ML-driven formulation:

Cleaning Agents: A recent study created an open dataset of shampoo formulations. Chitre et al. (2024) reported 812 formulations (with 294 stable samples) of surfactants, polymers, and thickeners, each characterized by phase stability, turbidity, and rheology . This large, diverse dataset is used to train ML models that can predict which ingredient ratios yield a stable, viscous product without precipitation. In industry, Unilever has used AI (in partnership with Arzeda) to find new enzymes for detergents that perform better than traditional chemistry. Their ML-driven design identified stain-removing enzymes five times faster than traditional R&D and reduced water/energy use in cleaning by large margins . These examples illustrate how ML transforms formulation: from suggesting entirely new ingredient classes (enzymes) to fine-tuning conventional mixtures using data.
Agrochemicals: Formulation in agriculture (pesticides, fertilizers) involves optimizing delivery (e.g. controlled release granules, wetting) and safety. ML tools can predict how a pesticide formulation will perform under different climate conditions or how to reduce active dosage while maintaining efficacy . For instance, ML might suggest the optimal surfactant blend that maximizes leaf adhesion while minimizing run-off. Data scarcity is acute here (field tests are expensive), so ML methods focus on transferring lab-scale data to field predictions.
Cosmetics: Besides shampoos, cosmetics R&D uses ML for cream and lotion formulations. Companies collect small datasets on skin feel, spreadability, and stability. Bayesian optimization can suggest new ratios of emollients and emulsifiers to achieve desired viscosity and sensory properties. Graph models predict solubility of fragrance oils in various oil/water mixes. The goal is to shorten the cycle of testing new product variants. Although much of this work is proprietary, academic datasets (like the shampoo study ) are beginning to enable public ML work in personal care.

Conclusion

Machine learning is rapidly becoming a key tool in formulation science. By using BO, GNNs, and other ML techniques, formulators can navigate high-dimensional ingredient spaces more efficiently than ever. Critical to success is rigorous validation: any ML-suggested formulation must be tested for physical stability, rheology, and shelf life (often using accelerated protocols ). The examples above – from shampoo datasets to detergent enzymes – show that even data-sparse domains can benefit markedly from ML. As tools mature, we expect to see integrated “self-driving formulation labs” where ML algorithms propose experiments, robots execute them, and data loop back into the model, continuously improving product design. This data-driven, AI-enhanced approach promises faster innovation in cleaners, crop formulations, and cosmetics, aligning with both performance and sustainability goals .

---

References

Liu, R., Wang, Z., Yang, W., Cao, J., & Tao, S. (2024). Self-optimizing Bayesian for continuous flow synthesis process. Digital Discovery, 3, 1958–1966. https://doi.org/10.1039/D4DD00223G
(2025). Machine learning-driven optimization of deep eutectic solvents: Accelerating physicochemical properties modeling. Sustainable Materials and Technologies, 45, e01536. https://doi.org/10.1016/j.susmat.2025.e01536
(2025). SolECOs: a data-driven platform for sustainable and comprehensive solvent selection in pharmaceutical manufacturing. Green Chemistry, 27, 12621–12641. https://doi.org/10.1039/D5GC04176G

Search This Blog

Shehan Makani