Skip to main content
SearchLoginLogin or Signup

Closing the Execution Gap in Generative AI for Chemicals and Materials: Freeways or Safeguards

GenAI may enable computers to create drugs or sustainable materials. But impact in chemistry happens further downstream, following synthesis, testing, and scale-up. We propose paths for closing this execution gap and creating powerful, safe AIs that can realize novel chemicals.

Published onMar 27, 2024
Closing the Execution Gap in Generative AI for Chemicals and Materials: Freeways or Safeguards


Designing chemicals and materials is a winner-takes-all game in a chemical space. Out of a nearly endless list of candidate chemicals, only a tiny subset can practically be made and tested for a given task. Of those, a mere handful will advance to the clinic or the market because of regulatory and economic factors, like Food and Drug Administration approval or Capital Expenditure for manufacturing. This attrition is slow, expensive, and high stakes; committing to the wrong chemical can ruin a clinical trial or doom a consumer device. Generative AI has undoubtedly broadened and accelerated the early stages of chemical design. However, real-world success takes place further downstream, where the impact of AI has been limited so far. Here, we identify the nature of this ‘execution gap’ and analyze its technical and social sources with a focus on the domain-specific nuances that separate chemistry and materials from other applications of AI. We explore strategies to address the gap and include directions for algorithmic development, strategies for efficient creation and sharing of data, and maximization of feedback between early discovery and downstream validation. Like any powerful technology, generative AI in chemistry brings ethical concerns and dual-use risks. In our analysis, these are not as critical or immediate as they are in other applications of AI. Precisely because of the execution gap, AI has so far not meaningfully lowered the barrier for bad actors to produce chemicals known to be harmful and nor has AI creativity practically broadened the risk spectrum in terms of the types and severity of potential misuse.

Keywords: drug discovery; autonomous laboratories; sustainability; materials science

🎧 Listen to this article

1. Introduction

The use of generative AI in chemistry and materials has closely tracked the evolution of the field and at times guided algorithmic innovations (Figure 1).[1],[2],[3] In the essentially infinite chemical space, generative AI offers the promise of inverse design by automatically producing winning designs instead of iterating over candidate chemicals by trial and error. This idea has been demonstrated: AI algorithms have expedited the discovery of small molecule drugs;[4],[5] protein structures created by generative AI have been experimentally verified;[6],[7] large language models, trained on scientific literature, have been used to design and plan the synthetic pathways of metal organic frameworks[8],[9] and to orchestrate data access and robotic experimentation;[10] and the concept of AI-guided closed-loop discovery, integrated with autonomous experimentation, has been demonstrated in multiple proofs of concept (Figure 2).[11],[12],[13],[14] The successes achieved beg the question: Is it possible to conjure up a blockbuster drug or a high-temperature superconductor from a well-stated user prompt, similar to consumer-facing generative models for text or images? Such capabilities could alter the scientific discovery landscape by lowering knowledge and cost barriers to design new chemicals. Naturally, concerns arise that misuse of such technology may enable the creation of illicit or toxic substances, warranting new safeguards and regulations.[15],[16]

Figure 1

Schematic illustration of generative AI and automation for chemistry and materials science.

Figure 2

Comparison between traditional and AI-guided design workflows. [Reproduced from reference 1]

In reality, generative AI is not yet an established success—or risk—in chemistry and materials science. Nearly a decade has elapsed since the first deep learning generative models for chemistry were published; reported victories are increasing but remain preliminary. While promising, innovations that can be uniquely attributed to AI have not been carried forward to the clinic or scaled up as consumer products and are within the reach of just a few specialized laboratories.

Some existing barriers are likely addressable through algorithmic innovation that would generalize applicability. Unlike vision or natural language models, machine learning (ML) models in chemistry face challenges when operating beyond their training domain. Trustworthy extrapolation, essential for scientific discovery and innovative designs, is lacking. This limitation necessitates expensive experimental validation of potentially inaccurate ideas.

More critically, a practical execution gap exists between generative AI and the physical world. In chemistry and materials, outcomes must be embodied, evaluated for function and safety, scaled up, and commercialized. Despite the rapid generation of ideas, evaluating them is costly, involving small-scale laboratory testing, clinical trials, and slow to set up pilot plants, which produce sparse data. Current AI struggles with these downstream stages and lacks the ability to learn from them. The definition of ease of synthesis and scalability are themselves application dependent and may involve precursor cost, supply chain availability, adaptability of industrial infrastructure, reproducibility, yield, etc. This friction between computation and translation is not uniform across applications because of algorithmic reasons, availability of open data, speed and throughput of experimentation, and economic drivers of innovation. Social factors are relevant too. Scientists may be demotivated to carry out the designs of an AI that could be inscrutable, error prone, overconfident, or unoriginal. Likewise, there are instances of disagreement between the AI community and traditional subject matter experts on the size and nature of this gap and on how to measure AI success.

The reality of this execution gap must also be considered when addressing dual-use concerns. In theory, the difference between minimizing toxicity for food coloring and maximizing it for a chemical weapon is merely a negative sign, but most generative AI approaches to molecular design are fraught with mispredictions and critically bottlenecked by time-consuming empirical evaluation. The community lacks a standardized set of ethics and risk management rules, unlike human and animal experiments or the emerging standards in the AI community. The risks of purely digital innovations may be overblown, disregarding the difficulty of actually making chemicals, and the corresponding risk management strategies may be inordinately cautious, blocking the release of algorithms, training data, or even user inputs. Since embodiment is a technological chokepoint, it is also the natural place for guardrails: controlling access to precursors or laboratories is both easier and more meaningful than keeping technical programming decisions or training data secret.

Distributed efforts, including algorithmic innovation, open data initiatives, and collaboration between tech, chemical industry, and academia can help deliver generative AI’s potential in chemistry and materials for healthcare and sustainability. These efforts can bridge the gap between digital discovery and physical productization while maintaining barriers against misuse. Although strides are being made, the pace of these efforts currently lags behind advances in generative AI.

2. Gaps in Using Generative AI for Chemistry and Materials

2.1. Learning Generalizable Representations

Chemistry and materials discoveries involve identifying physical matter with unprecedented properties, an extrapolation from known to unknown, which is particularly challenging because it takes place in a regime where training data is scarce. Furthermore, to invent useful chemicals and not just random chemicals, generative models must be coupled with discriminative models (i.e., property prediction models) trained on expensive labels from laboratory experiments or accurate simulations. Generative and discriminative models struggle to generalize in out-of-domain (OOD) regions. For example, a generator might lack creativity to propose novel chemistries beyond its training data, like suggesting a carbon fiber–reinforced plastic for an aviation airframe, which a model trained on metal alloys might overlook. Additionally, property predictors trained on the same data may assign spurious fitness labels to OOD suggestions. Hallucinations, in which models are creative but factually wrong, are common in generative models, especially in this extrapolative regime. Pretraining models on a large, broad dataset is a common technique in generative AI to create foundational models that can be efficiently fine-tuned on downstream tasks. Pretraining and foundational models have seen great success in images and text and have also been relatively successful in the space of protein design but—despite many efforts—not as much for molecules and even less so for solid crystalline materials, which have only recently started to be addressed.[17],[18],[19] Unlike small molecules, the properties of solid materials cannot be captured only from the structure of a few atoms and depend on processing as much as on composition, and for instance, the design of new catalysts requires algorithms capable of going beyond small atomic structures.[20],[21] This broad challenge in generalizability is one reason why closed-loop, iterative, experimental validation is so vital in a typical workflow and why developing meaningful and comprehensive training datasets has been such an important focus.

2.2. Generating Training and Validation Data

Laboratory experiments are slow, expensive, and poorly scalable, which usually make them the rate-determining step in an ML-driven discovery pipeline. The significance of this problem has been reflected in several efforts to accelerate the experimental validation process with autonomous laboratories in the past half decade.[11],[13],[22],[23],[24],[25],[26],[27] They have, however, experienced some unique practical challenges.[28],[29] For instance, the products that these robotic platforms can synthesize are limited by the set of prespecified precursor chemicals and implemented physical operations. Narrower chemical and operational ranges allow for more efficient automation, as evidenced by the robustness and commercial availability of DNA, RNA, and peptide synthesizers. With a greater breadth of the molecular products a robotic platform is able to make comes greater cost and complexity. Producing a new molecule, material, or device often requires a complex sequence of physical operations beyond mixing and heating, not the least of which is designing purification and isolation strategies to ensure that the properties one measures correspond to the molecule one intended to make. In addition, current off-the-shelf robotic arms lack the dexterity and coordination of human hands, which is crucial for tasks like the dosing of solid reagents with diverse textures.[30] While the initial examples of automated labs have been successful proofs of concept for accelerated synthesis and closed-loop design workflows, they also highlighted the challenges: the time and capital investment in developing and installing these platforms is large, so the averaged time and cost per unit experiment are not necessarily lower than traditional experimentation. It is also likely that experimental validation will still remain the rate-determining step, even with the development of general purpose robotic labs in the future, placing high importance on the predictive quality of generative models.

2.3. Data Quantity and Quality

The performance of both generative and discriminative models is bounded by the quality of the training dataset. Labeled data in the chemical sciences is obtained from experiments that are susceptible to both instrumental and human errors. For instance, the signal-to-noise ratio in chemical and biochemical experiments can be small, particularly across replicated biological experiments, which makes the direct application of machine learning to results from human experiments challenging.[31] The quantity of the data is also an important consideration since the state-of-the-art models employing end-to-end learning typically demand large training datasets. Autonomous experiments could potentially enhance both quality and quantity, but they still face the challenges mentioned above. Up to now, ‘real’ success stories such as approved drugs or commercialized materials, whether aided by AI or not, are few and provide sparse reward signals to gauge the superiority of one discovery workflow over another.

Data bias is another concern, as it leads to biased predictions. This includes the more obvious issue of bias, that is, erroneous labels, but also the distribution mismatch between the training data and the target chemical space. In molecular design, it is unlikely for a generative model to generate candidates containing chemical scaffolds (or crystal structures in inorganics for instance) that are not well represented in the training dataset, even if they might lead to better properties. Likewise, training models on data from scientific literature[32] inherits reporting biases, posing a risk of systematically overoptimistic models.

Physics-based simulations can sometimes be a cheaper—but lower fidelity—alternative to experimental measurements, so it is common to use these synthetic labels to expand the coverage of models in chemistry. The choice of the size and time scale of simulations is an important factor. For instance, phenomena such as the degradation of metallic materials in natural environments depend on local microstructural features at a scale of micrometers to millimeters, which can only be observed with mesoscale simulations rather than atomistic simulations.[33] Datasets containing labels of multiple experimental fidelities, combining simulated and experimental data or various time and size scales, are and will continue to remain an integral part of chemical machine learning, which requires the development of algorithms able to exploit these dimensions of information.[34],[35],[36],[37],[38]

2.4. You Get—At Best—What You Ask For

On the algorithmic side, generative models must not only suggest novel candidates but also optimize them for desired properties. It has historically been difficult to clearly describe—in machine readable format—the functional requirements that accurately capture real-world complexities. For instance, efficacious drug molecules should not only demonstrate a high affinity towards specific targets but also exhibit selectivity against alternative targets, coupled with a favorable ‘drug-likeness’ profile, encompassing acceptable absorption, distribution, metabolism, excretion, and toxicity characteristics, and be synthetically accessible at the appropriate scale while meeting regulatory standards of purity, etc. We not only require quantitative estimates of all these properties but also need to understand the tradeoff between all these objectives during property optimization.

2.5. Success Is Hard (To Measure)

As generative AI in chemistry and materials transitions from a scientific novelty to a practical technology, different coexisting measures of success must be reconciled. Proof-of-principle instances of AI creativity or autonomy may look trivial, or below standard, when compared with expert reasoning or traditional scientific rigor. The Automatic Chemical Design (ACD) levels, for instance, provide a means to classify the autonomy of agents based on the degree of contributions by chemists and machines in ideation and decision stages of the workflow.[39]

Recent instances such as the work on a deep learning method to identify potent DDR1 kinase inhibitors in Nature Biotechnology[5] and the work on an autonomous laboratory for materials synthesis in Nature[13] are examples of this difference in standards. In the former, the best candidate generated by the generative algorithm was structurally similar to ponatinib, a known DDR1 kinase inhibitor, posing the question of whether this was as ‘revolutionary’ as suggested by some media.[40],[41] In the latter, the phase identification and Rietveld refinement of X-ray diffraction patterns after synthesis was performed in an automated fashion with the usage of ML components, resulting in a poor quality of refinement as compared to the standard expected from human scientists, which might have led to inaccurate predictions of compositions.[42],[43] While methodological advances were quite significant in both cases, friction arose between holding the AI to its own standard versus the scientific standard in the given field.

A similar clash of standards sometimes arises around data and model availability. The gold standard for openness within the AI community is open sourcing of data and code,[44],[45] while publications in chemistry and materials are expected to comply with high standards of disclosure in methodology and result validation. Ideally, research articles at the interface would apply the superset of both expectations, but AI works in chemistry and materials are at risk of both underreporting experimental details compared with traditional works in the field and not making their code and data available (in a way that is usable). Releasing training data, code, and model checkpoints along with published papers should be a mandatory requirement in journals when the authors hold the intellectual property rights. In works coupling AI with robotic platforms, it is a reasonable choice to not make access to robotic hardware controls publicly accessible for safety concerns. However, not releasing accompanying software and datasets impedes scientific progress by blocking the community from accessing useful tools. Furthermore, it prevents due scientific scrutiny. Lastly, commercial interests in developing software or chemicals may be perfectly justified, but they should not interfere with reporting standards and must be accompanied with appropriate disclosures in publications. It is particularly dangerous for the community to mask for-profit or other spurious interests with poorly justified misuse concerns as a reason to underreport.

3. Paths Forward

Bridging the execution gap of generative models in chemical science includes tackling components arising from (1) the performance of generative algorithms and (2) the expense of experimental validation.

3.1. Algorithm Innovation

The unique challenges posed by scientific tasks demand algorithmic innovation in areas that may not be the focus for the development of general-purpose AI tools. Inverse chemical design models must couple generative AI with property prediction models to drive design to novel structures that optimize for properties of interest. While the first developments were seen in molecules [46], [47], periodic materials have gained traction more recently[48],[49] Improving predictive quality and generalization is critical to maximize success rates and avoid endless data acquisition loops focused on repairing unreliable models.[50] Since there is some overlap in problems and solutions for both generative and discriminative models, we discuss both cases within a common framework.

Pretraining, transfer-learning, and foundation models. Datasets in chemical science suffer from sparse real-world successes and expensive labeling, which often translates to small training datasets, so designing optimized molecules with as little labeled data as possible is still an open question.[51] Improving the effect of pretraining and transfer-learning strategies in AI for chemistry is critical. Being able to train foundation models for chemistry that can be fine-tuned for specific tasks is a promising path to address the scarcity of quality labeled data. There have been promising attempts in the direction of transfer learning for chemistry recently, especially for property prediction tasks, but they are yet to reach the same level of effectiveness as language and vision models.[52],[53],[54] Maximizing the usage of existing publicly available datasets across domains will help drive such foundation models. Recent works suggest that generalized architectures for chemistry and materials follow favorable scaling laws and can continue to improve with larger datasets.[55],[56] Novel approaches bridging the gap between pretraining and transfer learning strategies and the intricacies of scientific tasks remains a pivotal focus for advancing the field.

Inductive bias. Incorporating known science-based, domain-specific priors into models could be another avenue for extracting value from small datasets. A recent successful trend is the use of equivariant models that incorporate spatial symmetry rigorously, such as the E3NN framework.[57] By having the internal state and the operation of the model respect strictly the symmetry constraints of their inputs and labels, equivariance improves data efficiency and physical faithfulness in some classes of prediction tasks.[58],[59],[60],[61],[62],[63],[64] It is, however, uncertain how effective such physical priors are in generative models, with some recent research showing equivalent generalization performance being achieved without domain-specific inductive biases.[3],[65] There is hence a need for further research to explore the coupling between physical priors, dataset sizes, and the task being performed with generative models.

Digitizing synthetic accessibility. A direct adoption of graph generative algorithms for designing molecules exhibits severe problems in synthetic accessibility,[66] which motivates considering synthetic accessibility during generation.[67],[68],[69],[70],[71] Different chemical entities—such as small molecules, proteins, RNA, and solid materials—present unique challenges in generative modeling. The formalisms of small molecules as graphs, proteins as sequences of canonical amino acids, and RNA as sequences of nucleotides do not necessarily have straightforward analogs for solid materials with periodicity, certain defect structures, or well-defined microstructure that arise from the interplay of composition and processing. Consequently, the creation of models specifically tailored to each of these entities is vital.[49],[72],[73],[74] On the other hand, given that all chemical species adhere to universal physical principles and given that many applications demand simultaneous modeling of different chemical entities (e.g., small organic molecules and proteins in drug discovery or zeolites and organic structure-directing agents in catalysis), there is an expectation for models to generalize across diverse chemical spaces and multiple length scales, exemplified by initiatives like RosettaFoldAA[75] or so-called foundational machine learning potentials in materials.[56],[76],[77],[78] ML algorithms can also find use in accelerating experimental validation, such as by planning chemical synthesis, which is particularly promising when coupled with automated laboratories.[79],[80],[81]

Uncertainty quantification and active learning. To tackle experimental challenges in throughput and scalability, one possible algorithmic approach involves enhancing the efficiency of design and optimization workflows while easing the burden on validation components. Active learning (AL) strategies, utilizing Bayesian optimization for instance, can be employed to minimize costly queries/experiments, particularly in machine learning–accelerated synthesis workflows.[82],[83] Uncertainty estimates of machine learning models, that is, knowing when the model prediction is not reliable, are a vital part of the data acquisition strategies in AL. Existing approaches for uncertainty quantification, while varied, face limitations tied to specific datasets and tasks, hindering widespread adoption.[84] Developing universally applicable, well-calibrated uncertainty quantification techniques in chemistry and materials would enhance the efficiency of AL strategies.

3.2. Experimental Capabilities

A promising solution to accelerate the data generation and experimental validation is using robotic execution to achieve automated, and eventually autonomous, laboratories. Automation in chemical experiments is the first step towards autonomous experimentation, based on the modularization and scaling of basic experimental operations.[11],[13],[14],[24],[25],[26],[27] Many of these operations have commercialized solutions, such as liquid handling and plate or vial transfer, but certain key steps still lack effective commercial solutions, such as accurately handling powdered solid and viscid substances or very small liquid volumes. Integrating characterization to effectively ‘test and analyze’ and thus close the loop also challenges current workflows. There are both hardware and software opportunities. Characterization instrumentation is very diverse and typically costly, and only some techniques are amenable to physical integration alongside the automated synthesis equipment, and many others reside in physically separated facilities. This is particularly meaningful for solid materials, since their properties arise from complex multiscale interactions. Depending on the level of integration and physical proximity of the characterization equipment, samples can be transferred using fluidic techniques, stationary robotic arms, or mobile robotic arms.[85] On the software side, ML techniques offer the potential for automated analysis of characterization data.[86] The integration of heterogeneous characterization data from multiple sources, such as X-ray crystallography, microscopy, nuclear magnetic resonance, or various spectroscopic techniques, is also an ongoing area of research.

Autonomous experimentation requires using AI planning to fully or partially replace detailed human-written instructions. This involves collaborative advancements in synthesis planning algorithms, algorithms for chemical hypothesis generation (molecular design), communication across multiple automated laboratories, and the ability to monitor experimental progress through multimodal sensing systems to verify experiment success. Some issues demand breakthroughs in fundamental science, such as analyzing chemical components in a mixture without the use of authentic reference standards at scale. Overall, the main challenges are often interdisciplinary engineering problems, requiring collaborative efforts from mechanical engineers, electrical engineers, chemical engineers, and others. Such challenges are not typically suited for resolution by individual academic teams. Government or private-backed user facilities, as they exist for materials characterization (beamlines) or biomedical research (contract research organizations), are a possible pathway to ameliorate the capital and operating expense of autonomous labs and maximize their impact.[30],[85],[87]

3.3. Data, Benchmarks, and Other AI Infrastructure

The development, training, evaluation, and deployment of AI algorithms still require additional components, which we refer to as infrastructure.[88] Among these, as mentioned above, data quality and quantity is a big issue. On the one hand, we need to accelerate data generation by promoting high-throughput experiments (see above). On the other hand, we need to encourage the production of more open data and reforms in data-sharing practices. Emphasizing the importance of consistent measurements in molecular and materials properties, it is crucial for government agencies or nonprofit organizations to invest in the development of open standard datasets. A prime example is the creation of large-scale, high-quality datasets similar to the National Institutes of Health’s Tox21,[89] which has proven to be valuable for evaluating chemical predictive models.

Even when open data is available, locating and curating it and evaluating model performance in task-appropriate ways are nuanced and time-consuming tasks that require domain expertise.

To mitigate these issues, the creation of centralized, expert-maintained databases and benchmarks by authoritative institutions is crucial. For example, the Protein Data Bank (PDB),[90] a database of protein structures, which facilitated the establishment of Critical Assessment of Structure Prediction (CASP), a competition for benchmarking protein structure prediction models. Even before AI was a thinkable solution to CASP, the data availability enabled by PDB and the regular assessment of computational models enabled by CASP have been pivotal in the evolution and success of AI models and directly catalyzed the development of AlphaFold.[91] Initiatives such as the Therapeutic Data Commons[92] and the Open Reaction Database,[93] are also making strides in providing uniform access to open data in therapeutic science and organic reaction, respectively, marking progress towards resolving the highlighted challenges in data utilization and model evaluation. Moreover, it is important that publishers of scientific journals must support open data efforts, from requiring digital supplementary information to unified standards for machine readable versions of figures and chemical compound names, so that they can readily be utilized as training data for models, thus simplifying the need for complicated literature and figure mining algorithms.[50],[94],[95],[96],[97][98],[99],[100],[101],[102]

4. Ethical and Dual-Use Risks and Mitigation Strategies

The gap between generative model predictions and practical execution, observed in positive-use cases, also extends to malicious-use cases because of the same issues of predictive reliability, synthetic accessibility, and experimental production of novel chemicals. Recent work highlights concerns about generative models being used for malicious purposes, citing an example of designing toxic molecules with predicted LD50 values lower than nerve agent VX.[16] However, current generative models are far from reaching the potential to accurately design synthesizable compounds with specific property profiles, especially when compared to human experts.

Yet, as generative AI becomes an increasingly viable avenue for scientific discovery, its use in accelerating the design of novel harmful compounds or predicting synthetic routes for known harmful compounds could become more feasible. The inability of current models to reliably perform these tasks on basic molecules like aspirin [10] is not a viable safeguard in the long term.

4.1. Information Risk and the Broader Public

If, however, the objective shifts from instructing generative models to design compounds with novel properties to a more straightforward task, such as enhancing knowledge access about existing toxic chemicals, ML models and tools like conversational AI bots can pose significant risks on a shorter timeline. This was demonstrated in a recent incident in which a conversational meal-planner bot recommended a recipe for creating chlorine gas when guided to.[103] In situations in which the level of expertise of the end user and anticipated intent of usage are highly variable, it is crucial to balance the level of autonomy of the agent and provide due safeguards on what decisions are allowed and disallowed. This is a challenge for conversational AI in general (which can provide just as objectionable advice about dieting, medication, or investment) and must be addressed cohesively through alignment practices for generative AI, relying on ethics boards, legislative guidance, community engagement, and continuous oversight to implement strong safety nets.

4.2. Power-User Bad Actors

Existing safeguards in conversational AIs cannot consistently repel voluntary misuse by bad-actor power users in the form of prompt hacking and attacks, and this challenge carries over to chemical information. The level of risk that arises from exposing ‘forbidden’ chemical knowledge should be compared to the amount of information that would be available on the internet to bad actors with equal levels of motivation and expertise. As of today, actually making dangerous chemicals is harder than just knowing how to create them (which is harder than imagining what to create, as in the nerve agent example) because it requires access to specialized equipment and chemicals, on top of the combination of background technical knowledge and misuse-specific information. It is foreseeable that AI could supplant some or all of the latter, making hardware and hands-on knowhow limitations the only blocking elements.

4.3. Execution Risk

There is a more unique scope for dangerous situations for AI that is applied further downstream. This is particularly important as experimental facilities gain more autonomy, and chances for candidates escaping close human inspection increases.[11],[13],[14],[24],[25],[26],[27] In automated workflows, either fully autonomous or human-in-the-loop, it is important that questions on who proposes experiments, who approves/selects experiments, and who executes experiments are carefully considered within, for instance, the ACD levels framework so that responsibilities of the various parties involved are clearly delineated. Establishing guidelines and safeguards in this regard could be useful in preventing both intentional and unintentional misuse of autonomous platforms. This level of risk is more reminiscent of critical decisions like autonomous weapons systems, which—for clear reasons—are receiving strong attention. AI agents capable of making chemicals must receive very rigorous oversight, and the community needs to first develop the institutions to decide the nature of these limitations and then to enforce them.

The interface between ML model predictions and robotic hardware is a critical nexus for implementing checks. Methods may involve rule-based filters to ensure adherence to safety guidelines or advanced approaches with human feedback for thorough screening. The latter of course comes at the expense of lower overall throughput, so the level of regulations can be made to be commensurate with the level of risk and autonomy of the agent.

In addressing broader safety and misuse concerns of generative AI, President Biden’s recent executive order, issued on October 30, 2023, mandates AI companies to disclose red-teaming exercise results and large-scale model training to the government. It also tasks the Department of Energy with investigating AI’s potential role in cyberattacks and biological and chemical weapons development.[104] Yet, the need still remains for policies tailored to the aforementioned, more specific questions that chemistry and materials science pose.

5. Conclusions

Generative AI offers both promise and challenges in reshaping molecule and materials discovery. The journey from concept to application faces computational limitations, ethical concerns, and a practical execution gap. As AI advances towards expert-free chemical realization, ethical considerations become crucial, requiring a balanced approach to prevent misuse and ensure responsible innovation. Collaboration across domains, open data initiatives, algorithmic advancements, and the fusion of computer science with domain expertise are vital for advancing scientific ML. This paper emphasizes the need for a balanced approach, leveraging generative AI’s potential while addressing ethical and safety considerations in this evolving field.

No comments here
Why not start the discussion?