Large Language Models for Chemistry and Drug Discovery

By now you have likely heard of Large Language Models like ChatGPT which is a deep learning model based on a massive volume of text that can be used to generate text win response to a command (prompt) from the user. This concept can also be applied to designing molecules with the very same tools. For example, imagine a molecule as a SMILES string or some other text representation. Train a model on an equally massive number of molecule SMILES strings, such that an algorithm will understand how molecules are synthesized. In ChatGPT using text, the algorithm learns and understands the meaning of the text that is written, for small molecules one could infer that it would learn chemistry. You could take this analogy further and large language models have also been used to learn from protein sequences so they can be used to design new proteins and understand how a protein may fold.

So how can you put this into practice? There is an art to asking tools like ChatGPT - the design of the ‘prompt’ is important such that a whole field has built around this - the prompt engineer. The following example shows how simple it is to get a Large Language Model to design a molecule. The question of course is can it possibly design something that is novel and useful

One must assume that ChatGPT has been trained on a large number of SMILES. It can also help provide short cuts especially if you need advice for cheminformatics software etc.

How about synthesis of the molecule it proposed, could it do that? Sure! But is the result meaningful, this may be totally random information.

One could also ask for a retrosynthetic analysis. Again this appears quite general information and not specific.

Or even ask for more detail on the synthesis. It even deals with the typo in the word synthesis. But most of what it returns is pretty general information and not likely specific enough.

While the results above are not 100% accurate for sure, such models are prone to hallucination (delivering fake data), they do provide some useful starting points which the user can learn from, and may save time in the research process. Currently these tools do not predict properties like solubility and logP for example, so there is still some work to be done! Models for synthesis and retrosynthesis prediction that are trained on real data are likely to be more useful - see our earlier work on MegaSyn for example. As you can see there is potential here for helping chemists and others to do their work but there is still a ways to go to tailor the results to deliver meaningful answers. At Collaborations Pharmaceuticals, Inc. we have been exploring how Large Language Models like this can be employed in drug discovery. In addition we have been training our own Large Language Models to help us to predict many properties simultaneously and design new molecules. If you would like to learn how we can help your organization with ‘chemistry prompt engineering’ or put Large Language Models into your drug discovery workflow, please get in touch!