Can we build GPT-4 but for scientific data?

a look at foundation models in science

Jun 19, 2024

There has been a recurring thread of “unification” taking place in AI research in the last few years. The thread goes like this: rather than trying to solve a bunch of different tasks with different models specialized for each task, let’s just make one really massive model, keep it relatively simple, and train that model on all the tasks at once. The most prominent example is in language: instead of training individual models on text translation, sentiment analysis, and summarization—which was the norm in natural language processing a few years ago—just train one giant model on the entire internet and have it get really good at predicting the next word in a sequence. This approach has worked to a degree that has astonished everyone, forming the foundation for products like ChatGPT and Claude, and the same thing has been happening in computer vision and robotics: task-specific models are being superseded by giant generalist models. The question then becomes: can we do the same thing for science?

Machine learning is already being used extensively in science, helping us model everything from the weather to the folding of proteins to the interactions of electrons. Neural networks help us not only to make more accurate and efficient predictions of physical systems, but they’ve also shown they can predict emergent phenomena they had never seen before. In an especially exciting result this month, a neural network that was trained on simulations of atoms accurately predicted behaviors like the formation of crystal lattices and water molecules exchanging protons, even though it had never seen such behavior in training. A friend explained it like this: imagine you train a neural network on images of stars, without ever showing it a black hole or supernova. Then, you prompt the model to show you “a dense star”, and then “a very dense star”, and then “a very very dense star” and so on, and suddenly the model starts giving you pictures of black holes and supernovas, even though it had never seen one in its training data. This is what we’re doing but at the atomic scale: the model is learning something fundamental about how the atoms work, rather than just pattern-matching what it has seen in training.

The model simulates hydrogen atom (well, technically just a proton) jumping from one water molecule to another, despite never having seen this in its training data. Source: Timothy Duignan

This is all exciting, but one problem with applying machine learning to science is that collecting the required data1 is expensive, and training a model is also expensive, and scientific institutions—well, they don’t tend to have the kind of money that big tech companies have. Add to that the fact that each particular physical system has a slightly different set of equations that describe it, with different parameters and initial conditions2, which requires a different neural network to model, and you have a very expensive problem. This is where the prospect of “foundation models” in science come in: maybe we can pre-train one large model on a bunch of different physical systems, and have it get better at simulating all of them, and then have individual research groups “fine-tune” the model to the specific system they’re interested in. This is how ChatGPT works: first it’s pre-trained on all the text on the internet, tasked with learning how to autocomplete text, and then it’s fine-tuned with human feedback to be a useful chat assistant—but it can also be fine-tuned for other tasks like giving medical diagnoses or generating SQL queries.

Multiple groups of researchers are now building foundation models for fields like climate and fluid mechanics, which learn to simulate systems across a number of conditions and spatiotemporal scales, and end up performing better than state-of-the-art models that are trained on more specific conditions. The team at PolymathicAI—a consortium of researchers on a mission to build foundation models in science—recently built a multimodal model of galaxies, showing that combining data from images of galaxies and their spectral distributions (i.e. the relative amount of radiation they emit at each frequency) allows us to accurately predict other properties like mass, age, and star formation rate better than existing models. And just like with language models, the accuracy of the scientific foundation models improves with scale. As they get larger they become more accurate, and they also become more general, requiring less “prompting” to accurately model physical systems they haven’t seen before. In a particularly strange twist, even a model that was pre-trained just on YouTube videos performed better at simulating fluid mechanics than a model that didn’t get any pre-training and learned the task from scratch.3

What could these models be learning that apply across different modalities and physical systems, and could even extend to cat videos? Just like in language, the idea is that there are a number of shared properties across science—symmetries, conservation laws, and even basic concepts like causality—that foundation models in science can learn. Professor Miles Cranmer of Cambridge makes the following analogy: when we pre-train the model on a variety of systems we’re giving it the equivalent of a high school science education and then asking it to solve graduate-level physics problems. But if we just take a model without any pre-training and immediately try to teach it a specific task, it’s like trying to teach a toddler to predict the fine-grained properties of black holes before it has even learned how to speak.

Now, the holy grail of science is not just more accurate predictions, but new theories: explanations of the things we observe that give us an intuitive understanding, which in turn can guide further research and technology. Here, too, machine learning can help: we are starting to learn how to take the neural networks we use in science, and distill them into human-interpretable equations. This is a method called symbolic distillation, where we effectively “factorize” the neural network into a bunch of smaller components, and then extract an equation from each of the parts of the network using symbolic regression. Scientists have used this approach to discover physical laws: a neural network rediscovered Newton’s law of gravity given the trajectories of planets orbiting a solar system, without knowing the actual mass of the planets or any physical constants. And it’s not just discoveries of laws we already know: we’ve used this same approach to discover new laws that govern the scaling of galaxies and the dynamics of cloud cover.

A toy example of symbolic regression, an algorithm for finding an equation that fits a set of data. The algorithm models the mathematical equation as a “tree”, and it swaps around the nodes of the tree in an iterative fashion (called a *genetic algorithm*) to get closer and closer to the true function. Video taken from from PySR.

The primary promise of foundation models is to democratize the use of machine learning in science, by giving more people and institutions the ability to fine-tune powerful state-of-the-art models to their specific applications.4 In the long run though, they could give us not just more cost-effective and accurate predictions, but new insight. We can build large general-purpose models that “understand” a variety of physical systems from climate to astrophysics to molecular biology, and couple that with better methods for interpreting neural networks and distilling them into human-interpretable equations. What if, by training large models on physics, chemistry, and biology, we are bringing our models a step closer to “base reality,” giving them a new kind of knowledge of the world—knowledge that isn’t constrained by our perceptual faculties (as in image/video) or by our conceptual priors (as in language)—knowledge that even we don’t have yet? If there’s anything we’ve seen from the last few years, it’s that we should expect to be surprised by the capabilities of these models develop as we give them more compute and more data. It could be, as Cranmer puts it in a lecture, that the next great scientific theory is hiding somewhere inside a neural network.

Thanks to Suzanne and Andrew for feedback on drafts, and cover photo credit to Yves Pommier.

Keep in mind that it’s not always “collecting” data in the strict sense, as these models are often trained on simulated data. The simulations are computed using standard numerical methods (e.g. molecular dynamics), but those simulations are cost-intensive and tend to accumulate a lot of error quickly, which is what makes neural network models of the simulations so useful.

You might be surprised, for example, that modeling fluids involves very different sets of equations depending on whether we’re talking about shallow water or not, and whether we’re talking about a compressible fluid or an incompressible fluid, and traditionally each of these problems is tackled with a neural network of its own, trained from scratch.

In Multiple Physics Pretraining for Physical Surrogate Models, they trained a number of models on several fluid mechanics simulations, comparing models that were (1) pre-trained on fluid mechanics, (2) pre-trained on video using the “something something” video database and another database of human actions, and (3) not pre-trained on anything. They found that both of the pre-trained models outperformed the models that learned the task from scratch.

Foundation models also pose their own set of risks and challenges, like the fact that any biases baked into the model will get propagated to downstream tasks. See On the Opportunities and Risks of Foundation Models for more.

Bits of Wonder

Discussion about this post