6 Why believe AI? The role of machine learning in science
Artificial intelligence has made significant inroads into science over the past decade, especially under the machine learning variant called deep learning. This chapter will examine the challenges and questions concerning the role of AI in the scientific process. We proceed by building on two articles by Naomi Oreskes, ‘Why Believe a Computer?’ [87] and ‘The Role of Quantitative Models in Science’ [88], written at the turn of the millennium. These articles critically examine the role that computers should play in science. Computer simulation has since become widely accepted as the ‘third pillar of science’ [89], alongside theory and physical experimentation. While AI will not replace any of these paradigms, it may enrich them, and perhaps be regarded as a fourth paradigm itself [90]. With the benefit of two decades’ worth of hindsight, we revisit some of Oreskes’ arguments and refurbish them for the era of AI.
6.1 Testability and complexity
As a historian of science, Oreskes [87] describes how scientific epistemology has evolved across the centuries, with the aim of identifying the proper place (if any) that should be granted to computers within it. A key notion arising early on is that testing lies at the heart of science. Since at least the 17th century, upon the case made by Sir Francis Bacon, many scientists agree that a theory should be testable, and tested to be proven correct. This notion was refined by Karl Popper in the 1930s. Popper held that a theory should indeed be testable (or falsifiable), yet could never be fully proven correct. It could only be shown to be in accord with the available experimental data, but not proven to remain so with respect to any future evidence. Pierre Duhem, in his 1906 publication ‘The Aim and Structure of Physical Theory’, did not reject the need for testability, but argued that any theory could be modified and extended post hoc so as to accommodate new findings, and the more complex the theory, the easier it would be to extend. This argument adds justification for the Occam’s razor principle widely embraced by science, according to which a simple theory is preferable to a complex one if their explanatory powers are similar; a simple theory is easier to test, and it is more obvious when one attempts to modify it so as to accommodate incompatible evidence.
The dilemma18 that arises in the context of computer models is the following: the more faithfully a computer model represents a complex real-world system, the harder it is to test. Unfortunately, a complete description of a natural process, if even possible, would be extremely verbose and intricate. For instance, when expressed in a programming language, the number of lines of computer code in a global climate model can easily reach into the millions, which is at least one order of magnitude longer than the complete works of William Shakespeare (albeit much less pleasant to read), and it still only remains a rough approximation of the underlying processes.
On one level of abstraction, deep learning is a much simpler program than any Global Circulation Model. In fact, only a handful of basic mathematical functions are required for implementing even the largest deep learning models. However, deep learning contains an enormous number of parameters (millions or even trillions in latest trends), which all have an influence on the model’s operation, so that the complexity is shifted from the computer code to the parameter space. This abundance of parameters, or degrees of freedom, seems to fly in the face of Occam’s razor, and is one of the main points of contention concerning the admissibility of deep learning in the scientific toolbox. A quote attributed to John von Neumann reads, ‘with four parameters I can fit an elephant, with five I can make him wiggle his trunk’, and this sentiment permeates much of the scientific and statistical thinking which scientists and engineers will have been imbued with throughout most of their education.
The danger which is being alluded to is that of severe overfitting (cf. Section 2.5), that is, of having a model which is so plastic that it will contort itself to accommodate any input data, without capturing any of its essential characteristics, and consequently exhibiting poor generalization capabilities: its accuracy given previously unseen input will be poor. Traditionally, the advised approach for avoiding overfitting has thus steered statistical modeling strongly in the direction of using as few parameters as possible. Since deep learning opts to maximize rather than minimize the number of parameters in order to attain state-of-the-art prediction accuracy in a continuously growing list of domains, how has overfitting been resolved?
Information-theoretic studies of deep learning, going under the heading of ‘information bottleneck’ [91] suggest that the optimization procedure used in deep learning, called stochastic gradient descent (cf. Section 2.4), may be somewhat naturally immune to overfitting under certain assumptions about the statistics of the data being used. All the same, additional algorithmic techniques with the explicit aim of avoiding overfitting were also introduced. One such technique, called dropout [92] (cf. Section 2.9), developed in the lab of Turing laureate Geoffrey Hinton, randomly perturbs the model, and forces it to construct internal redundancy for robustness. This technique turned out to be highly effective in avoiding overfitting, and has been widely applied ever since the deep learning model ‘AlexNet’ described in Chapter 5 won the ImageNet Large Scale Visual Recognition Challenge in September 2012. This image classification competition had up to that point been dominated by bespoke hand-crafted models, and the surprise takeover arguably ignited the ongoing deep learning revolution. From this perspective, Hinton’s dropout can be seen as an alternative principle to Occam’s razor. But is it compatible with science?
6.2 The purpose of science
This leads us to step back and ascertain which goals we are pursuing in a given scientific endeavor. Oreskes [88] states that
the purpose of modeling in science must be congruent with the purpose of science itself: to gain understanding of the natural world.
Achieving understanding may be the purest driver of science, but there are others. Obtaining reliable predictions without true understanding would certainly rank as a more desirable scientific outcome than the absence of both. Arguably, quantum mechanics is an example of a theory which provides extraordinary capabilities for prediction, without necessarily delivering a deep understanding of the natural world. Deep learning, at the time of writing, is certainly much better at delivering predictions than understanding.
Besides prediction, informing policy is another necessary goal of science, that is, providing sensible governance recommendations based on the current body of knowledge, weighed against estimated risks and benefits. In this area, deep learning may be less suitable for direct application, since assessing risk and benefit typically requires context and common sense, two dimensions in which AI still needs vast improvement. It may, however, be usefully applied to produce predictions which inform such recommendations.
Although other objectives exist, understanding definitely reigns supreme among desired outcomes of the scientific process – even flawed understanding may serve as a stepping stone. Understanding allows us to build upon, to theorize further implications, thereby motivating new experiments which may in turn reveal flaws in our (previously assumed) understanding, driving progress as a result. In this area, deep learning has not made many contributions19. Research into extracting understanding from deep learning models does exist, going under such headings as explainable AI (XAI) and interpretable AI, but these efforts are still in their early stages.
Causal models (Section 2.17) are an alternative approach to AI and may be more fundamental to science. They are championed, among others, by Judea Pearl, also a Turing laureate, who holds that ‘all the impressive achievements of deep learning amount to just curve fitting’ [94]. In other words, deep learning is focused on finding statistical relationships in high-dimensional data, without considering the causal relationships which give rise to them. Human understanding is closely tied to causal interpretations of the natural world, and hence our AI models ought to speak that same causal language in order for us to learn something from them. Causal inference and causal discovery are highly active areas of research today, although not nearly as active as deep learning. Building a causal model for a given problem domain requires much more domain knowledge than does deep learning, and the software packages for causal models are mostly in research stages, whereas the deep learning software stack is now industrial-strength. Additionally, the causal approach lags far behind deep learning in the following fundamental aspect: it requires the modeler to define what the variables are (and for best results, specify any partial domain knowledge on the causal relationships between those variables). In contrast, deep learning’s most significant productivity enhancer has been to figure out all by itself what the variables are, from raw data, eschewing the need for such ‘feature engineering’. Unfortunately, the variables that deep learning comes up with don’t necessarily have a causal interpretation. Perhaps, combining the best of both approaches will supply an answer; however, this will still require major conceptual breakthroughs.
6.3 High-dimensional output, low-dimensional internals
Returning to the concern of overfitting, the picture we may often implicitly have in mind is that of a very high-dimensional input, combined with a low-dimensional output. For instance, petabytes of physical measurements across time and space as a source for tens or hundreds of atmospheric, oceanic and other variables as input, and as output, the global average surface temperature. In such a setting, overfitting is indeed an overwhelming concern. The situation is qualitatively different, however, if the complexity of the output matches that of the input, for instance if a model were to get the same petabytes of input as described above, but now had to predict values for all those same variables for subsequent timesteps and myriad physical locations, as its output. Predicting a billion values correctly, based on a fundamentally flawed model, would be much more of a fluke than predicting just a single value or a single time series.
For such balanced high-dimensional ambitions between input and output to become practical, the availability of the required data can of course be a limiting factor; however, Oreskes’ comment [87] that
the availability of data from the natural world has not kept pace with advances in theory and computation
is being potentially turned on its head. The data volumes from Earth Observation programs do not yet match those generated by model simulations, but the challenge is shifting from data dearth to data deluge, as in many other scientific arenas as well. For example, the Copernicus data archive, hosting the satellite observations collected by the European Space Agency’s Sentinel missions, is expected to grow from 34 to 80 petabytes within six years [95]. NASA’s Earth Science Data Systems (ESDS) is expected to grow to approximately 250 petabytes by the year 2025 [96]. To get a sense of scale, imagine 250 petabytes of written text, printed out on A4 paper sheets. The stack of paper would reach far past the moon’s orbit, and cover nearly 10% of the distance from the Earth to the sun.
A deep learning design pattern has emerged as a powerful way to improve both the trainability and the interpretability of models: combining high-dimensional input and high-dimensional output with a lower-dimensional part in the middle. Some work along this vein has shown that manipulating parameter values within this lower-dimensional mid-section can have meaningful or intuitive effects on the output20, while manipulating values in other layers typically results in more haphazard changes, which indicates that the network is somehow distilling the task down to its essence within the lowest-dimensional layers. Whether this representation is in a language that we can truly understand remains an open question.
6.4 Data impedance mismatch and end-to-end DL
On the topic of data, Oreskes raises an important limitation of working with computer models, which is that the data must typically first be transformed in order to match the variables that were designed as inputs to the models. This can be referred to as the data ingest problem. The same applies on the output side of the process [87]:
The gap that exists between empirical input and model parameters is mirrored by a gap between model output and the data that could potentially confirm it.
In engineering terms, we could describe this as an impedance mismatch between the data and the typical computer model. As it turns out, deep learning has an ace up its sleeve in this respect, because it is able to handle essentially any input and output we choose. The trend in many areas where deep learning is applied to large datasets is to use an end-to-end approach, from raw data directly to desired output. For instance, the classical approach to the problem of speech recognition is to decompose it into a sequence of sub-problems; first extract specific features from audio data, then process those features to obtain phonemes/syllables, then use these to construct words, and finally, assemble a full textual transcript based on these words [97]. In contrast, the end-to-end deep learning approach skips all these explicit intermediate representations, and directly learns how to map raw audio to a transcript. One could argue that this contributes to the black box aspect of deep learning. However, since the use of a traditional model requires the production of mappings from data to intelligible variables and from model output to verification data, it should be noted that these same mappings would also be available for the purposes of evaluating deep learning models. Indeed, end-to-end thinking is being considered in scientific applications, such as numerical weather prediction [98]. The related topic of generative models is discussed in the upcoming Chapter 7.
Given the current uncertainties surrounding the future (and even present) capabilities of deep learning, it is difficult to assign it a clear cut role in the scientific process. On the one hand, the predictive capabilities it provides are so useful that we may be compelled to rethink basic principles in scientific thinking, such as Occam’s razor. On the other, it may be that efforts in understanding and interpreting the internals of deep learning models will produce the type of understanding that we classically expect from scientific models, allowing us to have our cake and eat it too. Later in Chapter 8, we will cover a different kind of AI that is fully aligned with the current paradigms of the scientific process: causal models.