7 Generative AI
The deep learning revolution was largely sparked by successes in image classification, epitomized by the victory of the convolutional neural network AlexNet in the 2012 ImageNet competition. Decades of research in handcrafted feature detectors were disrupted by this model, trained on two consumer-grade GPUs, which radically changed the minds of computer scientists concerning neural networks. Prior to this incontrovertible demonstration, the consensus view was that neural networks were a subpar type of machine learning model. As far back as the 1960s, early mathematical results had shown that single-layer neural networks had severe limitations in the variety of functions that they could represent, which likely discouraged further inquiries into their potential. In 1989, it was mathematically proven that a neural network with just three layers (including the input and output layers) and a non-linear activation function is in fact a universal function approximator, meaning that given enough neurons, the model can approximate (nearly) any function arbitrarily well [99], [100]. Despite the ensuing revival in scientific interest, neural networks did not reach mainstream status at that time because they were difficult and expensive to train compared to other models. As discussed earlier in the book, the combination of abundant data and cheap computation were key in finally propelling neural networks to the fore.
Two limitations that beset machine learning up to that prior point have undergone a significant reexamination as a result. Firstly, previous ML models were task-specific; a model trained on a given task was essentially useless when applied to a different, even highly related task. Secondly, the output of an ML model was usually restricted to be low-dimensional, compared to the input data. Deep learning started out in much the same vein. The input to the AlexNet model is a \(256 \times 256\) pixel color image, and each pixel can take on \(2^{24}\) possible colors. This means that the number of possible inputs is over \(10^{473479},\) an astounding number given that the number of atoms in the universe is commonly estimated to be about \(10^{80}.\) The model’s output, on the other hand, simply consists in one among 1000 object labels (such as ‘Persian cat’ or ‘volleyball’). This is a very low-dimensional output as compared to the dimensionality of the input. It is also highly specific to the particular dataset of images that was compiled for the purposes of the ImageNet competition. In subsequent years, deep learning research has shown that low-dimensional output and task specificity were in fact not fundamental limitations, as we outline below.
7.1 Transfer learning and fine-tuning
It came as a surprise that image classification models trained on one task could in fact be repurposed for another, similar task, with modest computational effort. For example, a model trained on the ILSVRC 2012 dataset can be adapted to distinguish between an image of an American football and an image of a papaya, neither of which is a category in the original training dataset. It turns out that the weights learned in the early layers of the model are quite general-purpose feature detectors, which can be used as building blocks to detect nearly any kind of object. Only the last layers need to be retrained to adjust to a new object category. This technique is known as transfer learning, and requires much less computation and data than training the original model did. In the case of AlexNet, millions of images were required to train the model, and each image was ‘looked at’ 90 times over the course of the training (this is referred to as 90 epochs). By contrast, adapting the model to two new categories may require a few hundred additional images and a few epochs. Once the last layers have been retrained, while keeping the early layers frozen, one can also unfreeze the whole network and train a bit longer with a low learning rate, in order to further improve the final accuracy. This process is known as fine-tuning.
The success of early deep learning was thus achieved in supervised learning (cf. Section 2.2), where each data point comes with a label - the image category in the case of ImageNet. A drawback of supervised data is that it is labor-intensive to compile. For each datapoint, a label needs to be supplied, either by human annotation, or by some automatic process which itself needs human verification. The ImageNet images were labeled by enlisting tens of thousands of paid workers via crowdsourcing platforms [101], [102]. Because unlabeled data (in other words, simply data) is much more prevalent in the world than labeled data is, the ability to learn useful patterns from unlabeled data would obviously constitute an enormous advantage.
7.2 Unsupervised learning and generative models
Learning from unlabeled data is referred to as unsupervised learning, and it is the principal source of the power of generative deep learning models. Where the typical supervised ML model aims to learn a good mapping from \(X\) to \(Y,\) for example from images to category labels, an unsupervised model instead learns the joint probability distribution of \(X\) and \(Y,\) or simply the probability distribution of \(X\) if there are no labels \(Y.\) If the model is generative, it has the ability to sample from that learned distribution. In other words, it can generate new data points. If the data is a collection of cat pictures, the trained model will be able to produce a ‘new’ cat picture that was not part of the training data. Such unsupervised generative AI models have hugely improved in the past few years, and first gained prominence in the medium of text. The most emblematic generative models at the time of writing are the GPT family of large language models (LLMs) [103], [104], which are able to generate highly plausible text on nearly any given subject. GPT is the acronym for ‘generative pre-trained transformer’. The ‘transformer’ part of the name refers to the transformer neural network architecture (cf. Section 2.12), and ‘pre-trained’ signifies that the model has been trained in an unsupervised fashion, and hence on a large body of text. Finally, the ‘generative’ prefix indicates that the model is able to produce new data from the distribution that it has learned, i.e. to generate new text. With additional supervised fine-tuning of the pre-trained model, using for instance reinforcement learning from human feedback, the model can be adjusted so as to generate conversational responses to the user’s inputs, while attempting to avoid offensive, harmful or otherwise inappropriate output. The most famous such interface is known as ChatGPT, released by OpenAI in late 2022, but other offerings of similar nature have rapidly joined the market since then. In particular, several competitive open source models have been released, such as the Llama models [105], [106] by Meta, Gemma [107] by Google DeepMind, and BLOOM [108] through an initiative led by company HuggingFace. These models are being applied to the generation of a wide variety of text, from poetry to book summaries to programming source code. Some current limitations of such models will be briefly discussed below.
Text is not the only medium in which generative deep learning has made leaps of progress. Image generation has also come a very long way, especially in the ‘text-to-image’ variety; the user supplies a short textual prompt describing the image to generate, and the model creates one or several images derived from that prompt. In this medium as well, OpenAI has supplied the most popular interface, based on its DALL-E family of models [109]–[111]. Some notable image generation models on the open source front include Stable Diffusion and DeepFloyd IF [22] by company Stability AI. Note that many of the new large models are multimodal, meaning that they are able to deal with data from different media at once. For example, GPT-4 can understand both text and images as input, even though its generative capability is limited to text only. The new frontier of generative models is video generation. Lumiere [112] is a model released by Google in early 2024, which uses diffusion (see Section 2.15) and a U-Net architecture which downsamples video in both space and time, in order to generate new short videos based on input text and/or an input still image. A few weeks later, OpenAI announced its own video generation model called Sora [113], showing samples of minute-long footage of a much higher temporal consistency and realism than in the prior state of the art. Based on the pace at which generative models are advancing, it is very likely that there will be several more highly capable models released by the time this book is published.
7.3 Limitations
Generative models have a mass market appeal, and some of the models discussed in this chapter have already been accessed by over a hundred million users [114]. The curious public has been experimenting with this new technology, attempting to figure out what it can do for them, and to understand its strengths and limitations. We briefly go into several of these limitations in this section.
7.3.1 Hallucination
ChatGPT (and other LLMs) are most commonly accessed via a text interface, enabling a written chat conversation with the AI. As such, users initially expect it to behave similarly as their usual human conversation partners. However, an LLM differs in at least several important aspects. When speaking with a friend or colleague, we normally expect them to be truthful. In addition to this, we assume that they will adjust their level of confidence according to their knowledge of the subject being discussed, indicating by verbal and non-verbal cues when they feel competent to opine, and when they are less sure of themselves. Finally, we can generally expect them to be able to explain why they hold a certain belief, in particular by citing their sources of key pieces of information relevant to the topic. As it turns out, LLMs presently fall short on these expectations. An LLM will regularly generate textual statements that are completely false. In fact, it will do this in a highly confident tone, producing a mix of truthful and false statements with the aplomb of a con man. This has been referred to as hallucination and sometimes confabulation. When users realize that the model has invented something in the course of their conversation, they can naturally take a dislike to the technology, associating its response with a fabrication designed to mislead.
Indeed, LLMs are trained on enormous text datasets, containing a significant portion of the web, including discussion forums of all kinds. Surely, not all utterances on the web are truthful, and of course, much of the text that exists in the world contradicts itself in part, reflecting the huge diversity in human opinions and beliefs, along with their falsehoods and biases. Consider also that some text may have been accurate when it was written, but has become out of date, or is otherwise learned without the necessary context. For example, novels and fiction in general account for a sizeable chunk of published text, which should be distinguished from newspapers, or scientific literature, or online discussion forums. Therefore, even if the model was able to reproduce ‘the web’ with high fidelity, the user could get very different outputs, depending on which part of the web the model was tapping into. On top of this, an LLM only learns an approximation of this enormous amount of text, which adds another level of difficulty to the goal of keeping it truthful, and which would remain even if one had fully vetted the entire body of text used for training. Unfortunately, in its current design, an LLM is unable to trace back from its choice of words in generated text, to the portions of the training data that would have had the largest influence on that choice. Consequently, we can typically not determine why the model responds in a certain way to a given prompt, without clever detective work. If we ask it to explain itself, it will typically just hallucinate a response that sounds plausible, or apologize (as a somewhat generic response added by the service provider through fine-tuning or other means) and state that the mistake may be due to incomplete data. Finally, one should be aware that the process of generating text includes randomness. At every stage of inference, the model samples a random word (or token) based on a conditional probability distribution given the preceding words. This means that the text provided as an answer to any prompt is partly driven by rolls of numeric dice.
Where images are concerned, it appears that we have very different expectations of a generative model than we do in the case of text. We are aware that the model generates fictional images, which may contain inaccuracies and counterfactual features. Even though it may be able to depict familiar objects and famous persons, we are not surprised by hallucination in this context. In fact, here it is arguably considered as a feature rather than as a bug. Image generation services are seeing widespread adoption for creative purposes, and inaccuracies are often humorous or otherwise interesting. Of course, this also brings new potential for misuse, such as in the case of deepfakes, which consist in applying a generative model to produce an image or video using someone’s likeness. Political commentators fear that deepfakes can be used to influence elections [115], [116], and indeed if the technology reaches such a level of quality that even forensic analysis is unable to differentiate between real and fake images, we will need to rethink the very standing of photographic, video and audio evidence in our epistemic processes. The ability to choose between factual and invented content is certain to be an important requirement in the future. Especially in a scientific application, models will need to contain mechanisms to guarantee that important aspects of the generated data are governed by applicable conservation laws or other physical constraints.
7.3.2 Bias
As was alluded to in the section on hallucination, a generative model learns to approximate the probability distribution of the dataset that it is trained on. If the dataset contains inaccuracies, falsehoods, and biases, the model will inherit these aspects of the data, unless specific care is taken to avoid them. For example, AI bias has been well documented in the domain of face recognition. Identifying a person based on a photograph of their face is a challenge that many research efforts have tackled. Since the 1990s, models have consistently performed best on the combinations of race and gender that were most represented in the training dataset [117], sometimes resulting in extremely poor accuracy when attempting to recognize a person with a racial minority background. The EU AI act, which the European Union’s parliament and council have adopted in March of 2024, explicitly bans ‘the use of real-time remote biometric identification systems in publicly accessible spaces for the purpose of law enforcement’, unless some stringent legal criteria are met [118]. The main reason for this prohibition is based on considerations of fundamental privacy rights of citizens. If innocent persons got tangled up in police investigations simply because of a technological bias with respect to their immutable characteristics, it would be a severe injustice. The problem of bias21 permeates all types of ML models, including generative ones. If a group is underrepresented in the training dataset, it is likely to be equally underrepresented in the generated data, unless the training procedure (or the downstream generative procedure) is adjusted to rebalance the representation. On the topic of generative models, the EU AI act stipulates that the use of generative models for producing deepfake image, audio or video content, must be disclosed. Likewise, the use of generative models for textual output must be disclosed if the text is published with the purpose of informing the public on matters of public interest, unless a natural or legal person reviews the material and holds editorial responsibility for the published content [119].
7.3.3 Retaining copies and collapsing diversity
Several kinds of text-to-image generative models have been shown to retain a nearly exact memory of a small subset of the training data. In consequence, upon inputting the right text prompt, the model ‘generates’ an image that is nearly identical to one of the images in the training dataset [120]. In most cases, this is an undesirable aspect of the model, causing issues in terms of privacy, intellectual property rights, or lack of creative ability. A similar phenomenon has been observed in the case of LLMs. In particular, when asked to repeat a given word forever (as a prompt), ChatGPT repeated that word many times, and then all of a sudden, began outputting some of the text it had been pre-trained on, verbatim [121]. These discoveries provide a clear indication that the way in which deep generative models learn is not well understood yet.
Current generative models also seem to be unable to generate as much diversity as is present in their training datasets. This was noticed when using a dataset of images generated by a model to train a second model, then in turn using the output of the latter to train a third, etc. Each successive dataset obtained in this way tends to be less diverse than the previous one. This has been referred to as ‘model collapse’ [122], and has been observed in text [123] and images [124]. From a statistical point of view, there is evidence that ‘rare’ features are discarded over successive generations, thereby increasingly converging on average or majority features. This would be akin to the model creating its own, ever more narrow view of the world.
7.4 Implications for the Earth sciences
So far, generative models have not been broadly used in scientific applications, with a notable exception in numerical weather prediction (NWP). Here, three deep learning weather models were published in close succession, within the span of one year. Nvidia built FourCastNet, a neural network based on Fourier neural operators, doing much of its work in the frequency space instead of in pixel space [46], [125]. Google released its graph neural network model called GraphCast, which operates on a graph composed of the vertices of a globe-spanning mesh [47]. Finally, Huawei announced its Pangu-Weather model, built on a 3D Earth-specific transformer neural architecture [48]. All three models used the ERA5 global atmospheric reanalysis dataset in their training procedure, and were made to autoregressively predict the new state of global weather from the previous state. They were compared to the best-in-class classical numerical weather prediction models, which operate by numerically solving partial differential equations describing state transitions. Each neural network model outperformed the classical ones in several important metrics, disproving a recent assessment that a number of fundamental breakthroughs were likely to be required before achieving this feat [98]. In particular, Pangu-Weather showed higher accuracy across the board, in predicting geopotential, temperature, specific humidity, and wind speed, at the surface height level of 500 hPa. Nevertheless, a classical physics-based model was employed for the reanalysis that allowed the production of the training data, and therefore, classical models remain a critical requirement for the success of AI/ML in NWP, as well as in many other applications. It should also be noted that the three AI/ML models did not perform well in predicting some important aspects of the 2023 storm Ciarán, such as the maximum wind speeds at 10 meter height [126]. Generally speaking, AI/ML models will perform much better at interpolation than at extrapolation. When faced with a situation that lies far outside of their training data, they may predict with much lower accuracy than a physics-based model would.
Generative models are also being pursued in remote sensing. Instead of manipulating discrete language tokens, some research streams attempt to use same technology on multispectral and multi-instrument satellite observations. In this context, a token is a pixel with several layers, each layer stemming from a specific band in a satellite image. The abundance of such pixel data in the remote sensing archive is such that autoregressive self-supervised approaches akin to those used in large language models are feasible, and have led to promising predictive performance for several downstream tasks, such as predicting future surface reflectances across the 400-2300 nanometer range [127]. Another generative model architecture, called generative adversarial22 network (GAN), has previously been applied to Earth observation tasks. For example, conditional GANs (CGANs) were used to fill in voids in incomplete satellite observations, such as mountain shadows in incomplete radar data, as well as for spatial interpolation and image pansharpening [128].
7.5 Outlook
The potential of generative models is still being explored with much excitement, especially given their success in the language modeling space. Some voices argue that they are a red herring. Yann LeCun, the deep learning pioneer who shared the 2018 Turing award with Geoffrey Hinton and Joshua Bengio, believes that generative models as they are currently designed, especially for image sequences, do not have the right approach to succeed in the long run. The strategy thus far has been to degrade the input by adding noise or occluding parts of it, and then learning to reconstruct all of the pixels. However, in many situations this is not a good objective, because there is not enough information to reconstruct the entire image exactly. According to LeCun, instead of focusing on the input distribution, one should aim to achieve high-quality reconstruction in an abstract space [129]. The JEPA proposal from his group is to produce a joint embedding of an image and of a noisy version of that image, and to learn to reconstruct the latent representation of the original image from the latent representation of the noisy one [130].
It is very much an open question as to what extent neural network based systems will be able to perform complex reasoning. DeepMind’s AlphaGeometry model [131] is reported to have achieved top-ranking performance in geometry problems, solving 25 out of 30 Olympiad-level geometry test problems, which is nearly as good as the average of gold medalists of the International Mathematical Olympiad. It is a neuro-symbolic system, using a neural language model, and a symbolic deduction engine, working in concert. Essentially, the neural network generates possible geometric constructions, and the symbolic deduction engine attempts to use them to obtain the desired outcome. The model produces human-readable proofs, that is, it is able to ‘show its work’. Nevertheless, the deduction part is still done in a rather ‘brute force’ fashion, unlike how a human would go about the task, using intuition to guide the flow of the proof as well. Causal thinking is very much part of our intuitive thinking, and thus the ability to consider causal relationships while reasoning is likely to be an important milestone on the way to more advanced levels of intelligence. A workshop of causal researchers has found evidence that large language models were not consistently able to perform causal reasoning, and suggested the term ‘causal parrots’ [132] to describe them, building on an earlier critique that LLMs can behave like ‘stochastic parrots’, haphazardly stitching together linguistic forms they’ve observed in their training data [133]. In the next chapter, we provide an introduction to causal models: a branch of AI that aims to stay grounded in causal reasoning.