5 AI hardware and quantum computing

Computer hardware is the substrate of AI, just as the biological brain is thought to be the substrate of natural intelligence. Consequently, hardware is a key factor determining the abilities and performance of AI. According to one school of thought, AI is best pursued by imitating what we observe in biology. Such approaches could include commercial off-the-shelf (COTS) hardware, or may benefit from bespoke hardware. We are currently heading down the path of hardware specialization, although the wave of deep learning was launched on COTS hardware, which serendipitously matched its computational needs to a tee.

Indeed, the revolutionary deep learning model ‘AlexNet’ [71] was a CNN-based neural network (cf. Section 2.10), essentially a scaled-up version of its precursor model proposed some 25 years earlier [72]. AlexNet was implemented to run on Graphics Processing Units (GPUs), which were not initially germane to AI at all. GPUs were specifically designed to process and display graphical data, especially in 3D for the computer gaming sector. They had been steadily growing in performance for two decades, fueled by the public’s voracious demand for high-powered gaming experiences. At their core, GPUs are built to perform matrix multiplications at high speeds. As it happens, this is precisely the workhorse operation underlying all of deep learning. Consequently, AI quickly added itself to the list of GPU customers, alongside gamers and more recently cryptocurrency miners, and the AI use case has been growing ever since, gaining official support from the major manufacturers. Many of today’s largest supercomputers are equipped with hundreds or thousands of GPUs, and deep learning has become a staple scientific workload running on these machines.

5.1 Data and compute power

The concept of artificial neural networks, as well as the main algorithm for training them (cf. Section 2.4), has been around in some form since at least the 1960s12. However, they remained impractical until fairly recently, because two chief ingredients were not available in sufficient quantity: data and compute power.

In order to train a deep learning model from scratch, one typically requires a large dataset. The public availability of large datasets is a relatively recent phenomenon, and building such a dataset to get started was a significant hurdle which would have made this avenue of research impractical for most researchers. In fact, the AlexNet breakthrough was achieved in the form of an entry to the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) competition, using a subset of one of the first publicly available large collections of annotated images – over 1 million images, ranging over 1000 image categories [71].

The second ingredient whose availability to the average researcher lagged far behind the invention of neural networks is compute power, although high-end computers were available to the defense sector. As was previously discussed, stochastic gradient descent is an iterative algorithm, repeatedly cycling over the dataset’s entries, slightly improving the model at every pass. This incremental training process is highly compute and memory intensive, and training a model of sufficient size to obtain interesting results would have been beyond the means of most major research institutes, let alone that of individual researchers. Interestingly, the research on neural networks had lain dormant for so many years, that the ImageNet winning model was trained on consumer grade hardware, rather than on a supercomputer.

In a prescient piece entitled ‘The unreasonable effectiveness of data’ [74], prominent Google researchers shared their observation that large volumes of data, fed into a simple (but large) model, often leads to better outcomes than spending one’s efforts on building a more intricate and sophisticated model. They came to this conclusion while working on huge swathes of textual data, but it turns out to be applicable to image data as well, and will surely extend to further data modalities. Prominent AI researcher Richard Sutton reached a similar conclusion and published the ‘bitter lesson’ [75] that he learned over the course of many decades, that computation trumps cleverness, because simple models are easier to scale in the long run. Kaplan et al [76] provided a striking illustration of this phenomenon in the context of large language models. They showed smooth relationships between the test loss (i.e. the quality of the learned model) and the amount of computation put into the training, the dataset size in number of words/tokens, and the size of the model in number of parameters. Based on those laws, it even became fairly predictable how well a model would perform, given a certain compute budget, a training set of a certain size, and a certain number of parameters. Such predictable benefits have spurred huge investments in AI hardware, mainly in the form of GPUs, as well as into custom AI hardware architectures. The increased investment is highly visible in Figure 5.1, which shows a steep uptick in computational resources spent on training state-of-the-art ML models. Where the trend was slightly outpacing Moore’s law up until the early 2010s, it has grown much more rapidly since then (Moore’s law is described in Section 5.4).

Floating-point operations (FLOP) necessary to train machine learning models, as a function of time. Each point represents a notable ML system according to the Epoch database [77]. The vertical dashed line splits the timeline into two segments, before and after the ImageNet 2012 competition. A regression line is displayed for each segment. The rate of increase in the computational requirements for model training has more than doubled since the advent of deep learning.

Figure 5.1: Floating-point operations (FLOP) necessary to train machine learning models, as a function of time. Each point represents a notable ML system according to the Epoch database [77]. The vertical dashed line splits the timeline into two segments, before and after the ImageNet 2012 competition. A regression line is displayed for each segment. The rate of increase in the computational requirements for model training has more than doubled since the advent of deep learning.

5.2 Hardware co-evolution

The company Nvidia oriented itself to AI as an important application early on, releasing their CUDA Deep Neural Network library (cuDNN) of deep learning primitives in 2014, allowing AI developers to make optimal use of their GPUs, and feeding the deep learning wave [78]. The rapid successes of DL across a wide variety of tasks and domains have turned from individual snowballs into a global avalanche, AI methods now fast becoming mainstream tools, and accordingly requisitioning a considerable share of the world’s computational resources. In fact, DL workloads are now so widespread that they have begun to influence and shape the development of computer hardware itself. During the Transforming AI panel at 2024 Nvidia GTC conference, the authors of the transformer neural network architecture argued that the history of deep learning has been to ‘build an AI model that’s the shape of a GPU, and now the shape of a supercomputer’, to which Nvidia’s CEO Jensen Huang replied, ‘we’re building the supercomputer to the shape of the model’. Chips, circuits and systems are designed with AI use cases in mind, making AI a first class citizen in hardware design considerations, where a decade prior it was riding on the coattails of computer games. Today, entire datacenters are built for the sole purpose of making AI more efficient.

Google was a trailblazer in this new industry, stating in early 2016 that their Tensor Processing Unit (TPU) had been in use in their datacenters for over one year. The TPU was specifically designed to support their TensorFlow software framework for training and running deep learning models. This was their initial design; the fifth generation of TPU, which was announced in 2023, is orders of magnitude more capable. A host of companies has joined the fray to produce machines that are highly optimized for typical deep learning workloads. A key difficulty to overcome in this respect is the scaling of ML training – how to train ever bigger models?

Simply splitting the model training across multiple computers, linked together through a network, introduces inefficiencies. Such a system is bottlenecked by bandwidth and latency, meaning that the heaps of data can’t make their way to the processing elements quickly enough. This situation has led the company Cerebras to design for maximizing the amount of computation that can possibly be achieved on a single semiconductor device. Indeed, their revolutionary concept is to use an entire silicon wafer13 per unit, rather than follow the industry standard method of splitting the wafer into hundreds or thousands of chips. The processing elements on the wafer have the considerable advantage of being directly connected to each other via sub-micron copper wires, produced by extremely accurate lithographic processes. In the traditional approach, transferring data between processing elements frequently means moving data off one chip and onto another one, via a comparatively much bulkier interconnect using up space and energy, as well as leading to additional serialization/deserialization and caching components, whereas the Cerebras unit is able to shuffle the data around the wafer on shorter distances and using a flat memory hierarchy, which results in compounding efficiency gains. In addition, it integrates optimizations in the presence of sparsity of neural networks (many weights or activations are equal to zero, resulting in unnecessary computations), which can otherwise be remedied by additional (sometimes costly) procedures such as pruning [79]. Indeed, sparsity is an aspect that can crop up in many areas of AI/ML, modeling, and datasets. Efficiently handling this sparsity, in particular by avoiding unnecessary computation on zero-valued or null data, is likely to grow in importance as data sizes reach new heights, and can be tackled through special treatment at the hardware level [80].

5.3 Training and inference

Thus far in this chapter, we have mainly discussed the training phase of deep learning. This refers to the stage of creating a model, and teaching it to perform well on a given task, such as detecting a pedestrian in an image. The training phase is highly demanding in computational resources, and is typically done in a centralized setting where resources may be more abundant.

Conversely, the inference phase of DL refers to the actual use of the model, once the training of its parameters is complete, and it is deployed in an operational setting. In certain cases, inference also takes place in a centralized setting. A good example of this is the processing of billions of queries to a large language model (LLM) such as ChatGPT. In many cases, however, inference is done in a decentralized fashion, close to the source of the input data. For instance, the pedestrian-detection model mentioned above could be installed as a component in a car’s driver-assistance system. This is referred to as edge AI, because it operates somewhere at the edge of the internet, rather than at its center.

A central training facility requires hefty amounts of compute power, memory and bandwidth, whereas an inference/edge AI system typically performs much fewer calculations. However, it needs to be sufficiently fast to reliably produce outputs for its intended purpose, and has to run on a fairly small energy budget, so as to avoid draining the battery of the edge device (mobile phone, electric car, satellite, etc). Because of these key differences in requirements, the inference phase has also spawned a significant number of IoT designs, both from established manufacturers and from startup companies.

5.4 Quantum computing

Much like AI, Quantum Computing has made many headlines in recent years, and the two are similar in that both are technologies with tremendous potential and promise, although they are still undergoing fundamental developments.

Quantum computing aims to harness the laws of quantum mechanics to solve problems which are intractable for classical computers, in the sense that they would require far too much time and/or energy to solve. The key ingredient in quantum computing is the quantum entanglement of its memory elements14, which allows them to work together as a single entity, if it can be maintained for a sufficient amount of time. Roughly speaking, this means that with every additional element, a quantum computer doubles the complexity that it can handle; its capability grows exponentially as we add computing elements. This is in stark contrast with classical computers, where adding a computing element (say, an additional CPU) results in a roughly linear15 increase in capabilities.

This exponential possibility is what generates so much enthusiasm for the pursuit of quantum computing. The semiconductor industry itself has been accustomed to rapid exponential growth since its beginnings, the number of transistors in a dense integrated circuit having doubled approximately every two years for more than half a century, a phenomenon which has been named Moore’s law, after Intel founder Gordon Moore who predicted this trend in the 1960s and 1970s. However, physical, heat dissipation and leakage current constraints are increasingly limiting our ability to keep packing ever more transistors into the same 2D planar area of silicon. While the industry is expanding this pursuit into the third dimension [81], it is also increasingly keen to find a fundamentally different approach to maintain – or even surpass – the exponential growth in computing capabilities.

Within quantum computing there remain many issues to resolve. On the physical side, it is extremely difficult to build a quantum computer that is able to maintain its computing elements reliably entangled over any significant length of time. This issue is known as maintaining coherence among the quantum elements, in comparison to a deterioration into decoherence, most often caused by stray electrical and magnetic noise factors. Current designs struggle to achieve longer coherence, even with the aid of cooling the system ambiance to near absolute zero, which means that managing such a system is physically demanding and expensive.

On the conceptual side of quantum computing, many questions remain open as well. A quantum computer cannot be programmed using the same methods as a classical computer. Designing an algorithm for a quantum computer boils down to composing just the right choreography16 of quantum interferences among the computing elements to achieve the desired outcome, which is a fundamentally different way of programming, closer in kind to analog computing, which predates digital computing.

5.5 How will AI and quantum computing shake hands?

Since both AI and quantum computing are still rapidly evolving, it is difficult to predict how the two will eventually interact; however, we can offer some speculation on the topic. One question of interest would be whether quantum computing can be useful for the computationally intensive training phase of deep learning. Training a large neural network requires vast amounts of data, and involves a large number of parameters, while quantum computing excels at complex calculations which can be compactly specified and parameterized, and the answer compactly read out. Every additional parameter requires additional qubits, and every data input decreases the computation’s isolation needed for maintaining entanglement/coherence. Accordingly, we can venture the hypothesis that, unless a large dataset can be very economically encoded in a quantum computer, the latter will be of limited use for training large models, going forward. There is also the significant issue that quantum computers have a very limited output, so an application using a quantum computer for AI inferences would need to retrieve answers in a highly compact form. The size of the answer essentially needs to be at most as big as the number of qubits in the system [83]. Nevertheless, using quantum computers for machine learning (‘Quantum ML’) is an active field of study, pursued among others by CERN in the context of its quantum technology initiative17.

A second question of interest is whether AI will enable the future of quantum computing. For one thing, it may help us write programs for quantum computers. As discussed above, designing quantum algorithms remains an art mastered by very few so far. Potentially, an AI model can be trained to translate certain categories of classical programs to the quantum domain. After all, translation problems of all stripes have revealed themselves to be a strong suit of deep learning in particular, ranging over natural language text, source code, audio and other data, so qubits and quantum gates may also enter its vocabulary. On the hardware side of things, the ability of AI to efficiently simulate quantum systems [84], [85] could provide a helpful resource to advance research on how to build a good quantum computer.

There is also an aspect in which AI and quantum computing are in competition with each other. Both technologies require very substantial investments for research and development, and funds going to one are not available to the other. It has been argued that AI’s fast progress is detrimental to investment in quantum computing, because AI is able to massively speed up and enhance classical computing, pushing out the time horizon for quantum computing to achieve ‘supremacy’, or at least economic feasibility, in any specific area [86].

Since AI and quantum computers have different strengths, it is likely that they will complement each other in the future, by acting as components in a larger system which can draw on both, according to the nature of the task at hand, in a hybrid computer architecture.

References

[71]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017, doi: 10.1145/3065386.
[72]
Y. LeCun et al., “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, Dec. 1989, doi: 10.1162/neco.1989.1.4.541.
[73]
J. Schmidhuber, “Who Invented Backpropagation?” Nov. 2020. Available: https://people.idsia.ch/~juergen/who-invented-backpropagation.html
[74]
A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, Mar. 2009, doi: 10.1109/MIS.2009.36.
[75]
R. Sutton, “The Bitter Lesson,” Mar. 19, 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html (accessed Feb. 09, 2024).
[76]
J. Kaplan et al., “Scaling Laws for Neural Language Models,” Jan. 22, 2020. http://arxiv.org/abs/2001.08361 (accessed Feb. 09, 2024).
[77]
Epoch, “Parameter, compute and data trends in machine learning.” https://epochai.org/data/epochdb, 2024.
[78]
S. Chetlur et al., “cuDNN: Efficient primitives for deep learning,” doi: 10.48550/arXiv.1410.0759.
[79]
T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks,” Journal of Machine Learning Research, vol. 22, no. 241, pp. 1–124, 2021.
[80]
M. Speiser, “On Sparsity in AI/ML and Earth Science Applications, and its Architectural Implications.” Geneva, Switzerland/Virtual, Jul. 2021.
[81]
M. Radosavljevic and J. Kavalieros, 3D-Stacked CMOS Takes Moore’s Law to New Heights,” IEEE Spectrum. https://spectrum.ieee.org/3d-cmos, Aug. 2022.
[82]
Lex Fridman, “Scott Aaronson: Quantum Computing.” Feb. 17, 2020. Available: https://www.youtube.com/watch?v=uX5t8EivCaM
[83]
F. Tennie and T. N. Palmer, “Quantum Computers for Weather and Climate Prediction: The Good, the Bad, and the Noisy,” Bulletin of the American Meteorological Society, vol. 104, no. 2, pp. E488–E500, Feb. 2023, doi: 10.1175/BAMS-D-22-0031.1.
[84]
A. Jolly, “Researchers Simulate Ice Formation by Combining AI and Quantum Mechanics,” HPCwire. https://www.hpcwire.com/off-the-wire/researchers-simulate-ice-formation-by-combining-ai-and-quantum-mechanics/, Aug. 2022.
[85]
L. Zhang, J. Han, H. Wang, R. Car, and W. E, “Deep Potential Molecular Dynamics: A Scalable Model with the Accuracy of Quantum Mechanics,” Physical Review Letters, vol. 120, no. 14, p. 143001, Apr. 2018, doi: 10.1103/PhysRevLett.120.143001.
[86]
Sabine Hossenfelder, It looks like AI will kill Quantum Computing. 2024. Accessed: Feb. 21, 2024. [Online]. Available: https://www.youtube.com/watch?v=Q8A4wEohqT0