Design engineers’ increasing reliance on accelerators to enhance system performance has begun to alter the basic concepts of how devices channel and process data. Many leading-edge devices entering today’s market incorporate technologies like high-resolution imagery and video, machine learning and virtual and augmented reality. Faced with demand for products that require rapid processing of huge amounts of data while consuming minimal energy, designers now turn to programmable logic chips that optimize processing workloads. This trend has been further driven by CMOS scaling’s inability to keep pace with performance demands. As a result, accelerators—such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs,) machine-learning accelerators and heterogeneous CPU cores that can perform tasks in parallel—have begun to take on new, high-profile roles in electronics designs.
Chipmakers often link the deployment of accelerators with efficiency and performance requirements. “Acceleration is generally chosen because it either improves energy efficiency (performance/W) or performance density (performance/mm^2)—often both,” says Jem Davies, vice president, general manager and fellow at ARM.
At the core of these efficiency/performance issues lies a different type of compute workload. Applications like machine and deep learning and augmented and virtual reality require parallel processing hardware that can efficiently handle huge data sets, a feature that existing CPUs often cannot provide by themselves.
To better understand what accelerators bring to the table, compare their features with those of traditional CPUs. Typically, CPU cores are optimized for sequential serial processing. Processor manufacturers have pushed the performance of CPUs nearly as far as they can using conventional techniques, such as increasing clock speeds and straight-line instruction throughput.
Accelerators, on the other hand, leverage parallel architectures designed to handle repetitive tasks simultaneously. The primary benefit of accelerators is that they can offload and quickly execute compute-intensive portions of an application while the remainder of the code runs on the CPU.
“In the case of CPUs and GPUs, the contrast is large. A datacenter CPU can have eight to 24 cores,” says Robert Ober, chief platform architect for datacenter products at NVIDIA. “A datacenter GPU has more than 5,000 cores that have the same order of floating-point capability. The result is orders of magnitude more usable capability and throughput from a GPU for these tasks. The result is more efficiency—fewer steps for the same work, and less power and energy spent on that work.”
CPU giant Intel hasn’t overlooked the need for acceleration and parallelism. When the company introduced its Xeon Phi processor family, it described it as enabling “machines to rapidly learn without being explicitly programmed, in addition to helping drive new breakthroughs using high performance modeling and simulation, visualization and data analytics.”
Just last month, Intel launched a hardware and software platform solution to enable faster deployment of customized FPGA-based acceleration of networking, storage and computing workloads. In a press release, the company introduced the Intel Programmable Acceleration Card with the Intel Arria 10 GX FPGA enabled by the acceleration stack as the first in a family of Intel Programmable Acceleration Cards. It is expected to be broadly available in the first half of 2018. The platform approach enables original equipment manufacturers to offer Intel Xeon processor-based server acceleration solutions.
A New Design Perspective
As engineers increasingly use accelerators to extract value from big data, design teams have to rethink the flow of data within the systems they are designing and reconsider where processing should occur. In doing so, they fundamentally change conventional design philosophy. For years, engineers focused on reducing energy consumption, turning off many of the cores in a chip to minimize power consumption. With changing consumer demands, many engineers are shifting their attention to improving performance. Rather than processing all functions in a single CPU, they have begun to use multiple heterogeneous types of processors, or cores, with specialized functionality.
“Machine learning workloads have very specific data flow patterns and computational requirements,” says Davies. “By tailoring designs toward these workloads, great gains can be achieved in both energy efficiency and performance density. Because of the importance of machine learning, we expect all computing platforms—CPUs, GPUs and machine learning accelerators—to evolve to meet emerging machine learning requirements.”
Proof of this approach can be seen in recent innovations. Data scientists have been using accelerators to make groundbreaking improvements in applications such as image classification, video analytics, speech recognition and natural language processing.
One Size Does Not Fit All
These advances, however, have not been enabled by just one type of accelerator. The implementation of machine learning and other specialized applications requires engineers to leverage a variety of accelerators. Although it’s true that all accelerators improve performance, no one size fits all. Most of these processors involve some form of customization.
For example, GPUs perform well when accelerating algorithms in the learning phase of machine learning because they can run floating point calculations in parallel across many cores. But in the inference phase of machine learning, engineers benefit from using FPGA and DSP accelerators because these processors excel at fixed-point calculations.
Some companies have even gone so far as to create their own customized accelerators for machine learning. “You’ve probably seen the first of these new architectures, the tensor processing unit [TPU] recently announced by Google as their proprietary custom accelerator for machine inference,” says Nigel Toon, CEO of Graphcore. “Startup Nervana—recently acquired by Intel—also claims they are working on a TPU. It’s especially exciting to see Google advocating tailored processor design for machine learning.”
Graphcore itself has introduced a custom accelerator for machine intelligence applications. “If we think of the central processing unit in your laptop as being designed for scalar-centric control tasks and the GPU as being designed for vector-centric graphics tasks, then this new class of processor would be an intelligence processing unit [IPU], designed for graph-centric intelligence tasks,” says Toon. “When we started thinking about building a machine to accelerate intelligence processing at Graphcore, we knew that we had to look beyond today’s deep neural networks … Our IPU has to outperform GPUs and CPUs at all these tasks. But perhaps more importantly, it has to provide a flexible platform for the discoveries yet to come.”
One of the latest entries in this arena is Inuitive’s new multi-core image processor called the NU4000. This chip supports 3D imaging, deep learning and computer vision processing for applications such as augmented and virtual reality, drones and robots.
Although these technologies give design engineers new options for enhancing advanced systems, they also introduce additional challenges into the design process.
At the nuts-and-bolts level, the inclusion of accelerators in designs forces engineers to make tough tradeoffs in performance and flexibility. “The gains in energy efficiency and performance density come from targeting your design toward a given workload or set of workloads. This, in turn, will reduce the flexibility and sometimes the programmability of a system,” says Davies.
These tradeoffs, as well as the increased emphasis on performance, also force designers to take a broader view. According to Davies: “In the past, an SoC designer may have focused primarily on a single benchmark or set of benchmarks. We’re seeing the focus switch from benchmarks to use cases, where the use case may consist of multiple different workloads, running on a combination of accelerators, working together in unison.”
Adding these new factors to the mix will make it essential that engineers adopt a new set of technology-specific tools. Not surprisingly, processor providers have already moved to meet this need.
New Architectures and Development Tools
Because of this, design engineers will find a broad assortment of tools and development environments tailored for the inclusion of accelerators in designs. Many of these go one step further and provide the means to use accelerators to develop leading-edge applications like machine intelligence.
NVIDIA offers its CUDA Toolkit. This development environment targets GPU-accelerated applications, providing libraries, debugging and optimization tools, a C/C++ compiler and a runtime library to deploy applications.
CUDA libraries promise to enable drop-in acceleration across multiple domains, such as linear algebra, image and video processing, deep learning and graph analytics. Using built-in capabilities for distributing computations across multi-GPU configurations, engineers can develop applications that scale across a variety of platforms, from single GPU workstations to cloud installations.
At the same time, Intel has been leveraging its nearly ubiquitous ecosystem of technology to position the Intel Xeon Phi family as being easy for developers. According to the company, it allows implementers to simplify code modernization and reduce programming costs by sharing code and a developer base with Intel Xeon processors. “Standardizing on a unified Intel architecture means you can use a single programming model for all your code,” says Intel, “thereby reducing operational and programming expenses through a shared developer base and code reuse.”
Some accelerator makers like Graphcore have developed processors customized specifically for machine learning. To help with the deployment of its IPUs, Graphcore has developed Poplar, a graph-programming C++ framework that abstracts the graph-based machine learning development processes from the underlying graph-processing IPU hardware. The framework includes a graph compiler that promises to translate the standard operations used by machine learning frameworks into optimized application code for the IPU. The graph compiler builds up an intermediate representation of the computational graph to be scheduled and deployed across one or many IPU devices.
ARM has unveiled its latest generation of processor designs named DynamIQ. The semiconductor giant contends that chips built using the new multi-core microarchitecture will be easier to configure, allowing manufacturers to connect a wider assortment of CPUs. This could allow for not only more powerful systems-on-chip but also processors that can better perform computing tasks like artificial intelligence.
DynamIQ builds on ARM’s “big.LITTLE” approach, which pairs a cluster of “big” processors, with a set of power-sipping “little” ones. DynamIQ takes this flexibility one step further by supporting cores that fall anywhere in between—an approach known as heterogeneous computing. DynamIQ will let chipmakers optimize their silicon, allowing them to build AI accelerators directly into chips, which promises to help systems manage data and memory more efficiently.
Looking to the Future
It’s clear after looking at developments like ARM’s DynamIQ that the rise of accelerators will not diminish the importance of CPUs. On the contrary, the CPU will continue to play a vital role in future systems. “CPUs are good for general-purpose computing,” says Ober. “Modern computers run dozens of applications and numerous background processes, so the need for powerful CPUs will likely continue to increase.”
A look at current system frameworks bears this out. SoC applications have evolved to a point where there is typically a central traditional processor complex that orchestrates tasks at a high level, while complex, specialized tasks are distributed to heterogeneous accelerators sprinkled throughout the device.
That said, both traditional CPU makers and accelerator developers are mindful of forces like machine learning and augmented reality, which promise to play a big part in shaping the form and function of the next generation of electronic devices. To ensure their roles in supporting these new technologies, processor providers of all ilks continue to push the limits of their respective technologies.
ARM’s new microarchitecture DynamIQ certainly shows the semiconductor giant’s intention to provide the flexibility to support applications like artificial intelligence. NVIDIA’s Quadro and Tesla GPUs continue to advance the power of accelerators.
But where you are likely to see the most dynamic change is in accelerators that have been built specifically for advanced applications like machine intelligence. A good example of this can be seen in Graphcore’s Colossus IPU, a 16 nm massively parallel, mixed-precision floating-point processor expected to become available early in 2018. Designed from the ground up for machine intelligence, this new processor promises to be nothing like a GPU. What it is supposed to do is take the accommodation of new workloads one step further, pushing the envelope on how devices channel and process data.