This article is part of the Technology Insight series, made possible with funding from Intel.
As data sprawls out from the network core to the intelligent edge, increasingly diverse compute resources follow, balancing power, performance, and response time. Historically, graphics processors (GPUs) were the offload target of choice for data processing. Today field programmable gate arrays (FPGAs), vision processing units (VPUs), and application specific integrated circuits (ASICs) also bring unique strengths to the table. Intel refers to those accelerators (and anything else to which a CPU can send processing tasks) as XPUs.
The challenge software developers face is determining which XPU is best for their workload; arriving at an answer often involves lots of trial and error. Faced with a growing list of architecture-specific programming tools to support, Intel spearheaded a standards-based programming model called oneAPI to unify code across XPU types. Simplifying software development for XPUs can’t happen soon enough. After all, the move to heterogeneous computing—processing on the best XPU for a given application—seems inevitable, given evolving use cases and the many devices vying to address them.
KEY POINTS
- Intel sees heterogenous computing (where a host device sends compute tasks to different accelerators) as inevitable.
- An XPU can be any offload target commanded by the CPU, built on any architecture from any hardware vendor.
- The oneAPI initiative is an open, standards-based programming model that allows developers to target multiple XPUs with a single code base.
Intel’s strategy faces headwind from NVIDIA’s incumbent CUDA platform, which assumes you’re using NVIDIA graphics processors exclusively. That walled garden may not be as impenetrable as it once was. Intel already has a design win with its upcoming Xe-HPC GPU, code-named Ponte Vecchio. The Argonne National Laboratory’s Aurora supercomputer, for example, will feature more than 9,000 nodes, each with six Xe-HPCs totaling more than 1 exa/FLOP/s of sustained DP performance.
Time will tell if Intel can deliver on its promise to streamline heterogenous programming with oneAPI, lowering the barrier to entry for hardware vendors and software developers alike. A compelling XPU roadmap certainly gives the industry a reason to look more closely.
Heterogeneous computing is the future, but it’s not easy
The total volume of data spread between internal data centers, cloud repositories, third-party data centers, and remote locations is expected to increase by more than 42% from 2020 to 2022, according to The Seagate Rethink Data Survey. The value of that information depends on what you do with it, where, and when. Some data can be captured, classified, and stored to drive machine learning breakthroughs. Other applications require a real-time response.
The compute resources needed to satisfy those use cases look nothing alike. GPUs optimized for server platforms consume hundreds of watts each, while VPUs in the single-watt range might power smart cameras or computer vision-based AI appliances. In either example, a developer must decide on the best XPU for processing data as efficiently as possible. This isn’t a new phenomenon. Rather, it’s an evolution of a decades-long trend toward heterogeneity, where applications can run control, data, and compute tasks on the hardware architecture best suited to each specific workload.
Above: The quest for more performance will make heterogeneous computing a necessity.
“Transitioning to heterogeneity is inevitable for the same reasons we went from single core to multicore CPUs,” says James Reinders, an engineer at Intel specializing in parallel computing. “It’s making our computers more capable, and able to solve more problems and do things they couldn’t do in the past — but within the constraints of hardware we can design and build.”
As with the adoption of multicore processing, which forced developers to start thinking about their algorithms in terms of parallelism, the biggest obstacle to making computers more heterogenous today is the complexity of programming them.
It used to be that developers programmed close to the hardware using low-level languages, providing very little abstraction. The code was often fast and efficient, but not portable. These days, higher-level languages extend compatibility across a broader swathe of hardware while hiding a lot of unnecessary details. Compilers, runtimes, and libraries underneath the code make the hardware do what you want. It makes sense that we’re seeing more specialized architectures enabling new functionality through abstracted languages.
oneAPI aims to simplify software development for XPUs
Even now, new accelerators require their own software stacks, gobbling up the hardware vendor’s time and money. From there, developers make their own investment into learning new tools so they can determine the best architecture for their application.
Instead of spending time rewriting and recompiling code using different libraries and SDKs, imagine an open, cross-architecture model that can be used to migrate between architectures without leaving performance on the table. That’s what Intel is proposing with its oneAPI initiative.
Above: The oneAPI Base Toolkit includes everything you need to start writing applications that take advantage of Intel’s CPU and XPU architectures.
oneAPI supports high-level languages (Data Parallel C++, or DPC++), a set of APIs and libraries, and a hardware abstraction layer for low-level XPU access. On top of the open specification, Intel has its own suite of toolkits for various development tasks. The Base Toolkit, for example, includes the DPC++ compiler, a handful of libraries, a compatibility tool for migrating NVIDIA CUDA code to DPC++, the optimization oriented VTune profiler, and the Advisor analysis tool, which helps identify the best kernels to offload. Other toolkits home in on more specific segments, such as HPC, AI and machine learning acceleration, IoT, rendering, and deep learning inference.
“When we talk about oneAPI at Intel, it’s a pretty simple concept,” says Intel’s Reinders. “I want as much as possible to be the same. It’s not that there’s one API for everything. Rather, if I want to do fast Fourier transforms, I want to learn the interface for an FFT library, then I want to use that same interface for all my XPUs.”
Intel isn’t putting its clout behind oneAPI for purely selfless reasons. The company already has a rich portfolio of XPUs that stand to benefit from a unified programming model (in addition to the host processors tasked with commanding them). If each XPU was treated as an island, the industry would end up stuck where it was before oneAPI: with independent software ecosystems, marketing resources, and training for each architecture. By making as much common as possible, developers can spend more time innovating and less time reinventing the wheel.
What will it take for the industry to start caring about Intel’s message?
An enormous number of FLOP/s, or floating-point operations per second, come from GPUs. NVIDIA’s CUDA is the dominant platform for general purpose GPU computing, and it assumes you’re using NVIDIA hardware. Because CUDA is the incumbent technology, developers are reluctant to change software that already works, even if they’d prefer more hardware choice.
Above: Intel’s Xe-HPC GPU employs a brand new architecture, high-bandwidth memory, and advanced packaging technologies to deliver unprecedented performance.
If Intel wants the community to look beyond proprietary lock-in, it needs to build a better mousetrap than its competition, and that starts with compelling GPU hardware. At its recent Architecture Day 2021, Intel disclosed that a pre-production implementation of its Xe-HPC architecture is already producing more than 45 TFLOPS of FP32 throughput, more than 5 TB/s of fabric bandwidth, and more than 2 TB/s of memory bandwidth. At least on paper, that’s higher single-precision performance than NVIDIA’s fastest data center processor.
The world of XPUs is more than just GPUs though, which is exhilarating and terrifying, depending on who you ask. Supported by an open, standards-based programming model, a panoply of architectures might enable time-to-market advantages, dramatically lower power consumption, or workload-specific optimizations. But without oneAPI (or something like it), developers are stuck learning new tools for every accelerator, stymying innovation and overwhelming programmers.
Above: Fugaku, the world’s fastest supercomputer, uses optimized oneDNN code to maximize the performance of its Arm-based CPUs.
Fortunately, we’re seeing signs of life beyond NVIDIA’s closed platform. As an example, the team responsible for RIKEN’s Fugaku supercomputer recently used Intel’s oneAPI Deep Neural Network Library (oneDNN) as a reference to develop its own deep learning process library. Fugaku employs Fujitsu A64FX CPUs, based on Armv8-A with the Scalable Vector Extension (SVE) instruction set, which didn’t have a DL library yet. Optimizing Intel’s code for Armv8-A processors enabled an up to 400x speed-up compared to simply recompiling oneDNN without modification. Incorporating those changes into the library’s main branch makes the team’s gains available to other developers.
Intel’s Reinders acknowledges the whole thing sounds a lot like open source. However, the XPU philosophy goes a step further, affecting the way code is written so that it’s ready for different types of accelerators running underneath it. “I’m not worried that this is some type of fad,” he says. “It’s one of the next major steps in computing. It is not a question of whether an idea like oneAPI will happen, but rather when it will happen.”
Source: Read Full Article