What You Need to Know about Power8 and NVIDIA GPUs

Power8-NVIDIA-463312477BlogLast December, I wrote a blog post about big data. In it I mentioned the central role the massive number crunching capability of the new Power8 CPU plays in meeting the demands of today’s big computing tasks. In this post, I want to focus on one of the key components of the Power8 compute solution, which is its ability to make use of hardware based computational accelerators.

Math Co-Processors
The idea of a co-processor to perform specific math computations is not a new idea. From the very beginning of the x86 architecture, for example, Intel included a companion x87 chip that could be implemented to handle floating point arithmetic calculations offloaded from the main x86 CPU. In the early days, one could buy a PC with or without what was called a math co-processor. By the early 1990s, however, with the introduction of the 80586 CPU, die sizes had become small enough to incorporate the functions formerly performed by the co-processor directly on the main chip. The day of the separate co-processor, at least in the x86 family, had come to an end.

IBM’s Power architecture never had an optional external math co-processor in the same way Intel did. The IBM architects always focused on providing maximum arithmetic calculation horsepower within the core CPU package. This compute capacity was generally available to any thread running on the CPU; it was not compute task specific and ran in series with all other necessary CPU functions. This was in contrast to the external co-processor which had the potential to do computation specific calculations in parallel with the main CPU.

This architecture changed with the introduction of the Power7+ chip. The Power7+ was implemented with a 32nm feature size, down from the 45nm feature size of the Power7, but the Power7+ retained the same die size of the Power7, resulting in almost double the number of transistors available to the chip architects. They opted to make use of some of those extra transistors to target two specific math-intensive use cases. One was the calculation of cryptographic algorithms; and the other was the calculation required to compress and uncompress real-memory pages demanded by the new Active Memory Expansion (AME) functionality implemented by the Power7 servers. The solution entailed implementation using an on-chip co-processor built and coded specifically for computing a set of common cryptographic algorithms and AME-specific memory compression and decompression algorithms.

The Power8 chip carried that design forward, adding additional on-chip accelerators for Hardware Transactional Memory (HTM), Virtual Memory Management and Partition Mobility. What’s interesting is that in addition to these on-chip accelerators, the Power8 adds a generic capability to support an x87-style external accelerator. This has significant impactions for the future of the Power architecture as it opens the possibility for third-party vendors to provide co-processors to enhance the core capabilities of the Power8 chip into any specialized direction. Thus, Power8 becomes a platform upon which any number of specialized compute engines can be built.

Coherent Attached Processor Interface (CAPI)
A key component needed to support this external co-processor is CAPI. One of the challenges to overcome in implementing a co-processor is integrating it into the architecture of its host machine. Speed is critical; getting data to the co-processor and getting answers back as fast as possible is imperative. In the past, co-processor implementations used a device-driver model to do this, but that adds layers of protocol between the main system’s memory address space and the data that is addressed by the co-processor. Ideally, the co-processor should be able to address the same memory space as the main CPU, allowing it to operate as a peer of and in parallel with the main CPU. This key memory access model is what CAPI provides, eliminating the device driver bottleneck.

Access to CAPI technology is being offered by IBM through the Open Power Foundation. While not an open-source technology, CAPI and its associated technology is available to anyone willing to pay the license fee to become a member of the foundation.

The NVIDIA Tesla GPU
One of the early adopters of CAPI is NVIDIA, a leading developer of graphical processing units (GPUs). Originally developed to facilitate computing the large volumes of data needed by graphics intensive applications such as CAD and gaming — GPUs are at heart mathematical computation engines, and by appropriate coding, they can perform any kind of mathematical calculation requested. Since 2007, NVIDIA has been doing just that — repurposing their industry-leading proprietary graphics processing technology towards general purpose number-crunching. Today the NVIDIA Tesla K40 GPU can be ordered as a CAPI attached GPU for Power8 servers. To support the Tesla GPU, NVIDIA also supplies a programming model and associated instruction set (Compute Unified Device Architecture [CUDA]) to make it possible for developers to easily and effectively harness the power of the Tesla GPU and bring it to bear on their computations.

One of the areas of interest that IBM has recently targeted for a Tesla-based solution is the acceleration of Java-coded applications. The IBM-developed CUDA4J library provides application program interfaces (APIs) that allow Java programs to control and direct work to the Tesla engine using the normal Java memory model. Early experiments with Tesla-accelerated Java applications have yielded speed improvements approaching or exceeding an order of magnitude and promise better to come.

Reaching the Summit
The Power+NVIDIA combination has drawn the attention of one of the biggest supercomputer customers around — the U.S. Department of Energy (DoE). Responsible for both the Oak Ridge and Lawrence Livermore laboratories, the DoE operates some of the largest supercomputers in the world and has done so for a long time. Just last fall, the DoE announced that the next-generation flagship computers commissioned for these labs would be based on the IBM Power9 + NVIDIA Volta technology combination. The largest of these machines, codenamed Summit, is due for delivery in 2017. Taking over from Titan, an Opteron+Tesla-based system currently ranked as the second most powerful supercomputer in the world, Summit will be more than five times more powerful.

So, as you realize that your big data is going to demand some big computing — now you know where to find it.

Related Courses
Power Systems for AIX I: LPAR Configuration and Planning (AN11G)
AIX Basics (AN10G)
POWER8 Systems and AIX Enhancements (AN101G)

In this article

Join the Conversation

3 comments

  1. Jos Reply

    Interesting article about co-processors and their CAD or Supercomputing benefits although I wonder what the benefit of Power8 (or any Powerx architecture) is to standard commercial apps versus ‘standard’ architectures.

    1. Iain Campbell Reply

      In terms of commercial architecture I would suggest that if you research market share for Power/AIX you might just find a credible argument that Power *is* the standard architecture. As far as I can tell, Power/AIX market share currently exceeds MIPS/HP-UX and SPARC/Solaris combined … As far as performance is concerned, if you research any of the standard benchmarks for database and similar applications you will see Power servers are more than competitive there as well at the Power7 level, and Power8 is significantly faster than Power7. The Power implementation that makes a lot of sense in the commercial world leverages the high availability (I didn’t focus on this in the blog post, but Power servers have a number of hardware level redundancies, power management and availability features that create the possibility of up times realistically approaching five nines) and virtualization platform to allow for a high level of convergence. Production Power servers will commonly run several front and also backend applications in virtual machines in the same physical server, using virtual network communications having negligible latency and high bandwidth. If you want hardware redundancy, Power/AIX offers a very good enterprise class availability cluster solution in PowerHA. Add GPFS into the mix and you can build geographically distant availability clusters too. If you don’t want to run AIX, then Linux on Power is also a happening thing these days. That surprises those who assume that UNIX based shops move to Linux to leverage cheap x86 hardware, but if you look into it you will find that Power8 hardware may not be as expensive as you think when you look carefully at the price/performance bottom line, and the fact that these things will run 7/24/365 with a high degree of reliability. When it comes to hardware, you tend to get what you pay for. Finally, if you are running Java based code, there are Power8 specific JVMs being developed that will leverage the CAPI based Power8 co-processor capabilities to accelerate any Java code the runs on that machine, with no machine specific Java code changes required. Stay tuned …

  2. Jos Reply

    Great points. As far as sales, taking a purely technical/architectural look at it, sales simply reflects market success, not good design or better performance/efficiency, but yes, it certainly underscores the degree of success against competing architectures which is huge. I can’t argue with success. I did not look at the benchmarks earlier. I’ll have to take a look. Thanks for the great comments.