# FOUNDATION

HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) AND THE SOFTWARE ECOSYSTEM

MANJU HEGDE, CORPORATE VP, HETEROGENEOUS SOLUTIONS, AMD

### CUDA BRINGS PERFORMANCE TO PRO/RESEARCH ON



CUDA gave developers access to unprecedented performance

Not easy to use ...but enough performance-hungry developers willing to endure pain

Low Consumer space adoption ... esp. due to lack of cross-platform



Abundant performance + same complexity as CUDA programming

Cross platform resonates with developers (needs per-platform optimization)



Easy to program

Truly cross platform – Write Once Run Anywhere

Lack of performance efficiency offset by platform capability



## You can get developers to change! (takes time and strategy)

### HSA FOUNDATION : DRIVING FUTURE OF HETEROGENEOUS COMPUTING





### GOALS FOR THE HETEROGENEOUS SYSTEM ARCHITECTURE





- Advanced Natural User Interfaces & Presence Capabilities
- Rich Cloud Computing User Experiences
- Perceptual Computing Experiences
- Bring Hollywood Class Realism to Real-time Entertainment

### HSA ARCHITECTURE

**GPU compute C++ support** 

**User Mode Scheduling** 

Fully coherent memory between CPU & GPU

GPU uses pageable system memory via CPU pointers

**GPU** graphics pre-emption

**GPU** compute context switch







### HSA INTERMEDIATE LANGUAGE - HSAIL

- Designed for C99, C++ 2011, Java, Renderscript, OpenCL, C++ AMP
- HSAIL is a virtual ISA for parallel programs
  - Finalized to ISA by a JIT compiler or "Finalizer"
  - ISA independent by design for CPU & GPU
- Explicitly parallel
  - Designed for data parallel programming
- Support for exceptions, virtual functions, and other high level language features
- Syscall methods
  - GPU code can call directly to system services,
    - IO, printf, etc





### OPENCL<sup>™</sup> AND HSA



- HSA is an optimized platform architecture for OpenCL<sup>™</sup>
  - ◆ Not an alternative to OpenCL™
- ◆ OpenCL<sup>™</sup> on HSA will benefit from
  - Avoidance of wasteful copies
  - Low latency dispatch
  - Improved memory model
  - Pointers shared between CPU and GPU
- HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance
  - Optimized libraries may choose the lower level interface



### HETEROGENEOUS COMPUTE DISPATCH



How compute dispatch operates today in the **driver model** 

 Image: marging and margin

How compute dispatch improves **under HSA** 









FOUNDA

Hardware Queue







### HSA COMMAND AND DISPATCH FLOW





- Application codes to the hardware
- User mode queuing
- Hardware scheduling
- Low dispatch times

- No APIs
- No Soft Queues
- No User Mode Drivers
- No Kernel Mode Transitions
- No Overhead!

### COMMAND AND DISPATCH CPU <-> GPU

#### Application / Runtime



#### **Driver Stack**

#### **HSA Software Stack**







### ACCELERATED WORKLOADS CLIENT AND SERVER EXAMPLES



### HAAR Face Detection

CORNERSTONE TECHNOLOGY FOR COMPUTERVISION

### LOOKING FOR FACES IN ALL THE RIGHT PLACES





#### **Quick HD Calculations**

Search square =  $21 \times 21$ Pixels =  $1920 \times 1080 = 2,073,600$ Search squares =  $1900 \times 1060 = -2$  Million

### LOOKING FOR DIFFERENT SIZE FACES – BY SCALING THE VIDEO FRAME





### HAAR CASCADE STAGES





© Copyright 2012 HSA Foundation. All Rights Reserved.

### 22 CASCADE STAGES, EARLY OUT BETWEEN EACH





**NO FACE** 

#### **Final HD Calculations**

Search squares = 3.8 million Average features per square = 124 Calculations per feature = 100 Calculations per frame = 47 GCalcs

#### **Calculation Rate** 30 frames/sec = 1.4TCalcs/second

60 frames/sec = 2.8TCalcs/second

#### ...and this only gets front-facing faces

### CASCADE DEPTH ANALYSIS







### **PROCESSING TIME/STAGE**



"Trinity" A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)



AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)

### PERFORMANCE CPU-VS-GPU





AMD A10 4600M APU with Radeon<sup>™</sup> HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL<sup>™</sup> 1.1 (873.1)

### HAAR SOLUTION – RUN DIFFERENT CASCADES ON GPU AND CPU



By seamlessly sharing data between CPU and GPU, HSA allows the right processor to handle its appropriate workload







### ACCELERATING MEMCACHED CLOUD SERVER WORKLOAD

### MEMCACHED



- A Distributed Memory Object Caching System Used in Cloud Servers
- Generally used for short-term storage and caching, handling requests that would otherwise require database or file system accesses
- Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others
- Effectively a large distributed hash table
  - Responds to store and get requests received over the network
  - Conceptually:
    - store(key, object)
    - object = get(key)

## OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU



T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt, "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012), April 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209





### ACCELERATING B+TREE SEARCHES

CLOUD SERVER WORKLOAD

### **B+TREE SEARCHES**

#### • B+Trees are a fundamental data structure

- Used to reduce memory & disk access to locate a key
- Can support index- and range-based queries
- Can be updated efficiently
- B+Trees are used by enterprise DB applications
  - SQL: SQLite, MySQL, Oracle, and others
  - No-SQL: Apache CouchDB, Tokyo Cabinet, and others
    - Audio search, video copy detection



A simple B+Tree linking the keys 1-7. The linked list (red) allows rapid in-order traversal.



### PARALLEL B+TREE SEARCHES ON HSA

By efficiently sharing data between CPU and GPU, HSA increases performance versus Multi Threaded CPU, even for tree structures that reside in host memory. With HSA, DB can be larger than GPU memory, and can be shared.

HSA lets us move compute to data

- Parallel search can move to GPU
- Sequential updates can remain on CPU

| Platform                    | Size <<br>1.5 GB | Size<br>1.5-2.7 GB | Size ><br>2.7 GB |
|-----------------------------|------------------|--------------------|------------------|
| dGPU<br>(memory size = 3GB) | $\checkmark$     | $\checkmark$       | ×                |
| HSA                         | $\checkmark$     | $\checkmark$       | 1                |

M. Daga, and M. Nutter, "Exploiting Coarse-Grained Parallelism in B+Tree Searches on an APU", Accepted at "Second Workshop on Irregular Applications: Algorithms and Architectures, (IA3)" November 2012.

AMD A10 4600M APU with Radeon M HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM







### **ACCELERATING JAVA**

GOING BEYOND NATIVE LANGUAGES

### JAVA ENABLEMENT BY APARAPI



Aparapi = Runtime capable of converting Java<sup>™</sup> bytecode to OpenCL<sup>™</sup>



### JAVA AND APARAPI HSA ENABLEMENT ROADMAP





<sup>©</sup> Copyright 2012 HSA Foundation. All Rights Reserved.



### EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE



#### Optimized template library routines for common GPU functions

◆ For OpenCL<sup>™</sup> and C++ AMP, across multiple platforms

Programming model interface similar to multicore Task Parallel Runtimes (TBB, ConCRT)

- •CPU performance as good or better than multicore Task Parallel Runtimes
- Excellent performance and power efficiency on HSA Devices
- •For many applications, single source code base for both CPU and GPU !
- Leverage robust Visual Studio C++AMP debug solution

### LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS



AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta

LOC

#### **RESEARCH TOPICS IN HSA**



| Category            | Description                                                                                                                                                                                                                                                                                           | Comments |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| Languages/Compilers | Higher-level languages. GPU languages are primitive today. OpenCL is a good expert tool.<br>Look into domain specific languages (graphics, math). Ex: HSA could have a database<br>accelerator component                                                                                              |          |
|                     | Split compilation model – high level compliers & low level compilers and how to make them work well together                                                                                                                                                                                          |          |
|                     | How to run best on a device with multi ISA's                                                                                                                                                                                                                                                          |          |
| Software Run-Time   | Classic load balancing. Look for new ways to partition algorithms automatically in the runtime.<br>Simultaneous running of multiple kernels or multiple applications. Quality of service & virtualization. Scheduling for complex status graphs and scheduling dynamic parallelism                    |          |
| System Architecture | <ul> <li>Bandwidth/memory arch (balancing BW with compute)</li> <li>Load balancing</li> <li>Memory configurations: Stack memory devices will eventually appear and systems will change around idea of bandwidth. Shared memory stacks – what are the implications?</li> <li>TCU/LCU ratios</li> </ul> |          |
| Hardware            | <ul> <li>Logical split between split function hardware.</li> <li>Applying HSA to non-GPU devices (DSPs, FPGAs, etc.)</li> <li>Heterogeneous conformance optimization - how to run a program that runs well on all different HSA platforms and hardware</li> </ul>                                     |          |
|                     | Memory system design: low cost support for coherency and would give programmers a way to optimize their use of coherence                                                                                                                                                                              |          |
|                     | Security: looking into securing systems                                                                                                                                                                                                                                                               |          |
|                     | Efficient synchronization primitives                                                                                                                                                                                                                                                                  |          |
|                     | 3D graphics pipes – integration with HSA                                                                                                                                                                                                                                                              |          |

### THE HSA OPPORTUNITY



